Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.
RAD StudioTM: Build cross-platform apps ridiculously FAST!
- Delivering fast and powerful native applications for Windows, Linux, macOS, iOS, Android, cloud and IoT.
- Rapidly design, build, and run applications optimized for native performance, from a single code base across all platforms.
- Securely connect and embed any data, on any tier, in the Enterprise, or cloud.
Website | https://corpusredundanc.sourceforge.io |
Tags | Linguistics |
License | GNU General Public License version 3.0 (GPLv3) |
Features |
|