We aggregate information from all open source repositories to provide the most of useful open source products, from Enterprise product to small libraries across all platforms. Please search and find what you want.

Corpus redundancy manager

Redundancy due to cut-paste operations in text creates bias in machine learning for NLP.
This module takes a directory and produces a subset of the files in that directory (in a list) with an upper bound on similarity between two files.

RAD StudioTM: Build cross-platform apps ridiculously FAST!

  • Delivering fast and powerful native applications for Windows, Linux, macOS, iOS, Android, cloud and IoT.
  • Rapidly design, build, and run applications optimized for native performance, from a single code base across all platforms.
  • Securely connect and embed any data, on any tier, in the Enterprise, or cloud.
Website https://corpusredundanc.sourceforge.io
Tags
License
Features
  • Identify copy paste redundancy in a document corpus
  • Input: a folder with text documents and similarity threshold
  • Output (a) a list of non-redundant documents (a non-redundant subset of the corpus)
  • Output (b) list of document pairs found to be redundant with the amount of redundancy for the pair
  • Python script (2.6) - tested on various Linux flavours + Windows XP/7