Duplicate Finder

Other languages: Español 한국어 Português Русский 中文

Duplicate finder is an open-source application for detecting similar text across one or more files. It can be used to find 100% duplicates as well as content that is similar but not identical. The tool is compatible with several formats, including plain text, Markdown, and XML.

The duplicate finder tool can help you with:

Duplicate content example

Here's a quick example to give you an idea of what the tool detects:

Chunk 1:
Open the Google Play Store on your Android device, search for "AwesomeApp", then tap "Install" to download and install the app.
Chunk 2:
Open the App Store on your iOS device, search for "AwesomeApp", then tap "Get" to download and install the app.

How to use

  1. Download the app. Alternatively, you can build it yourself from the sources.
  2. Make sure Java 16 or later is installed on your computer
  3. In the terminal, open the folder with the .jar file you downloaded
  4. Execute java -jar duplicate-finder.jar with the following parameters:

    Parameter Meaning Example
    -r / --root
    required
    Relative or absolute path to the folder where you want to search for duplicate content -r=./my-project/
    -o / --output Relative or absolute path to the folder where you want to save the results of the analysis. If no directory is specified, the duplicate finder will use the current working directory. -r=./my-project/duplicates/
    -f / --fileMask Comma-separated list of file extensions to analyze. By default, all files are analyzed. -f=md,mdx
    -i / --indexer

    What to consider as a text chunk. The following options are available:

    • md – a markdown element
    • line – a single line of text
    • xml – an XML element
    • file – entire file's content
    • auto – attempt to infer from the file mask
    -i=md
    -l / --minLength The minimum length (in characters) for a text chunk to be analyzed. Default: 100 (text fragments shorter than 100 characters are ignored) -l=150
    -s / --minSimilarity The minimum degree of similarity between two text chunks to be considered duplicates. Default: 0.9 (90%) -s=0.85
    -d / --minDuplicates The minimum number of duplicates for a duplicates' group to be reported. Default: 1 (one duplicate is enough) -d=5
    -h / --headless Not open the duplicates viewer and only write the results to the files -h
    -v / --verbose Whether to log the progress and errors to the console. Use this option if the analysis is taking too long and you suspect a problem. Default: no logging -v
    -m / --memory Low memory mode - minimizes the duplicate finder's memory footprint at the cost of the analysis speed. -m
    -g / --gram (advanced) ngram length – affects the speed, memory footprint, and the accuracy of the analysis. The difference depends on the specifics of the content. -g=10

Command example

Here is an example of what your command might look like:

java -jar duplicate-finder.jar -r=/Users/me.user/my-site -i=md -f=md,mdx -s=0.85 -d=5 -l=200

The command above will do the following:

Results

Depending on the settings and the size of the project, you may have to wait a little bit for the analysis to complete. After that, the results will open in the duplicates viewer and saved the folder defined with the '-o' command line option. If no option is specified, the output is written to the working directory.

Duplicate finder UI Duplicate finder UI
  1. Toolbar: configure the font size, sorting order, and whether you want to only see a single reference chunk (2) for each of the duplicate groups
  2. Reference chunk list: select the chunk that serves as a reference for comparison.
  3. Duplicate chunk list: after you've selected the reference chunk (2), this list will show the chunks that are similar to it. To preview a duplicate, select it from the list.
  4. Reference chunk preview: after you've selected the reference chunk (2), you can preview its content here. Common parts are shown in green, while the differing ones are shown in red. The more of the duplicate chunks (3) have this part in common, the greener it will appear.
  5. Duplicate chunk preview: after you've selected the duplicate chunk (3), its preview will appear here. You can use it for a quick comparison with the selected reference chunk (4).

Learn more & contact

If you're interested in the development of this tool, check out the related blog post series:

For feedback, you can reach out using the contacts in the footer of this page. I will be happy to hear your thoughts and feature requests.

License

The code is licensed under the MIT license, which means you are free to use it for any purpose as well as fork and modify it.

all posts ->