Duplicate Finder
Other languages: Español 한국어 Português Русский 中文
Duplicate finder is an open-source application for detecting similar text across one or more files. It can be used to find 100% duplicates as well as content that is similar but not identical. The tool is compatible with several formats, including plain text, Markdown, and XML.
The duplicate finder tool can help you with:
- Detecting plagiarisms
- Content management
- SEO optimization
- Data deduplication
Duplicate content example
Here's a quick example to give you an idea of what the tool detects:
How to use
- Download the app. Alternatively, you can build it yourself from the sources.
- Make sure Java 16 or later is installed on your computer
- In the terminal, open the folder with the .jar file you downloaded
-
Execute
java -jar duplicate-finder.jar
with the following parameters:Parameter Meaning Example -r
/--root
requiredRelative or absolute path to the folder where you want to search for duplicate content -r=./my-project/
-o
/--output
Relative or absolute path to the folder where you want to save the results of the analysis. If no directory is specified, the duplicate finder will use the current working directory. -r=./my-project/duplicates/
-f
/--fileMask
Comma-separated list of file extensions to analyze. By default, all files are analyzed. -f=md,mdx
-i
/--indexer
What to consider as a text chunk. The following options are available:
- md – a markdown element
- line – a single line of text
- xml – an XML element
- file – entire file's content
- auto – attempt to infer from the file mask
-i=md
-l
/--minLength
The minimum length (in characters) for a text chunk to be analyzed. Default: 100 (text fragments shorter than 100 characters are ignored) -l=150
-s
/--minSimilarity
The minimum degree of similarity between two text chunks to be considered duplicates. Default: 0.9 (90%) -s=0.85
-d
/--minDuplicates
The minimum number of duplicates for a duplicates' group to be reported. Default: 1 (one duplicate is enough) -d=5
-h
/--headless
Not open the duplicates viewer and only write the results to the files -h
-v
/--verbose
Whether to log the progress and errors to the console. Use this option if the analysis is taking too long and you suspect a problem. Default: no logging -v
-m
/--memory
Low memory mode - minimizes the duplicate finder's memory footprint at the cost of the analysis speed. -m
-g
/--gram
(advanced) ngram length – affects the speed, memory footprint, and the accuracy of the analysis. The difference depends on the specifics of the content. -g=10
Command example
Here is an example of what your command might look like:
java -jar duplicate-finder.jar -r=/Users/me.user/my-site -i=md -f=md,mdx -s=0.85 -d=5 -l=200
The command above will do the following:
-
-r=/Users/me.user/my-site
– search for similar content in '/Users/me.user/my-site' and its subdirectories -
-i=md
– assume the content is written in Markdown and parse it according to Markdown rules -
-f=md,mdx
– only consider files with the '.md' and '.mdx' extensions -
-s=0.85
– only report matches with a similarity of 85% or higher -
-d=5
– only report texts that are duplicated 5 or more times -
-l=200
– only report texts that are 200 characters or longer
Results
Depending on the settings and the size of the project, you may have to wait a little bit for the analysis to complete. After that, the results will open in the duplicates viewer and saved the folder defined with the '-o' command line option. If no option is specified, the output is written to the working directory.
- Toolbar: configure the font size, sorting order, and whether you want to only see a single reference chunk (2) for each of the duplicate groups
- Reference chunk list: select the chunk that serves as a reference for comparison.
- Duplicate chunk list: after you've selected the reference chunk (2), this list will show the chunks that are similar to it. To preview a duplicate, select it from the list.
- Reference chunk preview: after you've selected the reference chunk (2), you can preview its content here. Common parts are shown in green, while the differing ones are shown in red. The more of the duplicate chunks (3) have this part in common, the greener it will appear.
- Duplicate chunk preview: after you've selected the duplicate chunk (3), its preview will appear here. You can use it for a quick comparison with the selected reference chunk (4).
Learn more & contact
If you're interested in the development of this tool, check out the related blog post series:
For feedback, you can reach out using the contacts in the footer of this page. I will be happy to hear your thoughts and feature requests.
License
The code is licensed under the MIT license, which means you are free to use it for any purpose as well as fork and modify it.