Duplicate Finder for Text: Algorithm
Other languages: Español Français Deutsch 日本語 한국어 Português 中文
The previous post covered the requirements and interface for our duplicate finder. Now it’s time to think about the actual algorithm. We’ll walk through algorithm explanations, JavaScript snippets, and some interactive demos so you can play with the concepts right in your browser.
Core concept
Finding similar text fragments comes down to two challenges:
- Representation – we need a way to represent text where similar texts end up with similar representations
- Search – we need to avoid comparing every possible pair (that would be way too slow)
We’ll follow an approach based on n-grams – overlapping groups of consecutive characters. More specifically, we’ll use trigrams (n-grams of length 3). Why does this work? When two texts are similar, they naturally share lots of the same short character sequences.
Splitting to n-grams
A trigram is a sequence of three consecutive characters. To generate trigrams from text, we slide a window of size 3 across the string:
function generateTrigrams(text) {
const trigrams = new Set();
for (let i = 0; i <= text.length - 3; i++) {
trigrams.add(text.substring(i, i + 3));
}
return trigrams;
}
For example, the text "hello"
produces the following trigrams: "hel" ,
"ell" ,
"llo" .
Try it yourself:
This representation has a useful property: if two texts are similar, they will share many trigrams. Even if the texts differ slightly, most trigrams will remain the same.
Building an inverted index
Comparing every text chunk against every other chunk would be too slow – the time complexity of such operation is O(n²), which is impractical for large datasets. Instead, we will build an inverted index that maps each trigram to the chunks containing it:
// Map: trigram → list of chunk IDs containing that trigram
const index = new Map();
function addToIndex(chunkId, text) {
const trigrams = generateTrigrams(text.toLowerCase());
trigrams.forEach(trigram => {
if (!index.has(trigram)) {
index.set(trigram, []);
}
index.get(trigram).push(chunkId);
});
}
Try adding some text chunks to see how they are indexed:
Notice how chunks with similar content share many index entries. This is what makes the search efficient: we can use the index to quickly find chunks that share trigrams with a given query.
Calculating similarity
Next, we need a way to measure how similar two texts are. As a metric, we will use the overlap of their trigram sets:
function calculateSimilarity(trigrams1, trigrams2) {
// Count common trigrams
const intersection = [...trigrams1]
.filter(t => trigrams2.has(t));
// Similarity = common / max(size1, size2)
const maxSize = Math.max(trigrams1.size, trigrams2.size);
return intersection.length / maxSize;
}
The formula is:
where A and B are the sizes of the trigram sets. This gives us a value between 0 (completely different) and 1 (identical).
Try different texts to see how the similarity changes:
Notice that:
- Identical texts score 100%
- Texts with minor differences still score high
- Completely different texts score close to 0%
Finding duplicates
Let’s put the pieces together. The algorithm works as follows:
- Index all chunks – for each chunk, generate trigrams and add them to the inverted index (done once per dataset)
- Find candidates – for a given chunk, use the index to find chunks with overlapping sets of trigrams
- Calculate similarity – for each candidate, calculate the similarity score
- Filter by threshold – keep only pairs above the minimum similarity
A simplified JavaScript implementation would look along the lines of:
function findDuplicates(minSimilarity) {
const duplicates = [];
for (const chunk of chunks) {
const candidates = findCandidates(chunk.id, chunk.trigrams);
for (const candidateId of candidates) {
const candidate = chunks[candidateId];
const similarity = calculateSimilarity(
chunk.trigrams,
candidate.trigrams
);
if (similarity >= minSimilarity) {
duplicates.push({
chunk1: chunk,
chunk2: candidate,
similarity
});
}
}
}
return duplicates;
}
Here’s an interactive demo. Add your own text chunks and adjust the similarity threshold to see duplicates detected in real-time:
Next steps
The JavaScript code in this article demonstrates the core algorithm. In the next posts, we’ll look at the actual implementation of the tool in Kotlin and explore how parallel execution and other optimizations improve the tool’s performance.
See you next time!