Large data sets usually contain many documents that are similar but not identical. Common examples include different revisions of a document that several people have worked on, and the same content in different formats such as a Word document and a PDF. To find these related documents, you can apply the process of shingling.
Shingling extracts the textual essence from each document and applies that pattern to a data set to find similar documents. This can help you speedily reduce the number of files to review or find the changes made between very similar files.
For example, say you wanted to find and deduplicate multiple copies of the same PDF file. Easy: You would simply search for the document’s cryptographic hash and that would locate all identical versions of that PDF across your data set.
However, this wouldn’t help you find the Word document the PDF was created from. Even though the PDF and the Word document have exactly the same textual content, they have different hash values. To find related documents, you need to compare document content, not exact binary make-up.
Shingling is—as the name implies—a series of phrases, typically five words each with overlaps on either end. This is a very fast and efficient way to capture the content of each item and compare it with others.
Once you have these shingles, you can start thinking about how many it would take to find other documents that are similar. Do the shingles need to be 100% identical or is 75% close enough? That percentage is called the “similarity coefficient” and you can adjust it to determine how exact of a match you need.
How does Nuix do it?
During processing, if near-duplicates are enabled, the Nuix Engine generates shingles for each item. The shingles are stored in the Nuix index so you can use them later for searching and analysis.
Nuix provides a variety of ways to work with shingles:
- View all the shingles in a collection of documents and filter them by keyword
- View where in the document a shingle occurs
- Export all the shingles extracted from a document or multiple documents
- Search for documents with a specific similarity coefficient.
What is it good for?
Shingling has a wide range of uses. For example, you could use shingle lists as part of a records classification exercise. A challenge for any records management or information governance initiative is managing and classifying legacy information. It’s impractical to manually review several years’ worth of accumulated digital debris on network file shares or email archives.
But what if you collected documents that matched a given records series and then had Nuix create a shingle list? This list would act as a textual fingerprint for all the documents assigned to this records series. You could then use the shingle list to search across all the unclassified data in your organization, adjusting the similarity coefficient to dial up or down the specificity of the search.
You could apply a similar process to identify content in your data set that was taking up space but had no business value. Instead of building a shingling pattern based on records, you could build a couple of loosely defined patterns based on the unnecessary content and rapidly clear large volumes of that data from your system.
Want to protect yourself against malware? Shingling can help with that too. Antivirus systems are designed to protect against known threats but advanced malware continually evolves and in some cases programmatically mutates to try to stay ahead of detection. Fortunately the Nuix Engine extracts text from thousands of different file types, not just office documents and emails. This means that Nuix can look inside potential malware text and use shingles to search for strings or patterns, such as names of people, IP addresses, URLs or even individual calling cards of hackers.
Nuix’s create, read and search functions for shingling are available programmatically through our APIs, so you can easily incorporate this type of textual analysis into your product.
Up next: shingling in action
In my next post, I’ll tell you about clustering, which uses shingling to group documents with similar content to allow for more efficient review and analysis. After that I’ll go into another use of shingling: Near-duplicate analysis.
The article “Fruity Shingles: Identifying similar documents” by Andre Ross does a great job of explaining the theory and mathematics behind shingling.