Textual analytics: shingling

Large data sets usually contain many documents that are similar but not identical. Common examples include different revisions of a document that several people have worked on, and the same content in different formats such as a Word document and a PDF. To find these related documents, you can apply the process of shingling.

Shingling extracts the textual essence from each document and applies that pattern to a data set to find similar documents. This can help you speedily reduce the number of files to review or find the changes made between very similar files.

For example, say you wanted to find and deduplicate multiple copies of the same PDF file. Easy: You would simply search for the document’s cryptographic hash and that would locate all identical versions of that PDF across your data set.

However, this wouldn’t help you find the Word document the PDF was created from. Even though the PDF and the Word document have exactly the same textual content, they have different hash values. To find related documents, you need to compare document content, not exact binary make-up.

Mother with twin babies sitting at a window

Only their mother can tell them apart … shingling helps identify similar but not identical items.
Photo: Seattle Municipal Archives

Shingling is—as the name implies—a series of phrases, typically five words each with overlaps on either end. This is a very fast and efficient way to capture the content of each item and compare it with others.

Once you have these shingles, you can start thinking about how many it would take to find other documents that are similar. Do the shingles need to be 100% identical or is 75% close enough? That percentage is called the “similarity coefficient” and you can adjust it to determine how exact of a match you need.

How does Nuix do it?

During processing, if near-duplicates are enabled, the Nuix Engine generates shingles for each item. The shingles are stored in the Nuix index so you can use them later for searching and analysis.

Nuix provides a variety of ways to work with shingles:

  • View all the shingles in a collection of documents and filter them by keyword
  • View where in the document a shingle occurs
  • Export all the shingles extracted from a document or multiple documents
  • Search for documents with a specific similarity coefficient.

What is it good for?

Shingling has a wide range of uses. For example, you could use shingle lists as part of a records classification exercise. A challenge for any records management or information governance initiative is managing and classifying legacy information. It’s impractical to manually review several years’ worth of accumulated digital debris on network file shares or email archives.

But what if you collected documents that matched a given records series and then had Nuix create a shingle list? This list would act as a textual fingerprint for all the documents assigned to this records series. You could then use the shingle list to search across all the unclassified data in your organization, adjusting the similarity coefficient to dial up or down the specificity of the search.

You could apply a similar process to identify content in your data set that was taking up space but had no business value. Instead of building a shingling pattern based on records, you could build a couple of loosely defined patterns based on the unnecessary content and rapidly clear large volumes of that data from your system.

Want to protect yourself against malware? Shingling can help with that too. Antivirus systems are designed to protect against known threats but advanced malware continually evolves and in some cases programmatically mutates to try to stay ahead of detection. Fortunately the Nuix Engine extracts text from thousands of different file types, not just office documents and emails. This means that Nuix can look inside potential malware text and use shingles to search for strings or patterns, such as names of people, IP addresses, URLs or even individual calling cards of hackers.

Overlapping wooden shingles on a roof

Like a roof, overlapping word shingles ensure complete coverage. Photo: Eric Verspoor

Nuix’s create, read and search functions for shingling are available programmatically through our APIs, so you can easily incorporate this type of textual analysis into your product.

Up next: shingling in action

In my next post, I’ll tell you about clustering, which uses shingling to group documents with similar content to allow for more efficient review and analysis. After that I’ll go into another use of shingling: Near-duplicate analysis.

Further reading

The article “Fruity Shingles: Identifying similar documents” by Andre Ross does a great job of explaining the theory and mathematics behind shingling.

Posted in Developers

Want to kick goals with Office 365? Don’t forget the archive

Over the last month you may have been enjoying the FIFA World Cup, which many people say is the world’s greatest sporting event.  Not me. I’ve been getting amped up for the 2014 Microsoft Worldwide Partner Conference (WPC), which is kind of like the World Cup for Microsoft partners. Like the German strikers in the semi-final against Brazil, WPC is off to fast start!

Crowd scene at the 2014 FIFA World Cup final

Jubilant scenes reminiscent of the Microsoft Worldwide Partner Conference. Photo: Danilo Borges, Brazilian Federal Government World Cup Portal

This is my sixth straight WPC and my first with Nuix. Like many of you, we are talking a lot about accelerating adoption of Microsoft Exchange 2013 and Office 365. Our booth has been busy (come check us out at booth # 2004) and I’ve had all manner of email-related conversations.

It is clear to me that many Microsoft partners have developed deep expertise around email server migrations but not as many know a lot about migrating the email archives most organizations run alongside their email servers.

Migrating email archives is a critically important component of any successful overall email migration, especially to the cloud. So it has concerned me to hear so many misconceptions about the process and technology here in DC. Here are the top three misconceptions to be aware of:

Misconception #1: Email archive APIs were designed to extract legacy data. FALSE!

The Truth: The major on-premise archive vendors including Symantec, EMC and HP Autonomy offer APIs for ingestion of data into their platform but they were not designed for extraction, especially not extraction of large volumes of data.

Misconception #2: Certain email metadata can only be captured from the email archive using the API. FALSE!

The Truth: If you have the right technology, it is possible to capture 100% of the information you need, BCCs and all, without using the API.

Misconception #3: Faster email archive migration is at the expense of data integrity. FALSE!

The Truth: Speed has no bearing on data integrity or chain of custody during email archive migration. You can migrate as fast (or slow) as you like as long as you track every single individual item, manage 100% of exceptions and can make the numbers match up at the end.

In all three cases, the exact opposite of what many people believe is actually true.

The major reason email archive migrations fail, or take years longer than they should, is because most migration technology vendors still rely on APIs to extract the data.

If you are thinking about using an API-based technology to extract data from your legacy archive, consider these four big problems:

  1. If you are going to use your legacy archive API, your legacy archive has to be perfectly healthy—that means accurate indexes and healthy databases. Around 75% of on-premises archive owners say their archive is not as well maintained as they would like.
  2. API-based extraction tools rely on your archive index to know where are all of the data is. What happens when your indexes are corrupt? The API-based tool will miss that data.
  3. Equally as bad, if the archive’s databases are corrupt, using the API will scramble data that is healthy on your file system during extraction.
  4. Finally, the API is inherently a bottleneck because it only allows you to make one connection per server. This leads to projects taking months and months for relatively small amounts of data. We’re talking a year for 20 terabytes.

Like many vendors, when Nuix started working on email archive migration, we used the archives’ APIs. However, after experiencing all of the issues above, we applied our patented engine towards a new approach that eliminated the API and all of the associated problems. We call that approach “binary extraction.”

Since making that move, we have completed more successful archive migration projects than any other vendor. We’ve migrated more data to Office 365 than anybody (petabytes!) and replaced dozens of API-based archive migration tools. About one of every three Nuix Intelligent Migration projects is actually a replacement of an API-based tool.

So, now you know. If you’re getting ready to help a client migrate a client to Office 365 or Exchange 2013, don’t forget to plan for the email archive migration and make sure to skip the API.

Posted in Email and Archive Migration

The investigator as storyteller: sharing the passion

Storytelling is an art form that existed before written language. It often included elements of painting or interpretive dance. (So glad that is no longer a part of the process; can you imagine the Shadow Volume Copy Shuffle, the MAC-a-rena, or the Malware Wobble?).

Two dancers performing an interpretive dance

Dancers perform the Hex-dump Hora. Photo: University of Wisconsin Digital Collections

The essence of storytelling has always been to communicate something to an audience in a manner they will understand and internalize. It should hold meaning for them that transcends the story itself and cross over to a place where knowledge and emotion are being transferred from the storyteller to the audience.

OK, so that may sound a bit touchy-feely for you, but let me explain in a manner that is more conducive to the digital forensics and incident response (DFIR) world.

Why do we do what we do? Specifically, why do we investigate data breaches and digital crimes? We don’t have to … there are plenty of other jobs out there, so why this particular job in this particular field? Personally, I think it’s got something to do with the way we are wired. We’re all put together a bit differently than everybody else, which is a good thing! Otherwise, who would fight the powers of cyber evil!?

The point I make in writing that is presumably, you chose this particular field because you are passionate about it. You love being a DFIR investigator and could not even imagine doing anything else (except maybe selling hot dogs outside a Lowe’s hardware store … that’s my next career when I retire … seriously). So, you have to use that passion along with your technical knowledge to bring the evidence to life for your target audience.

Don’t simply cut and paste evidence from your forensic workstation in to your report, talk about it! Tell the reader why it’s there … why of all of the other possible pieces of data you looked at, you chose to include this one. What does this finding mean to the overall case? The more you can elaborate and provide context and relevance, the better your audience will read and receive your report.

Every incident has the potential of being a story; it likely already has all of the elements that a good story needs.

Writing the next Hollywood blockbuster

The general rules of storytelling break down in to five elements: Setting, a plot, characters, conflict and a theme. Here is an example:

  • Setting: Company X, in Anytown, USA.
  • Plot: Company X has been infiltrated by cybercriminals who have stolen critical data (like every other episode of Agents of S.H.I.E.L.D.).
  • Characters: You, the cyber criminals, law enforcement, the client.
  • Conflict: Critical data has been stolen and the company is facing the backlash of disclosing the breach to the public.
  • Theme: Security is important and cannot be an afterthought. Failing to implement proper IT and security hygiene can be far more costly than proactive implementation

Not every one of your cases will make the New York Times bestseller list but they will most certainly be a “bestseller” to the people who were impacted by the crime. They are extremely interested in your report and what it says! It’s their data, their breach, their crime, their story, and it needs to be told in manner they can understand. They deserve answers to the questions I’m sure you have heard as many times as I have: “Why me?”, “How did this happen?”, “Where did my data go?”, “How can I protect myself from being a victim again?”

Tell the true crime story

By conveying the technical details of the case, incorporating the “so what” factor and using language your audience can understand and ingest, you can take your reporting to an entirely new level of effectiveness.

In my opinion, this is the single most difficult skill for the investigator to master, and there is no tool that will do it for you. You must be able to think your way through this process manually in a logical, methodical way and present your findings in a manner your target audience will receive and understand.

Second-hand books on a street stall

Your true crime story might not make the bestseller list but the people who need it will appreciate it.
Photo: Geraint Rowland

My challenge to you is to look at your next report as an opportunity to be a storyteller. Don’t just regurgitate a series of data points, tell the story of a crime. It doesn’t matter if it’s a credit card breach, PII theft, breach of contract, or a violent crime … that report is tremendously important to the victims of that crime and subsequently to the litigators, judges, juries, business owners, executives and board members who were impacted. So tell the story with the same level of tenacity and enthusiasm that got you into this field to begin with.

Good luck! #changingthehunt

Posted in Cybersecurity, Digital Investigation

Textual analytics: image and multimedia analysis

Images and multimedia are generally difficult types of unstructured data to analyze. They are full of metadata and visual information, and any text they contain requires optical character recognition to extract. However, the metadata and other information contained in photos, videos and other types of media can dramatically boost your analytical capabilities.

Don’t be surprised when I tell you the Nuix Engine can analyze images and multimedia, extract metadata and enable you to put that metadata to work when searching a data set.

For example, say you’re looking for inappropriate images taken at an Australian beach with a BlackBerry. After ingesting images into Nuix, you can use the GPS and exchangeable image file (EXIF) format metadata to locate where in the world each image was taken and the type of device that took the picture. By searching for these particular metadata, you can quickly reduce the result set.

A photo of an empty beach with EXIF and GPS metadata about the image

A picture taken with a BlackBerry in Sydney, Australia 42 meters above sea level, and the accompanying metadata in Nuix.

How does Nuix do it?

During processing, the Nuix Engine extracts and stores metadata for images and multimedia files. You can then use these metadata attributes for searching, analysis and reporting. Metadata categories include:

  • Color count
  • Color range
  • EXIF format
  • GPS location
  • Image size
  • Skin tone.

Nuix’s skin tone analysis is one of the most powerful features in image analysis. To analyze skin tone, the Nuix Engine performs a statistical analysis on the color of each pixel in an image file. Based on the percentage of flesh tones within the image, Nuix gives the image a numerical score between 0 and 1.01. You can sort and filter images using this score to find those with a large percentage of skin tones.

Several images of people; in the Document Navigator the skin tone selection is set to medium

Filtering images by skin tone.

We even make videos easy to review. During ingestion, Nuix converts videos into storyboards of thumbnail images. The thumbnails are taken from key frames and combined into one image, allowing you to see the overview of an entire movie in a single image.

Do you have hundreds of images taken by phones? The ubiquity of mobile devices and cellular phone networks means that more devices include geospatial information on all the pictures they take. Nuix extracts this information and makes it readily searchable, mappable and exportable.

Using image size, color count and color range, it becomes much easier to differentiate photographs from company logos, PowerPoint slides or faxed information. For example, if you’re looking for a faxed image, you could search for an image with a color range of black and white.

Nuix’s create, read and search functions for image and multimedia analysis are available programmatically through our APIs, so you can easily incorporate this type of analysis into your product.

Up next: shingling

In my next post, I’ll tell you about shingling, which allows you to extract the textual essence from a collection of documents using a series of overlapping phrases and reuse that pattern repeatedly across an entire data set. Shingling makes clustering, near-duplicate identification and email threading possible.

Posted in Developers

The investigator as storyteller: “so what?”

There is no question that the modern incident responder or digital forensics investigator has to be many things: Mediator, technical advisor, auditor, project manager and even counselor (I have actually had clients cry on my shoulder). Likewise, you need myriad skills including knowledge of Windows (and all her variants), Linux, networking, embedded applications, mobile devices, malware and research, just to name a few.

However, one of the most important roles the investigator or responder can play is storyteller.

Having the ability to take difficult technical concepts, and communicate them to a specific target audience in a manner they can understand, is of the utmost importance. It cannot be overshadowed by the other more technical skills the investigator needs.

Garrison Keillor at 2011 National Book Festival, Washington DC

Want to become a better investigator? Learn to tell a story. Photo: Ryan Somma

Translating the technical

What we as investigators do is very technical. We pore over timelines, registry hives, memory, malware, and file systems and use terms like $MFT, binary, shingling, hex encoding and unallocated clusters. For those of us who live in the digital forensics and incident response (DFIR) world these are common terms that make total sense to us (in the same way that I wouldn’t need to spell out DFIR if I was writing for an exclusively investigator audience).

For those “other” people, who do not live in our little world, we might as well be speaking Klingon for all the sense we are making. We open our mouths and try to explain what we are doing, and you can see a haze settle over their faces (kind of like the Neutral Zone) and they are undoubtedly wondering what’s for lunch (thanks for the analogy, Troy Larson).

So, apart from having the technical acumen to do our jobs, we need the verbal and written communication skills to articulate what we did, why we did it and what it means to an audience who very likely has no idea what we are talking about but desperately wants to know. This is the essence of being a really good investigator. Without this skill, I’m afraid you will never cross the threshold of mediocrity.

So what?

During my tenure at IBM/ISS I worked closely with Harlan Carvey, who knows a thing or two about writing. One of the best things he taught me was something one of us dubbed the “so what” factor. I can remember submitting reports to him for review and getting them back with the words “So what?” written in the margin just to the right of my findings. It would frustrate me to no end! I would think, “What the heck does that mean!? Just tell me what I’m missing here and I’ll fix it!”

Looking back, I can see the lesson he taught me, and the wisdom in it. He was trying to get me to think about my findings and why I was including them in my report. Why did these particular findings make it into the report while others didn’t? What’s so special about these findings? Since then, I have trained a few other investigators, and have used the same frustrating lesson. I challenged them to elaborate and be descriptive; tell the reader, “so what.”

The term “so what,” I came to learn, meant something very profound that every DFIR investigator needs to understand and appreciate. These three points can give your report writing a shot in the arm in terms of readability and subsequent customer satisfaction. I strongly suggest you employ them:

  • What is the relevance of the finding?
  • What does it mean in the context of the evidence?
  • Why is it important to the overall investigation?

I can’t tell you how many reports I’ve read that included some technical finding that I’m sure the investigator thought was important but neglected to provide the details around the finding that would make it relevant to the reader. Unfortunately, this can marginalize even the most critical of findings into little more than technical jargon that gets glossed over.

I have even been hired to interpret other firms’ reports and found them full of generalities, incomplete thoughts and evidence with no apparent relevance to the case, and ultimately without a logical conclusion. Even though I’ve been an investigator for the better part of the past 14 years, and knew what they were trying to communicate, I still had a hard time making sense out of some of these reports. It’s not that the investigation was done improperly; it was that the investigator was simply not a good writer.

Consider the audience’s needs

That’s kind of a big deal, since all of us are ultimately communicating to a key stakeholder who needs to know and understand the specific details of our investigations.

Remember, the audience we are writing to is largely non-technical. This includes such groups as litigators, judges, juries, business owners, and as well as executives and board members. These people are by no means unintelligent; they may be medical doctors or PHDs, lawyers, MBAs or other experts in their fields. They are simply not as technical as we investigators are and need some handholding.

When the read terms like, “the stack,” “last write times” and “portable executable,” they likely are thinking about the laundry they need to fold when they get home, the last email they wrote and a pocket guillotine.

Toy guillotine

Remember your audience may not think “portable executable” means what you think it means.
Photo: davidd

So, clearly, the responsibility falls to become the storyteller, and convey precisely what took place in a manner easily understood by our target audience.

Coming next: share your passion

Next time around I’ll continue this discussion with some more ideas about sharing the passion you have for investigating data breaches and digital crimes with the people who are paying your invoices.

Posted in Cybersecurity, Digital Investigation

Textual analytics: named entity extraction

Named entity extraction is a powerful way to overcome the major challenge of unstructured data: The fact that it’s unstructured.

Let me give you an example. Let’s say you wanted to extract a list of IP addresses from the database of your IT department’s asset management system. That’d be pretty easy, it would most likely be in a database field called “IP_address.” You’d write a simple query and get the list.

Extracting IP addresses from a web server log file is a little tricky, but it’s made easier by the fact that each log file entry follows a predictable structure—generally you can extract the IP address by counting the number of spaces, commas or other delimiters.

But what if you wanted to extract IP addresses from a bunch of emails or random documents on a file share? First you have to normalize them so you could search their contents, but then what?

If you’re already shouting out “regular expressions,” you’re way ahead of me. Regular expressions are an amazingly powerful and flexible way to identify text that follows a particular pattern. For example, IPv4 addresses are always four groups of between one and three digits, separated by periods. A common way to capture that in a regular expression is:


Nuix’s named entity extraction is the process of identifying patterns in your normalized text using regular expression searches.

How does Nuix do it?

During processing, the Nuix Engine evaluates each extracted item against all configured named entities—these include Nuix’s out-of-the-box entities and any you define yourself. It records each match and stores the matching value in our index as a metadata attribute of the item. You can then use that metadata attribute for searching, analysis and reporting.

Viewing the default named entities in Nuix

Viewing the default named entities in Nuix.

Nuix’s out-of-the-box named entities cover a wide range of basic patterns:

  • Company names
  • Countries
  • Credit card numbers
  • Email addresses
  • IP addresses
  • Personal identity numbers
  • Sums of money
  • URLs

But you’re not limited to those eight. For example, you might be looking for unique word or number patterns that indicate:

  • Phone numbers for an investigation
  • Patent numbers for intellectual property
  • Healthcare codes for HIPAA compliance
  • Part numbers for a manufacturing dispute.

If you can define a regular expression to find it, Nuix’s parallel processing engine can locate it.

A results list for IP address named entities in Nuix

A results list for IP address named entities in Nuix.

Named entities give you a powerful way to quickly scan data for private information, financial data, company names and many more. From unstructured data, you now have the information you’re seeking in an easily searchable format from which you can power your analytics.

Nuix’s create, read and search functions for named entities are available programmatically through our APIs, so you can easily incorporate this type of textual analysis into your product.

Up next: image and multimedia analysis

In my next post, I’ll tell you about image and multimedia analysis, which enables you to find and sort pictures and scanned documents using criteria such as skin tone, color depth, metadata and thumbnail images from videos.

Posted in Developers

Photos from the Nuix Insider Roadshow Sydney

Photos from the Nuix Insider Roadshow event held in Sydney on June 19, 2014.

If you’d like to attend our upcoming events in Silicon Valley on July 22 or Chicago on July 24, register your interest here.

Posted in Nuix

Get every new post delivered to your Inbox.

Join 25 other followers