Textual analytics: auto-classification

In computer science and machine learning circles it’s called “supervised learning.” In the eDiscovery space it’s called “predictive coding” or “technology-assisted review.” No matter what you call it, auto-classification is really just teaching a computer to assign tags to documents.

This type of machine learning requires a tremendous amount of math and computer science. These technologies can use a huge range of models, algorithms, and approaches. This may seem like an intimidating black box, but it doesn’t need to be.

Have you ever told your email client that a specific message is spam? Then you’ve already used auto-classification technology. By telling the filter that this email is spam, you are training the model. The more items that you and other users tag as spam, the better the model gets.

How does Nuix do it?

Nuix’s approach has always been to get the industry familiar with auto-classification. Approachable and affordable technology has a better chance of moving into the mainstream and encouraging more people to embrace the opportunities of machine learning.

This approach has also been designed with portability and scale in mind. This method lets you reuse the same model across multiple data sets and scale linearly as the number of documents in the set increases.

To be accessible to more people, we avoid the term “predictive coding” as this represents a very specific technology use case. Predictive coding models are often trained on the facts of a specific case which prohibits reuse. Nuix technology is designed to deliver auto-classification of unstructured documents, so we work from a data mining perspective, not predictive coding.

As a result, we embraced the Naive Bayes classifier model because of its reliability and industry acceptance. We used predictive model markup language to export and reuse our auto-classification models.

If you have a collection of documents and you want to “find more like these,” Nuix’s auto-classification makes it easy. Auto-classification can be as simple or complex as is required by your use case, but at its core auto-classification requires only a few simple steps:

  1. Sample: Select a collection of items to use as your training set.
  2. Train: Classify each item in the training set using someone knowledgeable on the topic.
  3. Build: Use Nuix to build a model.
  4. Repeat: Repeat the above steps until the model returns results that are accurate enough for your needs.
  5. Apply: Apply the model to the rest of the items in the collection.


Nuix provides a variety of methodologies to create a training set of sample documents.

  • Random: Nuix randomly selects a percentage of items as part of the sample set.
  • Seeded: Use the Nuix Query Syntax to select a more focused group of documents.
  • Putting it all together: Nuix randomly samples cluster pivot documents and unclustered items to ensure a high level of uniqueness in the training set.


For each auto-classification model, a human reviewer makes a binary decision (true/false or match/not match) for each item, which makes it possible to gauge the accuracy of the model.


Nuix looks at the coding decisions for various items, analyzes each item’s content, removes any items that don’t have sufficient text for a proper analysis, and creates the model.

Nuix then presents the results of the model as an industry-standard confusion matrix with the precision and recall scores for the model.

The Automatic Classifier Confusion Matrix dialog box shows the item count, confusion matrix, and statistics for this build.

The confusion matrix uses a grid to show how the relevant or irrelevant categories for the automatic classifier and actual matrices compare.


Repeat the training process until the model has achieved the required precision and recall results. If you are using auto-classification to guide a process or suggest a tag, you may not need high precision and recall scores. If you’re not submitting the results as evidence, you probably won’t need to worry about extreme precision.


Select the rest of the items in the collection and select the Automatically Classify Items option. Nuix will use the model you built to evaluate each item and classify it as either matching not matching the model.

What is it good for?

Pre-tag documents as part of a litigation support workflow

With Nuix’s auto-classification methodology you can run every case you process through a series of predefined auto-classifiers. These are trained generically to provide suggestive tagging or enriched metadata. For example, a quick review could tag items as spam, junk mail, or system alerts. Or you can reuse models from previous cases that were trained based on decisions made in those cases. This doesn’t replace a human review, but it can accelerate the process by allowing reviewers to triage how they review the documents.

Build institutional knowledge into investigations

With each new case, investigators must rely on the lessons of the past. They leverage their years of experience to find patterns in data, read between the lines in communication, and “sniff out” evidence of crime, malpractice, or fraud. It’s a stretch to say that a machine can replace a human when it comes to this type of intuition, but imagine that an investigator took all of the key evidence in a case and added it to an auto-classifier? Would it automatically find the smoking gun? Probably not. But by jump-starting the process, it could save valuable time.

Continuously train your model against the latest decisions

As part of any review auto-classification process, the more training data you feed the model, the more accurate it becomes. Why not continuously train the model and reapply it repeatedly?

Since Nuix’s auto-classification model is based on individual document decisions, it scales linearly as the number of records increases. Many other algorithms, most notably latent semantic indexing, are vector-based. This means each time you add data, there is a lengthy process to recalculate all the relationships between the items. With Nuix’s parallel processing framework and the fact that each item is evaluated individually, Nuix can scale to rapidly reapply the model across any size data set.

Up next: topic modeling

In my next post I’ll talk about about topic-modeling—an unsupervised learning technique where Nuix tells you what’s in your data.

Posted in Developers

Imagining the future at ILTA 2014

ILTA is over for another year and as I head home, I thought I’d share some quick reflections.

ILTA is where the legal community comes together to imagine the future and learn from each other’s experiences. It really shows how small the global legal technology community is. While there are always some new faces, it really is the once-a-year catch-up for the legal support family from all corners of the globe. It seems to me that the family is growing—this year seemed busier than most, especially on the exhibition floor.

Kevin the camel poses in front of the ILTA 2014 welcome sign

ILTA attracted members of the legal support family from all corners of the globe. Photo: Angela Bunting

So what were my take-aways…?

Discovery is always evolving

My first observations is that discovery, in the true sense of the word, is alive and making a comeback. As one attorney eloquently put it, when we started doing discovery it was a simple matter of identifying just100 or so documents. Then we moved to reviewing thousands or millions of documents in the hope of finding those 100 items. Now we’re looking for intelligent ways to find those 100 documents quickly from a sea of corporate data.

There didn’t seem to be much in the way of new products or players in discovery but visualizations and analytics were the talk of the show. From the discussions I had around the show, the consensus was that only handful of clear leaders had usable solutions in this area. Getting eyes on the data in meaningful ways earlier allows discovery, in the literal sense, to be the focus once again. This is essential as the data becomes larger and more diverse.

A cloudy future?

Storing data in the cloud is already a reality but we must look to the future. There were plenty of great new technologies on show for savvy tech professionals who are responsible for organizing and storing information. But while we rush to solve the today’s storage issues, we might run into bigger problems in the future. Sure, it’s easy to get data into the cloud, but how easy will it be to get it out again when required?

While we are  starting to see some connectors to the cloud for traditional discovery products, I think they will still face the same issues we see in software-as-a-service models today. Data is simply getting too large to drag it across the wire from one location or datacenter to another. The winning solution will make it possible to do discovery in place next to the data in the cloud, allowing firms to conduct eDiscovery proportionally and only extract and produce the documents that are really required.

Data is king

Everything we do online creates a digital record and now is the time to start analyzing this. What information could your data reveal? If you had insight into them, what behaviors would you change that would make a difference?

For example, cybersecurity has been a hot topic in the news recently, and this has led many firms to realize they could be easy targets because they hold some of their clients’ most sensitive data. Understanding what sensitive data you hold and where you keep it are the first steps in focusing your efforts on securing the most important data, rather than doing an average job of security across all of it.

Convergence is coming

Firms are realizing how many of their  business problems starts with or rely on the same unstructured data they handle in discovery matters. And they’re starting to think, once we’ve indexed this data, what else could we use it for?

Many discovery projects are born from very a reactive mindset and most organizations don’t derive a lot of value once that’ve put out that particular fire. Do we ever ask, “If the data looked that bad for our discovery project, what does the rest of the corporate information look like?”

Leading firms will start sharing these valuable insights about risky data, privacy concerns and information that could be taken off hold and defensibly deleted. This kind of advice will feature heavily as firms try to diversify their traditional roles and work in deeper collaboration with their clients. If this topic interests you, I recommend reading our new white paper, It’s All About the Data.

Most of all, ILTA was a lot of fun. The relaxed atmosphere and comic themes made a great lighthearted environment for learning, catching up with old friends and making new ones. It will be interesting to see how the next year is shaped by what we imagined in Nashville.

Posted in eDiscovery

Textual analytics: email threading

An email thread, or a set of emails about one topic from multiple people and groups, can contain hundreds of emails. Grouping these emails as a single thread helps with analysis and can speed up the review process.

Email threading is a classic document review and investigation tool. It’s useful in many circumstances, but its importance stands out clearly in situations where private material is sent via email.

Take for example a patent infringement case that involves releasing privileged emails. Generally private emails are assigned a privileged tag to protect their content. However, if the entire thread of these emails isn’t marked as privileged, then some private information could slip through the cracks and be seen by people who shouldn’t have access to it. When used together, chained near-duplicates and email threading can reveal document coding issues like this and highlight them before privileged information is released inappropriately.

How does Nuix do it?

Email threading is part of Nuix’s clustering functionality. Threads are produced during the cluster creation process and are formed based on the specific metadata attributes of the emails. After threads have been created, they can be used inside Nuix or passed to downstream applications.

The Document Navigator in Nuix Workstation shows a cluster of email threads about going out for drinks and a list of messages in a thread.

The emails in this thread are unlikely to be relevant to most cases.

You can combine chained-near-duplicate functionality with email threading to get a more comprehensive view of an email and its related messages. In addition, all of the search and analysis functionality available for clusters is available for email threads.

In most instances, threading is purely based on metadata. However, Nuix also provides the option to include content in email threads.

Up next: auto-classification

In my next post, I’ll tell you about auto-classification, which is a supervised learning technique that enables you to train Nuix to find “more like these.”

Posted in Developers

Compliance does not equal security

This week, another large American business, United Parcel Service (UPS), announced that a data breach across 51 of its stores compromised transaction data that may have included names, postal or email addresses and credit card numbers.

This comes on the heels of Community Health Systems announcing it had lost, as the result of a “cyber attack,” more than 4.5 million electronic personal healthcare records, and Supervalu Supermarkets announcing a data breach that affected 180 different locations.

Credit and debit cards

Recent breaches have led to the loss of millions of credit and debit card numbers. Photo: Sean MacEntee

In each of these instances, the impacted businesses lost sensitive data. In each case, storing, processing and transmitting that data was strictly regulated by governance, risk and compliance (GRC) régimes: The Payment Card Industry, Data Security Standard (PCI DSS) v3.0 for UPS and Supervalu, and the HIPAA Health Information Technology for Economic and Clinical Health (HITECH) standard for Community Health Systems.

This underscores something I and other security professionals have been saying for years: Compliance does not equal security.

Security isn’t just checking boxes

When organizations focus on compliance of GRC regimes, their security officers and staff are primarily concerned with “checking boxes.” PCI DSS and HIPAA HITECH both have specific security elements that organizations must meet in order to obtain compliance. But a compliance certificate is only ever a good place to start—the “floor” rather than the “ceiling.”

To be fair, in more than eight years as a Payment Card Industry Forensic Investigator, I have never investigated a breach in which the victim was 100% compliant. Some were very close, where perhaps a single element was missing, but none ever had all of the elements in place. More typically, the victim complied at one point in time, but since made changes to their environment that rendered them non-compliant.

Compliance is a point in time

Let me give you an example. Let’s say my small chain of restaurants, Chris Pogue’s Irish Pubs, is in the process of becoming PCI compliant. I instruct my IT guy, let’s call him Bob, to do whatever it takes to make sure we meet all 12 sections of the PCI DSS v3.0, so that I can achieve the requisite documentation from our Qualified Security Assessor (QSA).

Bob spends several months making changes, buying hardware, patching, upgrading, fixing, whatever until we meet all the requirements. The QSA comes in and looks everything over, checks all the boxes, and we get our certification. Woohoo! I’m compliant…I’m done right? So I tell Bob to go ahead and make stuff work properly again: Change passwords back to what they used to be, disable the firewall access control lists, which were making it hard to process transactions with my bank, and re-enable remote access.

So my pub met compliance at the point in time during which the QSA reviewed the systems. The changes I made after being certified compliant have caused me to be non-compliant. But since nobody checks my compliance on an ongoing basis, who cares!? I’ll worry about it again when it’s time for me to renew my compliance. I can stop spending my time and energy on data protection, and get back to selling Guinness, Irish breakfasts and bangers and mash.

Compliance makes people focus on checking boxes rather than thinking about security strategy. Most organizations see compliance as a nuisance, not a business enabler. They don’t understand why they have to make changes, they just know that they have to do a bunch of stuff that is not their core competency. So as soon as they accomplish their goal of compliance, they stop thinking about security  and move on.

Handwritten checklist

Compliance makes people focus on checking boxes rather than thinking about security strategy.
Photo: mt 23

Our current approach isn’t working

Today’s list of recent breaches is also a point in time, the latest in a growing number of high-profile hacks. Whatever these businesses are doing to safeguard their data, it isn’t working.

So, it begs the question, should businesses adjust their security spending focus? Think about it like this: We’re spending all this money on prevention but in all these cases the hackers have compromised their targets anyway. What is the value, then, in spending more time and money doing more of the same?

Nothing, and I mean nothing, is un-hackable. That doesn’t mean we should discard our security safeguards. We’re all doomed to get breached, but you don’t want to make it easy for the bad guys. But I’m saying detection and reaction are just as important as prevention, if not more so.

Detection and remediation are the future

If organizations shift their focus to detection, they can shorten the duration from initial breach to detection, from detection to containment, from containment to remediation, and from remediation to business resumption. The shorter these timeframes are, the smaller the impact on their business—which includes financial loss due to fraud, compliance violation fines and loss of sensitive data, customer confidence and market share.

These latest breaches are new in terms of media attention but not for cybercrime investigators. Breaches have been taking place for years and will continue for the foreseeable future. As long as there is something to steal, there will be somebody to steal it.

As a result, businesses need to include more comprehensive response strategies, which they should view as equally as important to compliance. Until this shift takes place, I’m afraid we’re in for more of the same.

I have said many times, there really are only three types of organizations: those that have been breached, those that are about to be breached, and those that are breached and simply don’t know it. Which are you?

Posted in Cybersecurity

Five pathways for successful litigation support vendors at ILTA 2014

Spending the past couple of days at ILTA 2014 has been a fantastic opportunity to catch up with Nuix customers who work in eDiscovery and litigation support. I’ve had dozens of meetings and conversations with more than 100 people. And I’m seeing some strong themes emerging in the litigation support business.

Growing the market—but not the number of customers

Leading LSVs recognize there is no great untapped pool of net new customers who have big eDiscovery problems. This means growing your market share has to be at the expense of someone else in the industry, which is costly and unpleasant for everyone, or by acquisition. Instead, they need to grow the market in other ways. Which leads me to…

Nuix comic hero poster

Stop by the Nuix booth (#327) at ILTA 2014 for your comic hero poster.

Leveraging their unstructured data skills

Rather than trying to poach customers, LSVs are looking to get deeper and wider engagement with their existing customer base. Fortunately, they have a core skill that is in high demand across many parts of the business: Understanding and managing large volumes of unstructured data. Many LSVs are seeking to parlay those talents into areas such as investigations, privacy, cybersecurity, records management, storage management and data migration. More of that later.

Acquiring—but only the right companies

Larger LSVs are looking to acquire smaller ones, but only if they have substantial customer bases and revenues. This has left some of the smaller LSVs stranded and thinking about throwing in the towel—especially if being acquired was the owners’ exit strategy. The lesson here is: If you want to get acquired, you have to be stand-out successful. But as we’re seeing, there are plenty of ways you can turn a small business into a larger one.

Thinking seriously about cybersecurity

Litigation support by nature involves handling customers’ high-risk and high-value data, which is a big target for cybercrooks. That’s why a big topic in the pre-conference hype and many of the conference sessions was how LSVs can maintain the security of their clients’ data. Quite a few people I spoke to were also wondering if developing security incident response capabilities would be a good way to differentiate themselves.

Moving into information governance

A lot of LSV people I spoke to were talking about information governance as a competitive advantage. Helping clients get proactive about managing their information is a great way to drive revenue and take away many of the pitfalls of eDiscovery.

Does it mean less work and a lower volume of data when litigation actually happens? Perhaps, but it’s only the repetitive and arduous parts of the job that you’ll avoid, and who wouldn’t be glad to be rid of those? And it’s all part of developing a deeper partnership with customers, which has to be good in the long run.

Posted in eDiscovery, Information Governance

Textual analytics: clustering

When searching large data sets with many documents, grouping similar items is a powerful way to reduce the time it takes to review them. This is where clustering comes into play. Clustering is a process where the Nuix Engine groups documents with similar content, which it identifies using shingling. Shingling provides a way to locate items with similar text across vast quantities of information, while clustering allows you to group those related documents for more efficient analysis.

For example, say you are searching a data set for a particular email. By clustering similar documents to that email, you can skip whole groups of documents that aren’t related to what you’re looking for. This saves you or your review team a lot of time.

How does Nuix do it?

Nuix creates clusters based on the degree of similarity between items in a data . The Nuix Engine calculates similarity using chained near-duplicates or email threads. Chained near-duplicates are sets of similar items based on item content. Email threads are groups of similar items based on email header information—we’ll cover these in more detail in a future post.

When Nuix creates clusters, it goes through all targeted items that were selected manually or the results from a query and looks for similar content. It then groups these items together and assigns each a common metadata identifier: A cluster name.

The Engine uses a similarity coefficient as a threshold to determine if two documents are similar. If a document fits the threshold limit, it is then assigned a metadata cluster ID. All documents in a cluster are given the same ID. This operation also defines a pivot item, which is the document most like the others in a cluster. Once the pivot document is defined, Nuix calculates the Cluster Pivot Resemblance for each document in the cluster. This is a measure of how similar each item is to the pivot document.

A list of emails called Gas Indices sent on different dates in the email-153 cluster with differing Cluster Pivot Resemblances

A list of documents in the email-153 cluster.

The cluster in this example has been assigned Cluster ID email-153. The “pivot item” (Cluster Pivots equals true) was sent on December 7th, 2001. Since this is the pivot document, it has a Cluster Pivot Resemblance of 1.0—in other words, it is identical to itself. The email with the lowest cluster pivot resemblance, 0.5, was sent on the date farthest from December 7th. If these emails are form emails, it’s not surprising that the email sent on the date farthest from that of the pivot document would be the least similar to the pivot document.

Nuix also applies business logic during the clustering process to ensure the clusters it creates are valuable. For example, if an item has insufficient text to determine its similarity to other items, it is placed in the Ignored category.

Nuix assigns the following metadata for clusters:

  • Cluster ID is the name of the cluster. This is also used as a metadata identifier.
  • Cluster Pivot Resemblance is a numerical value that indicates how similar the item is to the pivot.
  • Cluster Pivot is a true/false value stating whether the item is the cluster’s pivot document.

Nuix provides the following ways to search clusters:

  • By cluster run
  • By cluster
  • By pivot item
  • By all pivot documents in a cluster run

Nuix’s create, read, search, and delete functions for clustering are available programmatically through our APIs, so you can easily incorporate this type of textual analysis into your product.

Up next: email threading

In my next post, I’ll tell you about email threading, which is the grouping of email threads from multiple sources or people for accelerated analysis.

Posted in Developers

Nuix speaks your language

At Nuix, when we talk about the complexity of data, we’re not only referring to the huge variety of file and storage formats involved. Many of our customers operate in multiple countries and languages. One of the big advantages we’ve had over competing products is our ability to work with what’s called “double-byte character sets”—like Chinese, Japanese and Korean.

Giving customers the ability to work with other languages isn’t the same as allowing them to work in those languages. Over the past few years we’ve added the ability to use the Nuix Workbench interface in Chinese, German, Japanese and Korean. In version 5.2 we added Dutch, Brazilian Portuguese and Latin American Spanish. And when we release version 6.0, we’ll add Arabic to the list.

You probably know that Arabic is one of a dozen or so languages that reads from right to left. So our developers put in a lot of work to make sure the entire Workbench interface works that way too.

If you’re used to the Workbench in left-to-right mode, this screenshot will twist your mind just a little bit.

A screenshot of the Nuix Workbench in Arabic, with the interface running from right to left

Posted in Big Data

Get every new post delivered to your Inbox.

Join 26 other followers