In computer science and machine learning circles it’s called “supervised learning.” In the eDiscovery space it’s called “predictive coding” or “technology-assisted review.” No matter what you call it, auto-classification is really just teaching a computer to assign tags to documents.
This type of machine learning requires a tremendous amount of math and computer science. These technologies can use a huge range of models, algorithms, and approaches. This may seem like an intimidating black box, but it doesn’t need to be.
Have you ever told your email client that a specific message is spam? Then you’ve already used auto-classification technology. By telling the filter that this email is spam, you are training the model. The more items that you and other users tag as spam, the better the model gets.
How does Nuix do it?
Nuix’s approach has always been to get the industry familiar with auto-classification. Approachable and affordable technology has a better chance of moving into the mainstream and encouraging more people to embrace the opportunities of machine learning.
This approach has also been designed with portability and scale in mind. This method lets you reuse the same model across multiple data sets and scale linearly as the number of documents in the set increases.
To be accessible to more people, we avoid the term “predictive coding” as this represents a very specific technology use case. Predictive coding models are often trained on the facts of a specific case which prohibits reuse. Nuix technology is designed to deliver auto-classification of unstructured documents, so we work from a data mining perspective, not predictive coding.
As a result, we embraced the Naive Bayes classifier model because of its reliability and industry acceptance. We used predictive model markup language to export and reuse our auto-classification models.
If you have a collection of documents and you want to “find more like these,” Nuix’s auto-classification makes it easy. Auto-classification can be as simple or complex as is required by your use case, but at its core auto-classification requires only a few simple steps:
- Sample: Select a collection of items to use as your training set.
- Train: Classify each item in the training set using someone knowledgeable on the topic.
- Build: Use Nuix to build a model.
- Repeat: Repeat the above steps until the model returns results that are accurate enough for your needs.
- Apply: Apply the model to the rest of the items in the collection.
Nuix provides a variety of methodologies to create a training set of sample documents.
- Random: Nuix randomly selects a percentage of items as part of the sample set.
- Seeded: Use the Nuix Query Syntax to select a more focused group of documents.
- Putting it all together: Nuix randomly samples cluster pivot documents and unclustered items to ensure a high level of uniqueness in the training set.
For each auto-classification model, a human reviewer makes a binary decision (true/false or match/not match) for each item, which makes it possible to gauge the accuracy of the model.
Nuix looks at the coding decisions for various items, analyzes each item’s content, removes any items that don’t have sufficient text for a proper analysis, and creates the model.
Nuix then presents the results of the model as an industry-standard confusion matrix with the precision and recall scores for the model.
Repeat the training process until the model has achieved the required precision and recall results. If you are using auto-classification to guide a process or suggest a tag, you may not need high precision and recall scores. If you’re not submitting the results as evidence, you probably won’t need to worry about extreme precision.
Select the rest of the items in the collection and select the Automatically Classify Items option. Nuix will use the model you built to evaluate each item and classify it as either matching not matching the model.
What is it good for?
Pre-tag documents as part of a litigation support workflow
With Nuix’s auto-classification methodology you can run every case you process through a series of predefined auto-classifiers. These are trained generically to provide suggestive tagging or enriched metadata. For example, a quick review could tag items as spam, junk mail, or system alerts. Or you can reuse models from previous cases that were trained based on decisions made in those cases. This doesn’t replace a human review, but it can accelerate the process by allowing reviewers to triage how they review the documents.
Build institutional knowledge into investigations
With each new case, investigators must rely on the lessons of the past. They leverage their years of experience to find patterns in data, read between the lines in communication, and “sniff out” evidence of crime, malpractice, or fraud. It’s a stretch to say that a machine can replace a human when it comes to this type of intuition, but imagine that an investigator took all of the key evidence in a case and added it to an auto-classifier? Would it automatically find the smoking gun? Probably not. But by jump-starting the process, it could save valuable time.
Continuously train your model against the latest decisions
As part of any review auto-classification process, the more training data you feed the model, the more accurate it becomes. Why not continuously train the model and reapply it repeatedly?
Since Nuix’s auto-classification model is based on individual document decisions, it scales linearly as the number of records increases. Many other algorithms, most notably latent semantic indexing, are vector-based. This means each time you add data, there is a lengthy process to recalculate all the relationships between the items. With Nuix’s parallel processing framework and the fact that each item is evaluated individually, Nuix can scale to rapidly reapply the model across any size data set.
Up next: topic modeling
In my next post I’ll talk about about topic-modeling—an unsupervised learning technique where Nuix tells you what’s in your data.