Textual analytics: Topic modeling

Does learning about the contents of a document without having to do any work sound good to you? Then let me tell you about topic modeling.

Often called “unsupervised learning,” topic modeling is a form of machine learning. It’s based around the premise that a document is made up of multiple topics and that a collection of documents represents a collection of topics. With topic modeling, you can select a single document or multiple documents and return a list of topics within those documents.

How does Nuix do it?

Nuix’s strength is search technology; our indexes are robust, reliable and return consistent results. This is powerful when you know what to ask for. To take advantage of this, though, you need to know where to begin. Enter topic modeling. Nuix performs topic modeling using a common topic modeling algorithm: Latent Dirichlet allocation.

The key to generating good topics is text summarization. Nuix can automatically summarize text during processing by extracting the five most important sentences from each document. Importance is determined by a combination of attributes, including the position and length of the sentence.

A summary including five important sentences from an email.

A text summary for a specific email.

Using these summaries, Nuix generates topics for an individual document or a collection of documents. Each topic is represented by seven statistically significant words that have been extracted from the items. You can access these topics dynamically or store them in each item’s custom metadata database for future use.

A list of topics each named with the seven statistically significant words. The result “credit oct negative term market stock week” is highlighted.

A list of topics generated for a collection of emails.

Nuix’s create function for topic modeling is available programmatically through our APIs, so you can easily incorporate this type of textual analysis into your product.

What is it good for?

Unsupervised learning and Nuix’s implementation of topic modeling offer tremendous value and a huge number of use cases.

Review triage and prioritization

Use topic modeling to provide additional value to customers. Make topic modeling part of your eDiscovery workflow. Go above and beyond the standard keyword hit reports. Provide your customers with a topic list for emails, documents and spreadsheets as part of your standard reports.

Present these robust reports to counsel who can use them to prioritize reviews. Topics can be divided for reviews based on their relevance. Combine topic modeling with clusters to reduce the time it takes to find the facts.

Improve understanding

A key part of an investigation is understanding who knew what and when. Imagine if you could provide a list of topics, by custodian, for each week of an investigation?

This is easy to do using Nuix’s analytics and scripting capabilities. Create clusters for each custodian’s email. Generate topics for each cluster and store them in the custom metadata database against the pivot document. Then all you have to do is search for a specific date range for each custodian and put that content into a report. With this simple report you can quickly see who was talking about what and when, all without needing to know the facts of the investigation.

Posted in Developers

One step ahead: Introduction

Modern computing environments are getting larger and more dispersed. Breaches are getting more complex. Attackers are getting smarter. The number of available places for a hacker to attack (commonly referred to as the “attack surface”) is growing exponentially.

All of these factors are moving so quickly security services companies and software vendors are stretched to their limits just to keep up.

But what if we could get in front of the threat? Obviously that is the point where the industry wants to be. But it seems all but impossible because we’re so busy responding to so many security events. Well my friends, all that is about to change…

Can we predict breaches?

Threat prediction is an extremely difficult concept for security services and software vendors to nail down. We’re still struggling to respond quickly to events, a challenge that hangs around the neck of modern incident responders like a cyber-albatross. It could hardly be any different while 20-year-old methodologies are still the order of the day.

Under these circumstances, even claiming we can predict breaches seems more like science fiction than forensic science. I might as well be talking about unicorns or zombies. (Though to be honest, zombies are infinitely cooler than unicorns. They would probably just eat any unicorns they came into contact with. Even with those horns, unicorns couldn’t put up a serious fight against the overwhelming numbers of the undead. But I digress).

Zombies may have the numbers, but unicorns can gallop.

Zombies may have the numbers, but unicorns can gallop. Photo: Rob Boudon

How should we investigate?

There will always be challenges with growing infrastructures, emerging technologies, and a creative and active enemy. Those are givens, and in my opinion, they have never been the real problem. The problem has always been how to investigate.

Shouldn’t there be a set of core principles that transcends artificial boundaries? Something that says, “Regardless of the technology, this is how an investigation like this should go.”

I mean, we’ve see so many breaches in the past five years; there has got to be some intelligence we can take away to better prepare ourselves for the next attack. Right?

The short answer is, “Yes, there should be.” And I believe there is. But to explain it, I will have to take you a bit of a detour. Don’t worry, it’ll be fun, and I promise it will all make sense in the end.

Up next: standing on the shoulders of giants

Can you predict when or where a crime is going to occur? Plenty of people have tried, and their efforts can tell us some really interesting things about cybersecurity.

Posted in Cybersecurity

Do you trust your archive? Part 1: Broken index

Have you ever woken up from a nightmare where you’ve placed all of your company’s valuable email data into an archive and now you can’t get any of it out?

It happens all the time. A company has put its trust in an on-premise archive platform that is getting a bit on the elderly side—in IT terms that’s anything more than a few years old. The company has faith the archive will remain stable and be able to produce some or all of its data when required. After all, they ingested this valuable email into the archive for safe keeping. It should still be there, right?

Along comes a large eDiscovery, investigation or audit. Or perhaps the company has decided it’s time to migrate the on-premise archive to the cloud.

What if you found out you couldn’t produce any quantity of data out of the archive because your archive’s indexes were corrupt?

We see more and more companies in this predicament every day.

Don't lose sleep over your on-premise archive.

Don’t lose sleep over your on-premise archive. Photo: Nadio

Recently Nuix helped a well known online retailer out of a real-life archive nightmare. This customer was very worried. Its IT people were running eDiscovery searches using the archive’s native search tool, but each time they ran the search they got different results.

They called the legacy archive manufacturer for support. After several wasted months and a lot of hand-wringing and frustration, the manufacturer said the archive’s indexes were so corrupt they could no longer rely on them at all.

So they decided to do what most companies would in that situation: Abandon that old platform.

The online retailer contacted an established archive migration company recommended by the archive vendor. This migration company used technology that tried to extract data from the archive using its application programming interface (API) to locate and produce the data at a snail’s pace.

This method uses the archive’s server resources and infrastructure, including the indexes, to extract the data. When it works, it works very slowly. But guess what happens when the indexes are corrupt?

The archive migration company was forced to tell the customer: “We can’t extract your data from the archive because we have to use the indexes to perform that work and your indexes are beyond repair.”

The company was now ten months into the effort of rescuing the data with no results and a terminal diagnosis. Just when you thought the nightmare couldn’t get any worse, right?

Luckily the online retailer asked for a second opinion. Nuix Intelligent Migration processes legacy archive data in its highly proprietary format on the archive file system. We go directly to the disk rather than using the API.

Using this technology, we managed to restore more than 157 million archived emails. We found only 321 files were actually corrupt on the file system—that’s an extraordinarily good exception rate of 0.0002%. And we even produced these corrupt emails for the customer to ensure the chain of custody left no file unaccounted.

We had the customer’s data back in its native form in just six weeks.

Whew … it was only a bad dream. You can rest easy: Nuix will be there for you when your trust runs out.

Posted in Email and Archive Migration

BlackPOS v2: New variant or different family?

Media outlets have been abuzz the past week or so about a supposedly new variant of the infamous BlackPOS malware family.

BlackPOS gained notoriety as the malware responsible for the massive Target breach that occurred in late 2013. Security vendor Trend Micro recently published a blog post discussing a potential new variant to BlackPOS. Then Brian Krebs posted an article about a potential connection between this malware and the Home Depot breach.

While I agree that a connection between the malware used in numerous high-profile breaches would make for a good headline, the reality is there is no empirical evidence of such a connection.

As a malware analyst, I’ve looked at a number of point of sale (PoS) malware families, such as BlackPOS, Alina, JackPOS, Chewbacca, Dexter, and most recently Backoff. So my ears perked up when I heard about this new BlackPOS variant.

After careful review of both samples, I don’t believe the sample in question is actually part of the BlackPOS malware family. While I thought Trend Micro’s technical analysis was fantastic and overall a good read, it does not clearly identify a connection between the two samples. The goal of this post is to highlight the inherent differences in coding style and functionality and hopefully to stop any misinformation from spreading.

So, why do I think this malware sample is something different entirely and not BlackPOS? There are a number of reasons.


First, let’s discuss how subsystems were configured for both samples. A subsystem is used to specify what sort of environment an executable will be run in. The most common choices are “CONSOLE” and “WINDOWS” (or “GUI” as it’s sometimes shown).

A console application is designed to run on the command line, while a Windows application typically needs some sort of graphical component. The BlackPOS sample that hit Target was written with a windows subsystem, while the new malware was written with a console option. It’s a minor modification, but demonstrates an initial difference.

File properties showing BlackPOS was written with a windows subsystem while the new malware uses a console subsystem.

Subsystems use in BlackPOS and the new malware.


Interestingly enough, both samples are configured to run as a service. Being configured as a service is a common persistence technique we see on a wide range of malware samples.

Unfortunately, this persistence mechanism is where the similarities end. BlackPOS was configured to be run without any command-line arguments. It would check if it was running as a service, and in the event it wasn’t, it would create a new service with the following information:

Service Name: POSWDS
Display Name: POSWDS
Description: [N/A]
Startup Type: Automatic

It’s also important to note that the BlackPOS malware doesn’t include a description, while the new malware does.

Unlike BlackPOS, the new malware is configured to take a number of command-line arguments, as we see below:

C:\Documents and Settings\Administrator\Desktop>FrameworkServiceLog.exe
Usage: -[start|stop|install|uninstall]

As we can see, a number of arguments can be supplied. Additionally, the malware can take the ‘-service’ argument when it is being run as a service.

Another tactic we see this new malware family use is the service dependency technique. This is an important addition, which we didn’t see in the BlackPOS malware family. By adding itself as a dependency to another service, the new malware prevents it from easily being removed. After the malware installs itself as a service, it will execute the following system command that will configure it as a dependency for the legitimate LanmanWorkstation service:

%WINDIR%\\SYSTEM32\\sc.exe config LanmanWorkstation depend= mcfmisvc

The following image shows a de-compilation of the installation routine.

Installation routines for BlackPOS and the new malware.

Installation routines for BlackPOS and the new malware.

String Obfuscation

Both samples use string obfuscation, however the techniques they employ are quite different. BlackPOS uses a simple character shift technique that rearranges previously garbled characters into their original form. It’s somewhat similar to taking your alphabet soup and rearranging the letters so they make actual words.

Conversely, the new malware makes use of a simple XOR encryption routine, where the string is XORed against a one-byte key of 0x4D. XOR, or “exclusive OR”, is a logical operation that malware authors often use for simple encryption routines.

Dump File Obfuscation

While both samples dump their harvested card data to a fake DLL file, the dumped data is obfuscated in very different ways.

BlackPOS uses a customized version of Base64 to obfuscate dumped track data. As a reminder, Base64 is an encoding scheme that is often used for translating binary data into an ASCII representation.

The new malware makes use of a substitution cipher, not unlike those decoder rings that were really popular a number of years back. The following two tables are used for character swapping.


Additionally, the format of the harvested data is quite different. BlackPOS includes a command, such as |%ADD%|, while the new malware includes the victim’s IP address, as we can see below (The following data samples is an example only, and does not represent actual payment card data):

Example BlackPOS dump data (de-obfuscated):


Example new malware dump data (de-obfuscated):;5342120251699171^Smith/John^131010100000019301000000877000000;5342120251699171=131010119301


This category is where the two samples are most similar, as they both use network shares to move dump files to another machine on the compromised network.

The technique of using network shares to move harvested data is not terribly common. However, after the excessive press coverage of the Target breach, the technique is common knowledge among security researchers and malware authors.

While both malware samples moved data via network shares, their methods for doing so were very different. BlackPOS uses direct system() calls while the new malware writes its commands out to a batch script and executes it with a call to the CreateProcessA() Windows API. For less technical readers, this simply means the author for each malware sample took a very different approach to accomplishing the same thing.

Process Enumeration

In yet another example of how the two samples are different, we see differences in how they enumerate processes. BlackPOS uses the common EnumProcesses() Windows API call to identify processes to target, while the new malware uses the CreateToolhelp32Snapshot() Windows API call. Similar to what we saw in exfiltration, both authors essentially perform the same task in a different way.

The following code shows the de-compiled process enumeration techniques described above.

Memory scraping code from BlackPOS and the new malware.

Memory scraping code from BlackPOS and the new malware.

Additionally, the original BlackPOS sample that was used in the Target breach used a whitelist approach to determine what processes to target for memory scraping. It specifically looked for a pos.exe executable.

The new malware, however, uses a blacklist approach. As originally reported, it has a large list of known good process names that it will ignore. Subsequently, it will scrape memory from any process not in this list.

This means that the author of BlackPOS took a far more targeted approach to memory scraping, as it will only look for card data in a single process. The new malware, however, has a much less targeted approach, as it is simply ignoring known Microsoft Windows processes.


As we’ve seen, there are a number of differences between the BlackPOS family witnessed in the Target breach, and the most recently discussed malware family.

A single difference, or perhaps a couple of differences, might be the result of minor changes in a code base. However, the number and degree of variances between these two samples are a clear indication that they were more than likely coded by different people.  I think you’ll agree that when we look at the big picture, the new malware does not share any significant resemblance with the malware that hit Target.

It is unclear at this time whether this new malware is in fact the malware that was seen in the Home Depot breach. Many details have not yet been made public, so at this point in time, your guess is as good as mine. That’s the unfortunate reality and nature of electronic breaches, as ongoing investigations often prevent the dissemination of information.

While this particular sample may not be the newest variant of BlackPOS, it is still very much a serious threat. It employs a number of simple tactics that make it difficult to detect without specific knowledge of the malware family itself.  Overall, I think we can all agree that no matter what this family of malware is called, it still certainly has the capability to steal a wealth of information.

See also: VirusTotal analysis for the BlackPOS Sample and the new malware sample

Posted in Cybersecurity

Developer diaries: Making good developers great

Greetings my fellow Nuix technology enthusiast. I’m John Henry, lifelong web developer and the Technology Lead for Web Review & Analytics at Nuix’s Philadelphia location. My fancy title means that I’m privy to the selection, implementation, and support of technologies for all web development at Nuix. Let me share with you a bit of the wisdom that comes from such a perspective.

Personally, I like to scan the bullet points of any substantial article before I read it. So, for my first blog post, I think I’ll also start with bullet points: Three things that you can do to become a better developer. (And, who knows, these may also apply to the rest of your life.)

Over the last 15 years or so, I’ve identified three key characteristics I value in other software developers. I see these positive traits in the people I enjoy working with the most, and I value their opinion.

1. “Will he finish what he begins?” ~ Yoda

Own the code you write. Own the process you create, the script you’ve been asked to maintain, and the meeting bullet you’ve been called on to address. Train people until they feel like the process they’re inheriting is theirs. If you volunteer, or you’re nominated, for a position and take responsibility for it, excel in it. If there aren’t enough hours in the day to dedicate to the list of things you’ve accepted, don’t accept any more, and work to reduce your obligations. Nobody likes halfhearted efforts, and if you swoop in to “handle” an opportunity and leave a mess for the next guy, they will resent you.

The best people I’ve worked with continue to support their original efforts long after they’ve passed the torch to the next generation. You’ll be surprised how relevant your experience is even after a process has evolved for a few years without your direct involvement. Knowing why something was done can help people understand what should come next.

If you find yourself nominated, do your best. At the very least, you’ll sleep better at night knowing the work you are doing meets your standards of excellence.

Do. Or do not. There is no try.” Yoda, <em>Star Wars: The Empire Strikes Back

“Do. Or do not. There is no try.” Photo: Pablo Garcia

2. Don’t water the weeds

Software, and the practice of developing it, gives you many opportunities for improvement. Say, for example, you modify the scope of a parameter and the name is no longer truly representative of the function. You may be tempted to leave it so you won’t have to deal with the rabbit hole of changes and refactors today. You rationalize this and convince yourself you’re doing your peers a favor. Know this, you’re probably wrong.

Instead, be the person who takes a run at refactoring that awful 500 line method from the BlackBox class that no one else wants to untangle. But, only if you finish what you start!

3. Bark up or down?

There’s a lot of debate in the software development field. Software developers can argue about spaces or tabs, appropriate checkstyle rules, and code coverage efforts with a fervor usually reserved for political debate. It doesn’t help that every language or system has its own subset of these conversations.

“Should we use Windows, OSX, or Linux for our integration server?”

“What should we name this RESTful endpoint?”

This is our football. This is our casual conversation. This is how we relate to each other and find common threads during the work day. Have an opinion. Draw on your experiences or scholarly education and share it with your fellows. And, if they have a counter argument that makes some sense, add it to your brain, and maybe even adjust your opinion next time the topic comes up.

I could go on citing virtues that I admire, encourage and strive to achieve. The combination of commitment, initiative and interest in an individual can make the difference between a good developer and a great one.

Posted in Developers

Moving ever closer to the “find all evidence” button

Nuix has demonstrated time and again that there are smarter ways to investigate big data. Customers use technologies such as near-duplicate analysis, shingle lists, topic modeling, text summarization and named entities as powerful shortcuts to the evidence they seek. What’s more, they can view all their evidence sources through a single pane of glass, pulling together data from mobile devices, the cloud and archives as well as traditional hard drives and storage devices.

The release of Nuix 6.0 once again makes it easier and faster to draw out intelligence from diverse data sets. We’ve done this with a series of new filters under the Document Navigator pane in Nuix Workbench—available in all our Investigator and eDiscovery products.

Essentially, we draw together data from all your evidence sources based on type. For example, all internet data is grouped under categories such as internet history, cache data, bookmarks and downloads. There are also groups for communications, mobile devices and information about the computer from the registry and system files.

A close-up of the filter categories and their sub-categories in the Document Navigator.

Powerful new filters gather all evidence under internet, communications, mobile and computer categories.

For many of you, the advantages should be obvious.

Let’s say you’re doing an investigation that involves browsing histories. Each browser has a history, a cache, bookmarks and other information of interest. Each computer from each suspect may have two, three or more browsers. Doing this the old way, you’d have to examine dozens of different locations individually, or construct a complex query to look over all of them. Now you just check a single box and get all the browser histories from all the browsers on all the computers in your case file.

Posted in Digital Investigation, eDiscovery

Speaking your language: Mac OS and Linux

A few weeks ago Eddie Sheehy posted a teaser image showing the Nuix 6.0 Workbench interface in Arabic. But this isn’t the only way Nuix is speaking your language.

Over the years we’ve had a lot of requests for a Mac OS and Linux versions of our software. Our developers put a lot of work into it and I’m pleased to say Nuix 6.0 is available for Mac OS and Linux as well as Windows.

Macs are very common in law enforcement environments, so I know this is welcome news for our customers who wear the shield, as they say in the US.

As well as making our app run on Mac OS, we have significantly increased our support for Mac file formats and forensic artifacts. Parallels virtual disk images have caused headaches for investigators because there is limited forensic support for this format. Investigators have been forced to convert these virtual images to different formats to make them readable, however this potentially compromises forensic integrity. Nuix 6.0 eliminates this problem by directly ingesting parallels virtual disks.

The Nuix Workbench interface in Mac OS.

The Nuix Workbench interface in Mac OS.

Another difficulty for investigators is the Mac OS productivity applications Pages, Numbers and Keynote—again, there’s very limited support for these formats in forensic tools. They’re fully supported in Nuix 6.0, so investigators can now work with Mac OS formats in a native application.

Linux is a popular choice for digital forensics and cybersecurity specialists, due to the availability of forensic tools for that platform. Linux is also common in virtual and cloud infrastructure, and plenty of our customers are looking to run our software in the cloud. To simplify deployment, we’ve created packages for Debian/Ubuntu and Red Hat/CentOS.

The Nuix Workbench interface in Linux.

The Nuix Workbench interface in Linux.

Our dev people have put in a lot of work to make sure the Mac OS and Linux builds are as easy to deploy as the Windows version. In both cases, a single download contains everything you need to run Nuix, although you’ll need separate downloads for added functionality such as working with Lotus Notes and creating image thumbnails.

And in another language-related move, we know lots of investigators love to create scripts to help them automate repeatable processes. We recognize that Python has become a popular language for scripting, especially with cybersecurity incident responders, so customers can now use Python as well as Ruby and ECMAScript to script the Nuix Engine.

Posted in Cybersecurity, Digital Investigation, eDiscovery, Information Governance

Get every new post delivered to your Inbox.

Join 27 other followers