Understanding natural language processing and machine learning in the EMS's automatic categorization feature

The EMS's categorization feature uses something called natural language processing to sort audience submissions into groups based on their similarity to one another.

But our algorithm isn't perfect, and the process of categorizing pieces of text (called “clustering”) is an obscure process. When something doesn’t work the way we want or expect it to, it can be a mystery as to why.

Here’s a brief introduction to natural language processing to help us understand how the EMS’s categorization feature works and why it has its limitations.

Vectorization

We start with a bunch of EMS submissions—sentences of human language—that we need to sort according to their similarity. Computers are entirely mathematical creatures, so the first step is to turn human language into computer language: numbers!

The process of turning human language into numbers is called vectorization. There are different ways to vectorize a bunch of text (which is formally called a "corpus," from the Latin for body) depending on what you are trying to accomplish, but we use a method called the Universal Sentence Encoder (UES).

The UES takes a sentence—in the EMS’s case, sometimes multiple sentences—and turns it into 512 numbers. How it does this, I frankly don’t understand, and we don’t need to understand it for this explainer. We just need to know that for each sentence we have a list of 512 numbers that mathematically represent different qualities of the sentence—a 512-dimensional vector.

Clustering

Next, the computer compares the vectors of our sentences to one another, determines how similar they are, and categorizes (or "cluster") them according to their similarity.

While it’s hard to imagine how this happens in 512 dimensions, we can visualize how it happens in 2 or 3:

In the illustration on the left, each sentence, represented by a dot, has a 2-dimensional vector (a list of two numbers: one number for the X-axis and one for the Y-axis). On the right, each one has a 3-dimensional vector (three numbers: one each for the X-, Y-, and Z-axes). In each case, the clustering algorithm has tried its best to place each sentence in one of three clusters.

From these illustrations alone, we can already begin to imagine how messy this can get. In the 2-dimensional illustration, what about all the sentences in the red and green clusters that are almost overlapping—how does it know for sure which sentence belongs to which cluster? In the 3-dimensional illustration, the sentences that are outliers in their clusters—is it best to include them in clusters at all, or are they just noise?

If we use a 2-dimensional example with more complex topography, we can see how many different ways there are for different clustering algorithms to cluster a given set of data:

Remember—in the EMS, the Universal Sentence Encoder has to work with 512 dimensions for each sentence, not just 2 or 3. It's so complex that it doesn't get everything right on its own. That's where it needs your help.

Machine learning

Machine learning is the process by which the EMS's automatic categorization system can learn from from your feedback and improve over time.

The automatic categorization system has a complex of rules it uses to determine how important each of each sentence's 512 dimensions are for sorting that sentence into a cluster.

To help us illustrate that process, here are a bunch of emojis. Each of the following emoji objects has numerous properties: size, weight, texture, shape, material, edibility, are they grown or manufactured, etc. These properties are like the dimensions associated with the sentences in the EMS.

Let's say we run all these emojis through a clustering process and among the results get the following clusters:

🍑🍊🏀
🍋🥎🥑
⚽️🥥🏐

We can guess based on the output that the algorithm has determined that two properties are most important: shape and color. These clusters don't make sense in other contexts. If we humans sorted according these objects to their similarity, it would probably make more sense to sort them into these two clusters:

🍑🍊🍋🥑🥥
🏀🥎⚽️🏐

So we re-cluster them and send them back to the algorithm for consideration. When we manually re-cluster the emojis (or in the case of the EMS, sentences of text), we teach the algorithm which properties are most important for identifying clusters. In this case, we teach it that an object's status as a fruit and an object's status as a ball used in sports are more important properties than an object's shape or an object's color. (Did you know coconuts are fruit??)

Using machine learning, the clustering algorithm then incorporates our feedback and changes its rules so that it can cluster future submissions appropriately:

🍐🍏🍅🥝🍑🍊🍋🥑🥥
🏈🎾🏉⚾️🏀🥎⚽️🏐

In the EMS's categorization interface, when you create new categories, delete old categories, or move audience submissions from one category, you're performing the same teaching process for the EMS's clustering algorithm so it can make better sense of the questions your audience asks you. You're helping it determine how important each of the 512 dimensions it measures for each submission is.

If the categories the automatic categorization feature generates are imperfect, or if it incorrectly places audience submissions in the wrong category, just remember that the system is working with an incredibly complex dataset and that it needs your help to get better.

There's plenty more to learn about natural language processing and machine learning, but we hope that this is enough to make them more accessible and make the EMS's automatic categorization feature make more sense.