This post is about Distributed Representations, a concept that is foundational not only to the understanding of data processing in machine learning, but also to the understanding of information processing and storage in the brain. Distributed representations of data are the de-facto approach for many state-of-the-art deep learning techniques, notably in the area of Natural Language Processing, which will be the focus of this blog post.
If you’re anything like me, you probably feel somewhat overwhelmed by the technical jargon and content of machine learning publications; every paper or article you read seems to require an understanding of many underlying concepts (“prerequisites”). Distributed representations is one of those concepts, and the modest aim of this blog post is to hopefully check this prerequisite off your list and move one step closing to a stronger command of machine learning.
We’re going to start with a thought experiment: let’s imagine that my brain has 3 neurons, or 3 information processing units. My wife tells me that sometimes when she’s talking to me, it feels to her like my brain is mostly turned off, so perhaps this image is more apt than we think. Let’s also imagine that my brain is a complete blank slate; in other words, I have no knowledge learned/stored in my brain yet. My wife also believes this about my brain, but we digress. Now let’s say that I’m staring out of my window and I see a small red car driving by. The concept “small red car” is completely new to me, and so my brain decides to assign one of its units to the representation of this concept. This means that, from now on, I can recall the concept of “small red car” by retrieving it from this one neuron. Let’s write this in somewhat “mathy” terms, shall we:
|Small Red Car||[ 1 ]|
|Not a Small Red Car||[ 0 ]|
In this example, 1 designates the neuron “firing” and 0 represents the neuron not “firing.” As a side corollary, we can imagine a blank [ ] vector representing my brain knowing absolutely nothing.
Now, I’m still staring out of my window, and I see a second car pass by, this time it’s a Large Blue SUV. In similar fashion, my brain assigns this new concept to a second processing unit, and we end up with the following updated knowledge representations:
|Small Red Car||[ 1 0 ]|
|Large Blue SUV||[ 0 1 ]|
We now have two processing units (neurons), each either on or off (firing or not firing). Note that it’s theoretically possible to have [ 0 0 ], representing the notion “NOT a small red car AND NOT a large blue SUV,” but it is not possible to have [ 1 1 ] because then a small red red would also be a large blue SUV, and car commercials would suddenly become very confusing.
Continuing on with our window staring (it’s a slow day), one more vehicle passes by and it’s a Large Red SUV. Three concepts in total, 3 available processing units: perfect. We end up with the following knowledge representations:
|Small Red Car||[ 1 0 0 ]|
|Large Blue SUV||[ 0 1 0 ]|
|Large Red SUV||[ 0 0 1 ]|
With this type of representation, one thing immediately become apparent: 3 processing units only allow me to store 3 pieces of information. If I were to keep staring out my window, and see a large red car pass by, I would have no room for that new concept. I would either have to discard it, or replace an existing one.
A second, less dramatic result of this representation is the fact that each processing unit can contribute to only one concept. In other words, unit 1 only “fires” for a small red car, and doesn’t involve itself in representing anything else. This way of representing information is called a localist representation. It is a simple and accessible way of representing data, in the sense that you could point at some location in my brain and say “that neuron is where this concept is stored.” Similarly, if a localist approach is used to represent data or state in a machine, it would offer some advantages in terms of identifying data (just find the one hardware unit that is on), and changing data (turn one unit off, turn one unit on).
One-Hot Encoding (OHE)
If you’ve worked with machine learning within the context of NLP, there’s a chance you’ve come across a localist representation called One-Hot Encoding. For example: you have a vocabulary of n words and you represent each word using a vector that is n bits long, in which all bits are zero except for one bit that is set to 1. Different words get a different bit “assigned.” For example, let's consider this quote from Andy Weir's "The Martian":
“Duct tape works anywhere. Duct tape is magic and should be worshiped.”
The above sentence is made up of 12 words, of which 10 are unique (“duct” and “tape” are repeated). Note that for this tutorial’s purposes, we are currently ignoring letter case. Our vocabulary is therefore made up of 10 words: “duct”, “tape”, “works”, “anywhere”, “is”, “magic”, “and”, “should”, “be”, “worshiped.” We can use a vector of 10 bits to represent any of these words, by simply making the 1st bit represent the word “duct”, the second bit the word “tape”, etc. As follows:
“duct” --> [ 1 0 0 0 0 0 0 0 0 0 ] “tape” --> [ 0 1 0 0 0 0 0 0 0 0 ] … “magic” --> [ 0 0 0 0 0 1 0 0 0 0 ] … “worshiped” --> [ 0 0 0 0 0 0 0 0 0 1 ]
One-Hot Encoding is very common in natural language processing, because it involves a simple localist representation with all the advantages that entails. However, OHE has its limitations:
- Typical vocabularies contain upwards of 100,000 words, so these vectors are usually very long. And if we want to represent a sequence of 10 words, we’re already looking at input vectors with over 1,000,000 units! (cue Dr. Evil)
- In order to be able to represent n possible words, I need n bits. Or conversely, a vector of length n can at most be used to represent n words.
- Finally, and critically for machine learning, all words are equally similar (different) from each other. Ideally, when talking about semantics, we would like to have the words “dog” and “fox” be represented by vectors that are somewhat close to each other, at least as opposed to a vector representing a word like “lazy.” With One-Hot Encoding, we can’t really do that.
Below is an example function in Python that performs One-Hot Encoding. The input to this function is the vocabulary and the word we want to One-Hot Encode. The function returns an OHE vector.
def one_hot_encode_my_word(my_word, the_vocabulary): one_hot_encoded_vector =  for word in the_vocabulary: if word == my_word: one_hot_encoded_vector.append(1) else: one_hot_encoded_vector.append(0) v = ["duct", "tape", "works", "anywhere", "is", "magic", "and", "should", "be", "worshiped"] one_hot_encode_my_word("tape", v)
>>> [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]
Let’s go back to our original thought experiment with small red cars and large blue SUVs. Let’s also add a few more concepts into the mix -- the assumption here is that I went to sleep and my brain grew a few more neurons, then I woke up and watched some YouTube videos, so my knowledge map now looks as follows:
|Small Red Car||[ 1 0 0 0 0 0 0 0 ]|
|Large Blue SUV||[ 0 1 0 0 0 0 0 0 ]|
|Large Red SUV||[ 0 0 1 0 0 0 0 0 ]|
|Green Apple||[ 0 0 0 1 0 0 0 0 ]|
|Bumble Bee||[ 0 0 0 0 1 0 0 0 ]|
|Tall Building||[ 0 0 0 0 0 1 0 0 ]|
|Small Fish||[ 0 0 0 0 0 0 1 0 ]|
|Banana||[ 0 0 0 0 0 0 0 1 ]|
Here we face the same problems we mentioned in our One-Hot Encoding discussion: high computational cost, high dimensionality, and the concepts are equally similar. For example, to go from any concept to another, you go the same distance: -1 in one dimension, +1 in a second dimension. As far as this representation goes, a large blue SUV is as similar to a large red SUV as it is to a banana; not exactly nuanced. But we can do better.
Ever since the mid 1980’s (see Rumelhart, McClelland and Hinton - 1986), the connectionist (read "anti-localist") crowd has advocated something called Distributed Representations, which offer some advantages over the localist approaches and help us navigate around their limitations. The name “distributed representation” is mainly driven by the fact that the representation of any single concept is distributed over many, if not all, processing units. In many cases, the unit values in the vectors are continuous values, instead of just 1’s and 0’s.
Let’s apply this type of representation on my brain knowledge and see if we can make me smarter (my wife has already despaired, so it’s all on you now, no pressure):
|Small Red Car||[ 0.555 0.761 0.243 0.812 ]|
|Large Blue SUV||[ 0.773 0.309 0.289 0.835 ]|
|Large Red SUV||[ 0.766 0.780 0.294 0.834 ]|
|Green Apple||[ 0.153 0.022 0.654 0.513 ]|
|Bumble Bee||[ 0.045 0.219 0.488 0.647 ]|
|Tall Building||[ 0.955 0.085 0.900 0.773 ]|
|Small Fish||[ 0.118 0.192 0.432 0.618 ]|
|Banana||[ 0.184 0.232 0.671 0.589 ]|
What do we have here?
- Continuous values instead of discrete 1’s and 0’s.
- Each processing unit contributes to any and all concepts.
- The representations are dense (vs. localist representations which are sparse).
- Concepts are no longer localized in one unit (hence the “distributed” designation).
- We’re able to represent a very large number of concepts with only 4 processing units (as opposed to being limited by n units to n concepts).
- We can learn new concepts without adding new units. All we need is a new configuration of values.
- Most importantly, we are able to represent similarities better: Large Red SUV [ 0.773 0.309 0.289 0.835 ] and Large Blue SUV [ 0.766 0.780 0.294 0.834 ] are much more similar to each other than they are to Small Fish [ 0.118 0.192 0.432 0.618 ].
The last point above is crucial and has a wonderful implication when it comes to learning new concepts we haven’t seen before. If I see some new object that I don’t recognize, let’s say a river trout, and it has the following representation [ 0.144 0.187 0.439 0.606 ], I can quickly find which existing concept’s vector it is most similar to (the small fish vector) and I can guess that it must be a fish-like object, without any additional information about it!
Furthermore -- and here’s why this type of representation is so powerful -- we are able to generalize . I don’t need all the units in the vector to have values, I can “guess” what concept category they might fall into with only a partial representation. For example, [ 0.158 0.030 -- -- ] is of similar characteristics to a Green Apple. It could be a Green Pear, or a Yellow Lemon, but it’s very unlikely to be a Tall Building.
We looked at One-Hot Encoding as a localist type of representation, now let’s look at Word Embeddings, an example of distributed representation used in Deep Learning for NLP.
The idea of Word Embeddings is to take a corpus of text, and figure out distributed vector representations of words that retain some level of semantic similarity between them. When that happens, the words “duct” and “tape” are “closer” to each other than they are to “magic” (which we couldn’t do with One-Hot Encoding). This can be applied to any corpus, whether it’s a collection of novels, or documents within a certain scientific discipline, or many other applications. From here, these word embeddings can be used as inputs to additional models such as an SVM or recurrent neural network. This input is usually in the form of a matrix of vectors, just like we saw in the previous section.
In the following posts in this series, we will be exploring one popular implementation of a word embedding model: word2vec. Based on a recent paper (See Mikolov ‘13), word2vec turns one-hot-encoded vectors into embedded vectors, and achieves outstanding results along the way. It manages to produce vectors that allow us to do things like W(“King”) - W(“Man”) + W(“Woman”) = W(“Queen”)! How does it do that? Make sure you stay tuned for our next post!