Fairy Tale Word Vectors
This post continues our exploration from the last blog post Why Hyperdimensional Socks Never Match. We are still working our way through Kanerva’s paper. This time, with the basics of hypervectors under our belts, we’re ready to explore how words can be expressed as context vectors. Once in a high dimensional form, you can compare two words to see how similar they are and even perform reasoning.
To kick off our word vector adventure, we need some words. Preferring whimsy over the Google news, our text will be taken from ten freely available fairy tale books on http://www.gutenberg.org/.
Gather ye nouns
Our goal is to assemble a frequency matrix, with all the different nouns as the rows and the columns will be the counts of if the word appears or not in the document. Our matrix will be binary with just 1s and 0s. The document will be a sentence or fragment of words. A small visualization is below.
Noun | Doc1 | Doc2 |
---|---|---|
flower | 1 | 0 |
king | 0 | 1 |
fairy | 1 | 0 |
gold | 1 | 1 |
The size of the matrix will be big enough to support hypervector behavior, but not so big as to make computation too annoyingly slow. It will be nouns x 10,000.
The first task is to get a set of nouns to fill out the rows. Although, there are numerous online sources for linguistic nouns, they unfortunately do not cover the same language spectrum as old fairy tale books. So we are going to collect our own. Using Stanford CoreNLP, we can collect a set of nouns using Grimm’s Book as a guide. There are about 2500 nouns there to give us a nice sample to play with. This makes our total matrix size ~ 2500 x 10,000.
Now that we have our nouns, let’s get down to business. We want to create an index to row to make a noun-idx
and then create a sparse matrix for our word frequency matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
The next thing we need to do is to have some functions to take a book, read it in, split it into documents and then update the frequency matrix.
Random indexing for the win
The interesting thing about the update method is that we can use random indexing. We don’t need to worry about having a column for each document. Because of the nature of hyperdimensions, we can randomly assign 10 columns for each document.
1 2 3 4 5 6 7 |
|
The whole book is processed by slurping in the contents and using a regex to split it up into docs to update the matrix.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
We can now run the whole processing with:
1 2 |
|
On my system, it only takes about 3 seconds.
Great! Now we have hypervectors associated with word frequencies. They are now context word vectors. What can we do with them.
How close is a king to a goat?
One of the things that we can do with them is find out a measure of how closely related the context of two words are by a measure of their cosine similarity. First, we need a handy function to turn a string word into a word vector by getting it out of our frequency matrix.
1 2 3 4 5 |
|
Then we can make another nice function to compare two words and give a informational map back.
1 2 3 4 5 6 7 8 9 |
|
Let’s take a look at the similarities of some words to king.
1 2 3 4 5 6 7 8 9 10 |
|
As expected, the royal family is closer to the king then a guard or goat is.
One of the interesting things is that now we can do addition and subtraction with these word vectors and see how it affects the relation with other words.
Boy + Gold = King, Boy + Giant = Jack
We can take a look at how close boy and king are together by themselves.
1
|
|
Now we can add some gold to the boy and that new word vector will be closer to king than boy was alone.
1 2 |
|
Doing the same for boy and jack, we find that adding a giant moves the context closer.
1 2 3 4 |
|
Amusingly, a frog and a princess make a prince.
1 2 |
|
We can take this even farther by subtracting words and adding others. For example a similarity to the word queen can be obtained by subtracting man from king and adding woman.
1 2 3 |
|
Similarly, a contextual closeness to father can be gotten from subtracting woman from mother and adding man.
1 2 |
|
But wait, that’s not all. We can also do express facts with these word vectors and reason about them.
Reasoning with word vector with the database as a hyperdimensional value
The curious nature of hypervectors allows the storage of multiple entity, attributes in it and allow the retrieval of the likeness of them later by simple linear math - using only xor multiplication and addition. This gives us the database as a value in the form of a high dimensional vector.
For an example, say we want to express the fact that Hansel is a brother of Gretel. We can do this by adding the xor product of brother with hansel and the product of brother with Gretel.
1 2 3 4 5 6 |
|
Also we can express that Jack is a brother of Hansel.
1 2 3 4 |
|
We can add these two facts together to make a new hypervector value.
1 2 |
|
Now we can actually reason about them and ask questions. Is Jack a brother of Hansel? With a high cosine similarity, we can assume the answer is likely.
1 2 3 4 5 |
|
What about someone unrelated. Is Cinderella the brother of Gretel? - No
1 2 3 4 5 |
|
Is Jack the brother of Gretel - Yes
1 2 3 4 5 |
|
We can take this further by adding more facts and inventing a relation of our own.
Siblings in Hyperspace
Let’s invent a new word vector that is not in our nouns - siblings. We are going to create new random hypervector to represent it.
1
|
|
We will define it in terms of word vectors that we already have. That is, siblings will be a the sum of brother + sister. We XOR multiply it by siblings to associate it with the hypervector.
1 2 |
|
Now we can add some more facts. Gretel is a sister of Hansel.
1 2 3 4 5 6 |
|
Gretel is also a sister of Jack.
1 2 3 4 5 6 |
|
Collecting all of our facts into one hypervector (as a database).
1 2 3 4 5 |
|
Now we can ask some for questions.
Are Hansel and Gretel siblings? - Yes
1 2 3 4 |
|
Are John and Roland siblings - No
1 2 3 4 |
|
Are Jack and Hansel siblings? - Yes
1 2 3 |
|
It is interesting to think of that nothing is stopping us at this point from retracting facts by simply subtracting the fact encoded word vectors from our “database” value and making a new value from it.
Conclusions
In this fun, but casual exploration of word vector we have seen the potential for reasoning about language in a way that uses nothing more complicated than addition and multiplication. The ability to store dense information in hypervectors, extract it with simple methods, and flexibly collect it randomly, shows its versatility and power. Hyperdimensional vectors might hold the key to unlocking a deeper understanding of cognitive computing or perhaps even true artificial intelligence.
It is interesting to note that this technique is not limited to words. Other applications can be done the same way. For example a video recommendation using a hypervector with movie titles. Or perhaps even anomaly detection using sensor readings over a regular weekly time period.
Looking over our journey with word vectors. At the beginning it seemed that word vectors were magical. Now, after an understanding of the basics, it still seems like magic.
If you are interested in exploring further, feel free to use my github hyperdimensional-playground as a starting point.