In my previous post, I did some statistical analyses of the Dresden Files novels. Now I want to do some fancy stuff with vectors.
Word2Vec is a method for turning a bunch of words in text into related vectors, as the name implies. Using a shallow neural network, it represents finds vectors such that words in similar contexts have similar vector representations. The most famous example says that king-man+woman=queen, as computed with the word-vectors. Gensim has a built-in function for generating a Word2Vec model, so I threw all 15 books of the Dresden Files at it, and asked for vectors with 100 dimensions. (Using 50 didn’t work well, and 200 felt about as useful, so I stuck with 100.)
There’s four main sections to this post, each using word2vec.
- Related words: What words are similar to each other?
- Word equations: Can we make things like the queen equation above?
- Clustering: How do words group together?
- Thematic analysis: Can we see the rise and fall of themes in the books?
With this model, I can ask for the words most similar to a given word. Here are some nice examples, along with their closeness metrics (-1 is opposite, +1 is fully aligned):
Sword (0.90 – 0.83): hilt, pommel, belt, amoracchius, knife, glove, armrest, fidelacchius, blade, and sheath
Beer (0.94-0.89): drink, coke, sandwich, swigged, pad, ale, finished, notepad, homebrew, sip
Nervous (0.96-0.90): sick, uncomfortable, queasy, ashamed, surprised, frighten, twisty, midsentence, distract, awkward
Fuego (0.98-0.88): forzare, servita, venta, hexus, heartfelt, vento, infriga, eraser, smote, hammer
Karrin (0.95-0.92): valmont, sanya, charity, anna, andi, fitz, billy, rawlin, mort, ramirez
Stagger (0.98-0.95): stumble, scramble, tumble, scarecrow, limp, backward, flopped, fling, crawl, crouch
Harry (0.87-0.79): mr, christ, dammit, endgame, sorry, boy, tiff, jesus, god
Bob (0.73 – 0.69): psychopathic, arrrrr, germanic, uninformed, intell, hardpoint, hosebeast, craftsmanlike, crap, whee
It does a very good job! Mostly. Linking “Bob” with “hosebeast” is a bit of a stretch, but it is an obscene skull spirit, after all. Also, “Harry” results in some oddly religious terms. Overall, the vectors seem to be well-related.
Since the words are turned into vectors, you can make analogies into equations, like the queen example above. Here are some examples that worked well from the Dresden text:
Brother-man+woman = sister (0.89)
Harry-magic+gun = Karrin (0.83)
Thomas-Justine+Kinkaid = Murphy (0.93)
Unfortunately, those are hard to find. Most of the equations that I tried don’t make much sense. For example: priest-church+Marcone = Ebenezar (0.96). I’d like this to capture a sense of leadership, but it doesn’t. Oh well.
Clustering via Splitting
How do the words group together? To develop a clustering, I’ve started with the initial full group of words, and iteratively split the groups (using k-means) until each group had a single word in it.
The figure below shows that splitting, applied to the 50 most frequently used words. The splits are labeled with the word that is closest to the average of that group, and the tips of the branches are the individual words. Check it out:
Since these are the most frequent words, they are basically boring, but we can learn something from them. “Harry” is in the green group, along with what he does most: look, ask, tell, etc. Since the writing is following Harry around, we learn a lot about his personal actions. Murphy and Thomas, as sidekicks, are grouped together in the violet section. Then there’s lots of other typical words for humans doing things.
It’s more interesting to focus on subsets of words. For example, here are the 50 words that are most similar to “Fuego”:
The cyan group is all made up of spells, and the violet one is spells plus violence. There seem to be some wastebasket taxa, since it’s hard to cleanly separate all these words.
The gallery below has more examples, using “Murphy”, “staff”, “Uriel”, and “Harry”. The “Murphy” figure groups characters mostly by their roles: sidekicks/partners, enemies, and questionable types. The “staff” figure shows other magical and physical weapons and tools, and some spells. The “Uriel” figure shows that he is a magical being who makes a number of facial expressions and mmmms. The “Harry” plot shows so much cursing, as well as how others talk to him.
If you want way, way, way too much information, this link will show you the plot of the top 2,550 words, which have been used more than ten times throughout the series. The most central word is “unicorn”, which splits into “prophecy” and “float” and on and on. Too big to be useful!
Since we can turn words into vectors, that means that I can turn a text into a matrix, where one direction is time-like (progress in the book), and the other direction encodes the properties of the word. Each book would turn into a massively huge matrix, so I’ll instead encode properties of each paragraph. To help highlight special words in a sentence, I’ll divide each word’s vector representation by its frequency in the full data set. Then I’ll sum all the word-vectors in the paragraph, normalize the result, and call that single vector the encoded paragraph. This yields a manageably big matrix.
With this big matrix, the best thing to do is a singular value decomposition (SVD) (also called a proper orthogonal decomposition, which I’ve written about here). It’s a great way to reduce a signal to its strongest components. In this case, it will show us how intensely certain words are used over the length of book.
First, I subtracted off the average vector, since it does not give much information about variation within the book. This average is close to a lot of rarely used words, such as words in Latin. The SVD of the mean-subtracted data yields 100 orthogonal series which show the change of the word vectors over time, their relative amplitude, as well the word-direction that these time series oscillate along (“axes”). The first time series has the greatest contribution, and they taper off quickly. You can see the magnitudes of their contributions (singular values) here, where the first time series has almost 10% of the total “energy” of the book.
But what does each axis mean? Using the words most closely aligned to the directions of each axis, we find that the first axis is “trust” in the positive direction, and “wood” in the negative. Huh. That is very confusing. Looking at the words near these vectors, a better description would be the Emotional-Physical axis.
To illustrate the SVD approach, check out the following diagram that shows the themes of the 12th book in the series, Changes. Each red/blue barcode shows the oscillation of its axis over the length of the book: red is intense one way, and blue is intense the other. The start of the book is the left end, and the conclusion is at the right. These are listed from top to bottom in order of most to least influence. They are titled by my summary of the axis-related words (since “trust-wood” is basically meaningless). The first word in each pair is the red direction, and the second is blue. There are also a bunch of vertical lines plotted, which indicate important plot points. The events at each line are summarized below the figure, hidden by a spoiler tag.
For example, on the top plot, red is “physical”. Line 1 is where Dresden’s office explodes with much force, and line 15 is right where the Ick is defeated by crushing it with tons of rock. In contrast you see blue (“emotional”) when Harry is pleading with Uriel (line 11), or stressed while at the FBI office (line 13). This same sort of analysis can be used for each of the axes.
At each dashed line (SPOILERS!):
This is a pretty good way to illustrate the themes in a book! A more intelligent choice of word-axes could be useful in the future, though.
Since I have all the books, here is a diagram showing the top five axes in each book. You can get a general sense of where exciting things happen, but there’s so much data!
I particularly like the idea of turning books into red/blue barcodes. It’s a neat way to look at them!