Categories
etcetera

Text Processing of The Dresden Files: Statistics

The Dresden Files is a series of very fun books by Jim Butcher, with a wizard detective getting into trouble and saving the world and all that. I’ve wanted do do some Natural Language Processing (NLP) on a body of text, and with fifteen novels so far (two more later this year!), this is a nice big one to play with.

This post deals primarily with statistics of the words. The next post will use more advanced analysis techniques, such as Word2Vec and clustering techniques. I’ll work on answering:

  • How many different words are used? How has this changed across each book?
  • How many unique words are used? How many of those words are hapax legomena, which are only used once?
  • What words are special in each book? That is, what words are used unexpectedly frequently?
  • What silly extrapolations can be formed from this data?

I’m using a package called Gensim, which simplifies a lot of the analysis. In combination with Pattern, this makes it simple to recognize that “walking”, “walks”, and “walked” are all basically the same word, through a process called lemmatization. It also trims out a lot of the grammatical elements like articles, conjunctions, etc. This is my first time using Gensim, and I found a lot of help in a tutorial on Machine Learning Plus.


Word Counts

First, the simple metrics are plotted below. The left plots are the total word count and the number of unique words in each book. On the right, those are plotted against each other.

As the series continues, the books are getting generally longer and use more unique words. It looks very linear, so I included a linear fit. This shows that there’s basic vocabulary size of about 3000 words, and every extra 1000 words of text necessitates another 62 words of vocabulary.

But that’s counting all of the words. How frequently are the words used? According to Zipf’s law, a small number of words should be used much more frequently. Let’s plot that, where the x-axis is the ranking of the word (from most- to least-used within a book), and the y-axis is the frequency of use in that book.

The dashed line is (index)^-1, for comparison.

Indeed, the most-used words account for so much of the text that it may be worth ignoring some of the least-used words. Let’s include a plot of the removes the least-frequently-used 5% of words:

With the trimmed vocabulary, it grows by about 22 words per 1000 words of text. That means that most of the unique words are used infrequently. Just how infrequently? Let’s go for the extreme, and look at the hapax legomena (HL). There are 6,327 words used only one time throughout the series, out of 21,285 unique words. How are these distributed?

The following figure shows the number of HL per book in the top plot, and the ratio of HL to vocabulary size for each book in the bottom plot. As the books get longer, more HL are introduced. Looking at the ratio, their rate of introduction is also increasing, but slowly.


Special Words in Each Book

The most frequently used words are boring. The top ten are: say, get, go, one, look, like, know, back, could, eye. Extremely boring. What’s more interesting is the words that are over-represented in each book.

Since I can compute the frequency of a word for each book (FB) and for the full library (FL), I can say that words are special to a book if (FB)/(FL) is large. Unfortunately, this has some troubles with words that are only used a handful of times, so I’ll ignore any word that is used less than ten times in the full text. I’ve left them in their lemmatized form to show some of the difficulties of NLP. For example, it decides to simplify “Tommy” to “tomm,” and converts “thieves” to “thieve”. But you can figure out what they’re supposed to be. Here are the 10 most special words for each book:

Storm Front: linda, monica, stanton, jennifer, randall, gimpy, tomm, threeeye, paula, providence
Fool Moon: streetwolf, harris, flatnose, benn, jailer, tera, macfinn, parker, denton, kim
Grave Peril: kyle, kelly, ferro, agatha, hagglethorn, micky, lydium, malone, kravos, dais
Summer Knight: talos, grum, korrick, elidee, reuel, meryl, chlorofiend, unravele, centaur, ronald
Death Masks: snakeman, garcia, ulsharavas, gaston, thieve, francisca, vincent, shiro, mordite, ortega
Blood Rites: giselle, lisa, vixen, silverlight, paintball, trixie, tricia, genosa, darkhound, arturo
Dead Beat: mendoza, bartlesby, kumori, grevane, tyrannosaur, shiela, casey, xian, bock, necromancy
Proven Guilty: lemonade, parapet, greene, glau, pell, sandra, rosie, rick, scarecrow, crane
White Night: priscilla, olivia, vitto, bonnie, skavis, vittorio, ordo, helen, blanche, malvora
Small Favor: billie, hob, namshiel, aquarium, oceanarium, myrk, bart, rosanna, magog, torelli
Turn Coat: derek, vince, shagnasty, evelyn, skinwalker, madeline, graver, kirby, peabody, mai
Changes: alamaya, esteban, esmerelda, stevie, devourer, tilly, duchess, arianna, eebs, centipede
Ghost Story: aristedes, lemur, marci, stu, felicia, wolfwaffen, hideout, quadrant, fitz, stuart
Cold Days: sharkface, munstermobile, cleaver, redcap, rawhead, sarissa, sith, lacuna, winchester, barge
Skin Game: grinder, parkour, ascher, skateboard, octokong, hannah, genoskwa, grail, goodman, harvey

To see the top 10 words that aren’t people’s names:

Name-free top ten list

Storm Front: gimpy, threeeye, providence, canister, scorpion, monday, saturday, scuttle, sell, balcony
Fool Moon: jailer, streetwolf, garou, loup, lycanthrope, moon, northwest, compass, werewolf, monitor
Grave Peril: rift, courtyard, infant, nightmare, invitation, hellhound, talisman, nurse, basket, hush
Summer Knight: chlorofiend, unravele, centaur, unicorn, carriage, changele, toad, podium, vote, emissary
Death Masks: snakeman, thieve, mordite, prophecy, shroud, concourse, tattoo, plague, stadium, noose
Blood Rites: vixen, silverlight, darkhound, paintball, malocchio, barbie, renfield, pup, puppy, mama
Dead Beat: tyrannosaur, necromancy, undead, campus, zombie, liver, disciple, brioche, polka, zombies
Proven Guilty: lemonade, parapet, scarecrow, phage, reaper, convention, splattercon, fetch, venatori, summoner
White Night: ordo, trainee, thrall, ghoul, cavern, suicide, throne, bead, ash, cloak
Small Favor: hob, oceanarium, myrk, aquarium, dolphin, pentagram, gruff, snowball, workshop, mantis
Turn Coat: skinwalker, naagloshii, edinburgh, wheelchair, traitor, intellectus, spider, dock, zero, cottage
Changes: devourer, duchess, centipede, ick, jaguar, crutch, pyramid, backboard, slim, altar
Ghost Story: hideout, lemur, wolfwaffen, quadrant, baldy, wraith, eternal, immaterial, ectomancer, fomor
Cold Days: sharkface, munstermobile, cleaver, rawhead, winchester, barge, malk, tugboat, adversary, caddy
Skin Game: parkour, genoskwa, skateboard, octokong, grail, squire, amphitheater, vault, slaughterhouse, sleet

This does a good job of highlighting important themes, characters, and scenes. A few things jump out that I didn’t remember immediately. For example, in Proven Guilty, there is a scene in Mac’s tavern where lemonade is mentioned several times (the only point in the series that uses “lemonade”). Also in Proven Guilty, a climactic scene is on the parapet of a castle in the Nevernever. In Changes, Stevie D is a hired gun who bursts into the room with Butters after Harry is hurt. Skin Game has the exciting chase with the skateboard. Lots to remember!


Extrapolations

Let’s make some unreasonable extrapolations using linear fits, which is a bad and wrong idea!

When will book 16 come out? We know the true answer (July 14, 2020), but this reflects a slow-down of what was an extremely rapid release pace. If that pace held, book 16 would have come out on September 28 of 2015, and the most recent book would have been the twentieth, released in January 2020.

Let’s be even more unreasonable. When will the books be entirely made of hapax legomena? This will be the 521st novel, which will have 8,900 pages and be released in July of 2557. This uses an average of 146 words per page, based on the first edition hardcover page counts.

When will a book use each of the 170,000 words in the Oxford English Dictionary? That’ll happen in book number 1107, which will be 18,600 pages long, released in February of 3185.

Let’s say that a fast reader covers 300 words per minute. When will a book be impossible to finish in one year? No sleep, no food, only Dresden Files. That’d be 1.1 million pages long, and the 65,453rd book in the series. Expect it in January of 72178.

Life should be extinct on Earth in about 2.8 billion years. This is a solid upper bound on the end date of the Dresden Files. By then, the 2.6 billionth novel in the series should be wrapping things up. This book will be 43 billion pages long, and about as tall as the distance from New York to Los Angeles. Unfortunately, those cities will be long gone, along with much of the continent, oceans, tectonic motion, and hope for more sequels.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.