Optimized Spelling Alphabet

I watched a great video about the creation of the NATO ~~phonetic~~ spelling alphabet, and wondered if I could do something similar with optimization.

The video showed that the NATO alphabet was designed to make it easier to spell words over radio and telephone channels, which may have bad signal quality. There are a few requirements:

One word per letter of the English alphabet.
Words are easy to distinguish from each other.
Words are fairly short.
Words are easy to pronounce for speakers of English, French, and Spanish.

(To simplify my approach to the problem, I’m going to ignore the multilingual requirement.)

So how can we approach this algorithmically? We need a few things:

A function that tells us the “distance” between two words, based on their pronunciation.
A list of common words, filtered for length and complexity.
Code that will optimize a list of words (one per letter) to minimize their similarity.

Since this has a bunch of word processing, I’ll use python.

Distance between words

What does it even mean to compute the distance between words? I want them to sound different, so I’ll need a way to turn a word into phonemes, then a metric for comparing those to another word’s phonemes.

Luckily, there are libraries to do these things! First, I’ll use pronouncing to convert the words into ARPABET phonemes, then use phonecodes to convert to the International Phonetic Alphabet, then convert each phoneme into a feature vector using panphon. A feature vector has 22 entries to describe the sound: where it’s formed in the mouth, if it’s sonorant, voiced, nasal, etcetera. Ultimately, each word becomes a list of feature vectors. (Well, if the word has multiple pronunciations, it’ll be a list of lists. But close enough.)

Great, now I can use panphon’s weighted edit distance to tell me how much editing is needed to transform one word into another. This is weighted by the feature vectors, as some sounds more easily replace each other.

Finally, I’ve got a function to compute that (bat, bat) has a distance of 0, (bat, cat) has a distance of 2.25, and (bat, dinosaur) is 39.125. Identical words have distance 0, similar words have a small distance, and very different words have a large distance. Good!

Next, I need to create a list of candidate words.

List of words

The pattern library has a useful file: a sorted list of the normalized frequency of English language words. It has 11,990 words, ranging in frequency from “the” to “zoo”. Right around the middle is “vacuum”.

I’ll filter this list to limit the syllable count, number of phonemes, remove profanity, and remove words that sound like they start with the wrong letter (like express, know, effort, colonel, etc).

Optimization

What would the best spelling alphabet look like? I want to be sure the words aren’t confused, so it should maximize the minimum distance between any words. That would ensure that no two words are similar. Then as a secondary metric, the average distance between words should be large.

Unfortunately, I’ve got so many possible combinations of words. Using the word list with 11,990 entries, each letter has between 1418 (S) and 3 (X) words. That yields 4.4×10^62 combinations. That’s absurd. Truly impossible to check each one. But that’s why we have the frequency-ordered list. We can just keep the top N most frequent words for each letter, assuming they are valid, and work with those. If we keep the top 50, that yields 9.7×10^41. Still absurd, but much smaller!

So we can’t do an exhaustive search. Time to be slightly more clever.

The simplest is to just do random search, with some narrowing over time. At the start, we’d swap 50% of the entries with a random word, and at the end just try one at at a time. At each iteration, we keep the new alphabet if it’s better than the previous one. This works, but isn’t smart.

How about starting with a better alphabet? For each letter, we could choose the word that is farthest from the words in the rest of the letters. That’s a slow initial investment (about 25*26*N^2 initial comparisons, or 6.5 million with N=100), but yields an initial alphabet that’s quite distinct.

Instead, we could go letter-by-letter and choose the word that is most distinct from the rest of the word list. That would provide a decent starting list.

Another way to improve is greedily: find the letter in the list that is the worst, then swap it with the best available word.

But I have no idea what’ll work well, so let’s try each combination of initial conditions (random choice, NATO, most dissimilar) and optimizing with (random search, greedy swap).

Results

Matching the NATO alphabet’s constraints:

I’ll limit the number of phonemes and syllables to match the NATO alphabet, and hang onto 100 words per letter (and the NATO words). How do the results look?

The NATO alphabet is a good starting point. Farthest-start is better. Random start is bad because it has no thought behind it.
The optimization works! In each case, the optimizer does better than the initial data. Random search works decently, but the greedy search does better.
The best result was farthest-start with the greedy optimization
- Minimum word distance of 21 is pretty great, compared to the NATO value of 8.9.

Limiting syllables

So we can optimize for more dissimilar words, but what about more restricted cases? Like fixing the number of syllables? For these weird cases, if the filter removes all the words for a letter, I’ll use the NATO word for that one.

I’m surprised that the optimized one-syllable word list is “better” than the original NATO alphabet.
The syllabically-limited cases aren’t as optimal as the earlier result, but still decent.
Oddly, the 2-syllable case is the best of these.

Constraint	NATO original	1 syllable	2 syllables	3 syllables	4 syllables	NATO-like
Min Distance	8.9	10.5	18.5	17.875	13.375	21.5
Mean Distance	24.9	22.7531	33.6208	37.6246	48.0515	34.9742
A	alpha	act	anger	advantage	acknowledgement	abruptly
B	bravo	blind	breakdown	beautifully	bureaucratic	behind
C	charlie	change	cambridge	concerto	carbohydrate	closely
D	delta	dry	doorway	digestion	disappointment	drink
E	echo	ear	earthquake	erosion	eternally	east
F	foxtrot	french	foxtrot	factual	formulation	foxtrot
G	golf	glimpse	goldsmith	genuine	genuinely	genuine
H	hotel	her	household	hopelessly	hierarchical	heroic
I	india	inch	instinct	influence	indigestion	instinct
J	juliet	joke	judgement	journalist	jurisdiction	justify
K	kilo	kick	kibbutz	kilogram	kilo	kilogram
L	lima	lounge	livestock	likelihood	legislative	loyalty
M	mike	march	mainstream	murderer	menstruation	mixture
N	november	noon	network	nostalgia	negligible	nitrogen
O	oscar	ounce	opera	observer	objectively	orthodox
P	papa	plump	pleasure	picturesque	perimeter	power
Q	quebec	queue	quantum	quotation	questionable	quantity
R	romeo	realm	restraint	regiment	reconstruction	romeo
S	sierra	strength	shakespeare	scientist	subsequently	spirit
T	tango	twelve	transport	treasurer	triumphantly	temple
U	uniform	urge	unkind	unbroken	unemployment	uniform
V	victor	valve	viewpoint	volcano	vocational	volcano
W	whiskey	warmth	wildlife	warrior	wonderfully	wildlife
X	x-ray	x-ray	x-ray	x-ray	x-ray	x-ray
Y	yankee	york	yorkshire	yesterday	yankee	youngest
Z	zulu	zoo	zulu	zimbabwe	zulu	zimbabwe

The worst spelling alphabet

What if I try to make the worst spelling alphabet? That’d have the words that are the easiest to confuse. In this case, I optimized for the lowest average distance between words.

Here’s a matrix showing how similar each word is to each other one, for a mean distance of 9.7. That’s much, much worse than the useful cases above.

For example, it has “tight / sight /night”. Also “bait / date /gate / hate”. Overall, just a terrible list of words. If you want to really be useless at spelling, use these.

Conclusion

Personally, I like the two-syllable word list. It feels more resistant to errors than the single-syllable list, and feels less silly than the broadly-optimized list.

It’s fun to optimize for silly things.