Categories
etcetera

Optimizing the English Alphabet

The alphabet used for the English language isn’t great. It generally works, but doesn’t capture the full spread of sounds. Depending on the dialect (and analysis method), there are 38 to 49 phonemes in spoken English. Also, the spelling system is awful.

Of course, every problem is approachable with optimization, so I’ll run an analysis and make some weird phonetic “alphabets”. I have that in quotes because I’ll just be using IPA symbols, rather than inventing new symbols. (I’m not Sejong the Great, after all.)

Approach

  • Collect a list of English phonemes and their frequencies.
  • Cluster the phonemes, using their feature vectors, into a smaller “alphabet”.
  • Translate an input sentence into the new alphabet.

More detail, complications

I used a word frequency file from the pattern library to get the frequencies of each word. Then I used pronouncing to convert the words into ARPABET phonemes, then use phonecodes to convert to the International Phonetic Alphabet. Later, I converted each phoneme into a feature vector using panphon.

Unfortunately, they don’t perfectly cooperate. The feature vectors don’t include some dipthongs, so I’ve had to change the ARPABET symbols “OY” into “AO IH” and “AW” into “AE AU”. Also, they don’t include rhoticity, so I used the feature vector of “ɜ” in place of “ɝ” (or “ER”). Also, I had to properly include tie bars for the digraphs “d͡ʒ” (“JH”) and “t͡ʃ” (“CH”).

Then I used the panphon’s feature weights to scale the feature vectors, to get a better measure of the distance between phonemes. The sklearn package’s KMeans let me weight the phonemes by their frequency, then create clusters. I found each cluster’s most representative phoneme, and used those as the new alphabet.

Now I can make alphabets ranging from one to thirty-six characters!

Resulting alphabets

Now I’ve got alphabets, how well do they work?

Let’s compute the error as the total frequency-weighted distance for each symbol to its replacement. As a baseline, say that a single symbol has 100% error. As we increase the number of symbol, the error goes down:

With a single symbol, the alphabet is just “h”. That’s no good. Two symbols gets us a consonant and a vowel, “ʌ, d”, for 42% error.

It looks like there’s a big improvement with 8 symbols, then diminishing returns. It seems to really level off at 20, after which is fine-tuning.

Here are those alphabets:

Length, error level, alphabet
1, 2.362796, HH
2, 0.993778, AH, D
3, 0.722559, AH, N, T
4, 0.601745, AH, N, T, Z
5, 0.470228, AH, N, T, R, Z
6, 0.410011, AH, N, T, IH, R, Z
7, 0.388953, AH, N, T, IH, R, Z, HH
8, 0.323342, AH, N, T, IH, R, L, Z, K
9, 0.312420, AH, N, T, IH, R, L, Z, K, NG
10, 0.288048, AH, N, T, IH, R, L, Z, K, V, NG
11, 0.265201, AH, N, T, IH, R, L, Z, K, V, P, NG
12, 0.244793, AH, N, T, IH, R, L, Z, K, V, UW, P, NG
13, 0.227734, AH, N, T, IH, R, L, Z, K, V, UW, P, HH, NG
14, 0.208122, AH, N, T, IH, R, L, Z, K, V, UW, W, P, HH, NG
15, 0.187842, AH, N, T, IH, R, L, Z, K, V, UW, W, P, HH, EY, NG
16, 0.162346, AH, N, T, IH, R, L, IY, Z, K, V, UW, W, P, AY, HH, NG
17, 0.148394, AH, N, T, IH, R, L, IY, Z, K, V, UW, W, P, AY, AO, HH, NG
18, 0.129282, AH, N, T, IH, S, R, L, DH, IY, K, V, UW, W, P, AY, AO, HH, NG
19, 0.110008, AH, N, T, IH, S, R, L, DH, IY, M, K, V, UW, W, P, AY, AO, HH, NG
20, 0.096154, AH, N, T, IH, S, R, L, DH, IY, EH, M, K, V, UW, W, P, AY, AO, HH, NG
21, 0.093533, AH, N, T, IH, R, L, IY, AE, Z, EH, M, K, V, UW, W, P, AY, AO, HH, NG, CH
22, 0.086428, AH, N, T, IH, R, L, IY, AE, Z, EH, M, K, V, UW, W, P, AY, AO, HH, NG, Y, CH
23, 0.070163, AH, N, T, IH, S, R, L, DH, IY, AE, EH, M, K, V, UW, W, P, AY, AO, HH, NG, Y, CH
24, 0.058347, AH, N, T, IH, S, R, L, DH, IY, AE, EH, M, K, ER, V, UW, W, P, AY, AO, HH, NG, Y, CH
25, 0.051286, AH, N, T, IH, S, R, L, DH, IY, AE, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, NG, Y, CH
26, 0.046194, AH, N, T, IH, S, R, L, DH, IY, AE, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, NG, SH, Y, CH
27, 0.036348, AH, N, T, IH, S, R, L, DH, IY, AE, Z, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, NG, SH, Y, CH
28, 0.031893, AH, N, T, IH, S, R, L, DH, IY, AE, Z, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, NG, UH, SH, Y, CH
29, 0.026388, AH, N, T, IH, S, R, L, DH, IY, AE, Z, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, OW, NG, UH, SH, Y, CH
30, 0.017668, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, AO, HH, EY, OW, NG, UH, SH, Y, CH
31, 0.013202, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, F, AO, HH, EY, OW, NG, UH, SH, Y, CH
32, 0.008891, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, F, AA, AO, HH, EY, OW, NG, UH, SH, Y, CH
33, 0.004412, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, B, F, AA, AO, HH, EY, OW, NG, UH, SH, Y, CH
34, 0.002545, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, B, F, AA, AO, HH, EY, OW, NG, UH, SH, Y, G, CH
35, 0.001246, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, B, F, AA, AO, HH, EY, OW, NG, UH, SH, Y, G, CH, JH
36, 0.000175, AH, N, T, IH, S, R, L, DH, IY, D, AE, Z, EH, M, K, ER, V, UW, W, P, AY, B, F, AA, AO, HH, EY, OW, NG, UH, SH, Y, G, CH, JH, TH

How legible are these? I’ll switch back to IPA, and show example text for the different alphabets.

I’ve taken a sample text from the Speech Accent Archive, which designed this as a way to record a wide range of phonemes in spoken English.

The text is: “Please call Stella. Ask her to bring these things with her from the store: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob. We also need a small plastic snake and a big toy frog for the kids. She can scoop these things into three red bags, and we will go meet her Wednesday at the train station.”

The links below lead to to a website that will read the IPA text. I’ll show the full result, then build back up to an understandable level.

  • Thirty-six: pliz kɔl stɛlʌ. æsk hɜ tu bɹɪŋ ðiz θɪŋz wɪð hɜ fɹʌm ðʌ stɔɹ: sɪks spunz ʌv fɹɛʃ snoʊ piz, faɪv θɪk slæbz ʌv blu tʃiz, ʌnd meɪbi ʌ snæk fɔɹ hɜ bɹʌðɜ bɑb. wi ɔlsoʊ nid ʌ smɔl plæstɪk sneɪk ʌnd ʌ bɪɡ tɔɪ fɹɑɡ fɔɹ ðʌ kɪdz. ʃi kæn skup ðiz θɪŋz ɪntu θɹi ɹɛd bæɡz, ʌnd wi wɪl ɡoʊ mit hɜ wɛnzdi æt ðʌ tɹeɪn steɪʃʌn.
    • This is the largest number of letters, and is very intelligible. Sorta robotic, thanks to the robot reading it, but fine.
  • Two: ddʌd dʌd ddʌdʌ. ʌdd dʌ dʌ dʌʌd dʌd dʌdd ʌʌd dʌ dʌʌd dʌ ddʌʌ: dʌdd ddʌdd ʌd dʌʌd ddʌ dʌd, dʌd dʌd ddʌdd ʌd ddʌ dʌd, ʌdd dʌdʌ ʌ ddʌd dʌʌ dʌ dʌʌdʌ dʌd. ʌʌ ʌddʌ dʌd ʌ ddʌd ddʌddʌd ddʌd ʌdd ʌ dʌd dʌʌ dʌʌd dʌʌ dʌ dʌdd. dʌ dʌd ddʌd dʌd dʌdd ʌddʌ dʌʌ ʌʌd dʌdd, ʌdd ʌʌ ʌʌd dʌ dʌd dʌ ʌʌdddʌ ʌd dʌ dʌʌd ddʌdʌd.
    • Entirely useless. A binary alphabet is theoretically usable, but not practical for humans.
  • Eight: tlɪz kʌl ztɪlʌ. ɪzk lʌ tɪ tɹɪn zɪz zɪnz ɹɪz lʌ zɹʌn zʌ ztʌɹ: zɪkz ztɪnz ʌz zɹɪz znʌ tɪz, zʌz zɪk zlɪtz ʌz tlɪ tɪz, ʌnt nɪtɪ ʌ znɪk zʌɹ lʌ tɹʌzʌ tʌt. ɹɪ ʌlzʌ nɪt ʌ znʌl tlɪztɪk znɪk ʌnt ʌ tɪk tʌɪ zɹʌk zʌɹ zʌ kɪtz. zɪ kɪn zkɪt zɪz zɪnz ɪntɪ zɹɪ ɹɪt tɪkz, ʌnt ɹɪ ɹɪl kʌ nɪt lʌ ɹɪnztɪ ɪt zʌ tɹɪn ztɪzʌn.
    • Starts to sound like a language, but not understandable.
  • Twenty: plis kɔl stɛlʌ. aɪsk hɛ tu pɹɪŋ ðis ðɪŋs wɪð hɛ vɹʌm ðʌ stɔɹ: sɪks spuns ʌv vɹɛs snʌ pis, vaɪv ðɪk slaɪps ʌv plu tis, ʌnt mipi ʌ snaɪk vɔɹ hɛ pɹʌðɛ paɪp. wi ɔlsʌ nit ʌ smɔl plaɪstɪk snik ʌnt ʌ pɪk tɔɪ vɹaɪk vɔɹ ðʌ kɪts. si kaɪn skup ðis ðɪŋs ɪntu ðɹi ɹɛt paɪks, ʌnt wi wɪl kʌ mit hɛ wɛnsti aɪt ðʌ tɹin stisʌn.
    • It sounds like a thick accent, but mostly understandable.
  • Twenty-seven: pliz kɔl stɛlʌ. æsk hɜ tu pɹɪŋ ðiz ðɪŋz wɪð hɜ vɹʌm ðʌ stɔɹ: sɪks spunz ʌv vɹɛʃ snʌ piz, vaɪv ðɪk slæpz ʌv plu tʃiz, ʌnt meɪpi ʌ snæk vɔɹ hɜ pɹʌðɜ paɪp. wi ɔlsʌ nit ʌ smɔl plæstɪk sneɪk ʌnt ʌ pɪk tɔɪ vɹaɪk vɔɹ ðʌ kɪtz. ʃi kæn skup ðiz ðɪŋz ɪntu ðɹi ɹɛt pækz, ʌnt wi wɪl kʌ mit hɜ wɛnzti æt ðʌ tɹeɪn steɪʃʌn.
    • This is where it sounds fine to me.
    • It’s only missing: ɑ, b, d, f, g, dʒ, oʊ, θ, ʊ, which are apparently fairly easy to replace.
      • Side note: θ is the final phoneme to be used in any of these alphabets!
    • If we split the digraphs and dipthongs, this alphabet length would reduce to 25 characters.

Seems like we should have around 27 characters!

Caveats

This didn’t actually count all the phonemes in English. This analysis only used the first pronunciation of each word, and only for North American English. With additional work, it could be generalized to world English.

To make shorter alphabets, it might also help to split all the digraphs and dipthongs into their constituents.

It would also be interesting to see what letters in the current alphabet are associated with each phoneme. Spelling is so messy that we will get some strange results.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.