Introduction
This post is inspired by the first problem from Google's 2012 Code Jam programming contest, which involved a substitution cipher. Substitution cipher encryption is a famous encryption technique where each alphabet letter is mapped to another character. To encrypt, we replace each letter in the message with its corresponding character. Decryption works similarly.
For example, suppose we have the alphabet {a, b, c} and we map a => 4, b => G, c => 0. Then I could encrypt the message "abccab" as "4G004G" and you could decrypt it back to "abccab" since you know the mapping.
In Google's problem, we're given a mapping and some ciphertexts which we must decrypt using the mapping. In this post, we'll write python code to decrypt messages without being given the mapping.
The 'word pattern' property
There are a high number of potential mappings (26 factorial), so we must only test mappings that are likely to decrypt the message to English. We can do this by exploiting the following property of substitution ciphers: If you read an encrypted word and note the positions where new letters are introduced, the positions will be the same in the decrypted version of the word.
For instance, 'racecar' has the pattern 'abcdcba'. Suppose we encrypt 'racecar' with the mapping {r -> v, a -> f, c -> u, e -> q}, to get 'vfuqufv'. An adversary could recognize that 'vfuqufv' has the pattern 'abcdcba', and it turns out that only 5 words in all of English satisfy that pattern: reifier, deified, reviver, rotator, and racecar. So even without the mapping our adversary could guess our word pretty easily.
We'll use this property to decrypt encrypted messages, just like our adversary did with 'racecar'. First though, let's use python to create a table that maps word patterns to english words. That way given an encrypted word, we can immediately see what English words it could be representing. I used the dictionary at sil.org which is free for non-commercial use.
Decrypting a message
When decrypting a message with multiple words, some words' translations might contradict each other. To see this, suppose we find a message encrypted as 'esddv wvzdb', and consider the possible decryptions 'pizza world' and 'hello world'. 'hello world' is fine, but in 'pizza world', there is a contradiction because 'd' maps to 'z' in 'pizza' but 'l' in 'world'.
So to decrypt a message, we could generate all possible combinations of all translations of each word and filter out all results that contain contradictions. However this method would take too long for messages with more than a few words. For instance consider the ciphertext 'pd udxu udgc upp upp cowwgki hwkcd apowi bkwu, udxa, xqi skcpwnk guckwh gqup x ika'. There are only 5 valid translations, but there are over 1042 contradictory ones, which is far too many.
A simple yet efficient way to solve this problem is with depth-first search. That is, we iterate through the words, picking arbitrary translations for each word. If the current word cannot be translated, we backtrack to previous words and pick different translations for them, and then proceed forward again. The following python code implements this algorithm. The next_mappings helper function is omitted here for brevity but is included in the final source code (scroll down):
An optimization which speeds up our algorithm greatly is to consider the words from largest to smallest. This way there is less unecessary branching done at and near the root of our search tree. The following code is included where we read the input:
Conclusion
We now have a python script that can decrypt substitution cipher messages. Click here for the full code, which includes a helper function and some input/output code omitted from the code samples above. Here are some example ciphertexts, each decrypted in under a second with our algorithm:
-
Ciphertext from the Google Code Jam problem that inspired this post
ejp mysljylc kd kxveddknmc re jsicpdrysi rbcpc ypc rtcsra dkh wyfrepkym veddknkmkrkcd de kr kd eoya kw aej tysr re ujdr lkgc jv
our language is impossible to understand there are twenty six factorial possibilities so it is okay if you want to just give up
-
First phrase of Alice in Wonderland, encrypted with a random key
mresx vmg zxlehhehl jn lxj kxpc jepxo nu gejjehl zc yxp gegjxp nh jyx zmhb
alice was beginning to get very tired of sitting by her sister on the bank
-
A quote from Hamlet by Shakespeare, encrypted with a random key
pd udxu udgc upp upp cowwgki hwkcd apowi bkwu, udxa, xqi skcpwnk guckwh gqup x ika
Oh that this too too sullied flesh would melt, thaw, and resolve itself into a dew
Further Reading
A paper I discovered after writing but before publishing this post covers this topic in much greater detail, and provides a faster, more resilient algorithm for decrypting substitution ciphertexts. The paper, by Edwin Olson, can be found here. The paper's treatment of the topic is excellent so I hesitated to publish this post. However I published it anyway, hoping to satisfy those looking for an easier and simpler approach.