Python: Cryptography decoding a Caesar shift (frequency analysis)
Due to the simple nature of the Caesar cipher, it could easily be brute forced by trying all possible 25 keys and then looking by eye to see if the plaintext was revealed (this too can be automated by checking for common English words to see if the solution was probable). However the much more elegant method of frequency analysis can be used.
Below is a table of the frequency of letters in the English language:
Letter | Frequency (percent) | Frequency (decimal) | Normalised Frequency |
a | 8.17% | 0.08167 | 0.64297 |
b | 1.49% | 0.01492 | 0.11746 |
c | 2.78% | 0.02782 | 0.21902 |
d | 4.25% | 0.04253 | 0.33483 |
e | 12.70% | 0.12702 | 1.00000 |
f | 2.23% | 0.02228 | 0.17541 |
g | 2.02% | 0.02015 | 0.15864 |
h | 6.09% | 0.06094 | 0.47977 |
i | 6.97% | 0.06966 | 0.54842 |
j | 0.15% | 0.00153 | 0.01205 |
k | 0.77% | 0.00772 | 0.06078 |
l | 4.03% | 0.04025 | 0.31688 |
m | 2.41% | 0.02406 | 0.18942 |
n | 6.75% | 0.06749 | 0.53133 |
o | 7.51% | 0.07507 | 0.59101 |
p | 1.93% | 0.01929 | 0.15187 |
q | 0.10% | 0.00095 | 0.00748 |
r | 5.99% | 0.05987 | 0.47134 |
s | 6.33% | 0.06327 | 0.49811 |
t | 9.06% | 0.09056 | 0.71296 |
u | 2.76% | 0.02758 | 0.21713 |
v | 0.98% | 0.00978 | 0.07700 |
w | 2.36% | 0.02360 | 0.18580 |
x | 0.15% | 0.00150 | 0.01181 |
y | 1.97% | 0.01974 | 0.15541 |
z | 0.07% | 0.00074 | 0.00583 |
And shown graphically:
Using the following code we can use frequency analysis to find the solution to ciphertext created using the Caesar shift demonstrated previously (see Caesar shift and Caesar shift using makestrans).
from Numeric import * from string import maketrans def translator(text,alphabet,key): trantab = maketrans(alphabet,key) return text.translate(trantab) def caesar_decode(ciphertext,s): alpha="abcdefghijklmnopqrstuvwxyz" return translator(ciphertext,alpha,alpha[-s:]+alpha[:-s]) class frequency_analysis: def __init__(self, ciphertext): self.cor=[0.64297,0.11746,0.21902,0.33483,1.00000,0.17541, 0.15864,0.47977,0.54842,0.01205,0.06078,0.31688,0.18942, 0.53133,0.59101,0.15187,0.00748,0.47134,0.49811,0.71296, 0.21713,0.07700,0.18580,0.01181,0.15541,0.00583] self.ciphertext=ciphertext.lower() self.freq() self.min_error() self.key=self.minimum[0] self.solution=caesar_decode(self.ciphertext,self.minimum[0]) def freq(self): self.arr=zeros(26,Float64) for l in self.ciphertext: x=ord(l) if (x>=97 and x<=122): self.arr[x-97]+=1.0 self.arr/=max(self.arr) def error(self): e=0 for i in range(0,len(self.arr)): e+=abs(self.arr[i]-self.cor[i])**2 return e def min_error(self): self.minimum=[0,10000] for rot in range(0,25): e=self.error() print rot,e if e<self.minimum[1]: self.minimum[1]=e self.minimum[0]=rot x=self.arr[-1] del self.cor[-1] self.cor.insert(0,x) ciphertext="ymjwj fwj ybt ydujx tk jshwduynts: tsj ymfy bnqq "+\ "uwjajsy dtzw xnxyjw kwtr wjfinsl dtzw infwd fsi tsj ymfy bnqq "+\ "uwjajsy dtzw ltajwsrjsy. ymnx nx f ajwd nrutwyfsy qjxxts yt "+\ "wjrjgjwjxujhnfqqd ktw fyyfhpx zxnsl kwjvzjshd fsfqdxnx bmnhm "+\ "wjvznwj qtsljw ufxxflj tk yjcy ns twijw yt fhmnjaj gjyyjw wjxzqyx." FA=frequency_analysis(ciphertext) print FA.solution
This code will calculate the error in statistical frequency for each letter squared to generate an error for each possible rotation. Using a sufficiently long piece of ciphertext this code should accurately reveal the Caesar rotation use. The table below shows the error for each rotation:
Rotation | Error |
0 | 4.11797847386 |
1 | 3.05305477067 |
2 | 3.70059678828 |
3 | 3.66330931218 |
4 | 3.5078619579 |
5 | 0.361318100755 |
6 | 3.17289666386 |
7 | 3.66072641654 |
8 | 3.39769855873 |
9 | 1.74854802027 |
10 | 2.92550921273 |
11 | 2.67524757297 |
12 | 2.86847189573 |
13 | 3.06980318397 |
14 | 2.56886153328 |
15 | 2.17180117031 |
16 | 2.24503724763 |
17 | 2.95579718798 |
18 | 1.74002183444 |
19 | 1.83328601011 |
20 | 1.74779021766 |
21 | 2.71332097813 |
22 | 1.5409364067 |
23 | 1.83209213494 |
24 | 1.54904808883 |
The lowest error is for 5 rotations (correctly so) with an error of 0.361318100755, the next lowest error is 22 rotations with an error of 1.5409364067. This is ~4.3x difference, which gives a very large degree of confidence to our solution and below is the deciphered text.
there are two types of encryption: one that will prevent your sister from reading your diary and one that will prevent your government. this is a very important lesson to remeberespecially for attacks using frequency analysis which require longer passage of text in order to achieve better results.
Future
The frequency analysis presented here can be used along with some other techniques in order to crack the viginere cipher.