Break Caesar Cypher with frequency analysis
In this Article
Overview
I have described how the Caesar Cypher algorithm works in a previous post that you can find here. Today I am going to focus on how to break a Caesar Cypher encrypted text without knowing its key. Of course this type of cypher algorithm is not complex and can be easily brute-forced since the key has a finite and small number of values (max of 255 if you are thinking to use all the characters in the ASCII table). In this article I am going to focus on a different approach called the Frequency Analysis.
Solution
The idea is that any spoken language has letters that are more frequent than others, therefore if I find the most frequent letter in the cyphered text I can then subtract its index value to the index value of the most frequent letter in a language. In this case I am going to focus on the English language, however this approach can be used on any language.
To implement the frequency analysis I am going to use a Priority Queue. I have described the priority queue implementation in the Dijkstra algorithm post
import Element from "../../../common/collections/queue/element";
import PriorityQueue from "../../../common/collections/queue/priority.queue";
/**
* Class that allows to run an analysis on characters frequency in a string
*/
export default class FrequencyAnalysis {
/**
* Analyse text and returns a priority queue representing character frequency
*
* @param text the text to analyse character frequency for
* @returns a priority queue with the most frequent character as the first element
* and the least frequent character as the last element
*/
public static analyze(text: string): Array<Element<string>> {
const pq = new PriorityQueue<string>();
const textChars: Array<string> = text.split('');
// loop through all the char in the text
for (let i = 0; i < textChars.length; i++) {
const char = textChars[i];
// skip empty spaces since this will be the most common char
if (char === ' ') {
continue;
}
const removed: Element<string> | null = pq.remove(char);
if (removed === null) {
// if this is the firs time encountering this char then set the frequency to 1
pq.enqueue(char, 1)
} else {
// if the priority queue has already this char, increase its frequency
pq.enqueue(char, removed.getPriority() + 1)
}
}
// return an array of string elements sorted by most frequent to least frequent
return pq.getElements().reverse();
}
}
The implementation is pretty simple, it loops all the characters in the text, skips empty spaces since those will be the most common and we are assuming the key is not an empty space. It then puts each character into the queue with the corresponding frequency as the priority. The elements with the lowest priority will be the most frequents therefore at the end the method is reversing the array being returned.
The Analysis
Now let’s start run some analysis. First of all I am going to write up some random english text and analyse it to get what the most frequent character is:
const text = 'This general text will be used for the purpose of analysing the character frequency in the English language';
const textFrequency: Array<Element<string>> = FrequencyAnalysis.analyze(text);
// expect the most frequent char to be 'e'
expect(textFrequency.length === 0).toBe(false);
const languageMostFrequent = textFrequency[0];
expect(languageMostFrequent.getElement()).toBe('e')
The most frequent character in the example above is the letter e. Now let’s analyse some cyphered text:
// given some cyphered text
const cypheredText: string = 'MLyq44msqL0rLx07q';
const cypheredTextFrequency: Array<Element<string>> = FrequencyAnalysis.analyze(cypheredText);
// expect the most frequent character to be 'q' and 'L'
// however since 'q' was the last char it is going to be at the top
expect(cypheredTextFrequency.length === 0).toBe(false);
const cypheredMostFrequent = cypheredTextFrequency[0];
expect(cypheredMostFrequent.getElement()).toBe('q')
In this case both L and q are the most frequent characters with the same number of occurrences but q will be the first element in the array because it was the last computed. Since this is a guess game, I will start from the first element and work my way down until I have found a key that will de-cypher the text into some meaningful text.
Given the alphanumeric string as:
// given the following alphanumeric string
const alphanumericString: string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789 ';
// find the index of the key
const i = alphanumericString.indexOf(cypheredMostFrequent.getElement()) - alphanumericString.indexOf(languageMostFrequent.getElement());
// find the key
const key = alphanumericString[i]
expect(key).toBe('M')
// decypher the message
const result: string = new CaesarCypher().decrypt(cypheredText, key);
expect(result).toBe('A message of love');
In the snippet above I have subtracted the cyphered text most frequent character index in the alphanumeric string with the language text most frequent character index in the alphanumeric string. The difference pointed me to M so I used that as the key to decrypt the cyphered text and voilà! I have got some meaningful text. In this instance I have been lucky and got the key at my first guess. As long as the alphanumeric string being used is known then it’s possible with frequency analysis to break some Caesar cyphered text without knowing it’s key. The complete source code for can be found in the nodejs-series repository.