Introduction to Programming Machine Problem IV from Spring 1993

Background

Caesar cyphers can be automatically broken by taking advantage of the known letter frequencies of the language of the encrypted text. The frequencies for the English language, based on a sample of a few hundred thousand letters, are given in the file freq; these add to about 1 (not exaxtly 1 because of imprecision in the number representation)

Assignment

Your assignment is to write a program that uses this data to crack the code for arbitrary Casear cyphers. The input data you will be given will be encrypted using a rotation of between 0 and 25 places. The basic rotation code is as it was in MP2, that is, only the letters are changed; all other characters are unchanged, and capitalization is preserved. The following basic algorithm will work for solving this problem:

  1. Read the input text and accumulate the letter frequency data for the text as it is in encrypted form.

  2. Compare the resulting data with the data given in freq, looking for a rotation of the frequency data that will give you the best match. To do this, try each of the 26 rotations, and remember which gives the smallest difference between the frequencies you observed and the frequencies given.

  3. Reset the input so that it can be re-read, then copy it to the output rotated the amount that was determined in step 2.

Your program should operate in the same way as MP2 and 3; that is, it should read cyphertext from standard input and write decrypted plaintext to standard output.

Your program should put a one line title on the output data to indicate the rotation that was used to give the best result. Assuming that the best match occured with a rotation of 13 places, the title should be:

***ROT13***
This leaves one big problem: How do you compare two lists of letter frequencies to find the best match? This requires a measure of the difference between the two lists; for this, use the sum of the squares of the differences of the corresponding entries in the lists. If you minimize this sum by trying different rotations, you are doing what is called a least squares fit. Note that each letter frequency is the quotient of the number of times that letter was seen divided by the total number of letters examined.

Grading Criteria

This problem is worth 5% of your grade. If you get your assignment to work perfectly, you will earn only half of credit. The other half will depend on the style of your code and commentary.