22C:21 Project 2

Due date: Wednesday, 12/3, Midnight

Introduction

This project builds on Project 1. There are two main differences between this project and Project 1.

You will now build a larger dictionary and will therefore use more sophisticated data structures to provide efficient insertion and search. In particular, you will keep track of word frequencies when you build the dictionary. This will allow you to use a "two tier" data structure, consisting of a fast and small data structure for the most frequently occuring words and a slower and larger data structure for the less frequently occuring words.
You will provide a more sophisticated interface to the user, by suggesting possible replacement words for errors discovered by spellCheck. In order to do this, you will perform a breadth-first search on an implicit graph of words.

Overview

As in Project 1, you will have to write two separate programs: buildDict and spellCheck. As in Project 1, the program buildDict builds a dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. The program buildDict builds a dictionary by extracting words from a document known to be free of spelling errors. However, in this project, buildDict also keeps track of the frequency of occurrence of each word that it extracts from the document. When buildDict completes extracting all the words from the document, as in Project 1, it should write the extracted words onto a text file called words.dat. However, now it should write, in each line, a word followed by the frequency of occurrence of that word in the text files it has used to produce the dictionary. Furthermore, the words should be ordered in non-increasing order of word frequency, with the most frequently occurring words appearing at the beginning of the output file. The spellCheck program reads words.dat and creates not one, but two dictionaries: a primary dictionary containing the T most frequent words and a secondary dictionary containing the remaining words. T is a parameter you can set to 500. This is justified by the fact that many empirical studies have found that 80% of the words in an average document come from a list of about 500 words! For both the primary and secondary dictionaries, you can use quadratic hashing with probing. Since the number of words expected to be in the primary dictionary is small, you can afford to make the hash table quite large; for example, as large as 10 times the value of T. Recall that in Project 1, your hash table had size that was around 2-5 times the number of words in the table. You can continue to use that rule for the secondary dictionary.

Building a Dictionary

The rules that define what a word is are as described in the Project 1 handout, with one change. Now I also want you to throw out capitalized words that do not start a sentence. That is, such strings should no longer be considered valid words and should not be inserted into the dictionary. A word is said to be capitalized, if its first letter is in upper case and the rest of the letters are in lower case. A word is said to begin a sentence, if the non-whitespace character immediately preceding the first letter of the word is a period. The reason for throwing out capitalized words that do not begin a sentence is that such words are likely to be proper nouns. It is possible that throwing out words as described above may not remove all proper nouns and may in fact remove some words that are not proper nouns. For this project buildDict needs to keep track of the frequency of occurrence of the words it is extracting from the available texts. This means that the wordList class needs to be modified to keep track of frequency information. Call the modified wordList class, fwList class (short for "frequency and word list"). Here is what I would like you to have as member functions of the fwList class.

It makes sense to just copy all of the methods of the wordList class into the fwList class. These methods can also retain their original meanings (semantics) except for the insert method. The insert method should now increment the frequency of the word being inserted.
You should also add the following new methods to the fwList class:
1. int getFrequency(String myWord)
  This function takes a word and returns its frequency, as currently recorded. If the word does not exist in the list of words, the function returns 0.
2. String[] getWords(int k)
  This function returns a String array with the k most frequently occuring words, with the array being sorted in non-increasing order of frequency.
3. String[] getWords()
  This function behaves just like the above function, except that it returns an array with all of the words in the fwList class.
4. void printWordsAndFrequencies()
  This methods prints all of the words in the fwList object, along with their frequencies, in non-increasing order of frequencies. The output is formatted to have a word and its frequency in each line.
5. void printWordsAndFrequencies(String fileName)
  This method is like the above method, except that it prints the output onto the file specified by the String fileName.
You are free to add other member functions that make the class more convenient to use.

In addition to the texts of the three novels you used in Project 1, you should use the following three new novels also to build a dictionary.

Leo Tolstoy's "War and Peace": war.txt.
Jonathan Swift's "Gulliver's Travels": gulliver.txt.
R.L. Stevenson's "Treasure Island": treasure.txt.

Note that all of these files (and especially Tolstoy's "War and Peace") are all much larger than the ones you used in Project 1.

The spellCheck program

After the buildDict program completes its task and creates a dictionary file words.dat, the spellCheck program is ready to take over. This program starts by reading the first T words from the file words.dat and stores these in the primary dictionary. It then reads the remaining words and places them in a secondary dictionary. As mentioned before, for both the primary and secondary dictionaries, you should use hashing with quadratic probing. Then, as in Project 1, spellCheck starts extracting and spell checking words from the user document. For each extracted word spellCheck searches the primary dictionary first. If the word is not found in the primary dictionary, spellCheck searches in the secondary dictionary. Any word not found in both dictionaries gets identified as a misspelled word. Like in Project 1, spellCheck responds to a misspelled word by providing the following prompt:

Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?

If the user responds with an R or a P, then the program responds by producing a list of at most 10 suggested replacement words that are labeled 0 through 9, followed by an option that allows the user to input their own replacement word. So for example, if spellCheck encounters the word ont it might respond as follows:

	(0) ant
	(1) one
	(2) on
	(3) opt
	(4) out
	(5) Use my replacement

Note that the list of suggested replacements may not always contain 10 words, 10 is just the maximum number of replacement words produced by the program. Also note that sometimes, the program may find no replacement words and in this case the user will just see the prompt

	(0) Use my replacement

The user responds by typing a number corresponding to the option they want. If, in the above example the user types number between 0 and 4 (inclusive) then the program just uses the corresponding replacement word from the list. Otherwise, if the user types 5, then the program needs to ask the user for a replacement word and it uses whatever the user types in response.

In the next section of the handout, I'll describe how to generate possible replacement words.

How to generate replacements?

We generate possible replacements for misspelled words by assuming that the most likely errors a user makes while typing are

Missing a letter (typing tet instead of text),
Including an extra letter (typing taext instead of text),
Substituting one letter for another, (typing rext instead of text) and
Swapping a pair of consecutive letters (typing txet instead of text).

Now consider the graph whose vertices are all possible strings and whose edges connect pairs of strings s and s' if

s' is obtained by deleting a letter in s or
s' is obtained by inserting a letter into s or
s' is obtained by substituting another letter for some letter in s or
s' is obtained by swapping a pair of letters in s.

Note that this is an infinitely large graph because it has infinitely many vertices (all possible strings)! So we will never maintain it explicitly and will explore it by generating neighbors "on-the-fly."

Suppose we have encountered a misspelled word w. To find meaningful replacements for w we find all words in the dictionary (correctly spelled words) that are at most 2 hops away from w in the graph defined above. The choice of 2 is somewhat arbitrary and depending on the results of some experimentation, you may want to fine tune this value. Also, recall that we are interested in reporting at most 10 possible replacements, so if the number of words in the dictionary within 2 hops of w is more than 10, we have to prune the list to contain exactly 10 words. To find the all dictionary words within 2 hops of w, you should implement a search function with the following signature:

	String[] kNeighborhood(String w, int k)

This function returns all words in the dictionary that are within k hops of w. This method uses breadth-first traversal, to find the k-neighborhood of w. The kNeighborhood method can be obtained by making modifications to the breadth-first traversal code we wrote as part of the myListGraph class. Here are two items that need to change:

The getNeighbors method can no longer scan an adjacency list to find neighbors, since we are not explicitly maintaining the graph. This method now needs to generate neighbors of a given string w by explicitly performing the 4 operations mentioned above (i.e., skipping a letter, including an extra letter, substituting one letter for another, swapping a pair of consecutive letters). You might want to write a method corresponding to each of these operations and combine the four String arrays returned by these methods into one String array with no duplicates.
The breadth-first traversal needs to terminate as soon all nodes within the k-neighborhood of w are reached.

Documentation:

You should document your code extensively. Your comments should be clear and precise and not add unnecessary clutter to your code. Avoid the approach of ritualistically filling your code with comments, especially those that are pointless (e.g. "this is a variable declaration") or imprecise (e.g. "this loop goes around and around until it figures out the answer"). Ambiguous or inaccurate comments can be worse than none at all. Consider using the Javadoc style of comments, especially since Eclipse can recognize these. Go to Sun's webpages on Javadoc for more information.

Write-up:

By the time you have finished Project 2, you will have created a reasonably interesting piece of software. Now place yourself in the role of a user and try to evaluate your experience with your spell-checker. Focus on three main issues:

How good is your spell-checker at catching errors? Are there certain kinds of errors that are not flagged? Are there correctly spelled words that are flagged as errors?
How good is your spell-checker at suggesting replacements? Does it generate too few suggestions? Does it generate too many irrelevant suggestions?
What do you think of the performance of your spell-checker? Is it fast enough for you when it starts up and also when it generates replacements?

Submit a 2-page write-up addressing these issues, clearly and concisely. Suggest possible improvements you could make to the spell-checker based on each the three issues you have considered.

What to submit:

You are required submit via ICON by midnight on Wednesday, 12/3 the following java files: buildDict.java, spellCheck.java, wordPair.java, and fwList.java. You should also submit a README file in which you should mention all known errors, gross inefficiencies, all required methods that are unimplemented, and documentation for all "extra" methods that were implemented. Finally, you should submit your write-up as a pdf document names project2.pdf. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. If any of these requirements change in the next 2 weeks, I will post updates in the announcements section of the course webpage

A final word:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.

This project is more complicated than Project 1 along many dimensions. I suspect it will take most of you all of three weeks; so please do not wait for two more weeks to start!

In general the graders of your program will look for several things besides the correctness of your program. The overall organization of your program, the partitioning of various tasks into functions, and program documentation will be taken into account while grading.

Sriram Pemmaraju