Introduction
This project builds on Project 1. There are two main differences between this project and Project 1.
Overview
As in Project 1, you will have to write two separate programs: buildDict and spellCheck. As in Project 1, the program buildDict builds a dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. The program buildDict builds a dictionary by extracting words from a document known to be free of spelling errors. However, in this project, buildDict also keeps track of the frequency of occurrence of each word that it extracts from the document. When buildDict completes extracting all the words from the document, as in Project 1, it should write the extracted words onto a text file called words.dat. However, now it should write, in each line, a word followed by the frequency of occurrence of that word in the text files it has used to produce the dictionary. Furthermore, the words should be ordered in non-increasing order of word frequency, with the most frequently occurring words appearing at the beginning of the output file. The spellCheck program reads words.dat and creates not one, but two dictionaries: a primary dictionary containing the T most frequent words and a secondary dictionary containing the remaining words. T is a parameter you can set to 500. This is justified by the fact that many empirical studies have found that 80% of the words in an average document come from a list of about 500 words! For both the primary and secondary dictionaries, you can use quadratic hashing with probing. Since the number of words expected to be in the primary dictionary is small, you can afford to make the hash table quite large; for example, as large as 10 times the value of T. Recall that in Project 1, your hash table had size that was around 2-5 times the number of words in the table. You can continue to use that rule for the secondary dictionary.
Building a Dictionary
The rules that define what a word is are as described in the Project 1 handout, with one change. Now I also want you to throw out capitalized words that do not start a sentence. That is, such strings should no longer be considered valid words and should not be inserted into the dictionary. A word is said to be capitalized, if its first letter is in upper case and the rest of the letters are in lower case. A word is said to begin a sentence, if the non-whitespace character immediately preceding the first letter of the word is a period. The reason for throwing out capitalized words that do not begin a sentence is that such words are likely to be proper nouns. It is possible that throwing out words as described above may not remove all proper nouns and may in fact remove some words that are not proper nouns. For this project buildDict needs to keep track of the frequency of occurrence of the words it is extracting from the available texts. This means that the wordList class needs to be modified to keep track of frequency information. Call the modified wordList class, fwList class (short for "frequency and word list"). Here is what I would like you to have as member functions of the fwList class.
In addition to the texts of the three novels you used in Project 1, you should use the following three new novels also to build a dictionary.
The spellCheck program
After the buildDict program completes its task and creates a dictionary file words.dat, the spellCheck program is ready to take over. This program starts by reading the first T words from the file words.dat and stores these in the primary dictionary. It then reads the remaining words and places them in a secondary dictionary. As mentioned before, for both the primary and secondary dictionaries, you should use hashing with quadratic probing. Then, as in Project 1, spellCheck starts extracting and spell checking words from the user document. For each extracted word spellCheck searches the primary dictionary first. If the word is not found in the primary dictionary, spellCheck searches in the secondary dictionary. Any word not found in both dictionaries gets identified as a misspelled word. Like in Project 1, spellCheck responds to a misspelled word by providing the following prompt:
Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?
If the user responds with an R or a P, then the program responds by producing a list of at most 10 suggested replacement words that are labeled 0 through 9, followed by an option that allows the user to input their own replacement word. So for example, if spellCheck encounters the word ont it might respond as follows:
(0) ant (1) one (2) on (3) opt (4) out (5) Use my replacementNote that the list of suggested replacements may not always contain 10 words, 10 is just the maximum number of replacement words produced by the program. Also note that sometimes, the program may find no replacement words and in this case the user will just see the prompt
(0) Use my replacementThe user responds by typing a number corresponding to the option they want. If, in the above example the user types number between 0 and 4 (inclusive) then the program just uses the corresponding replacement word from the list. Otherwise, if the user types 5, then the program needs to ask the user for a replacement word and it uses whatever the user types in response.
In the next section of the handout, I'll describe how to generate possible replacement words.
How to generate replacements?
We generate possible replacements for misspelled words by assuming that the most likely errors a user makes while typing are
Suppose we have encountered a misspelled word w. To find meaningful replacements for w we find all words in the dictionary (correctly spelled words) that are at most 2 hops away from w in the graph defined above. The choice of 2 is somewhat arbitrary and depending on the results of some experimentation, you may want to fine tune this value. Also, recall that we are interested in reporting at most 10 possible replacements, so if the number of words in the dictionary within 2 hops of w is more than 10, we have to prune the list to contain exactly 10 words. To find the all dictionary words within 2 hops of w, you should implement a search function with the following signature:
String[] kNeighborhood(String w, int k)
This function returns all words in the dictionary that are within k hops of w. This method uses breadth-first traversal, to find the k-neighborhood of w. The kNeighborhood method can be obtained by making modifications to the breadth-first traversal code we wrote as part of the myListGraph class. Here are two items that need to change:
Documentation:
You should document your code extensively. Your comments should be clear and precise and not add unnecessary clutter to your code. Avoid the approach of ritualistically filling your code with comments, especially those that are pointless (e.g. "this is a variable declaration") or imprecise (e.g. "this loop goes around and around until it figures out the answer"). Ambiguous or inaccurate comments can be worse than none at all. Consider using the Javadoc style of comments, especially since Eclipse can recognize these. Go to Sun's webpages on Javadoc for more information.
Write-up:
By the time you have finished Project 2, you will have created a reasonably interesting piece of software. Now place yourself in the role of a user and try to evaluate your experience with your spell-checker. Focus on three main issues:
Submit a 2-page write-up addressing these issues, clearly and concisely. Suggest possible improvements you could make to the spell-checker based on each the three issues you have considered.
What to submit:
You are required submit via ICON by midnight on Wednesday, 12/3 the following java files: buildDict.java, spellCheck.java, wordPair.java, and fwList.java. You should also submit a README file in which you should mention all known errors, gross inefficiencies, all required methods that are unimplemented, and documentation for all "extra" methods that were implemented. Finally, you should submit your write-up as a pdf document names project2.pdf. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. If any of these requirements change in the next 2 weeks, I will post updates in the announcements section of the course webpage
A final word:
While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.
This project is more complicated than Project 1 along many dimensions. I suspect it will take most of you all of three weeks; so please do not wait for two more weeks to start!
In general the graders of your program will look for several things besides the correctness of your program. The overall organization of your program, the partitioning of various tasks into functions, and program documentation will be taken into account while grading.