22C:21 Project 2

Due date: Friday, 4/14

Introduction

This project builds on Project 1. Here are the main differences between this project and Project 1.

You will use a more sophisticated 2-level data structure to provide efficient insertion and search. Specifically, the buildDict program will identify the 500 most frequently occuring words in the text documents it reads. It will first write these first into words.dat, followed by the rest, less frequently occuring, words. When spellCheck builds the dictionary from words.dat, it puts the 500 most frequently occuring words in one "high priority" data structure (we will call this the primary dictionary) and the rest of the words in a second, "lower priority" data structure (we will call this the secondary dictionary). The primary dictionary needs to be extremely fast because the words in it are likely to be accessed very often. The secondary dictionary can afford to be a little slower.
You will provide a more sophisticated interface to the user, by suggesting possible replacement words for errors discovered by spellCheck. In order to do this, we need to model the kinds of errors that people make and then use this model to start with a mispelled word to and search for possible correct words the user might have actually meant to type. Details of how to do this are described below and in Problem Set 8.
The spelling checker built in Project 1 makes no attempt to correctly deal with words that contain apostrophes such as "I'll" or "could've" etc. This project asks you to fix this problem.

Keeping track of frequencies

In this project, buildDict keeps track of the frequency of occurrence of each word that it extracts from the given text documents. As in Project 1, when buildDict completes extracting all the words from the document, it writes the extracted words onto a text file called words.dat. However, in this project, buildDict first writes the 500 most frequent words (in any order) into words.dat and then writes the remaining words (in any order) into words.dat. The spellCheck program reads words.dat and creates not one, but two dictionaries: a primary dictionary containing the 500 most frequent words and a secondary dictionary containing the remaining words. This plan implies that as buildDict is extracting words from the document, it is also updating their frequency of occurance. To be able to do this, modify the Record class to include an integer data member for word frequency, in addition to the two string data members it already has. As in Project 1, buildDict uses a hash table to keep track of the words that it extracts.

Two-level dictionary

The program buildDict separates the words it extracts into two categories of high frequency and low frequency words so as to help spellCheck build a more efficient, two-level dictionary data structure. The spellCheck program stores the 500 high frequency words in a "primary dictionary". Since the number of words being stored in the primary dictionary is small, we can afford to use a hash table, that has a small load factor - say 1/10. This means that the size of the hash table will be at least 5000. So as your table size you might want to pick the smallest prime greater than 5000. The rest of the words (i.e., the low frequency words) are stored in the "secondary dictionary." We may have to store a large number of words in the secondary dictionary and therefore due to memory constraints, we cannot use a hash table with too small a load factor. So for the secondary dictionary, use a hash table with a load factor of about 2 or 3. In other words, use a table whose size is about half or even a third of the number of words that you intend to store in the table.

The motivation for this two-level organization of the dictionary is that many empirical studies have found that 80% of the words in an average document come from a list of about 500 words! For these words, we can afford to build a hash table with small load factor and get, almost instantaneous response to insert and search operations. For the rest of the words, we can afford to provide slightly slower access, because they are accessed less frequently. This two-level organization provides a good balance between time and space efficiency.

While checking for spelling errors, the spellCheck program takes each word and first looks for it in the primary dictionary. If the word does not appear in the primary dictionary, it then looks in the secondary dictionary. This order of looking first in the primary dictionary and then in the secondary dictionary is quite important. If the word is not found in either dictionary, spellCheck will flag the word as being mispelled.

Suggesting Replacements

Like in Project 1, spellCheck responds to a misspelled word by providing the following prompt:

Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or abort (A)?

However, what the program does next is different. If the user responds with an R or a P, then the program responds by producing a list of at most 10 suggested replacement words that are labeled 0 through 9, followed by an option that allows the user to input their own replacement word. So for example, if spellCheck encounters the word ont it might respond as follows:

        (0) ant
        (1) one
        (2) on
        (3) opt
        (4) out
        (5) Use my replacement

Note that the list of suggested replacements may not always contain 10 words, 10 is just the maximum number of replacement words produced by the program. We use the limit 10 because we do not want to inundate the user with too many options. Also note that sometimes, the program may find no replacement words and in this case the user will just see the prompt

        (0) Use my replacement

The user responds by typing a number corresponding to the option they want. If, in the above example the user types number between 0 and 4 (inclusive) then the program just uses the corresponding replacement word from the list. Otherwise, if the user types 5, then the program needs to ask the user for a replacement word and it uses whatever the user types in response.

The big question now is: how do we generate reasonable replacement words? We do this by assuming that the most likely errors a user makes while typing are

Missing a letter (typing tet instead of text),
Including an extra letter (typing taext instead of text),
Substituting one letter for another, (typing rext instead of text) and
Swapping a pair of consecutive letters (typing txet instead of text).

Now consider the graph whose vertices are all possible strings and whose edges connect pairs of strings s and t if

t is obtained by deleting a letter in s or
t is obtained by inserting a letter into s or
t is obtained by substituting another letter for some letter in s or
t is obtained by swapping a pair of letters in s.

Suppose we have encountered a misspelled word w. To find meaningful replacements for w, we find all words in the dictionary (i.e., correctly spelled words) that are at most 2 hops away from w in the graph defined above. The choice of 2 is somewhat arbitrary and depending on the results from this project we may fine tune this value. Also, recall that we are interested in reporting at most 10 possible replacements, so if the number of words in the dictionary within 2 hops of w is more than 10 we have to prune the list to contain exactly 10 words. Correctly spelled words that are one hop away should be considered better replacements than those that are two hops away. You should keep this in mind when pruning the list of candidate replacements.

The actual procedure for finding all correctly spelled words that are at most 2 hops away from w is fairly simple and you are responsible for figuring this out. Problem Set 8 should be helpful, in this regard. Note that your program should make no attempt to explicitly build the graph mentioned above - it is too big! I mention the graph only so that we can precisely define what we mean by good replacements for a mispelled word. The procedure you come up with for generating words that are at most 2 hops away from w, implicitly traverses the edges of the imaginary graph to find reasonable replacements.

Dealing with apostrophes

As you may have already noticed, the spelling checker in Project 1 makes no attempt to correctly deal with words that contain apostrophes such as "I'll" or "don't" or "could've." For example, if a text document used by buildDict contains the sentence:

	I could've taken it yesterday, but I guess now I'll have to
	take it next Sunday.

Then ve and ll are flagged as valid words, which they are clearly not. In this project I want you to implement a clean way of dealing with apostrophes. Specifically, if words such as "I'll" and "could've" occur in the text document that buildDict reads, then these words should be considered to be correctly spelled words. However, "ve" and "ll" should not be considered being correct. You should think about modifying our definition of a word from Project 1, in order to cleanly deal with apostrophes. You are responsible for figuring out a simple and clean scheme that can reasonably deal with apostrophes.

What to submit:

You are required to electronically submit by midnight on Friday, 4/14 the following files: BuildDict.java, SpellCheck.java, Dictionary.java, Link.java, LinkList.java, and Record.java. Make sure that your file names are exactly as above and make sure that you do not submit any additional files.

A final word:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.

Sriram Pemmaraju