22C:21 Project 1

Due date: Friday, 10/24, 5 pm via ICON

Introduction

This is the first in a series of 3 programming projects at the end of which you will have constructed a spell checker -- program that checks and possibly corrects spelling errors in documents. In this project you will take a first step, by building a somewhat primitive spell checker.

Overview

As part of this project you will have to write two separate programs: buildDict and spellCheck. As the names suggest, the program buildDict builds a small dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. You will use a hash table that is quadratically probed as the primary data structure for both buildDict and spellCheck.

Building a Dictionary

In this project, we will build a dictionary by extracting words from large on-line documents known to be error free. I am making available the following three famous novels in text form:

Lewis Carroll, Alice in Wonderland: Through The Looking Glass.
Charles Dickens, A Christmas Carol.
Robert Louis Stevenson, Dr. Jekyl and Mr. Hyde.

The above documents can be found in the files: alice.txt, carol.txt, and hyde.txt.

The program buildDict should read each of these documents, extract words from them, and insert them into a dictionary. It is important to have a precise understanding of what we mean by a word before we proceed. As far as this project is concerned, a word is a contiguous sequence of 2 or more letters (lower case or upper case) immediately followed by a non-letter character. For this project we will ignore the difference between upper and lower case letters and therefore the words The and the will be considered identical. To understand the above definition, consider the following text:

The giant telephone company AT&T bought an IBM360 computer last year. Since then the company has had an error-free run in the telecommunications industry.

The distinct words extracted from this text are listed below in alphabetical order:

 
an	at	bought	company	computer	error 		free
giant	had	has 	ibm 	in 		industry	last
run	since	telephone 	telecommunications	 	the 		 
then 	year

Notice that the word at has been extracted from AT&T and the hyphenated word error-free has contributed two words error and free to the above list. Each of the extracted words needs to be inserted into a data structure that can also be searched quickly (so that the same word is not inserted twice). To support this you should define and implement a class called wordList that can be used by both the buildDict program and the spellCheck program.

You should use hashing with quadratic probing to implement the wordList class. In the lecture on Wednesday, October 1st, I mentioned the simple probing function

h(s, i) = (h(s) + i²) mod M. A more general quadratic probing function is h(s, i) = (h(s) + c₁ i + c₂ i²) mod M. Here c₁ and c₂ are constants. I would like you to use this more general probing function. See Wikipedia's page on Quadratic Probing for some advice on specific values of constants to use. For h(s) you can use a hashing fucntion described in the lectures (see your notes from Wednesday, Oct 1st and Friday, Oct 3rd) and implemented in chainingHashTable.java. As mentioned in class, it is a good idea for M (the size of the hash table) to be a prime number. Furthermore, you might want to choose M to be about 2-5 times the number of items you expect to insert into the table. You might have to do some experimentation to figure out a reasonable value for M.

The class worldList should contain at least the following member functions:

public boolean search(String myWord)
This method returns true if myWord is found in the hash table; otherwise it returns false.
public void insert(String myWord)
Using this function the user will attempt to insert words into the has table. This method inserts myWord into the hash table if it is not already in it. After myWord is inserted into the hash table, you should make sure that there are still sufficiently many empty slots in the table. Otherwise, probing can become quite inefficient. Specifically, suppose that you have chosen M to be at least twice as large as the number of words in the table. After inserting a new word into the table, you should check if M is still at least twice as large as the number of words in the table. If not, it is time to resize (i.e., expand) the table. You can use the usual trick of expanding the table to be twice its current size. However, you will have to rehash all of the words into the table since M, the table size, determines which slots words are hashed into.
public int numberWords()
This method returns the number of words in the hash table and it should run in constant time.
public void sortedPrint()
This method prints out all the words in the hash table in lexicographic order in a neat and readable format.
public void sortedPrint(String fileName)
This method is the same as the above method except that it writes into a file with the given fileName.

The program buildDict should start by asking the user for a list of text files to read from. The user should respond by typing the file names as

[filename1] [white space] [filename2] [white space] [filename3]...

In this version of the project you will be using only three files: alice.txt, carol.txt, and hyde.txt. Then buildDict will read all specified text files and build a dictionary of all distinct words found. It should then write the contents of this dictionary into a file called dictionary.dat. This file constitutes the dictionary that will subsequently be used by the program spellCheck.

The spellCheck program

After the buildDict program completes its task and creates a dictionary file dictionary.dat, the spellCheck program is ready to take over. This program should start by reading the list of words in the file dictionary.dat and storing these in a wordList object. In addition to the words in dictionary.dat, the program spellCheck should add valid single letter words such as A and I to the data structure. After spellCheck has read the list of words from dictionary.dat it should prompt the user for the name of a document she wants spell-checked. spellCheck should then read words from this user specified document, one-by-one and check which words appear in its dictionary. As a running example, suppose that the user wants a file called myLetter to be spell-checked. As spellCheck reads from the file myLetter, it writes into a file called myLetter.out. After spellCheck has completely processed myLetter, the file myLetter.out will be the corrected version of myLetter. Note that myLetter.out is identical to myLetter except for misspelled words being replaced by correctly spelled words. The spellCheck program should make sure that all the white spaces and punctuation marks that appear in myLetter appear exactly in the same manner in myLetter.out. As the program scans the user specified document, it may encounter a word that is not present in its dictionary. It takes such a word to be misspelled and responds by (i) displaying the misspelled word and (ii) producing the following prompt:

Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?

The user is expected to respond by typing an R, P, I, N, or an E. If the user types anything else, the program should simply ask the question again. Here is what your program should do for each of the possible responses from the user.

The user types R. In this case, your program should ask the user to type a replacement word. The program should read the replacement word and replace the current occurance of the misspelled word by the replacement word. If the same misspelled word occurs later in the document, the user needs to be prompted again.
For example, if the document myLetter contains the word colour, then when spellCheck encounters this word, it searches for it in its dictionary, does not find it, and asks the user what she wants to do. Assuming that the user responds by typing R and then types the replacement word color, the program writes color instead of colour in myLetters.out.
The user types P. In this case also, your program should ask the user to type a replacement word. The program should read the replacement word and replace all occurances of the misspelled word by the replacement word. This means that the user will not be prompted when another occurance of the misspelled word is encountered later in the document.
The user types I. In this case, your program should just ignore the misspelled word and move on to the next word. If the same misspelled word occurs later in the document, the user needs to be prompted again.
The user types N. In this case, your program should ignore all occurances of the misspelled word. This means that the user will not be prompted when another occurance of the misspelled word is encountered later in the document.
The user types E. In this case the program should simply terminate without changing the text file at all.

The actions corresponding to the responses R, I, and E are easy to implement. The actions corresponding to the responses P and N require some modifications to the wordList class. For example, if the user asks that all occurrances of colour be replaced by color, then the program needs to remember this so that whenever subsequent occurances of the word colour are encountered they are replaced by color. Similarly, for N.

To implement P and N, let us think of each item in a wordList object as consisting of two words: (i) an incorrect word and (ii) a replacement word. The incorrect word will serve as the key for the item. The spellCheck program should define an object, say replacementWords, that belongs to the wordList class. Suppose the misspelled word colour is encountered and the user responds by asking that all occurances of colour be replaced by color. Then, the pair of words colour and color would form a single item, with colour being the incorrect word and color being the replacement word. This item would be inserted into the object replacementWords. Similarly, if the user responds by asking that all occurances of the word colour be ignored, then the program should insert the word colour as the incorrect word and the replacement word. So the same word is taken to be its replacement and as a result no change is made to the document. This organization means that when a word is encountered, the spellCheck program should first look into the replacementWords object and only if it does not find the word there, it should look in the dictionary. This modification to the wordList class means that it might be helpful to implement the following method as part of the wordList class:

               public void insert(String incorrectWord, String replacementWord)

This method makes an item with with incorrectWord and replacementWord in it and inserts this item into a wordList object.

Documentation:

You should document your code extensively. Your comments should be clear and precise and not add unnecessary clutter to your code. Avoid the approach of ritualistically filling your code with comments, especially those that are pointless (e.g. "this is a variable declaration") or imprecise (e.g. "this loop goes around and around until it figures out the answer"). Ambiguous or inaccurate comments can be worse than none at all. Consider using the Javadoc style of comments, especially since Eclipse can recognize these. Go to Sun's webpages on Javadoc for more information.

What to submit:

You are required submit via ICON by 5 pm on Friday, 10/24 the following java files: buildDict.java, spellCheck.java, and wordList.java. In addition you should submit a README file in which you should mention all known errors, gross inefficiencies, all required methods that are unimplemented, and documentation for all "extra" methods that were implemented. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. If any of these requirements change in the next 2 weeks, I will post updates in the announcements section of the course webpage

Final words:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. You have 3 weeks for this project because I expect that it will take you 3 weeks! So please start right away.

Sriram Pemmaraju
Thursday October 2 2008