Introduction
This is the first in a series of 3 programming projects at the end of which you will have constructed a spell checker -- program that checks and possibly corrects spelling errors in documents. In this project you will take a first step, by building a somewhat primitive spell checker.
Overview
As part of this project you will have to write two separate programs: buildDict and spellCheck. As the names suggest, the program buildDict builds a small dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. You will use a hash table that is quadratically probed as the primary data structure for both buildDict and spellCheck.
Building a Dictionary
In this project, we will build a dictionary by extracting words from large on-line documents known to be error free. I am making available the following three famous novels in text form:
The program buildDict should read each of these documents, extract words
from them, and insert them into a dictionary.
It is important to have a precise understanding of what we mean by a
word before we proceed.
As far as this project is concerned, a word is a contiguous sequence of 2 or more
letters (lower case or upper case) immediately followed by a non-letter character.
For this project we will ignore the difference between upper and lower case letters
and therefore the words The and the will be considered identical.
To understand the above definition, consider the following text:
The giant telephone company AT&T bought an IBM360 computer last year.
Since then the company has had an error-free run in the telecommunications
industry.
The distinct words extracted from this text are listed below in alphabetical order:
an at bought company computer error free
giant had has ibm in industry last
run since telephone telecommunications the
then year
Notice that the word at has been extracted from AT&T and the hyphenated word error-free has contributed two words error and free to the above list. Each of the extracted words needs to be inserted into a data structure that can also be searched quickly (so that the same word is not inserted twice). To support this you should define and implement a class called wordList that can be used by both the buildDict program and the spellCheck program.
You should use hashing with quadratic probing to implement the wordList class. In the lecture on Wednesday, October 1st, I mentioned the simple probing function
The class worldList should contain at least the following member functions:
The program buildDict should start by asking the user for a list
of text files to read from.
The user should respond by typing the file names as
[filename1] [white space] [filename2] [white space] [filename3]...In this version of the project you will be using only three files: alice.txt, carol.txt, and hyde.txt. Then buildDict will read all specified text files and build a dictionary of all distinct words found. It should then write the contents of this dictionary into a file called dictionary.dat. This file constitutes the dictionary that will subsequently be used by the program spellCheck.
The spellCheck program
After the buildDict program completes its task and creates a dictionary file
dictionary.dat, the spellCheck program is ready to take over.
This program should start by reading the list of words in the
file dictionary.dat and storing these in a wordList object.
In addition to the words in dictionary.dat, the program spellCheck
should add valid single letter words such as A and
I to the data structure.
After spellCheck has read the list of words from dictionary.dat it should
prompt the user for the name of a document she wants spell-checked.
spellCheck should then read words from this user specified document,
one-by-one and check which words appear in its dictionary.
As a running example, suppose that the user wants a file called
myLetter to be spell-checked.
As spellCheck reads from the file myLetter, it writes into
a file called myLetter.out.
After spellCheck has completely processed myLetter,
the file myLetter.out will be the corrected version of
myLetter.
Note that myLetter.out is identical to myLetter except
for misspelled words being replaced by correctly spelled words.
The spellCheck program
should make sure that all the white spaces and punctuation marks
that appear in myLetter appear exactly in the same manner in
myLetter.out.
As the program scans the user specified document, it may encounter a word
that is not present in its dictionary.
It takes such a word to be misspelled and responds by
(i) displaying the misspelled word and (ii) producing the following prompt:
Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?The user is expected to respond by typing an R, P, I, N, or an E. If the user types anything else, the program should simply ask the question again. Here is what your program should do for each of the possible responses from the user.
For example, if the document myLetter contains the word colour, then when spellCheck encounters this word, it searches for it in its dictionary, does not find it, and asks the user what she wants to do. Assuming that the user responds by typing R and then types the replacement word color, the program writes color instead of colour in myLetters.out.
The actions corresponding to the responses R, I, and E are easy to implement. The actions corresponding to the responses P and N require some modifications to the wordList class. For example, if the user asks that all occurrances of colour be replaced by color, then the program needs to remember this so that whenever subsequent occurances of the word colour are encountered they are replaced by color. Similarly, for N.
To implement P and N, let us think of each item in a wordList object as consisting of two words: (i) an incorrect word and (ii) a replacement word. The incorrect word will serve as the key for the item. The spellCheck program should define an object, say replacementWords, that belongs to the wordList class. Suppose the misspelled word colour is encountered and the user responds by asking that all occurances of colour be replaced by color. Then, the pair of words colour and color would form a single item, with colour being the incorrect word and color being the replacement word. This item would be inserted into the object replacementWords. Similarly, if the user responds by asking that all occurances of the word colour be ignored, then the program should insert the word colour as the incorrect word and the replacement word. So the same word is taken to be its replacement and as a result no change is made to the document. This organization means that when a word is encountered, the spellCheck program should first look into the replacementWords object and only if it does not find the word there, it should look in the dictionary. This modification to the wordList class means that it might be helpful to implement the following method as part of the wordList class:
public void insert(String incorrectWord, String replacementWord)This method makes an item with with incorrectWord and replacementWord in it and inserts this item into a wordList object.
Documentation:
You should document your code extensively. Your comments should be clear and precise and not add unnecessary clutter to your code. Avoid the approach of ritualistically filling your code with comments, especially those that are pointless (e.g. "this is a variable declaration") or imprecise (e.g. "this loop goes around and around until it figures out the answer"). Ambiguous or inaccurate comments can be worse than none at all. Consider using the Javadoc style of comments, especially since Eclipse can recognize these. Go to Sun's webpages on Javadoc for more information.
What to submit:
You are required submit via ICON by 5 pm on Friday, 10/24 the following java files: buildDict.java, spellCheck.java, and wordList.java. In addition you should submit a README file in which you should mention all known errors, gross inefficiencies, all required methods that are unimplemented, and documentation for all "extra" methods that were implemented. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. If any of these requirements change in the next 2 weeks, I will post updates in the announcements section of the course webpage
Final words:
While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. You have 3 weeks for this project because I expect that it will take you 3 weeks! So please start right away.