This is the first in a series of 3 programming projects at the end of which you will have constructed a spell checker -- program that checks and possibly corrects spelling errors in documents. In this project you will take a first step, by building a somewhat primitive spell checker. It is my hope that this exercise will give you a glimpse into some of difficulties involved in constructing good programs for checking and correcting spellings.
As part of this project you will have to write two separate programs: buildDict and spellCheck. As the name suggests, the program buildDict builds a small dictionary of words, while the program spellCheck uses this dictionary of words to determine which words in a given document are misspelled. The task that buildDict performs can be thought of as a preprocessing step; buildDict will be used rarely as compared to spellCheck. Once a dictionary has been built, then there is no reason to use buildDict again until the performance of spellCheck is found to be poor enough to require a modification of the dictionary. The implication is that buildDict can expend more time and space resources on its task as compared to spellCheck.
Building a Dictionary
Building a good dictionary is a rather hard task and in this first attempt, the dictionary we build, will most likely lead to rather poor performance by spellCheck. However, in subsequent attempts we will improve and enlarge our dictionary. In this project, we will build a dictionary by extracting words from large on-line documents known to be error free. I am making available the following three famous novels in text form:
The program buildDict should read each of these documents, extract words
from them, and insert them into a dictionary.
It is important to have a precise understanding of what we mean by a
word before we proceed.
As far as this project is concerned, a word is a contiguous sequence of 2 or more
letters (lower case or upper case) immediately followed by a non-letter character.
For this project we will ignore the difference between upper and lower case letters
and therefore the words The and the will be considered identical.
To understand the above definition, consider the following text:
The giant telephone company AT&T bought an IBM360 computer last year.
Since then the company has had an error-free run in the telecommunications
The distinct words extracted from this text are listed below in alphabetical order:
an at bought company computer error free
giant had has ibm in industry last
run since telephone telecommunications the
Notice that the word at has been extracted from AT&T and the hyphenated word error-free has contributed two words error and free to the above list. Each of the extracted words needs to be inserted into a data structure that can also be searched quickly (so that the same word is not inserted twice). To support this you should define and implement a class called wordList that can be used by the buildDict program. This class should contain a vector of words in alphabetically sorted order. I know that there are many different vector classes. For the sake of consistency and because we discussed this particular vector class in my lectures, I would like you to use the apvector class available on the course page. The class worldList should contain at least the following member functions:
cout << dictionary;
in her programs. As before, dictionary is an object belonging to the class wordList. The effect of the above output statement should be to insert in alphabetical order the list of words in dictionary into the output stream in a neat and readable format.
The program buildDict should start by asking the user for a list
of text files to read from.
The user should respond by typing the file names as
[filename1] [white space] [filename2] [white space] [filename3]...In this version of the project you will be using only three files: alice.txt, carol.txt, and hyde.txt. Then buildDict will read all specified text files and build a dictionary of all distinct words found. It should then write the contents of this dictionary into a file called words.dat. This file constitutes the dictionary that will subsequently be used by the program spellCheck. The file words.dat should contain words in alphabeticallt sorted order.
The spellCheck program
After the buildDict program completes its task and creates a dictionary file words.dat, the spellCheck program is ready to take over. This program should start by reading the list of words in the file words.dat into a data structure that can be quickly searched. In addition to the words in words.dat, the program spellCheck should add valid single letter words such as A and I to the data structure. For this data structure also, you should use the wordList class that you used in buildDict so that you do not have to write fresh code. Overload the extract operator >> so that it can be used in the following manner:
cin >> dictionary;
to extract from the input stream a list of words and store them in the
wordList object called dictionary.
After spellCheck has read the list of words from words.dat it should
prompt the user for the name of a document she wants spell-checked.
spellCheck should then read words from this user specified document,
one-by-one and test if the words appears in its dictionary.
As a running example, suppose that the user wants a file called
myLetter to be spell-checked.
As spellCheck reads from the file myLetter, it writes into
a file called myLetter.out.
After spellCheck has completely processed myLetter,
the file myLetter.out will be the corrected version of
Note that myLetter.out is identical to myLetter except
for misspelled words being replaced by correctly spelled words.
The spellCheck program
should make sure that all the white spaces and punctuation marks
that appear in myLetter appear exactly in the same manner in
As the program scans the user specified document, it may encounter a word
that is not present in its dictionary.
It takes such a word to be misspelled and responds by
(i) displaying the misspelled word and (ii) producing the following prompt:
Do you want to replace (R), replace all (P), ignore (I), ignore all (N), or exit (E)?The user is expected to respond by typing an R, P, I, N, or an E. If the user types anything else, the program should simply ask the question again. Here is what your program should do for each of the possible responses from the user.
What to submit:
You are required to electronically submit by 5 pm on Tuesday, 10/7 the following files: buildDict.cxx, spellCheck.cxx, wordList.cxx, wordList.h, map.cxx, and map.h. Make sure that your file names are exactly as above and make sure that you do not submit any additional files. You will submit your project using a program called submit that you can use on computer science departmental machines. I will post more details related to this program on the course page.
While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. For example, you should define some number of constructors for the wordList and the map class. You need to decide what the syntax for these constructors should be and what exactly they should accomplish.
You have 3 weeks for this project because I expect that it will take you 3 weeks! So please start right away.
We will soon post a grading scheme for the project that will help you decide what pieces to focus on first.