22C:21 Project 3
Due date: Monday, May 8th
Introduction
This project builds on Project 2.
In this, you will generate suggested replacements in a much more efficient way.
Suggesting Replacements
The procedure for suggesting replacements, used in Project 2, was inefficient
and somewhat simple-minded.
Here is a more sohisticated and efficient algorithm for generating
replacements.
In Project 2 we defined a graph whose vertices
are all possible strings and whose edges
connect certain
pairs of strings s and t.
Let us call this the word graph.
The edges in the word graph were of 4 types:
- Deletion edges: t is obtained by deleting
a letter in s,
- Insertion edges: t is obtained by inserting a letter into
s,
- Substitution edges: t is obtained by substituting another letter
for some letter in s, and
- Swapping edges: t is obtained by swapping a pair of letters in
s.
We will now think of some of the edges in the graph as being important
and others as being unimportant.
The idea is that an edge (s, t) is important
if there are reasons to believe that a user is likely to use the
word s instead of t (or vice versa).
We will use the following principle to determine which edges are important
and which are not:
If two words sound similar, it is more likely that a user will use
one instead of another.
This principle has specific implications to the 4 kinds of edges we consider.
Here are examples.
- Deletion edges: Dropping the letter t from the word letter
does not change the sound of the word much, because of the two consecutive t's
in the word.
Similarly, dropping the letter i or the second e in the word
receive, does not change the sound of the word too much.
So edges such as (letter, leter) and (receive, receve)
are considered important whereas edges such as
(letter, lette) and (receive, eceive)
are not.
This is because in the latter cases, the sound of the word changes significantly when
the letter is dropped.
- Insertion edges: If an edge (s, t) is an insertion
edge, then it is also a deletion edge.
Therefore, the above discussion holds for insertion edges as well.
For example, I have often written neccessary for the word
necessary.
Since the mispelled word, obtained from the correctly spelled word via
the insertion of an extra c, sounds similar to the original word,
the edge (necessary, neccessary) is considered
important.
- Substition edges:
If a pair of letters sound the same then a user is more likely to substitute
one for another. For example, the letters c and s sound similar
in some contexts and a user might type nesessary instead of
necessary.
This means that edges such as (necessary, nesessary) are
considered important, whereas the edge (necessary, netessary)
is not.
In the latter case, replacing the letter c by a t drastically
changes the sound of the word.
- Swapping edges: If a pair of letters sound the same then swapping them will
create a new word that sounds similar to the original word.
For example, the swapping the letters e and i in
receive yields a similar sounding word and so the
edge (receive, recieve) is considered important whereas
the edge (receive, erceive) is not.
To do apply these ideas systematically, here is a table of pairs of letters
that we will think of as being similar in sound.
a e
c s
c k
e i
g j
i y
o u
v w
s z
Also, every letter sounds similar to itself.
Given this table, we can tag each edge as either being important or
unimportant as follows, depending on the type of the edge.
- Deletion edges: If two similar sounding letters occur
consecutively in a word s and one of them is deleted,
to give us the word t, then the edge (s, t)
is important. All other deletion edges are unimportant.
- Insertion edges: Since every deletion edge is an insertion edge
as well, the important/unimportant tags for
these kinds of edges are defined above.
- Substitution edges If the word t is obtained by
substituting a letter in the word s by a similar sounding
letter, then the edge (s, t)
is important. All other substition edges are unimportant.
- Swapping edges: If a two similar sounding letters occur
in a word s and these are swapped,
to give us the word t, then the edge (s, t)
is important. All other swapping edges are unimportant.
Here is how the idea of important and unimportant edges comes into play while
searching for replacements.
Given a mispelled word w, use the following steps to find
replacements.
-
Do a breadth first search (BFS) on the word graph starting with w
as the source and using just the important edges.
New strings will be generated as BFS proceeds; check whether a string that
is generated belongs to the dictionary (as in Project 2) or not.
If a generated string does belong to the dictionary, enqueue it into
a queue of replacement strings.
Terminate BFS when either 10 replacement strings have been found or
when 2000 strings have been generated.
The number 2000 is somewhat arbitrary and should be fine-tuned for better
performance.
Note that BFS may proceed down several levels starting at the root, before
it is terminated.
Recall that in Project 2, the search went down just 2 levels.
This was because in each level there were a lot of words (especially in
the second level) and we could not afford to go down more levels.
Here, since we are only using important edges, our search is narrower
and can therefore afford to be deeper.
Also, note that the queue of replacements contains words in the order
in which they were generated and this order does correspond to a decreasing
order of relevance of the replacement words.
-
Suppose that in Step (1), x replacement words have been found,
where x is smaller than 10.
In Step (2), the goal is to find the remaining 10 - x
replacement words.
In Step (2) we do a modified BFS.
Starting with the mispelled word w, we use all edges
(not just important edges) to generate level one.
For subsequent levels we use just the important edges.
As in Step (1), we terminate the BFS either when all the replacement words
we seek have been found or we have generated 2000 words.
Note that the word graph is too big to store explicitly and explore.
Therefore, you will run BFS on an implicitly stored graph, by generating
neighbors not by scanning the adjacency matrix/list representation of
the word graph, but instead by using the rules mentioned above.
What to submit:
You are required to electronically submit by midnight 5-8
the following files: BuildDict.java,
SpellCheck.java,
Dictionary.java,
Link.java,
LinkList.java, and
Record.java.
Make sure that your file names are exactly as above and make sure that you
do not submit any additional files.
A final word:
While I have specified quite a few details, I have left enough unspecified
so as to leave you with a few design decisions to make. These decisions will
affect the efficiency, robustness, and readability of your program.
So make your choices with care.
Sriram Pemmaraju