22C:21 Project 3

Due date: Monday, May 8th

Introduction

This project builds on Project 2. In this, you will generate suggested replacements in a much more efficient way.

Suggesting Replacements

The procedure for suggesting replacements, used in Project 2, was inefficient and somewhat simple-minded. Here is a more sohisticated and efficient algorithm for generating replacements. In Project 2 we defined a graph whose vertices are all possible strings and whose edges connect certain pairs of strings s and t. Let us call this the word graph. The edges in the word graph were of 4 types:

Deletion edges: t is obtained by deleting a letter in s,
Insertion edges: t is obtained by inserting a letter into s,
Substitution edges: t is obtained by substituting another letter for some letter in s, and
Swapping edges: t is obtained by swapping a pair of letters in s.

We will now think of some of the edges in the graph as being important and others as being unimportant. The idea is that an edge (s, t) is important if there are reasons to believe that a user is likely to use the word s instead of t (or vice versa). We will use the following principle to determine which edges are important and which are not: If two words sound similar, it is more likely that a user will use one instead of another. This principle has specific implications to the 4 kinds of edges we consider. Here are examples.

Deletion edges: Dropping the letter t from the word letter does not change the sound of the word much, because of the two consecutive t's in the word. Similarly, dropping the letter i or the second e in the word receive, does not change the sound of the word too much. So edges such as (letter, leter) and (receive, receve) are considered important whereas edges such as (letter, lette) and (receive, eceive) are not. This is because in the latter cases, the sound of the word changes significantly when the letter is dropped.
Insertion edges: If an edge (s, t) is an insertion edge, then it is also a deletion edge. Therefore, the above discussion holds for insertion edges as well. For example, I have often written neccessary for the word necessary. Since the mispelled word, obtained from the correctly spelled word via the insertion of an extra c, sounds similar to the original word, the edge (necessary, neccessary) is considered important.
Substition edges: If a pair of letters sound the same then a user is more likely to substitute one for another. For example, the letters c and s sound similar in some contexts and a user might type nesessary instead of necessary. This means that edges such as (necessary, nesessary) are considered important, whereas the edge (necessary, netessary) is not. In the latter case, replacing the letter c by a t drastically changes the sound of the word.
Swapping edges: If a pair of letters sound the same then swapping them will create a new word that sounds similar to the original word. For example, the swapping the letters e and i in receive yields a similar sounding word and so the edge (receive, recieve) is considered important whereas the edge (receive, erceive) is not.

To do apply these ideas systematically, here is a table of pairs of letters that we will think of as being similar in sound.

	a	e
	c	s
	c 	k
	e	i
	g	j
	i	y
	o	u
	v	w
	s	z

Also, every letter sounds similar to itself. Given this table, we can tag each edge as either being important or unimportant as follows, depending on the type of the edge.

Deletion edges: If two similar sounding letters occur consecutively in a word s and one of them is deleted, to give us the word t, then the edge (s, t) is important. All other deletion edges are unimportant.
Insertion edges: Since every deletion edge is an insertion edge as well, the important/unimportant tags for these kinds of edges are defined above.
Substitution edges If the word t is obtained by substituting a letter in the word s by a similar sounding letter, then the edge (s, t) is important. All other substition edges are unimportant.
Swapping edges: If a two similar sounding letters occur in a word s and these are swapped, to give us the word t, then the edge (s, t) is important. All other swapping edges are unimportant.

Here is how the idea of important and unimportant edges comes into play while searching for replacements. Given a mispelled word w, use the following steps to find replacements.

Do a breadth first search (BFS) on the word graph starting with w as the source and using just the important edges. New strings will be generated as BFS proceeds; check whether a string that is generated belongs to the dictionary (as in Project 2) or not. If a generated string does belong to the dictionary, enqueue it into a queue of replacement strings. Terminate BFS when either 10 replacement strings have been found or when 2000 strings have been generated. The number 2000 is somewhat arbitrary and should be fine-tuned for better performance. Note that BFS may proceed down several levels starting at the root, before it is terminated. Recall that in Project 2, the search went down just 2 levels. This was because in each level there were a lot of words (especially in the second level) and we could not afford to go down more levels. Here, since we are only using important edges, our search is narrower and can therefore afford to be deeper. Also, note that the queue of replacements contains words in the order in which they were generated and this order does correspond to a decreasing order of relevance of the replacement words.
Suppose that in Step (1), x replacement words have been found, where x is smaller than 10. In Step (2), the goal is to find the remaining 10 - x replacement words. In Step (2) we do a modified BFS. Starting with the mispelled word w, we use all edges (not just important edges) to generate level one. For subsequent levels we use just the important edges. As in Step (1), we terminate the BFS either when all the replacement words we seek have been found or we have generated 2000 words.

Note that the word graph is too big to store explicitly and explore. Therefore, you will run BFS on an implicitly stored graph, by generating neighbors not by scanning the adjacency matrix/list representation of the word graph, but instead by using the rules mentioned above.

What to submit:

You are required to electronically submit by midnight 5-8 the following files: BuildDict.java, SpellCheck.java, Dictionary.java, Link.java, LinkList.java, and Record.java. Make sure that your file names are exactly as above and make sure that you do not submit any additional files.

A final word:

While I have specified quite a few details, I have left enough unspecified so as to leave you with a few design decisions to make. These decisions will affect the efficiency, robustness, and readability of your program. So make your choices with care.

Sriram Pemmaraju