22C:30, 22C:115 Project 3

Due date: Wednesday, 12/10, 5 pm

Introduction

This project differs from projects 1 and 2 in that the emphasis is on experimentation and interpretation of results rather than on coding. In projects 1 and 2 we made several choices of data structures. For example, we used a hash table for the primary dictionary and a sorted array of words for the secondary dictionary. How good are these choices as compared to alternatives? This is the kind of question we will seek to answer via a series of experiments.

Experiment 1
Implement a binary search tree data structure. Use this instead of a vector of words in the buildDict program. Specifically, while maintaining the functionality of the class fwList implement it using a binary search tree. How much difference does this make to the running time of buildDict? Explain this difference or lack of it.

Experiment 2
Implement the AVL tree data structure. Perform an experiment similar to Experiment 1, but use an AVL tree instead of a plain binary search tree. How much difference does it make to buildDict to use a balanced search tree rather than search tree that can be quite skewed in the worst case? Explain this difference or lack of it.

Experiment 3
In going from Project 1 to Project 2 we placed high frequency words in a "primary dictionary", which we implemented as a hash table. How much difference did this decision make to spellCheck. Specifically, how much did the response time of spellCheck improve as a result of storing high frequency words separately? Explain your observations.

Experiment 4
What if we decide to store many more words in the hash table? Specifically, what if we store the 5000 most frequent words in a hash table with 5000 slots? How does the response time of spellChecker change? More generally, experimentally investigate how the ratio of the number of words stored in a hash table to the number of slots in the hash table affects the performance of the hash table. Explain your observations.

Experiment 5
What fraction of spellCheck's running time is spent in computing replacements for misspelled words? In your view, is this fraction large enough that we should invest time improving on the running time of the algorithm that computes replacements.

Experiment 6
How does your spellCheck program compare with the ispell program available on the linux lab machines? Explain the differences in a paragraph and suggest ways in which spellCheck could be improved to match ispell in performance.

Notes:



Sriram Pemmaraju