22C:21 Computer Science II: Data Structures: Homework

Homework 10: Due 11/29

The last homework and Project 2 asked you to implement Prim's MST algorithm. This homework asks you to do that again, but much more carefully, using a binary heap implementation of the PRIORITY QUEUE ADT. In particular, as discussed in class, your implementation should run in (m+n)*log(n) for graphs with n vertices and m edges. To make sure that this is the case, you will test your implementation on large, sparse graphs - with about 10,000 vertices. For an n-vertex, sparse graph, your implementation should run in n*log(n) time, in the worst case. These large, sparse graphs will be randomly generated and generating these sparse, random graphs is part of this homework.

As discussed in class, Prim's algorithm can be efficiently implemented by maintaining in a binary min-heap the "non-tree" vertices that have at least one neighbor among the tree vertices. To do this, you will have to modify the implementation of the Heap class provided in heap.java. Specifically, create a new class called VertexHeap. A VertexHeap object will contain a bunch of nodes, with each node having the following three fields:

an int field called identity that represents the index of vertex,
an int field called parent that represents the index of a neighbor among tree vertices to which this vertex has the lightest edge, and
a field called priority that represents the weight of a lightest edge that connects this vertex to a neighbor among tree vertices.

The methods insert, delete, and change should run in log(n) time, in the worst case. Also, recall that the VertexHeap also needs a method called getIndex that takes the index of a vertex and returns the location of this vertex in the heap; this method should run in constant time. For this you should add an additional data member, called map to the class to map vertex indentities to locations in the heap.

To make sure that the running time of your implementation is (n + m)*log(n), you will have to carefully call other methods in the myWeightedGraph and also make sure that these methods are as efficient as possible. This may involve making other changes to the myWeightedGraph class. For example, consider the implementation I presented in class and in particular the code that updates the heap in response to the migration of a vertex v from the non-tree vertex set to the tree vertex set. We get all the neighbors of v, scan these one by one, and for each non-tree vertex u, get the weight of edge {v, u}, and if necessary update u's priority and parent in the heap. Now note that getting the weight of edge {v, u} (using the method getWeight) is not a constant time operation. It involves scanning the neighbors linked list of v until we find the EdgeLink containing u and then returning the weight that is stored in that EdgeLink object. In the worst case this may take n time (e.g., if v has n-1 neighbors and u is the last neighbor in v's adjacency list). To solve this problem, you should implement a more sophisticated version of the getNeighbors method so that it not only gets the neighbors of a vertex v, but also the weights of the edges from v to these vertices. There may be other instances of such inefficiencies that you may have to deal with.

Generating Random Graphs:
To generate large, sparse random graphs we use the Erdos-Renyi model of random graphs. Suppose we want to generate a graph with n vertices, for some positive integer n. Now let p be a real number between 0 and 1 (inclusive). First generate n vertices labeled 1 through n and then for each pair of vertices i and j, connect the vertices by an edge with probability p. In other words, we visit every pair {i, j} of vertices and flip a biased coin. With probability p the coin toss has outcome "heads" and with probability 1-p the coin toss has outcome "tails." If the coin toss has outcome "heads" we add edge {i, j}; otherwise we don't. This model of a random graph was first discussed in a series of papers by mathematicians Paul Erdos and Alfred Renyi, starting in 1959, and hence the name. The probability p can be used to control the number of edges (i.e., the sparsity) of the generated graph. Note that on average we expect p*(n-1) edges to be incident on each vertex. This is because each vertex has a probability p of being connected to each of the remaining n-1 vertices. So, for example, if we set p = 10/(n-1), then we would expect 10 edges to be incident on each vertex, on average.

It is possible that the graphs generated in the manner described above are not connected (i.e., they may have several connected components). Here is a simple way to connect up the graph and turn it into one that has exactly one connected component. Suppose the graph has k connected components. Pick two of these (arbitrarily), pick one vertex from each (at random), and add an edge between the two chosen vertices. This will reduce the number of connected components to k-1. Repeating this process for a total of k-1 times will leave exactly one connected component.

Perform the following experiments to test and time your MST implementation. Generate random graphs for n = 10000, 10100, 10200, 10300,..., 10900 with p = 30/(n-1). For each of the 10 (n, p) values mentioned above, generate 10 random graphs with that value of n and p. Make sure that each generated graph is also connected up, in the way described above, and has just one connected component. Compute the MST of each graph, first using your old implementation from Project 2 and then using your new implementation. Time each MST computation and report the average time (over the 10 graphs that were generated for the particular (n, p) value). Thus for each (n, p) value you will report two times: one for the old MST implementation and one for the new MST implementation.

Submit a document, called homework10.pdf that should contain a write-up describing your experimental results. This write-up should address the following issues:

How does the running time of your new MST implementation grow, as a function of n? Present a plot of these running times. Does it our expectation that the running time of MST should be n*log(n) for a sparse graph with n vertices. Recall that by a sparse graph, we mean a graph in which the number of edges is some constant times n, the number of vertices. For the graphs we are generating, since the expected degree is 30, the total number of edges is expected to be about 15*n; thus, these are sparse graphs.
How does the running time of your old MST implementation grow, as a function of n? Present a plot of these running times. Based on the plot of the running times, can you get a sense of what the function might be? It is possible that your old MST implementation takes forever to run on a graph with 10,000 vertices. In this case, generate smaller graphs and run your implementation on those and generate plots.
Based on your experimental results, how much faster do you think your new implementation is compared to your old implementation?