22C:21: Computer Science II: Data Structures

Project 1. Posted 2/16. Due 3/12


Introduction

These days we hear a lot about small-world networks, which are graphs with certain characteristics that seem to occur in a surprisingly large number of natural phenomena. Informally speaking, a small-world network is a graph with three properties:
  1. The average degree of a vertex is small relative to the number of vertices in the graph.
  2. For any vertex v, the neighbors of v are very likely to themselves be neighbors. Graphs with this property are said to be highly clustered.
  3. The average path length between pairs of vertices, measured in hops, is small relative the size of the graph.
In other words, small-world networks are sparse, highly clustered graphs with small average path length.

Small-world networks represent the small-world phenomenon, which is the hypothesis that everyone in the world can be reached through a short chain of social acquaintances. The concept gave rise to the famous phrase "six degrees of separation" after a 1967 small world experiment by social psychologist Stanley Milgram which suggested that two random US citizens were connected on average by a chain of six acquaintances. Other popular examples of small-world networks and the small-world phenomena are the Kevin Bacon game, the Erdos number project, connectivity of the internet, the power grid in the western part of the U.S., gene networks, online social networking communities, etc. Wikipedia has many related entries; see Small-world network, Small-world phenomenon, Social network, etc., for more interesting details and references.

Your task in this project is to experiment with methods for generating large, random, small-world networks. You are expected to write code, conduct experiments by running your code, gather data from your experiments, and finally write a report summarizing and explaining your findings.

Some Definitions

Let G be a graph with vertex set V and edge set E.

Average degree. The degree of a vertex is the number of neighbors it has in the graph and the average degree of the graph is the average of all the vertex degrees. We will use d(v) to denote the degree of v and d(G) to denote the average degree of G.

Clustering coefficient. For any vertex v in G, the clustering coefficient of v, denoted C(v), is the ratio of the number of edges between pairs of neighbors of v to the total number of pairs of neighbors of v. For example, say vertex v has 10 neighbors. There are (10 choose 2) = 45 pairs of neighbors and suppose that of these pairs of neighbors 15 pairs are connected by edges. Then the clustering coefficient of v is 1/3. The clustering coefficient of G, denoted C(G), is the average of the clustering coefficients of all the vertices in G. The value of C(v), for any v, is between 0 and 1 and therefore the value of C(G) is also between 0 and 1.

Average path length. The distance between a pair of vertices u and v in G is the number of hops on a shortest path between u and v. The average path length of G, denoted L(G), is the average of the distances between all pairs of vertices in G.

Small-world networks are graphs G that have a relatively small value for d(G) and L(G) and relatively high value for C(G).

Random Graph Models

It is important to be able generate random input to test algorithms and to prove properties of these. For example, if we design a new sorting algorithm, we would want to be able to generate large sequences of randomly ordered numbers, to test the algorithm on. Similarly, in order to test graph algorithms, we would like to generate random graphs and there is a simple, well-known way to generate random graphs. Let n be a positive integer and let p be a real number between 0 and 1 (inclusive). First generate n vertices labeled 1 through n and then for each pair of vertices i and j, connect the vertices by an edge with probability p. In other words, we visit every pair {i, j} of vertices and flip a biased coin. With probability p the coin toss has outcome "heads" and with probability 1-p the coin toss has outcome "tails." If the coin toss has outcome "heads" we add edge {i, j}; otherwise we don't. This model of a random graph was first discussed in a series of papers by mathematicians Paul Erdos and Alfred Renyi, starting in 1959. We will call this the Erdos-Renyi model of random graphs.

However, it turns out that the Erdos-Renyi random graphs have very small clustering coefficient relative to many of the well-known small-world networks. For example, the Kevin Bacon graph has a clustering coefficient of 0.79, whereas an Erdos-Renyi random graph with the same number of vertices has clustering coefficient 0.00027. This is not surprising because in the Erdos-Renyi model, two vertices have the same probability of being connected by an edge, whether they have a common neighbor or not. The clustering coefficient numbers mentioned above appear in an influential paper by Watts and Strogatz that appeared in Nature in June 1998. In this paper, Watts and Strogatz also suggest a simple, alternate random graph model that yields graphs with small average degree, high clustering coefficient, and small average path length. For specific details of the Watts-Strogatz model look at Figure 1 in this paper. However, the Watts-Strogatz model seems somewhat simplistic and your job is to experiment with an alternate, maybe more general, way of generating small-world networks. This is described in detail in the next section.

A More General Model

Let d and n be positive integers and let p be a real number in the range 0 through 1 (inclusive). The construction of a small-world network occurs in two steps:
  1. Fix a d by d square in the plane. Distribute n points uniformly at random into the square. These are the vertices of the graph being generated. Connect by edges, pairs of points that are at distance at most 1. At this point we have a graph that is called a unit disk graph. This is because for any vertex v, the neighbors of v are exactly the vertices in the disk of radius one (unit disk) centered at v.

  2. Now we will "rewire" some of the edges in the unit disk graph constructed in Step (1). Suppose the vertices are labeled 1 through n. Consider vertices in the order 1 through n and suppose that we are currently considering a vertex v. Consider each edge {v, w} connecting v to a higher labeled vertex w. With probability p, replace edge {v, w} by an edge {v, z} where z is picked picked uniformly at random from the set of all vertices that are not neighbors of v (including w).

We would like to believe that the random graph generated in the manner described above is a small-world network. The construction attempts to model the fact that people are typically connected to a few other people in physical proximity, but have some "long-range" connections as well. Here are some basic observations about the graphs generated by the above construction.

Experiments

You should run experiments to test the following hypothesis:

As p increases starting from 0, the average path length falls quickly. The clustering coefficient also falls, but more more slowly. Hence, we will get to a value of p with small average path length, but fairly high clustering coefficient.

To run experiments to test this hypothesis, let us fix n = 10,000 and average vertex degree at 10. So each vertex is connected to, on average, 1 in 1000 vertices. First find by a simple "back-of-the-envelope" calculation what value of d you should be using to get an average vertex degree of 10. You might want experiment with different values of d to get this right. Now increase p slowly from 0 to 1. For each value of p, generate a graph, using the two-step process described above and calculate its (i) clustering coefficient and (ii) average path length. Let G(p) denote the graph generated for a particular value of p. Plot the ratios C(G(p))/C(G(0)) and L(G(p))/L(G(0)) so that we get a sense of whether our hypothesis is true or not. You can decide how to vary p. For example, you may increase p in increments of 1/10 or in increments of 1/100, etc. But, the more points you have in your plot, the more convincing your results will be. You might want to look at Figure 2 of the Watts-Strogatz paper to get an idea of exactly what the x-axis and y-axis of your plot should be and roughly what shape your plots should have.

To strengthen your experimental results, you might want to run these for other, larger values of n and other slightly larger values of average degree. You might also want to see if you can generate graphs for which L(G) and C(G) match the numbers in Table 1 in Watts-Strogatz paper for the film actors graph, power grid graph, and the C. elegans graph.

As you run your experiments you may encounter issues that require tweaking your parameters. For example, for the values of n = 10,000 and average degree = 10 it may be that the constructed graph is not connected. In such a case, you might want to increase the value of the average degree.

Implementation

Here is a high level description of the steps in your implementation. There are many details to think about before you get started.

  1. Start with an adjacency list implementation of the graph class. This is available here. For this project the adjacency list representation is much more efficient than the adjacency matrix representation because the graphs we are concerned with, are all sparse. Turn this class into a generic class.

  2. Add a constructor to the graph class that takes arguments d, n, and p and returns a graph constructed in the way described above.

  3. Implement a method in the graph class that takes a vertex and returns its cluster coefficient. The header of this method would be
    double clusteringCoefficient(vertexType v)
    Then implement a method in the graph class that returns the clustering coefficient of the graph. This method would have the header:
    double clusteringCoefficient()

  4. Implements a program called Experiments that runs the experiments and outputs the data. This should be in a file called Experiments.java.

  5. Implement a method to compute the average path length in the graph. The header for this method would be
    double averagePathLength()

  6. Implement a breadth first search method that takes no arguments and returns a breadth first search tree of the graph. The header for this method would be
    treeType bfs()

What to submit

Working code distributed in various files, a clearly written README, and a report in pdf summarizing and explaining your results. All of these in a directory called Project1. More details on the exact names of the code files, size of the report, etc. will be posted soon.