The Indexable Web is more than 11.5 billion pages
Abstract
What is the current size of the Web? At the time of this writing, Google claims to
index more than 8 billion pages, MSN Beta claims about 5 billion pages,
Yahoo! at least 4 billion and Ask/Teoma
more than 2 billion. Two sources for tracking the growth of the Web are [6,7], although they are not kept up to date.
Estimating the size of the whole Web is quite difficult, due to its dynamic nature (According to Andrei Broder, the
size of the whole Web depends strongly on whether his laptop is on the web, since it can be configured to produce
links to an infinite number of URLs!). Nevertheless, it is possible to assess the size of the publically indexable
Web. The indexable Web [4] is defined as "the part of the Web which is considered for indexing by the major
engines". In 1997, Bharat and Broder [2] estimated the size of Web indexed by Hotbot,
Altavista, Excite and Infoseek
(the largest search engines at that time) at 200 million pages. They also pointed out that the estimated intersection
of the indexes was less than 1.4\%, or about 2.2 million pages. Furthermore, in 1998, Lawrence and Giles [3] gave
a lower bound 800 million pages. These estimates have now become obsolete.
In this short paper, we revise and update the estimated size of the indexable Web to at least 11.5 billion pages
as of the end of January 2005. We also estimate the relative size and overlap of the largest Web search engines.
Precisely Google is the largest engine, followed by Yahoo!,
by Ask/Teoma, and by MSN Beta. We adopted the methodology
proposed in 1997 by Bharat and Broder [2], but extended the number of queries used for testing from 35,000 in
English, to more than 438,141 in 75 different languages. We remark that an estimate of the size of the web is useful
in many situations, such as when compressing, ranking, spidering, indexing and mining the Web.
Data files
The data used in the experiment are available for download in UTF-8 plain text format, compressed with bzip2. They
are formatted as follows
SearchTime, Engine, Query, Rank, URL, CheckTime, GMTY
The field SearchTime is the integer returned by the system function
time() at the time of the search. The field Engine
indicate the queried search engine, and Query is the word used in the search.
Rank indicate the position of the URL among the first 100 returned by the search
engine.
The last two fields are related to the checking procedure. The field CheckTime
represent the integer returned by the system function time() at the time of the check,
and the field GMTY indicate if the URL was recognized (1) or not recognized (0) by
each search engine (G=Google, M=Msn Beta, T=Ask/Teoma, Y=Yahoo!).
~ Round 1 ~
 |
| Engines Coverage % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Coverage |
76.30 |
62.03 |
57.58 |
69.28 |
| Engines Intersections % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Google |
- |
55.80 |
35.56 |
55.63 |
| Msn |
78.40 |
- |
49.56 |
67.38 |
| Teoma |
58.83 |
42.99 |
- |
54.13 |
| Yahoo! |
67.96 |
49.33 |
45.21 |
- |
|
Download the data file round1.urls.bz2 [3.0Mb]
~ Round 2 ~
 |
| Engines Coverage % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Coverage |
76.09 |
61.90 |
57.69 |
69.39 |
| Engines Intersections % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Google |
- |
55.27 |
35.89 |
56.60 |
| Msn |
78.48 |
- |
49.57 |
67.28 |
| Teoma |
58.17 |
42.95 |
- |
53.70 |
| Yahoo! |
67.71 |
49.38 |
45.32 |
- |
|
Download the data file round2.urls.bz2 [3.0Mb]
~ Round 3 ~
 |
| Engines Coverage % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Coverage |
76.27 |
61.87 |
57.70 |
69.37 |
| Engines Intersections % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Google |
- |
55.23 |
35.96 |
56.04 |
| Msn |
78.42 |
- |
49.87 |
67.30 |
| Teoma |
58.20 |
42.68 |
- |
54.13 |
| Yahoo! |
68.45 |
49.56 |
44.98 |
- |
|
Download the data file round3.urls.bz2 [3.0Mb]
~ Round 4 ~
 |
| Engines Coverage % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Coverage |
76.05 |
61.73 |
57.57 |
69.30 |
| Engines Intersections % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Google |
- |
55.30 |
35.75 |
56.23 |
| Msn |
78.52 |
- |
49.42 |
67.09 |
| Teoma |
57.81 |
42.18 |
- |
53.88 |
| Yahoo! |
67.85 |
49.45 |
45.11 |
- |
|
Download the data file round4.urls.bz2 [3.0Mb]
~ Round 5 ~
 |
| Engines Coverage % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Coverage |
76.11 |
61.96 |
57.56 |
69.26 |
| Engines Intersections % |
| |
Google |
Msn |
Teoma |
Yahoo! |
| Google |
- |
55.52 |
35.46 |
56.06 |
| Msn |
78.42 |
- |
49.77 |
67.15 |
| Teoma |
58.19 |
42.74 |
- |
53.84 |
| Yahoo! |
67.84 |
49.58 |
45.02 |
- |
|
Download the data file round5.urls.bz2 [3.0Mb]
URLs normalization
Since we used the web interface of each search engine to check the presence of the URLs, lot of care have been taken
while normalizing them. In addition, we eliminated from our computations the URLs not recognized by the originating
search engine after the normalization, to avoid any possible bias that could be introduced applying this procedure.
On each retrieved URL, we applied the following steps
- Hex-encoded characters (%XX) have been converted in a standard ISO-8859-1 characters
- Html entities have been converted in their corresponding standard characters
- Every URL, not terminating with a dot (.) followed by 2-5 characters, has been considered as a directory,
and a slash (/) have been added to its tail.
- Everything (parameters) after the question mark (?) have been removed (question mark included)
- URLs containing invalid characters (spaces, quotes, equals...) have been eliminated
Engine sizes estimation
In the square distance method, we tried to minimize the square distance between the estimate sizes in each
pair of engine. Let A and B be two search engines, and let x and y be the relative sizes coefficents
such that x*A=B and y*B=A, using as lower bound the declared sizes of the engine's indexes, we tried
to minimize (for each engine) the square difference between the declared size and the relative size obtained by the
pairwise overlaps.
In the linear program approach, we built a linear program with 12 contraints of the form A-y*B<=Cn
for each n from 1 to 12. The objective was the minimization of the sum of the Cn variables. Each
engine variable (in this example A and B) represent its index size, and can assume any value greater
or equal to the declared engine's size.
Both the approaches give similar engine's sizes.
Indexed Web estimation
Analyzing the coverage of each engine over the 5 rounds, we obtained the following engines coverages
Google=76.16%, Msn Beta=61.90%, Ask/Teoma=57.62%, Yahoo!=69.32%
on the test data. Since we generated the same amount of URLs from each engine, and since we eliminated the URLs not
recognized by the originator engine after the normalization process, we can consider these values as representative
of each engine's coverage of the Indexed Web. Thus, using the estimates engine's index sizes, we can
estimate the dimension of the Indexed Web, for each one of them. Averaging these values we obtained the declared
9.36 billion pages.
Furthermore, computing how many URLs are recognized by every search engine (among the URLs in the test data), we can
estimate the number of URLs of the Indexed Web, shared by the four search engines. The estimate intersection of the
engine's indexes turned out to be the 28.85% of the Indexed Web, or about 2.7 billion pages.
Bibliography
[1] A.Gulli and A.Signorini, Building an open source meta search engine [WWW2005]
[2] K.Bharat and A.Broder, A technique for measuring the relative size and overlap of public web search engines [WWW1998]
[3] S.Lawrence and C.L. Giles, Accessibility of information on the web [Nature 400:107-109, 1999]
[4] E.Selberg, Towards Comprehensive Web Search [PhD thesis, University of Washington, 1999]
[5] Lawrence Web site (http://www.neci.nj.nec.com/homepages/lawrence/)
[6] SearchEngineShowDown (http://searchengineshowdown.com/stats/)
[7] SearchEngineWatch (http://searchenginewatch.com/reports/article.php/2156481)
Translations