Вы находитесь на странице: 1из 6

How many documents are on the Web?

Fletcher Perri June 30, 2011


We want to extrapolate from what we know about Googles index of documents to nd the number of documents on the Web. There are specically two things that we want to gather from Googles index: the number of indexed documents and the rate at which new documents are added to the index. What is a document? Were not counting documents on the Deep Web. The Deep Web is connected to the Web only by mostly-barred gates; the Deep Web is commonly thought to include access-controlled databases, peer-to-peer networks, and corporate intranets, among others. Were also not counting les that arent human-readable; the le must be some sort of text document. Pictures, code, etc are not counted. To a certain extent, we are counting spam, but only insofar as it masquerades as legitimate content on Googles index. In brief, again, we are considering human-readable widely available text documents; this includes HTML les and Google-searchable PDFs. Probably the only other widely represented Googlesearchable le format is the Microsoft Word .doc format. We are also assuming some things about the structure of the Web. For one, we are assuming it is static: the total number N of documents does not change in our model. Since we have already discarded the Deep Web, we assume the Web is connected you can get there from here. Our next step is nding out about Googles index. There are no published statistics, so we need to hunt down the data on our own. Using conditional probabilities and assuming independence of search terms a and b, we can nd I , the number of documents in Googles index. We rst nd fa and fb , which are the number of results in the index for search terms a and b respectively. Then we nd fab , the number of results in the index when our query includes both a and b. Assuming independence, we have fa fb fab = . I I I Solving for I, we have fa fb I= . fab However, this did not work out as we hoped it would. The assumption of independence wasnt valid for our test search terms. In order to make the independence assumption rightly, we would have to analyze large corpora of text similar to documents found on the Web and nd independent pairs of search terms there. 1

Fortunately, Maurice de Kunder (worldwidewebsize.org) had done exactly this. We used ImageJ (rsbweb.nih.gov/ij/) and the axis labels on a graph we found on de Kunderswebsite to get numbers for I , the size of Googles index: I = 1.513 1010 docs and for
dI , dt

the rate of growth of Googles index: dI = 4.635 107 docs/sec. dt

How will we extrapolate from these numbers to the number of documents on the Web? First, lets look at the citation network model. For the citation network model (as given in Tung, Topics in Mathematical Modeling), n is the total number of papers. The average number of citations per paper is m. The number of papers cited k times, or the number of papers of degree class k , is Nk . The proportion of k . Using n as a timelike variable, we can show the papers cited k times is pk = N n change in the number of papers in a degree class k with respect to n as Nk [n + 1] Nk [n] = (n + 1)pk [n + 1] npk [n]. For n large, pk [n + 1] pk [n], so we can write Nk [n + 1] Nk [n] pk [n]. (2) (1)

The fundamental assumption of this model is preferential attachment, or the rich get richer. That is, we assume that the probability that a new paper (the n + 1th paper) will cite a paper with k citations is proportional to k . However, this leaves us with a chicken-and-egg problem; viz, its now impossible for a paper thats never been cited to be cited. We get around this by making the probability p cite from Nk that a new paper will cite a paper with k citations proportional to k + 1, instead: pcite from Nk = (k + 1)pk (k + 1)pk , = m+1 k (k + 1)pk (3)

bearing in mind that k kpk = m and k pk = 1. To get the average number of new citations of existing papers, we multiply this probability by m: m pcite from Nk = (k + 1)pk m . m+1 (4)

Now we use (4) to show the ow of papers into and out from a degree class k : Nk [n + 1] Nk [n] = kpk1 [n] Combining (2) and (5), we get pk [n] kpk1 [n] m m (k + 1)pk [n] . m+1 m+1 2 (6) m m (k + 1)pk [n] . m+1 m+1 (5)

Dropping the time-step n and setting the sides equal, pk = kpk1 = m m (k + 1)pk m+1 m+1

kpk1 mm +1 1 + (k + 1) mm +1 kpk1 = m+1 +k+1 m kpk1 = , 2 + k + 1/m Which, when solved in terms of p0 and plotted, yields a power law for large k : pk k 2+1/m . We end up borrowing some of the same terminology for our model. Our timelike variable is c, the total unique number of crawled documents. We let Ik [c] be the number of documents in our index of degree class k when weve crawled c unique documents. N is the total number of documents, indexed and unindexed. The proportion of total documents of degree class k is pk . We want to use this same preferential attachment assumption and say that, given that a document in some degree class k , the probability that its already in the index is proportional to k . We dont, however, use this directly. Instead we simply say that the probability that a document of some degree class k is already in the index is what one might think it is: Ik [c] . (7) p in N & in Ik = N pk Now we go back to the citation network model, and calculate the change in the number of documents in our index of a given degree class k with respect to our time-step as Ik [c] , (8) Ik [c + 1] Ik [c] = pk 1 N pk
k [ c] where 1 I is the probability that a document of degree class k is not already N pk in the index. We multiply by the proportion of total documents in our degree class to serve as a relative frequency; this is really only an expected change, as we dont actually know ahead of time whether the crawled document is of degree class k . Rewriting:

Ik [c + 1] = Ik [c] + pk 1

Ik [c] N pk Ik [c] = Ik [c] + pk N N 1 = pk + Ik [c] N

1 Here, for convenience of manipulation, we dene P = pk and B = NN . Now, subtracting 1 from our time-step c and temporarily disregarding the degree class k , we have Ic = BIc1 + P (9)

We can see that the initial condition should be I0 = 0. Now, we use the method of generating functions:

G(x) =
c=0

Ic xc (BIc1 + P )xc
c=0

=0+
c=1

BIc1 xc + Ic1 xc1 + P


c=1

P xc
c=1

= Bx

xc
c=1

= BxG(x) + P =

x 1x

(with the assumption that x < 1)

x P . 1 Bx 1 x

After partial fraction decomposition, we nd G(x) = = = and recalling that G(x) =

P 1B P 1B P 1B

1 1 1 x 1 Bx

xc
c=0 c=0

(Bx)c

(1 B c )xc ,
c=0

c c=0 Ic x , c

we have

P Ic x = 1B c=0 so we can equate coecients and see Ic =

(1 B c )xc ,
c=0

(10)

P (1 B c ). 1B

(11)

Checking with induction, we see our base case works out nicely: I0 = P (1 B 0 ) 1B = 0. 4 (12)

We assume that our expression (12) agrees with the recurrence (9) for Ic , and check with Ic+1 : Ic+1 = P (1 B c+1 ) 1B P = (1 B B c ) 1B P = (1 B B c + B B ) 1B P P = (1 B ) + B (1 B c ) 1B 1B P (1 B c ) =P +B 1B = P + BIc .

This agrees with our recurrence at (9). Returning to our original notation and making substitutions for P and B , we get Ik [c] = N pk 1 Summing over degree classes in k , we set I=N 1
k

N 1 N

(13)

pk = 1 and let I [c] = I :


c

N 1 N

(14)

However, we have no idea what c is! Although I is dependent on c, we wont even investigate this, and just use the estimate for I at the present time from the data we found. Theres not necessarily any simple relationship between the number of documents crawled by Google and the size of their index or the rate at which their index is growing. Weve only assumed that they pick documents at random from those documents available on the internet. We have no dependence on how documents link to each other, other than in an abstract sense, as preferential attachment forms a basis for our model. While we have no way of guessing a number for how many documents Google has crawled, its much more realistic for us to estimate their current crawl rate in docs/sec. So, we assume that c is large enough so that we can treat our function as continuous, and we dierentiate: dI = N dc We use the chain rule to write dI dI dt = , dc dt dc (16) N 1 N
c

log

N 1 N

(15)

dt where dI is our known current rate of index growth, and 1/ dc is our plausiblydt estimated current crawl rate. If we rewrite equation (14) as

N 1 N

= N I,

(17)

we can now combine (15), (16), and (17) to write dI dt = N dt dc N 1 N


c

= (N I ) log

N 1 N N 1 . N log

Since N is presumably very large, we can make an asymptotic approximation: dI dt = (N I ) log dt dc = (N I ) log N 1 N N 1 N
1 1 N N

(N I ) log e N N I . N Solving for N , we get N I 1


dI dt

dt dc

(18)

which is is a fairly bald statement; it in eect says Google has indexed the entire Web. If, as suggested by the data we found, we let I (our index size) be 1.513 (our index growth rate) be 4.635 107 docs/sec, and then 1010 docs and we let dI dt dt let our crawl rate, 1/ dc , be any positive number, we see N I . What conclusions can we draw? Well, the possibilities for error are mostly twofold. One, the data we found might not have a rm connection to reality. We can only really nd out by doing our own testing in that arena. Two, our model may not be realistic, or may be awed in a more subtle way than we can easily see. With the data we have right now, specically the extremely low rate of index growth we found, the answer our model is giving seems to make sense. Thankfully dt for our model, it is basically impossible to get in a situation where dI dc > 1, dt that is, a situation where we are adding documents to our index faster than we are crawling them. If we somehow got in this situation, N would go negative and not be a representation of reality in any sense. Again, however, from the way our model is set up, we always get N I given dI dt < 1, which is as it should be. Part of the dt dc dt failure condition of our model, when dI dc = 1, actually makes sense in a way: dt when every document we crawl gets added to the index, the index is growing as fast as it can and we cant have any idea of how large N is, just that we might be able to tell once index growth relative to crawl rate slows down. 6

Вам также может понравиться