Академический Документы
Профессиональный Документы
Культура Документы
Fortunately, Maurice de Kunder (worldwidewebsize.org) had done exactly this. We used ImageJ (rsbweb.nih.gov/ij/) and the axis labels on a graph we found on de Kunderswebsite to get numbers for I , the size of Googles index: I = 1.513 1010 docs and for
dI , dt
How will we extrapolate from these numbers to the number of documents on the Web? First, lets look at the citation network model. For the citation network model (as given in Tung, Topics in Mathematical Modeling), n is the total number of papers. The average number of citations per paper is m. The number of papers cited k times, or the number of papers of degree class k , is Nk . The proportion of k . Using n as a timelike variable, we can show the papers cited k times is pk = N n change in the number of papers in a degree class k with respect to n as Nk [n + 1] Nk [n] = (n + 1)pk [n + 1] npk [n]. For n large, pk [n + 1] pk [n], so we can write Nk [n + 1] Nk [n] pk [n]. (2) (1)
The fundamental assumption of this model is preferential attachment, or the rich get richer. That is, we assume that the probability that a new paper (the n + 1th paper) will cite a paper with k citations is proportional to k . However, this leaves us with a chicken-and-egg problem; viz, its now impossible for a paper thats never been cited to be cited. We get around this by making the probability p cite from Nk that a new paper will cite a paper with k citations proportional to k + 1, instead: pcite from Nk = (k + 1)pk (k + 1)pk , = m+1 k (k + 1)pk (3)
bearing in mind that k kpk = m and k pk = 1. To get the average number of new citations of existing papers, we multiply this probability by m: m pcite from Nk = (k + 1)pk m . m+1 (4)
Now we use (4) to show the ow of papers into and out from a degree class k : Nk [n + 1] Nk [n] = kpk1 [n] Combining (2) and (5), we get pk [n] kpk1 [n] m m (k + 1)pk [n] . m+1 m+1 2 (6) m m (k + 1)pk [n] . m+1 m+1 (5)
Dropping the time-step n and setting the sides equal, pk = kpk1 = m m (k + 1)pk m+1 m+1
kpk1 mm +1 1 + (k + 1) mm +1 kpk1 = m+1 +k+1 m kpk1 = , 2 + k + 1/m Which, when solved in terms of p0 and plotted, yields a power law for large k : pk k 2+1/m . We end up borrowing some of the same terminology for our model. Our timelike variable is c, the total unique number of crawled documents. We let Ik [c] be the number of documents in our index of degree class k when weve crawled c unique documents. N is the total number of documents, indexed and unindexed. The proportion of total documents of degree class k is pk . We want to use this same preferential attachment assumption and say that, given that a document in some degree class k , the probability that its already in the index is proportional to k . We dont, however, use this directly. Instead we simply say that the probability that a document of some degree class k is already in the index is what one might think it is: Ik [c] . (7) p in N & in Ik = N pk Now we go back to the citation network model, and calculate the change in the number of documents in our index of a given degree class k with respect to our time-step as Ik [c] , (8) Ik [c + 1] Ik [c] = pk 1 N pk
k [ c] where 1 I is the probability that a document of degree class k is not already N pk in the index. We multiply by the proportion of total documents in our degree class to serve as a relative frequency; this is really only an expected change, as we dont actually know ahead of time whether the crawled document is of degree class k . Rewriting:
Ik [c + 1] = Ik [c] + pk 1
1 Here, for convenience of manipulation, we dene P = pk and B = NN . Now, subtracting 1 from our time-step c and temporarily disregarding the degree class k , we have Ic = BIc1 + P (9)
We can see that the initial condition should be I0 = 0. Now, we use the method of generating functions:
G(x) =
c=0
Ic xc (BIc1 + P )xc
c=0
=0+
c=1
P xc
c=1
= Bx
xc
c=1
= BxG(x) + P =
x 1x
x P . 1 Bx 1 x
P 1B P 1B P 1B
1 1 1 x 1 Bx
xc
c=0 c=0
(Bx)c
(1 B c )xc ,
c=0
c c=0 Ic x , c
we have
(1 B c )xc ,
c=0
(10)
P (1 B c ). 1B
(11)
Checking with induction, we see our base case works out nicely: I0 = P (1 B 0 ) 1B = 0. 4 (12)
We assume that our expression (12) agrees with the recurrence (9) for Ic , and check with Ic+1 : Ic+1 = P (1 B c+1 ) 1B P = (1 B B c ) 1B P = (1 B B c + B B ) 1B P P = (1 B ) + B (1 B c ) 1B 1B P (1 B c ) =P +B 1B = P + BIc .
This agrees with our recurrence at (9). Returning to our original notation and making substitutions for P and B , we get Ik [c] = N pk 1 Summing over degree classes in k , we set I=N 1
k
N 1 N
(13)
N 1 N
(14)
However, we have no idea what c is! Although I is dependent on c, we wont even investigate this, and just use the estimate for I at the present time from the data we found. Theres not necessarily any simple relationship between the number of documents crawled by Google and the size of their index or the rate at which their index is growing. Weve only assumed that they pick documents at random from those documents available on the internet. We have no dependence on how documents link to each other, other than in an abstract sense, as preferential attachment forms a basis for our model. While we have no way of guessing a number for how many documents Google has crawled, its much more realistic for us to estimate their current crawl rate in docs/sec. So, we assume that c is large enough so that we can treat our function as continuous, and we dierentiate: dI = N dc We use the chain rule to write dI dI dt = , dc dt dc (16) N 1 N
c
log
N 1 N
(15)
dt where dI is our known current rate of index growth, and 1/ dc is our plausiblydt estimated current crawl rate. If we rewrite equation (14) as
N 1 N
= N I,
(17)
= (N I ) log
N 1 N N 1 . N log
Since N is presumably very large, we can make an asymptotic approximation: dI dt = (N I ) log dt dc = (N I ) log N 1 N N 1 N
1 1 N N
dt dc
(18)
which is is a fairly bald statement; it in eect says Google has indexed the entire Web. If, as suggested by the data we found, we let I (our index size) be 1.513 (our index growth rate) be 4.635 107 docs/sec, and then 1010 docs and we let dI dt dt let our crawl rate, 1/ dc , be any positive number, we see N I . What conclusions can we draw? Well, the possibilities for error are mostly twofold. One, the data we found might not have a rm connection to reality. We can only really nd out by doing our own testing in that arena. Two, our model may not be realistic, or may be awed in a more subtle way than we can easily see. With the data we have right now, specically the extremely low rate of index growth we found, the answer our model is giving seems to make sense. Thankfully dt for our model, it is basically impossible to get in a situation where dI dc > 1, dt that is, a situation where we are adding documents to our index faster than we are crawling them. If we somehow got in this situation, N would go negative and not be a representation of reality in any sense. Again, however, from the way our model is set up, we always get N I given dI dt < 1, which is as it should be. Part of the dt dc dt failure condition of our model, when dI dc = 1, actually makes sense in a way: dt when every document we crawl gets added to the index, the index is growing as fast as it can and we cant have any idea of how large N is, just that we might be able to tell once index growth relative to crawl rate slows down. 6