Вы находитесь на странице: 1из 5



lucky.igen@gmail.com | khushpar@gmail.com yuvistreet@gmail.com | nikhilbhat2008@gmail.com

Mentor: Bhanukiran Vinzamuri | Group ID: Web Miners

Abstract: A crawler is a computer program that browses the Web in a methodical, automated manner or in an orderly fashion.Our task here is to build a priority crawler that can crawl the URL data set provided for 120 days and selectively choose URLs to be kept in the limited size queue based on the data quality and information provided..So that we can have a better view about the working of a crawler.

We have done the implementation to basically undestand the working of a priority based crawler and also to find out what advancements could be done to make this process of crawling more time efficient. Since the experimentation data was huge having 16-20000 URLS and having over 3.2 milllion attributes for each URL,the biggest challenge was obviously handling such huge and anonymus amount of data.The task became even more interesting because we had no prior experience of handelling such data.

KEYWORDS:Crawler,URL,Feature Selection,SVM.


The data set being provided to us was very anonymus and unstructured.The Data Set under our experimentation consisted of 121 files of SVM format being provided in the form of Dayx.svm where x could range from 0120.Each of these files contains 16-20000 URLs and each URL consisted of 3.2 million attributes.The most challenging task was the correct understanding of the data set and also correct feature selection since attributes were anonymus and their properties were unknown.The only information directly provided was the URL being benign or malignant depicted by +1 or -1 respectively in the beginning of the URL.


Extraction of Key value pairs using REGEX

Removal of unnecessary data



In this project,we have found out the top 10 URLs for each day over the data of 121 days by appropriate feature selection.The crawler we have implemented had to be time efficient since the results for a query in real world should be instantaneous as well accurate in finding out out the top 20 URLs.

Calculation of Entropy as well weighted sum of each URL

Interactive Crawling module


Pre Processing:

Since the experimentation data is noisy,incomplete and unconsistent,we have to make it ready to use by pre processing i.e. eliminating the unimportant and redundant data, since quality decisions must be based on quality data. First of all,to encounter the incompleteness of data we abstracted the key value pairs using regular expression matching,to extract all the <key : value> pairs and storing it in a 2-D Hash map.By doing this we did not have to bother about incompleteness of data and did not have manually add 0s in place of incomplete attributes,hence saving a lot of precious preprocessing time. Secondly,We did not want highly redundant data to appear in our calculations every time so we calculated the standard deviation and mean for all attributes,to find out their importance over all the URLs,and removed all those features which had standard deviation=0 and mean=1,since both of these imply high number of occurences. Even after doing the above steps,there was some amount of redundancy still left,since there were many columns which were redundant.Since columns represent attributes of a particular line/row,therefore its important to remove attributes that have a minute effect on actual data. If 40% of lines had an attribute common to them,it might be considered as unimportant. After all this,the most important matter of our concern was, we did not want to lose on rare attributes which had high importance,so we found out the attributes which were present in less than 1% of total number of lines and kept them,since they had a very high importance value. The whole process of pre processing can be mathematically summed up as Do not keep an attribute if its ((SD==0||M==1||f>0.4)&&(!(f<0.01)) Where, S.D=standard deviation of the attribute, M=Mean of the attribute.,f=frequency of the attribute over total number of lines

specialized database on a particular topic, then pages that refer to that topic are more important, and should be visited as early as possible.To make this possible we should find out the important attributes in dataset and the URL which consists of more important attributes should be fetched first. To implement this,we have considered two important heuristics. o Entropy

The main heuristic we have used is the calculating the entropy metric over the set of attributes left after the step of pre processing and 16,000 URLs. Entropy = - pi *log(pi) Where pi=Number of times that attribute has come Total number of elements for that attribute For example , Consider line number 100 If it contains att1:100 att2:50 then if att1: is present in 1000 urls and 300 times its value has been 100 and att2: is present in 400 urls and out of those urls it has assumed the value 50 only twenty times so the entropy score for this url will be 300/100 log(300/100)+50/400log(50/400).... o Weighted sum

The second heuristic we have used is assigning a weighted sum to each URL based on the goodness of the attributes present in it. In this method,we have ranked attributes according to how rare are they and accordingly ranked attributes calculating a score for each url based upon that. Again consider the same example of line number 100 If it contains att1:100 att2:50 then if now att1 is present in like 50 more urls and att2 is present in like 200 urls then the weighted sum for this particular attibute wil be 100/50 +50/200 After completing the calculations of entropy or weighted sum, we sort the URLs depending on these values .After all the processing an interactive user driven crawling module appears which asks whether we want to to fetch or drop n URLs and correspondingly displays the resuls,as top 20 URLs.



We have implemented our priority crawler in PERL as it contains many inbuilt features for handling such huge and anonymus data like REGEX.Also instead of using traditional data structures like queues and stacks,we have used more advanced data structures like hash maps for storing the attribute values so that they can be easily compared for their occurences in different URLs and for finding out accurate results.

Not all pages are necessarily of equal interest to a crawlers client. For instance, if the client is building a


Calculation by weighted sum takes almost half run time than that taken by entropy heuristic but also we saw a drastic change in the URL queue. The total run time of code depends on the Day we are calculating for.The average run time of the code is around 250 seconds,in fetching the top 20 URLs,for each days data. The snapshot of results over the data of Day120 .

11225 2794 3540 3466 3570 3557 3550 3460 14279 13950 11721 6277

3.60009816371488 3.60009816371488 3.53217880206689 3.53217880206689 3.53217880206689 3.53217880206689 3.53217880206689 3.53217880206689 3.51733817831168 3.49387686344043 3.39696069252859 3.3452109708853

A scatter plot of day vs run time

Run time
500 400 300 200 100 0 0
X axis-- Day Y axis---Run time in seconds

Run time




For day 0,the variation in entropy values can be ennumerated as follows Entropy Value 4.06722374538319 4.03754144448823 3.97515162331633 3.90347165343665 3.90347165343665 3.89863541884464 3.77368213065205 3.76249595546937 3.76249595546937 3.71368047511625 3.71368047511625 3.70846175620895 3.70846175620895 3.6962117775535 3.61183637301763 3.61183637301763 3.60009816371488 3.60009816371488 3.60009816371488

Snap shot of 2nd heuristic i.e,weighted sum on the data of day 120

URL ID 7289 3237 1243 3492 3454 14133 13922 3374 3297 3694 3465 4694 4658 2764 9955 10357 11144 13441 2721

Surveying other possible approaches is a very important part of this paper,since we have to look at all the pros and cons of our approach to make it better in future. In this paper,we have surveyed five possible approaches for a priority based crawler and have compared them with the method we have implemented. The team Quarks has used the approach of first preprocessing the data by removing the threshold 0s and inserting 0s in place of incomplete data and then normalizing all the features to remove the redundant ones.Their process of preprocessing takes 1.5 hours,which is definintely not preferrable when designing something like a time constrained crawler.Also after preprocessing they land up on an unbelievable figure of 23 important attributes out of total 3.2 million.So,calculating the entropy metric of 16,000 URLs over these 23 attributes they give the top 100 URLs in total 5 minutes. The other team Spiders 2011 has done simple preprocessing by just adding 0s in place of incomplete data which takes around 1 min,and removing redundancy which takes around 3 minutes.This step of preprocessing gives top 84,000 attributes on which they have calculated entropy gain and information gain over just 1000 urls to give the top 100 URLs.So,there are basically two loop holes in their approach.Firstly,In the step of preprocessing they havent taken care of saving the rarely occuring important data,which could get removed in the process of redundancy removal.Secondly they have worked only on 1000 URLs of Day 0,Since all 16,000 URLs are not considered,it would definitely affect the quality of results they are getting by this approach.

The main flaw according to us in their approach is the random sampling of 1000 URLs because of which their final result would depend only on that sample of 1000 URLs they have taken and not on the complete dataset,though their idea of PCA reduction is appreciable since it gives an accurate idea about the importance of a URL. The fourth team we have surveyed is A3M.They have worked only on 5040 URLs which they got through random sampling on the total of 16,000 URLs.And they have considered only the features occurring in these URLs,so the number of attributes reduced to 33,000 from the total of 3.2 million.After calculating the information gain of each and every feature,they selected top 100 features after sorting them according to their entropy gain value. They took only those urls out of total 16000,which had +1 assigned to them(there were 5963 of them),and then features were assigned a value 0 or 1 depending on their occurrence in these urls(assign 0 if the feature is not present in the url). Then an importance matrix was formed of size 5963*100 in which importance value was calculated through entropy. And finally they calcualted the top 100 URLs according to their importance value. The approach of team A3M can be criticised on the basis that they have neglected the importance of preprocessing and have frelied on random sampling which may or may not give reliable results. The last team we surveyed was web spiders.They have only worked on top 1000 attributes and 50 URLs.In the preprocessing step they have just used the nave approach of inserting 0s in place of missing data. After this they calculated entropy of each attribute.then they subtrated from it the conditidonal entropy of each attribute. After that they calculated the igain for each attribute.then they bubble sorted the attributes .At the last step they applied ir metrics accordingly to the urls and selected the urls for enqueing and dequeing.Although they have applied really good techniques like entropy,igain and ir metrics but they should have worked on the whole range of attributes and URLs to get a clear picture of the crawler they have designed and also to have more reliable results.

The third approach we shall consider would be of team SKIM.In Preprocessing,they have considered only 67,000 attributes occuring in randomly selected 1000 URLs.On those attributes they have done redundant 0 removal and variance classification,i.e.,if variance of an attribute>0.9,they have removed it.After this step on randomly selected 1000 URLs and attributes,they have done PCA dimension reduction on 3 matrices of size 1000*100,500*100,100*100. And at the final step they have calculated importance of each URL by the formula. Importance of a URL=Eigen value of a feature*feature value. Where summation is over all features in a particular URL. Finally they have sorted the URL values in descending order to present the top 10 URLs.

Finally ,we can conclude that the teams we have surveyed have either sampled k number of urls ,or either they have taken a fixed number of attributes.While,on the other hand we have taken care of all atributes and reduced them and then assigned a weighted sum to each

url based on the goodness of a attribute,so as to have more reliable and accurate results. Also we can sum up the time taken by our approach and others to have a clear picture of a better approach Team Name PreProcessing time Time taken in displaying the results after preprocessing Around 250 seconds 5 minutes 10 minutes 6.712 seconds 1-2 seconds 0.386 seconds

Web Miners Quarks SPIDERS 2011 SKIM Web Spiders A3M

18-20 seconds 1.5 hours 1-2 minutes N/A N/A N/A

1) 2) 3) http://en.wikipedia.org/wiki/Web_crawler Slides and class notes of the Web Mining And Knowledge Management course Information provided by other coding teams