Вы находитесь на странице: 1из 6

76 (IJCNS) International Journal of Computer and Network Security,

Vol. 2, No. 1, January 2010

A Comprehensive and Relatively Fast Web


Archiving Application for Offline Browsing
Mr. B.Vijaya Babu1 and Prof M.S.Prasad Babu2
1
Professor, Dept. of Computer
Science,
Usha Rama College of Engineering &
Technology , TELAPROLU, Near Gannavaram,
Krishna(D.t) vijaymtech28@gmail.com

2
Dept of
CS&SE,
A.U.College Of Engineering, Andhra
University Visakhapatnam
msprasadbabu@yahoo.co.in

Abstract: The task of archiving or mirroring the websites has have some cultural value, because they belongs to the
become more difficult in the recent years, because of the rapid digital heritage and they are the material witness of our
devolopments in designing technologies of websites. The users society and with out proper archiving policy, they will be lost
and organizations are facing many challenges in term of for the future[12].
restoration of all types of links originally present on the Since a website is a unit of linked files including text
websites. More over preservation, accessing and interpretation files, sound files, data bases and scripts etc., the author or
of the data present on the websites for future references poses the organization that has created the site is protected in
new problems. This paper focuses on the design exactly the same way as the author of a classical literary
,implementation and analysis of an optimized, multi threaded,
work. An organization or the author of the site has to
website specific, application that saves the websites at a
consider the legal copy right implications, when drawing up
relatively faster time on to a user defined location of a local
disk in the form of a file ,which is useful for offline browsing. an archiving strategy for their websites. Libraries of different
The problems and limitations of existing state of art archiving countries and various organizations that deal with the
tools in terms of comprehensive retrieval of the links, and speed reproduction of the information see copy right as an
of archiving have been addressed. This application is compared obstruction. The current copy right laws do not allow
with the existing open sourced utilities like web eater 0.2.1 and enough possibilities to make reproductions of publications,
winHTTrack Website copier 3. 43-6 versions. with out the authors permission. A lot information on the
sites is being shielded by copy right, making its
Key words: Web archiving, open source tool, running time, reproduction subject to a number of clearly defined
active threads. conditions [12][7].
1. Introduction
2. Study of literature on related work
Web archiving or mirroring is the process of collecting
An extensive study has been done on the literature of
portions of the World Wide Web(WWW) and ensuring the
various web archiving, mirroring utilities, applications and
collection is digitally preserved for the future references
tools, both licenced as well as open sourced . The study of
and interpretation of research scholars, scientists, business
the state of art technologies on the related work has
people, various government and private organizations of
provided enough base and platform for the development of
different countries.The web is a pervasive and ephemeral
our application.
media where modern culture in a large sense finds a
J. Cho et al..have described the design and performance
natural form of expression. Publication, debate, creation,
of WebBase, a tool for Web research. The system includes
work and social interaction in a large sense: many aspects
a highly customizable crawler, a repository for collected
of the society are happening or reflected on the Internet in
Web pages, an indexer for both text and link-related page
general and the web in particular [7].
features, and a high-speed content distribution facility. The
Archiving websites is important for three main reasons.
distributed model enables the researchers world-wide to
Firstly, archiving web sites is justified by the documentary
retrieve pages from WebBase, and stream them across the
value the website possesses themselves. Archived websites
Internet at high speed [9]. Masanes.J has presented various
are a necessary evidence material for research into the and
crawling algorithms and approaches undertaken today by
the evolution of this medium itself history. Secondly, the
different institu- tions; it will discuss their focuses,
websites them selves are having large informational value ,
strengths, and limits, as well as a model for appraisal and
it is impor- tant to realize that the websites can be frozen as
identifying potential comple- mentary aspects amongst
on-line sources one way or the other. Thirdly, websites also
(IJCNS) International Journal of Computer and Network Security, 77
Vol. 2, No. 1, January 2010

them. following diagram.


He stated that the completeness of a Web archive can be
measured in horizontally by the number of relevant entry
points found within the designated perimeter and vertically
by the number of relevant linked nodes found from this
entry point.[7,10]. It was shown that, not all the pages are
stored, but some of the hidden pages are missing and he also
presented models for extensive and intensive archiving
mechanisms. B.Vijaya Babu et al. in their comparative study
have concluded that the comprehensive archiving or
retrieval of the links of the website also depends on the
technologies that were used to design the individual pages of
the websites[1].
We have also gone through the literature of existing open
source archiving tools which are useful in web site mirroring
and other re-creation activities. Out of all those mirroring
tools, cURL[3], HTTrack [9]and Stanford WebBase[9,18] are the
most optimized tools, but are having their own advantages
and disadvantages. curl is fast, extensible, and fully featured
as a web-downloading tool. It can use telnet, ftp and http
protocols. It has build-in encryption for SSL connections. It
fills in forms, providing passwords to websites if requested. It Figure 1. Screen shots of web eater
follows all redirects and collects cookies automatically. 0.2.1
cURL has a command line interface .Curl supports resumed
transfers both ways on both FTP and HTTP. The main The winHTTrack Website copier 3. 43-6 on the
drawback of cURL is it lacks spidering ability. By itself, it other hand is a GUI tool and is easier for implementation.
cannot mirror whole websites. A wrapper must extend cURL The screen shot of the home page is as shown below.
to spider sites and rotate through links[9].
Tarmo Robal has designed a web crawler software
component which is able to map Estonian Internet domain
structure, and index its web page source[15]. Vladislav
Shkapenyuk ans Torsten Suel in their work have described the
design and implementation of a distributed web crawler
architecture that runs on a network of work stations. The
scalar scales to several hundred pages per second, and is
resilient against sytem crashes and other events, which can be
adapted to other applications [16]. Allen Heydon and Mark Figure 2. Screen shots of winHTTrack Website copier 3.
najork have described about the Mercator[17], a scalable and 43-6
extensible Web Crawler and discussed about the alternative
trade-offs in the design of web crawlers. Paulo Boldi et al.
In our study we have found that HTTrack web site
have reported the implementation of Ubi Crawler, another
copier[9] is better than cURL, as far as the issues of
scalable and distributed web crawler with the features like
recreating the links, as it employs the most professional
platform independence, fault tolerance[18].
interface of the evaluated crawlers. It actually has two
interfaces, one GUI, one command line. It has many options
2.1 Study of Existing Open sourced Tools on the and accurately downloads im- ages and rewrites links. It has
Related Work SSL capability [9]. It spiders a prescribed depth into the
Number of tools and utilities are available in the form of site. It can spider several sites in series. But it lacks the
licensed with copy rights and as open source applications on extensibility of cURL. Difficulties in the core code can lead to
the related work, but most of them are lacking the ease of awkward workarounds. CDL, for instance, might not be able
to build form guessing into HTTrack, which is required to
implementation and comprehensivness.More over, they are
reach much of the deep web [9].
not meeting basic requirements of archiving.
Stanford’s WebBase[ 11] mirrors sites with almost
The open sourced tools like cURL, winHTTrack Website
copier 3 . 43-6, and web eater 0.2.1 have been studied. Both perfect recall of mirrored html. Simple sites with little
cURL and web eater are command line tools and web eater dynamic content can be mirrored with ease. But Stanford
0.2.1 is not comprehensive in terms of retrieval of links WebBase does not rewrite links within pages. It does not
and pages. The archiving efficiency of Web eater 0.2.1[13] is download files required in an EMBED tag as used by Flash
not up to the mark. The screen shots for running and and other programs. It does not parse or even scan
implementation of Web eater 0.2.1 are as shown in the JavaScript for links or images. WebBase may be extended
78 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 1, January 2010

only with great difficulty. The C++ code is not ANSI }


compliant, and cannot be compiled with a modern
compiler[9][11]. Rui Cai et al. have studied about an 4. Architecture Diagram
intelligent crawler with the main idea of learning about the
site map of a forum site with a few pre-sampled pages, and
then decide how to select an optimal traversal path to
avoid duplicates and invalids[2].

3. Pseudo Code
According to Shalin Shah [19], part of the pseudo code
summary of the algorithm that can be used to implement
the application is given below :

Add the URL to the empty list of URLs to


search

While not empty(the list of URLs to


search)
{

Take the first URL in from the list of


URLs
Mark this URL as already searches
URL Figure 3. Architecture Diagram of standard Web
Crawler
If the URL protocol is not HTTP
then break;
go back to 5. Flow Diagram of the Web Archiving
while Application

If robot.txt file exist on the site


then
If file includes “ Disallow” statement
then break;
go back to while

Open the URL

If the opened URL is HTML file


then break;
go back to while
Iterate the HTML
file

While the HTML text contains another link

{ If robot.txt file exist on URL/site then


If file includes” Disallow” statement
then break;
Go back to
while
Figure 4. Flow Diagram of the Web Archiving Application
If the opened URL is HTML file
then
If the URL is not marked as searched then
Mark this URL as already searched
URL

}
(IJCNS) International Journal of Computer and Network Security, 79
Vol. 2, No. 1, January 2010

6. Implementation of the Application and Case


Study
This application is implemented on a Dell T100 power edge
Model server system with 4GB RAM, 250 GB SATA Hard
Disk and 1 Gbps Network card(Ethernet) . The system has a
2 Mbps dedicated leased line internet connectivity .The
website www.usharama.com is taken as the reference site
with prior permission, for the implementation of the
application.
Figure 6. Screen Shots for Threads=1 and Depth=5 with
BFS Algorithm

Figure 5. Screen Shots of Home Page of Web Archiving


Application

The implementation results in observations for different


active threads and the running time, for BFS and DFS
algorithms are shown.

Figure 7. Screen Shots for Threads=1 and Depth=5 With


DFS algorithm

The page size of 3000kb and a depth of 5 are set for the
implementation. Some interesting results have been
observed for different active threads selected. As the number
of threads increases the running time for saving the website
is observed decreasing. This running time also depends on
the factors like the bandwidth of the Internet connectivity,
speed of packet transfer rate, congestion at that particular
moment and also on the configuration of the system.
80 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 1, January 2010

The content of the website is archived in the folder seconds.


UR on the desk top and it consists the following folders.
a) Images
b) Includes(Jscript and CSS Files)
c) Jsfiles
d) Photos
e) Pages and to browse the website in offline, the
index.html also has been created. The observations are
tabulated inTable1 and Table2 as shown below.
Table1: Observations for BFS algorithm

Figure 9. Output screenshot of winHTTrack website copier


3. 43-6

But the web eater 0.2.1 tool has given poor results in terms
of restoration of the links as well as the running time. It has
retrieved only the home page or the index page.

Table2: Observations for DFS algorithm

From the results, the BFS crawling efficiency is better than


DFS, at the cost of running time and vice versa.The values of Figure 10. Output screenshot of web eater 0.2.1
running time for various active threads are depicted as
shown in fig 8. 8. Conclusions
This archiving application saves all the links, except
external links of the website originally present in the site
and a l so retrieves the pages written in either aspx,html, or
PHP etc., in to the archived folder. The application is
compared with the existing state of art open sourced
archiving tools like web eater 0.2.1 and winHTTrack
Website copier 3. 43-6 versions . It has shown a much
improved performance in terms of running time and number
Figure 8. Comparison of running time for BFS and DFS of links visited. As the ultimate objective of our work is to
develop a comprehensive, open sourced web archiving tool,
7. Comparison of Open sourced tools web eater this work provides a platform and scope for future
0.2.1 and winHTTrack Website copier 3. 43- enhancements like the retrieval of external links and
6 versions consequently to the problems related to memory
The open source tools winHTTrack website copier 3. 43- requirements.
6 and web eater 0.2.1 are implemented and compared with
our application. The winHTTrack website copier 3.43-6 is 9. Acknowledgements
compre- hensive in terms of restoration of the links but I extend my sincere thanks to the management of Usha
takes much longer running time of 22 minutes and 21
(IJCNS) International Journal of Computer and Network Security, 81
Vol. 2, No. 1, January 2010

Rama College of Engineering and Technology,


[14] Fitch, K. Web site archiving: an approach to recording
Telaprolu, Krishna(D.t), Andhra Pradesh, INDIA for
every materially different response produced by a
giving me the permission to use the website
Website.” AusWeb 2003: the Ninth Australian
http://www.usharama.com to conduct the case study World Wide Web Conference, Hyatt Sanctuary Cove,
and I also thank the staff and Gold Coast, Australia, 5-9 July 2003.
colleagues for their cooperation in completing the http://ausweb.scu.edu.au/aw03/papers/fitch/
work.
[15] Tarmo Robal “ Agile Web-Crawler : Design
References and Implementation, 2007
[1] B.Vijaya Babu, Prof M.S Prasad Babu “ Performance http://www.scribd.com/doc/100903/Agile-webcrawler-
Evaluation and Comparative study of Web Archiving design- and-implementation
Tools “ International Journal of Computer Engineering
[16] Vladislav Shkapenyuk and Torsten Suel “ Design
and Information Technology( ISSN 0974-2034), Volume
and Implementation of a high performance Distributed
01,Number 01,page 100-106, Nov 2008-Jan 2009.
Web crawler”
[2] Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, And Lei [17] Allen Heydon and Mark najork “ Mercator : A
Zhang 2008. iRobot: An Intelligent Crawler for Web Scalable, Extensible Web Crawler
Forums, In the proceedings of WWW 2008 / Refereed http://www.mias.uiuc.edu/files/
Track: Search – Crawlers, April 21-25, 2008 · Beijing, tutorials/mercator.pdf
China.
[18] Paulo Boldi, Bruno Codenotti, Massimo santini and
[3]. Ricardo baeza-yates, Aristides Gionis, Flavio p. Junquira, Sebastino Vigna “ Ubi Crawler : A scalable fully
Vanessa Murdock, Vassils Plachouras and Fabrizio distributed web Crawler, 2003
Silvestri 2008 Design Trade-Offs for Search Engine http://eprints.kfupm.edu.sa
Caching, In the ACM Transactions on the Web, Vol. 2,
No. 4, Article 20, Publication date: October 2008 [19] Shalin Shah “Implementing an Effective Web Crawler
[4]. Seung Hwan Ryu, Fabio Casati, Halvard Skogsrud, “eInfochips, Dash Board September 2006.
Boualem Benatallah And Re´Gis Saint-Paul 2008
Supporting the Dynamic Evolution of Web Service Authors Profile:
Protocols in Service- Oriented Architectures, In the
Prof. B.Vijaya Babu received B.Tech
ACM Transactions on the Web, Vol. 2, No. 2, Article 13,
(ECE).,and M.Tech(Computer Science).,
Publication date: April 2008. degrees from JNTU, Hyderabad in 1993 and
[5] http://www.curl.haxx.se. 2004 respectively. Presently he is working as
[6] Robert C. Miller and Krishna Bharat SPHINX: A Framework Professor and Head of CSE/IT Departments
in UshaRama College of Engineering &
for Creating Personal, Site-Specific Web
Technology, Telaprolu, Near Gannavaram,
Crawlers . Proceedings of the Seventh International Krishna(D.t), Andhra Pradesh, INDIA.
World Wide Web Conference (WWW7), Brisbane, During his 16 years of experience in
Australia, April 1998 and Printed in Computer Network teaching and research, he attended many National and
and ISDN Systems v.30, pp. 119-130, 1998. Brisbane, International Conferences/ Seminars in India and also contributed
Australia, April 1998. no. of research papers to various International journals.

[7] Web Archiving-Julien Masanès, Springer- ISBN-10 3- Prof. Maddali Surendra Prasad Babu
540-23338-5,Verlag Berlin Heidelberg 2006, ,European obtained his BSc., M.Sc., M. Phil., and Ph.D.
Web Archive. degrees from Andhra University in 1976,
[8] http://www.httrack.com 1978, 1981and 1986 respectively. He was the
[9] Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Head of the Department of the Department
Wang Lam, Andreas Paepcke, Sriram Raghavan, And of Computer Science & Systems
Engineering, Andhra University from
Gary Wesley 2006 “ Stanford WebBase
2006-09.During his 30 years of
Components and Applications “ ,In the ACM experience in teaching and research, he
Transactions on Internet Technology, Vol. 6, No. 2, attended about 30 National and
May 2006, Pages 153–186. International Conferences/ Seminars in India and contributed
[10] Julien Masanès, “Towards Continuous Web Archiving“. about 60 Research papers either in journals or in National and
D-Lib Magazine, Volume 8 Number 12, December 2002. International conferences/ seminars. He received the ISCA Young
Scientist Award at the73rd Indian Science Congress in 1986.
[11] www-iglib.stanford.edu
[12] DAVID—Archiving Websites , Version 1.0, Antwerp -
Leuven, July 2002,
[13] http://freshmeat.net/projects/webeater/

Вам также может понравиться