Академический Документы
Профессиональный Документы
Культура Документы
2
Dept of
CS&SE,
A.U.College Of Engineering, Andhra
University Visakhapatnam
msprasadbabu@yahoo.co.in
Abstract: The task of archiving or mirroring the websites has have some cultural value, because they belongs to the
become more difficult in the recent years, because of the rapid digital heritage and they are the material witness of our
devolopments in designing technologies of websites. The users society and with out proper archiving policy, they will be lost
and organizations are facing many challenges in term of for the future[12].
restoration of all types of links originally present on the Since a website is a unit of linked files including text
websites. More over preservation, accessing and interpretation files, sound files, data bases and scripts etc., the author or
of the data present on the websites for future references poses the organization that has created the site is protected in
new problems. This paper focuses on the design exactly the same way as the author of a classical literary
,implementation and analysis of an optimized, multi threaded,
work. An organization or the author of the site has to
website specific, application that saves the websites at a
consider the legal copy right implications, when drawing up
relatively faster time on to a user defined location of a local
disk in the form of a file ,which is useful for offline browsing. an archiving strategy for their websites. Libraries of different
The problems and limitations of existing state of art archiving countries and various organizations that deal with the
tools in terms of comprehensive retrieval of the links, and speed reproduction of the information see copy right as an
of archiving have been addressed. This application is compared obstruction. The current copy right laws do not allow
with the existing open sourced utilities like web eater 0.2.1 and enough possibilities to make reproductions of publications,
winHTTrack Website copier 3. 43-6 versions. with out the authors permission. A lot information on the
sites is being shielded by copy right, making its
Key words: Web archiving, open source tool, running time, reproduction subject to a number of clearly defined
active threads. conditions [12][7].
1. Introduction
2. Study of literature on related work
Web archiving or mirroring is the process of collecting
An extensive study has been done on the literature of
portions of the World Wide Web(WWW) and ensuring the
various web archiving, mirroring utilities, applications and
collection is digitally preserved for the future references
tools, both licenced as well as open sourced . The study of
and interpretation of research scholars, scientists, business
the state of art technologies on the related work has
people, various government and private organizations of
provided enough base and platform for the development of
different countries.The web is a pervasive and ephemeral
our application.
media where modern culture in a large sense finds a
J. Cho et al..have described the design and performance
natural form of expression. Publication, debate, creation,
of WebBase, a tool for Web research. The system includes
work and social interaction in a large sense: many aspects
a highly customizable crawler, a repository for collected
of the society are happening or reflected on the Internet in
Web pages, an indexer for both text and link-related page
general and the web in particular [7].
features, and a high-speed content distribution facility. The
Archiving websites is important for three main reasons.
distributed model enables the researchers world-wide to
Firstly, archiving web sites is justified by the documentary
retrieve pages from WebBase, and stream them across the
value the website possesses themselves. Archived websites
Internet at high speed [9]. Masanes.J has presented various
are a necessary evidence material for research into the and
crawling algorithms and approaches undertaken today by
the evolution of this medium itself history. Secondly, the
different institu- tions; it will discuss their focuses,
websites them selves are having large informational value ,
strengths, and limits, as well as a model for appraisal and
it is impor- tant to realize that the websites can be frozen as
identifying potential comple- mentary aspects amongst
on-line sources one way or the other. Thirdly, websites also
(IJCNS) International Journal of Computer and Network Security, 77
Vol. 2, No. 1, January 2010
3. Pseudo Code
According to Shalin Shah [19], part of the pseudo code
summary of the algorithm that can be used to implement
the application is given below :
}
(IJCNS) International Journal of Computer and Network Security, 79
Vol. 2, No. 1, January 2010
The page size of 3000kb and a depth of 5 are set for the
implementation. Some interesting results have been
observed for different active threads selected. As the number
of threads increases the running time for saving the website
is observed decreasing. This running time also depends on
the factors like the bandwidth of the Internet connectivity,
speed of packet transfer rate, congestion at that particular
moment and also on the configuration of the system.
80 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 1, January 2010
But the web eater 0.2.1 tool has given poor results in terms
of restoration of the links as well as the running time. It has
retrieved only the home page or the index page.
[7] Web Archiving-Julien Masanès, Springer- ISBN-10 3- Prof. Maddali Surendra Prasad Babu
540-23338-5,Verlag Berlin Heidelberg 2006, ,European obtained his BSc., M.Sc., M. Phil., and Ph.D.
Web Archive. degrees from Andhra University in 1976,
[8] http://www.httrack.com 1978, 1981and 1986 respectively. He was the
[9] Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Head of the Department of the Department
Wang Lam, Andreas Paepcke, Sriram Raghavan, And of Computer Science & Systems
Engineering, Andhra University from
Gary Wesley 2006 “ Stanford WebBase
2006-09.During his 30 years of
Components and Applications “ ,In the ACM experience in teaching and research, he
Transactions on Internet Technology, Vol. 6, No. 2, attended about 30 National and
May 2006, Pages 153–186. International Conferences/ Seminars in India and contributed
[10] Julien Masanès, “Towards Continuous Web Archiving“. about 60 Research papers either in journals or in National and
D-Lib Magazine, Volume 8 Number 12, December 2002. International conferences/ seminars. He received the ISCA Young
Scientist Award at the73rd Indian Science Congress in 1986.
[11] www-iglib.stanford.edu
[12] DAVID—Archiving Websites , Version 1.0, Antwerp -
Leuven, July 2002,
[13] http://freshmeat.net/projects/webeater/