Вы находитесь на странице: 1из 1

In the process of domain specific data retrieval, the main idea is to get the co ntent which are within

a certain domain & display to the user. So the relevant d ata may not be the whole website or even the web page instead it might be a smal l section within the webpage. Therefore a technological approach required to ret rieve above mentioned data retrieval. There are few systems, libraries that can be used to retrieve specific data from websites, among them, jsoup looks promisi ng for the purpose because of its features. Jsoup is a java library for working with real world HTML. It provides a very con venient API for extracting & manipulating data, using the best of DOM, CSS, & jq uery like methods. Jsoup is an open source application which makes it a perfect development tool for this project as it can be modified according to the purpose . As jsoup is specially developed for java environment, it makes a perfect candi date for the development process as well. In the project, need to extract some s ections within a given website, where the URL will be available. In this case, j soup is suitable for the process, as it can be used to extract data for a given URL, from a file, or from a given string. Thus jsoup can be used to extract data from the given URL & then store the data & also extract or scrape sections with in the data. Jsoup API has many sophisticated features that can be used to enhance the extrac tion process. For example, data extraction can be done by reading the DOM struct ure of the website. As all websites are using HTML, jsoup can read the structure of the websites & go through the DOM structure & get the content as intended. I n the jsoup, the HTML tags & attributes can be easily identified & get data by r eferring to them. These are called elements & elements provide a range od DOM-li ke methods to find elements, & extract & manipulate their data. The DOM getters are contextual: called on a parent document & find matching elements under the d ocument; called on a child element they find elements under that child. There ar e many elements & getters provided in jsoup. That makes the data extraction proc ess very easy because, can extract only the intended sections without grabbing a bunch of web pages. The extracted data need to be classified according to the content type. That mea ns as text, images, videos, links, etc. & jsoup can be used for the categorizati on. It has the features to identify the content separately as text, links, and i mages & based on that the extraction process can be separated. It can identify t he content type using the HTML tags & based on that use functions to extract each content type. Thus while extracting the data, the cont ent classification also can be achieved using the jsoup. So considering the requirements for data extraction process, jsoup can be mentio ned as a highly sophisticated tool for data extraction. The features & functions provided for java based data extraction made the process very much easy & as it is an open source application, the cost effectiveness is also achieved.