Web Crawlers

Introduction
This application is show articles of different site and show on our panel .in this case our frontend
panel will be mobile application. Why we mobile for our client interaction because it used
mostly.
In mobile industry there are two main competitors’ android and iOS. Android is open source
maintained by Google. IOS is not opensource (developer can’t modify its code and SDK).
Android and IOS are about different technologies and its difficult for us to maintain and learn
both these different technologies. So, we search and discussed some software engineers those
have experience of fields so we found new technology that is called hybrid application.
Now days hybrid applications are very famous like Facebook, AliExpress and daraz.pk many
other applications are using this concept. So, they have to write code for once and compile it for
android and iOS as well. So, this will help us to write code once. What hybrid applications are
developed on Cordova which the main part that compile it and transform it as native application.
Native applications are those that are developed by SDK which are given by company those are
developing and maintain mobile operating system.
Mobile applications mostly dependent on Rest API which make it dynamic. In our system we
chose Laravel which is a web framework written on php programming language. PHP is
dynamic type and script language, which means we have no worry to compile the code and
datatypes of functions and variables.
Most of the time php web applications hosted on Linux Servers. The stack of php development
called LAMP stack. Linux Apache MySQL and PHP.
All these parts of stack are open sourced. We will explain all these part in details.
So, we introduced the technologies we will used for this system. Now we are explaining our
system in simple words.
We want crawl data from news sites top one and want show under same umbrella. A user how
daily read news from different websites. Like BBC and ARY News website. What is news all
about a news have headline and details. So, our idea is to crawl over some renowned sites and
put in our database and write rest api for our mobile application.
So how this system will work basically we write some schedulers we run daily in morning and
crawl on sites and take latest news than put on our database.
According on Crawl and put all data of other sites in our database this make problem for use so
we decide to get data of news sites with heading and some description, we will provide link of
sites for read full news. Because mostly people are interested in headlines.
A Brief History of Web Crawlers
Web crawlers visit net applications, collect knowledge, and find out about new websites from
visited pages. internet crawlers have an extended and attention-grabbing history. Early internet
crawlers collected statistics regarding the online. additionally, to grouping statistics regarding the
online and categorization the applications for search engines, fashionable crawlers will be wont
to perform accessibility and vulnerability checks on the appliance. fast enlargement of the online,
and also the complexness additional to internet applications have created the method of
locomotion a awfully difficult one. Throughout the history of internet locomotion several
researchers and industrial teams self-addressed totally different problems and challenges that
internet crawlers face. totally different solutions are planned to cut back the time and price of
locomotion. playing associate degree thoroughgoing crawl could be a difficult question. to boot
capturing the model of a contemporary internet application and extracting knowledge from it
mechanically is another open question. What follows could be a temporary history of various
technique and algorithms used from the first days of locomotion up to the recent days. we tend to
introduce criteria to gauge the relative performance of internet crawlers. supported these criteria
we tend to plot the evolution of internet crawlers and compare their performance.
How it Works
Crawler is s type of the robot that is on WWW (World Wide Web) and live on the web . Is
known with so many name such as web spider , automatic indexer . No matter what but its
function still remain the same.
A crawler is make and hired by a search engine to update their web data or indices the web
content of other websites. It copies all the pages so that they can be prepare after by search
engine. This is allow all the user to find web pages faster on the search engine. The web crawler
too valid links and HTML code, and sometimes it extracts other information from the website.
Robots.txt
By creating a robots.txt file at the root folder of your website on server with rules for web
crawlers such as allow or disallow, they must follow. You can apply default rules which apply to
all bots and specify their specific User-agent string.
User-agent: *
Disallow: /
User-agent: *
Disallow:
Top web crawler bots

Googlebot (User-agent: Googlebot) , Bingbot ,Slurp Bot and DuckDuckBot.
Web crawler working
Internet
Crawler
Scheduler
URLs Parser
Indexing
Index
Search
Web Scraping
Web Scraping is collection of information from webpages which is extracted using program and
instructions.
Web page is contained information in special language call HTML. When we open site in our
browser it consists on HTML which browser can understand and show use the page.
The main parts in web scraping is to gather the compatible data for our research issue from the lots of
textual data.
Inside the HTML we are usually involved in systematic data – when we want to study all the data
that is use the quantitative methods.
Systematic structures can be numbers or recurrent names for example like countries or address.
There is a three steps:
First, we are extract the Unstructured data which is basically HTML.
Second, we find out about the reappear patterns behind data that we are looking for.
Third, we use these follow to the unstructured text to return the descriptive data.
Process of Web Crawling/Scraping

Web crawling
Make http request to any website which data you want to take and save. Download the
response object which is mostly html some time it is XML feed or JSON data.
Data Parsing and Extraction
Fetched HTML is processed using parser that extracts data from each downloaded page
different techniques like Regular Expressions. HTML Parser or Artificial Intelligence.
Data Cleaning and Transformation
Convert the parsed data into to structured form to save in file, csv file, json or in any database
Usually feeds the records into a queue that is consumed by the data writer, which usually script of
instructions.
Data Serialization and Storage
Reads from queue of records and writes the data into a format like CSV,JSON, JSON lines, XML or
load into relational database depending on requirements
INTERN
ET
Web
Web
Data
Data
Web Scraping
Service
Web Page
CSV File
Applicatio
Database
n
RDB
HTML and XML

HyperText Markup Language (HTML) is a language use to create websites.Websites can be looked
by everyone which is online on the internet or web browsers like Opera, Safari etc. HTML is very
easy to learn computer language, with fundamentals being handle to the person; and quite powerful
and you can easily create the website. It is continually go through improvement and development to
fullfil the demands of all the people on the internet under w3c.
The HTML is stands for Hyper Text Markup Language.
Hypertext is the approach by with you travel around the web by click on text called hyperlink in
HTML. This hyperlink is in blue color and with the underline. The hyperlink it written as
<a>Google</a> which navigate user to the next webpage.
The HTML tag told the browser this is a HTML document. The HTML tag represent base of the
HTML document. The HTML tag is written as “<img>”.
Html is consists on opening and closing tags and elements.
We highlight the mostly used tags an elements of html language.

Image Tag
Link Tag
Image Tag
Heading Tags
Paragraph Tag
Span Tag
Code
<!DOCTYPE html>
<html>
<body>
<h1>Big Heading</h1>
<p>A paragraph.</p>
<a href=”yahoo.com”>Yahoo</a>
</body>
</html>
What we see in browsers

Output:
Big Heading
A paragraph.
Yahoo
HTML Element Syntax
HTML element is an main part of the html page, written document or Dom. It represents meaning to
the html page.
For example, the title element of the html page represent the title of the page.
Most of the HTML element are written with a open tag and the close tag, with information or data in
between of the opening and the closing tag .
Elements can also contain attributes that defines its other properties.
For example, to put a image in webpage we use html element “<img>”
DOM
The DOM (Document Object Model) could be a program API for HTML (hypertext markup
language) and Extensible Markup Language(XML) documents.
DOM characterize the logical structure of document and therefore the method a document is
manipulated and accessed.
Many other types of data that will be hold on in various method, and far of this could historically
clear as knowledge instead of documents.
nonetheless, Extensible Markup Language (XML) presents this knowledge as documents, and
therefore the Document Object Model (DOM) could also wont to manage that knowledge.
With the Document Object Model (DOM), the documents are produce and build by the programmers
and navigate the structure, and add, modify, or delete parts and their content.
Something found in associate degree hypertext markup language or XML(Extensible Markup
Language) document may be changed, deleted,accessed, or further victimization the DOM
(Document Object Model), with some exceptions - particularly, the DOM (Document Object Model)
interface for the inner set and outer set haven't nonetheless been specific.
Document
Root element
<html>
element element
<head> <body>
element
<title> attribute element element
“href” <a> <h1>
Text: Text:
Text:
“The Link” “The Header”
“Over Title”
How DOM is parsed

Now we understand what is DOM the data object model or dociement object model. Which basically
contains data in its structure. In html we used many elements some of them with different attributes.
Like hyper link tag “a tag” which have some links of different pages and we also specify how to open
up these links, we want it open on same window,other tab or in other window which we usally called
popup window for these thing we used an other attribute which is called “target”.
Suppose we have a table on some informative web sites which are basically some helpful links of an
article. We all know when we search about some educational topics on google it provide use lots of
link, which some time pretty difficult for use to find links.
Let say we have an abc website which provide 100 links of an article which are recommended by
other users and we want crawl there top 10 links to show on our page so we download there page
html.
Than we find out all hyperlinks tags and we parse these tags basically we read these tags href
attribute and put the link of these tags in our database.
Here we are show web page of player of cricketing site ESPN.
These are stats of player, so if we want save data of player stats in our file or database we download
the html of page, analyze the structure and then we parse the Dom and extract delated data and save
in our database.
So, the core part is to analyze the structure of page, we analyze it well than it is very easy for use to
save data in database.
cURL
Introduction
cURL is command line for the loading or downloading web page content. Basically it provide the
HTML of url.
cURL use the lib curl, it can supports a variety of protocols, presently together with protocol,SOAP
request ,HTML,FTPS, FTP, SFTP, TFTP, SCP, and SMTP etc.
cURL can support https and performs SSL certificatifiction verification once a protected protocol is
such.
Once cURL is connects to far off server with the https, it will get the far of server certificate and
then analysis opposed to its ca certification store the validity of the far of servers to make sure the
far of servers is that the one it comes to be.
Some of the cURL containers area unit pack with ca certificate save file.
Directory wherever the cURL program found.
Present operating directory.
Windows system directory.
Windows directory.
Directories per trail setting variable.
Examples
Essential use of cURL affect just type curl at the command line (cmd):
curl www.goodbyeworld.com
The is Response we get when we crul a web address in Linux terminal.

<div class="post_cell post-layout--right">
<div class="post-text" item_prop="text">
<p>In PHP programming language and how does all of this is working?</p>
<p>Reference link: <a href="http://php.net/manual/en/book.curl.php" rel="no-

_refer">cURL</a></p>
</div>
<div class="post_tag grid gs4 gsy fd-column">

<div class="grid ps-relative d-block">
<a href="/question/tagg/php" class="post_tag" title="Show questions;"
rel="tagg">PHP</a> <a href="/question/tagg/curl" class="post_tag" title="Show Question;"
rel="tagg">curl</a>
</div> </div>
<div class="mb0 ">
So if we analyze the code above, we can see some informative content. Basically we found the
address of the stack overflow web link how the PHP lock works.
PHP- cURL
Component for php language that form it available for php script or program to used of lib
curl.cURL is a libery of the PHP programming language and this is a command line tool like
wget that is help us to send files and also help us to download data from HTTP and FTP.cURL
can supports proxies and we can transfer data over the SSL connetions, you can also set cookies
and even get files that is behind the login.
In order to use the cURL functions in over program we need to install the libcurl package .PHP is
required the 7.10.5 version of the libcurl or later version of the libcurl.
Following are some functions of the cURL :
 curl_close – Is use to shut a cURL
 curl_copy_handle – Is use to copy a cURL and check with all of its options.
 curl_error - Is return a string containing the past error of the present session.
Program Example
<?php
// Following are all the data we are sending to service
$the_data = array(
'msg' => 'Hi Earth',
'id_name' => 'Ahmed'
);
$curl = curl_init();
// You can set all Url the choice of yours:
// $curl = curl_init('http://localhost/hi_earth');
// Here we post over data
curl_setopt($curl, CURL_POST, 1);
// address the url you want to call
curl_setopt($curl, CURLOPT_URL, 'http://localhost/try.php');
// make it w the data is coming back so enter in string
curl_setopt($curl, CURLOPT_RETURN_TRANS, true);
// Enter the information
curl_setopt($curl, CURLOPT_POST_FIELDS, $the_data);
// Send request
$result = curl_exec($curl);
// Receive some information back

$info = curl_getinfo($curl);
echo 'Data type: ' . $info['content_type'] . '<br />';
echo 'HTTP code: ' . $info['http_code'] . '<br />';
// Free the assets because $curl is been used

curl_close($curl);
echo $result;
?>

Web Crawlers

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Web Crawlers

Загружено:

Авторское право:

Доступные форматы

Introduction

Top web crawler bots

Process of Web Crawling/Scraping

HTML and XML

We highlight the mostly used tags an elements of html language.

What we see in browsers

“href” <a> <h1>

How DOM is parsed

The is Response we get when we crul a web address in Linux terminal.

<div class="post-text" item_prop="text">

<p>Reference link: <a href="http://php.net/manual/en/book.curl.php" rel="no-

<div class="post_tag grid gs4 gsy fd-column">

// Receive some information back

// Free the assets because $curl is been used

Вам также может понравиться