Вы находитесь на странице: 1из 49

Basic WWW Technologies



2.1 Web Documents.

2.2 Resource Identifiers: URI, URL, and URN.

2.3 Protocols.

2.4 Log Files.

2.5 Search Engines.

Modeling the Internet and the Web


School of Information and Computer Science
University of California, Irvine
What Is the World Wide Web?
The world wide web (web) is a network of
information resources. The web relies on three
mechanisms to make these resources readily
available to the widest possible audience:
1. A uniform naming scheme for locating resources
on the web (e.g., URIs).
2. Protocols, for access to named resources over
the web (e.g., HTTP).
3. Hypertext, for easy navigation among resources
(e.g., HTML).

Modeling the Internet and the Web 2


School of Information and Computer Science
University of California, Irvine
Internet vs. Web
Internet:
• Internet is a more general term
• Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP…
Web:
• Associated with information stored on the
Internet
• Refers to a broader class of networks, i.e. Web of
English Literature
Both Internet and web are networks
Modeling the Internet and the Web 3
School of Information and Computer Science
University of California, Irvine
Essential Components of WWW


Resources:
• Conceptual mappings to concrete or abstract entities, which do not
change in the short term
• ex: DTU website (web pages and other kinds of files)
Resource identifiers (hyperlinks):
• Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource
• http://www.ics.uci.edu is used to identify the ICS homepage
Transfer protocols:
• Conventions that regulate the communication between a browser
(web user agent) and a server

Modeling the Internet and the Web 4


School of Information and Computer Science
University of California, Irvine
Standard Generalized Markup
Language (SGML)
• Based on GML (generalized markup language),
developed by IBM in the 1960s
• An international standard (ISO 8879:1986)
defines how descriptive markup should be
embedded in a document
• Gave birth to the extensible markup language
(XML), W3C recommendation in 1998

Modeling the Internet and the Web 5


School of Information and Computer Science
University of California, Irvine
Continued
• It is a technology that allows additional information to
be added in electronic documents so that the value
of documents are maximized.
• This includes the ability to manage, access, automate,
and detect structural errors.
• Examples of additional information include
specifications for arranging, formatting the document
for different types of outputs (electronic v.s. paper
forms), cross-reference links to other documents,
and the structure of the document that facilitates
information finding, and document merging

Modeling the Internet and the Web 6


School of Information and Computer Science
University of California, Irvine
CONTINUED…
• Word processing package and computer typesetting systems commonly
use a procedural markup (Helvetica font, 18 points, etc.) which specify
how text will be processed or will appear on an output.
• In contrast, SGML uses a structural markup (example, figure) which
enables the description of structured information independent of how
the information is processed.
• Every SGML-based document requires descriptions or a set of rules about
structured information in a document. This description is called
document type definition (DTD) and the SGML language provides a
standard syntax for expressing DTDs.
• Any information that are marked up or added into an SGML document
must follow the descriptions in a DTD.
• DTDs are explicit forms of preferences about the document that an author
has during authoring. Documents of the same type can share the same
DTD and each document written is considered as a document instance.
Modeling the Internet and the Web 7
School of Information and Computer Science
University of California, Irvine
CONTINUED
• DTD is the core element of SGML documents.
• The DTDs consist of the definitions about SGML markups
(tags) allowed in a document.
• The definitions also include a formal description of the
document and the relationship of the elements, such as
chapters, footnotes, or indices) within the document.
• A marked-up element, an "element", has a start tag <Q> ,
content, and an end tag </Q>.
• For example, a title might be marked with the title tag as
<TITLE> Document Processing </TITLE>. Markups can
be for graphics, images, and other entities as well as
text.
Modeling the Internet and the Web 8
School of Information and Computer Science
University of California, Irvine
SGML Components
SGML documents have three parts:
• Declaration: specifies which characters and delimiters
may appear in the application. Basically determines if a
parser can interpret a document. Invisible for most users
• DTD/ style sheet: defines the syntax of markup
constructs. Usually a referent for a source
• Document instance: actual text (with the tag) of the
documents. One DTD could control 1000s of documents.
More info could be found: http://www.W3.Org/
markup/SGML

Modeling the Internet and the Web 9


School of Information and Computer Science
University of California, Irvine
DTD Example One
<!ELEMENT UL - - (LI)+>
• ELEMENT is a keyword that introduces a new
element type unordered list (UL)
• The two hyphens indicate that both the start tag
<UL> and the end tag </UL> for this element
type are required
• Any text between the two tags is treated as a list
item (LI)

Modeling the Internet and the Web 10


School of Information and Computer Science
University of California, Irvine
DTD Example Two


<!ELEMENT IMG - O EMPTY>


• The element type being declared is IMG
• The hyphen and the following "O" indicate
that the end tag can be omitted
• Together with the content model "EMPTY",
this is strengthened to the rule that the end
tag must be omitted. (no closing tag)

Modeling the Internet and the Web 11


School of Information and Computer Science
University of California, Irvine
HTML Background 


• HTML was originally developed by Tim Berners-


Lee while at CERN, and popularized by the
Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
vendors sharing the same conventions for HTML.
This has motivated joint work on specifications
for HTML.
• HTML standards are organized by W3C : http://
www.w3.org/MarkUp/

Modeling the Internet and the Web 12


School of Information and Computer Science
University of California, Irvine
HTML Functionalities
HTML gives authors the means to:
• Publish online documents with headings, text, tables,
lists, photos, etc
– Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
• Link information via hypertext links, at the click of a button
• Design forms for conducting transactions with remote
services, for use in searching for information, making
reservations, ordering products, etc

Modeling the Internet and the Web 13


School of Information and Computer Science
University of California, Irvine
HTML Versions
• HTML 4.01 is a revision of the HTML 4.0 Recommendation first
released on 18th December 1997.
– HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt
• HTML 4.0 was first released as a W3C Recommendation on 18
December 1997
• HTML 3.2 was W3C's first Recommendation for HTML which
represented the consensus on HTML features for 1996
• HTML 2.0 (RFC 1866) was developed by the IETF's HTML
Working Group, which set the standard for core HTML
features based upon current practice in 1994.

Modeling the Internet and the Web 14


School of Information and Computer Science
University of California, Irvine
Sample Webpage

Modeling the Internet and the Web 15


School of Information and Computer Science
University of California, Irvine
Sample Webpage HTML Structure
<HTML>
<HEAD>
<TITLE>The title of the webpage</TITLE>
</HEAD>
<BODY> <P>Body of the webpage
</BODY>
</HTML>

Modeling the Internet and the Web 16


School of Information and Computer Science
University of California, Irvine
HTML Structure
• An HTML document is divided into a head section
(here, between <HEAD> and </HEAD>) and a body
(here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
with other information about the document)
• The content of the document appears in the body. The
body in this example contains just one paragraph,
marked up with <P>

Modeling the Internet and the Web 17


School of Information and Computer Science
University of California, Irvine
HTML Hyperlink


<a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
"destination" anchor, which may be any Web
resource (e.g., an image, a video clip, a sound
bite, a program, an HTML document)

Modeling the Internet and the Web 18


School of Information and Computer Science
University of California, Irvine
Resource Identifiers
• Resources are referred by means of strings with a
controlled syntax and semantics
• URI: Uniform Resource Identifiers - General set of
identifiers - focused on concepts of extensibility and
completeness.
• Justified as resources as conceptual mappings to entities.
• URL: Uniform Resource Locators - Identifiers that explicitly
encode details about the algorithm used to access the
resource
• URN: Uniform Resource Names are those URIs that must
persist and remain unique even if the resource becomes
unavailable
Modeling the Internet and the Web 19
School of Information and Computer Science
University of California, Irvine
Introduction to URIs
• Every resource available on the Web has an address
that may be encoded by a URI.
• An absolute URI consists of two portions separated
by a colon: a scheme specifier followed by a
second portion whose syntax and semantics
depend on the particular scheme.
• In BNF (Backus–Naur Form), the general syntax of
absolute URIs is expressed as
• ⟨absoluteURI⟩ ::= ⟨scheme⟩:(⟨hierarchicalPart⟩|
⟨opaquePart⟩)

Modeling the Internet and the Web 20


School of Information and Computer Science
University of California, Irvine
URI Example

• As an example, three common schemes for identifying Web


resources are http, https and ftp.
• In these three cases, the second part of the URI is hierarchical and
consists of a double slash // followed by a so-called authority
component (e.g. a host name) and optionally by a path and/or a
query.
• For example, the URI http://bsd.slashdot.org/
article.pl?sid= 02/08/16/0041250 contains all three
components: the authority is bsd. slashdot.org, the absolute
path is /article.pl, and the query is the sub- string following
the question mark
Modeling the Internet and the Web 21
School of Information and Computer Science
University of California, Irvine
Protocols


• Describes how messages are encoded and exchanged


• The lower layer in the hierarchy is typically related to the physical communication
mechanisms (e.g. it may be concerned with optical fibers or wireless
communication).
• Higher levels are related to the functioning of specific applications such as email or file
transfer.
• For example, a host in the Internet is characterized by a 32 bit address, independently of
whether the physical connection goes through a home DSL cable, through an office
Ethernet LAN, or through an airport wireless network.
• Similarly, email protocols are a lingua franca spoken by clients and servers. Sending
email just involves connect ing to a server and using this language.
• All of the details about how the information is actually transmitted to the recipient are
hidden by the hierarchical mechanism of encapsulation by which higher level
messages are embedded into a format understood by the lower level protocols.

• Different Layering Architectures


• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture

Modeling the Internet and the Web 22


School of Information and Computer Science
University of California, Irvine
ISO OSI Layering Architecture

Modeling the Internet and the Web 23


School of Information and Computer Science
University of California, Irvine
ISO’s Design Principles
• A layer should be created where a different level
of abstraction is needed
• Each layer should perform a well-defined function
• The layer boundaries should be chosen to
minimize information flow across the interfaces
• The number of layers should be large enough
that distinct functions need not be thrown
together in the same layer, and small enough that
the architecture does not become unwieldy

Modeling the Internet and the Web 24


School of Information and Computer Science
University of California, Irvine
ISO Layers

• Physical: concerned with the transmission of electrical or optical


(possibly also acoustic) signals.
• Data link: provides error control and divides data into frames.
• Network: provides routing of packets from source to destination.
• Transport: provides end-to-end reliability (guaranteeing all packets
reach destination).
• Session: primitive functions that coordinate the dialog between
applications.
• Presentation: concerned about the transfer syntax used by
applications.
• Application: applications such as file transfer, email, browsing, etc.

Modeling the Internet and the Web 25


School of Information and Computer Science
University of California, Irvine
TCP/IP Layering Architecture

Modeling the Internet and the Web 26


School of Information and Computer Science
University of California, Irvine
TCP/IP Layering Architecture
• A simplified model, provides the end-to-end
reliable connection
• The network layer
– Hosts drop packages into this layer, layer
routes towards destination
– Only promise “Try my best”
• The transport layer
– Reliable byte-oriented stream

Modeling the Internet and the Web 27


School of Information and Computer Science
University of California, Irvine
Hypertext Transfer Protocol (HTTP)


• A connection-oriented protocol (TCP) used


to carry WWW traffic between a browser
and a server
• One of the transport layer protocol
supported by Internet
• HTTP communication is established via a
TCP connection and server port 80

Modeling the Internet and the Web 28


School of Information and Computer Science
University of California, Irvine
GET Method in HTTP

Modeling the Internet and the Web 29


School of Information and Computer Science
University of California, Irvine
Domain Name System 


DNS (domain name service): mapping from domain


names to IP address
IPv4:
• IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to
255.255.255.255.
IPv6:
• Revision of IPv4 with 128 bit address

Modeling the Internet and the Web 30


School of Information and Computer Science
University of California, Irvine
Top Level Domains (TLD)
Top level domain names, .com, .edu, .gov and ISO
3166 country codes
There are three types of top-level domains:
• Generic domains were created for use by the Internet
public
• Country code domains were created to be used by
individual country
• The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internet-
infrastructure purposes

Modeling the Internet and the Web 31


School of Information and Computer Science
University of California, Irvine
Registrars
• Domain names ending
with .aero, .biz, .com, .coop, .info, .museu
m, .name, .net, .org, or .pro can be
registered through many different
companies (known as "registrars") that
compete with one another
• InterNIC at http://internic.net
• Registrars Directory: http://
www.internic.net/regist.html
Modeling the Internet and the Web 32
School of Information and Computer Science
University of California, Irvine
Server Log Files 


Server Transfer Log: transactions between a


browser and server are logged
• IP address, the time of the request
• Method of the request (GET, HEAD, POST…)
• Status code, a response from the server
• Size in byte of the transaction
Referrer Log: where the request originated
Agent Log: browser software making the request (spider)
Error Log: request resulted in errors (404)

Modeling the Internet and the Web 33


School of Information and Computer Science
University of California, Irvine
Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search
engines
• What are the searched keywords
• How many clicks/page views a page
received
• Error reports, like broken links
Modeling the Internet and the Web 34
School of Information and Computer Science
University of California, Irvine
Server Log Analysis

Modeling the Internet and the Web 35


School of Information and Computer Science
University of California, Irvine
Search Engines


According to Pew Internet Project Report (2002),


search engines are the most popular way to
locate information online
• About 33 million U.S. Internet users query on
search engines on a typical day.
• More than 80% have used search engines
Search Engines are measured by coverage and
recency

Modeling the Internet and the Web 36


School of Information and Computer Science
University of California, Irvine
Coverage


Overlap analysis used for estimating the size


of the indexable web
• W: set of webpages
• Wa, Wb: pages crawled by two independent
engines a and b
• P(Wa), P(Wb): probabilities that a page was
crawled by a or b
• P(Wa)=|Wa| / |W|
• P(Wb)=|Wb| / |W|
Modeling the Internet and the Web 37
School of Information and Computer Science
University of California, Irvine
Overlap Analysis
• P(Wa ∩Wb| Wb) = P(Wa ∩ Wb)/ P(Wb)
= |Wa ∩ Wb| / |Wb|
• If a and b are independent:
P(Wa ∩Wb) = P(Wa)*P(Wb)
• P(Wa ∩Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| * |Wb| / |Wb|
= |Wa| / |W|
=P(Wa)

Modeling the Internet and the Web 38


School of Information and Computer Science
University of California, Irvine
Overlap Analysis
Using |W| = |Wa|/ P(Wa), the researchers
found:
• Web had at least 320 million pages in 1997
• 60% of web was covered by six major
engines
• Maximum coverage of a single engine was
1/3 of the web

Modeling the Internet and the Web 39


School of Information and Computer Science
University of California, Irvine
How to Improve the Coverage? 


• Meta-search engine: dispatch the user


query to several engines at same time,
collect and merge the results into one list
to the user.
• Any suggestions?

Modeling the Internet and the Web 40


School of Information and Computer Science
University of California, Irvine
Web Crawler
• A crawler is a program that picks up a page
and follows all the links on that page
• Crawler = Spider
• Types of crawler:
– Breadth First
– Depth First

Modeling the Internet and the Web 41


School of Information and Computer Science
University of California, Irvine
Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
add them to a queue
• Pick the 1st link from the queue, get all
links on the page and add to the queue
• Repeat above step till queue is empty

Modeling the Internet and the Web 42


School of Information and Computer Science
University of California, Irvine
Breadth First Crawlers

Modeling the Internet and the Web 43


School of Information and Computer Science
University of California, Irvine
Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous
level and repeat 2nd step

Modeling the Internet and the Web 44


School of Information and Computer Science
University of California, Irvine
Depth First Crawlers

Modeling the Internet and the Web 45


School of Information and Computer Science
University of California, Irvine
WEB CRAWLER
• Spiders or robots that automatically
download web pages
• A crawler can visit many sites to collect
information that can be analyzed and
mined in a central location, either online
(as it is downloaded) or off-line (after it is
stored)

Modeling the Internet and the Web 46


School of Information and Computer Science
University of California, Irvine
APPLICATIONS OF WEB CRAWLERS

• BUSINESS INTELLIGENCE
• MONITOR WEB SITES AND PAGES OF
INTEREST
• UNIVERSAL CRAWLERs
• PREFERENTIAL CRAWLERS

Modeling the Internet and the Web 47


School of Information and Computer Science
University of California, Irvine
BASIC CRAWLER ALGORITHM

Modeling the Internet and the Web 48


School of Information and Computer Science
University of California, Irvine
CONTINUED…..
• The crawler maintains a list of unvisited URLs called the frontier.
• The list is initialized with seed URLs which may be provided by
the user or another program.
• In each iteration of its main loop, the crawler picks the next URL
from the frontier, fetches the page corresponding to the URL
through HTTP, parses the retrieved page to extract its URLs,
adds newly discovered URLs to the frontier, and stores the
page (or other extracted information, possibly index terms) in
a local disk repository.
• The crawling process may be terminated when a certain
number of pages have been crawled. The crawler may also
be forced to stop if the frontier becomes empty

Modeling the Internet and the Web 49


School of Information and Computer Science
University of California, Irvine

Вам также может понравиться