Chapter 1

Basic WWW Technologies 
 
2.1 Web Documents. 
2.2 Resource Identifiers: URI, URL, and URN. 
2.3 Protocols. 
2.4 Log Files. 
2.5 Search Engines.
Modeling the Internet and the Web

School of Information and Computer Science
University of California, Irvine
What Is the World Wide Web?
The world wide web (web) is a network of
information resources. The web relies on three
mechanisms to make these resources readily
available to the widest possible audience:
1. A uniform naming scheme for locating resources
on the web (e.g., URIs).
2. Protocols, for access to named resources over
the web (e.g., HTTP).
3. Hypertext, for easy navigation among resources
(e.g., HTML).
Modeling the Internet and the Web 2

Internet vs. Web
Internet:
• Internet is a more general term
• Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP…
Web:
• Associated with information stored on the
Internet
• Refers to a broader class of networks, i.e. Web of
English Literature
Both Internet and web are networks
Essential Components of WWW 
Resources:
• Conceptual mappings to concrete or abstract entities, which do not
change in the short term
• ex: DTU website (web pages and other kinds of files)
Resource identifiers (hyperlinks):
• Strings of characters represent generalized addresses that may
contain instructions for accessing the identified resource
• http://www.ics.uci.edu is used to identify the ICS homepage
Transfer protocols:
• Conventions that regulate the communication between a browser
(web user agent) and a server

Standard Generalized Markup
Language (SGML)
• Based on GML (generalized markup language),
developed by IBM in the 1960s
• An international standard (ISO 8879:1986)
defines how descriptive markup should be
embedded in a document
• Gave birth to the extensible markup language
(XML), W3C recommendation in 1998

Continued
• It is a technology that allows additional information to
be added in electronic documents so that the value
of documents are maximized.
• This includes the ability to manage, access, automate,
and detect structural errors.
• Examples of additional information include
specifications for arranging, formatting the document
for different types of outputs (electronic v.s. paper
forms), cross-reference links to other documents,
and the structure of the document that facilitates
information finding, and document merging

CONTINUED…
• Word processing package and computer typesetting systems commonly
use a procedural markup (Helvetica font, 18 points, etc.) which specify
how text will be processed or will appear on an output.
• In contrast, SGML uses a structural markup (example, figure) which
enables the description of structured information independent of how
the information is processed.
• Every SGML-based document requires descriptions or a set of rules about
structured information in a document. This description is called
document type definition (DTD) and the SGML language provides a
standard syntax for expressing DTDs.
• Any information that are marked up or added into an SGML document
must follow the descriptions in a DTD.
• DTDs are explicit forms of preferences about the document that an author
has during authoring. Documents of the same type can share the same
DTD and each document written is considered as a document instance.
CONTINUED
• DTD is the core element of SGML documents.
• The DTDs consist of the definitions about SGML markups
(tags) allowed in a document.
• The definitions also include a formal description of the
document and the relationship of the elements, such as
chapters, footnotes, or indices) within the document.
• A marked-up element, an "element", has a start tag <Q> ,
content, and an end tag </Q>.
• For example, a title might be marked with the title tag as
<TITLE> Document Processing </TITLE>. Markups can
be for graphics, images, and other entities as well as
text.
SGML Components
SGML documents have three parts:
• Declaration: specifies which characters and delimiters
may appear in the application. Basically determines if a
parser can interpret a document. Invisible for most users
• DTD/ style sheet: defines the syntax of markup
constructs. Usually a referent for a source
• Document instance: actual text (with the tag) of the
documents. One DTD could control 1000s of documents.
More info could be found: http://www.W3.Org/
markup/SGML

DTD Example One
<!ELEMENT UL - - (LI)+>
• ELEMENT is a keyword that introduces a new
element type unordered list (UL)
• The two hyphens indicate that both the start tag
<UL> and the end tag </UL> for this element
type are required
• Any text between the two tags is treated as a list
item (LI)

DTD Example Two 
<!ELEMENT IMG - O EMPTY>

• The element type being declared is IMG
• The hyphen and the following "O" indicate
that the end tag can be omitted
• Together with the content model "EMPTY",
this is strengthened to the rule that the end
tag must be omitted. (no closing tag)

HTML Background  
• HTML was originally developed by Tim Berners-

Lee while at CERN, and popularized by the
Mosaic browser developed at NCSA.
• The Web depends on Web page authors and
vendors sharing the same conventions for HTML.
This has motivated joint work on specifications
for HTML.
• HTML standards are organized by W3C : http://
www.w3.org/MarkUp/

HTML Functionalities
HTML gives authors the means to:
• Publish online documents with headings, text, tables,
lists, photos, etc
– Include spread-sheets, video clips, sound clips, and other
applications directly in their documents
• Link information via hypertext links, at the click of a button
• Design forms for conducting transactions with remote
services, for use in searching for information, making
reservations, ordering products, etc

HTML Versions
• HTML 4.01 is a revision of the HTML 4.0 Recommendation first
released on 18th December 1997.
– HTML 4.01 Specification:
http://www.w3.org/TR/1999/REC-html401-19991224/html40.txt
• HTML 4.0 was first released as a W3C Recommendation on 18
December 1997
• HTML 3.2 was W3C's first Recommendation for HTML which
represented the consensus on HTML features for 1996
• HTML 2.0 (RFC 1866) was developed by the IETF's HTML
Working Group, which set the standard for core HTML
features based upon current practice in 1994.

Sample Webpage

Sample Webpage HTML Structure
<HTML>
<HEAD>
<TITLE>The title of the webpage</TITLE>
</HEAD>
<BODY> <P>Body of the webpage
</BODY>
</HTML>

HTML Structure
• An HTML document is divided into a head section
(here, between <HEAD> and </HEAD>) and a body
(here, between <BODY> and </BODY>)
• The title of the document appears in the head (along
with other information about the document)
• The content of the document appears in the body. The
body in this example contains just one paragraph,
marked up with <P>

HTML Hyperlink 
<a href="relations/alumni">alumni</a>
• A link is a connection from one Web resource
to another
• It has two ends, called anchors, and a direction
• Starts at the "source" anchor and points to the
"destination" anchor, which may be any Web
resource (e.g., an image, a video clip, a sound
bite, a program, an HTML document)

Resource Identifiers
• Resources are referred by means of strings with a
controlled syntax and semantics
• URI: Uniform Resource Identifiers - General set of
identifiers - focused on concepts of extensibility and
completeness.
• Justified as resources as conceptual mappings to entities.
• URL: Uniform Resource Locators - Identifiers that explicitly
encode details about the algorithm used to access the
resource
• URN: Uniform Resource Names are those URIs that must
persist and remain unique even if the resource becomes
unavailable
Introduction to URIs
• Every resource available on the Web has an address
that may be encoded by a URI.
• An absolute URI consists of two portions separated
by a colon: a scheme specifier followed by a
second portion whose syntax and semantics
depend on the particular scheme.
• In BNF (Backus–Naur Form), the general syntax of
absolute URIs is expressed as
• ⟨absoluteURI⟩ ::= ⟨scheme⟩:(⟨hierarchicalPart⟩|
⟨opaquePart⟩)

URI Example
• As an example, three common schemes for identifying Web

resources are http, https and ftp.
• In these three cases, the second part of the URI is hierarchical and
consists of a double slash // followed by a so-called authority
component (e.g. a host name) and optionally by a path and/or a
query.
• For example, the URI http://bsd.slashdot.org/
article.pl?sid= 02/08/16/0041250 contains all three
components: the authority is bsd. slashdot.org, the absolute
path is /article.pl, and the query is the sub- string following
the question mark
Protocols 
• Describes how messages are encoded and exchanged

• The lower layer in the hierarchy is typically related to the physical communication
mechanisms (e.g. it may be concerned with optical fibers or wireless
communication).
• Higher levels are related to the functioning of specific applications such as email or file
transfer.
• For example, a host in the Internet is characterized by a 32 bit address, independently of
whether the physical connection goes through a home DSL cable, through an office
Ethernet LAN, or through an airport wireless network.
• Similarly, email protocols are a lingua franca spoken by clients and servers. Sending
email just involves connect ing to a server and using this language.
• All of the details about how the information is actually transmitted to the recipient are
hidden by the hierarchical mechanism of encapsulation by which higher level
messages are embedded into a format understood by the lower level protocols.
• Different Layering Architectures

• ISO OSI 7-Layer Architecture
• TCP/IP 4-Layer Architecture

ISO OSI Layering Architecture

ISO’s Design Principles
• A layer should be created where a different level
of abstraction is needed
• Each layer should perform a well-defined function
• The layer boundaries should be chosen to
minimize information flow across the interfaces
• The number of layers should be large enough
that distinct functions need not be thrown
together in the same layer, and small enough that
the architecture does not become unwieldy

ISO Layers
• Physical: concerned with the transmission of electrical or optical

(possibly also acoustic) signals.
• Data link: provides error control and divides data into frames.
• Network: provides routing of packets from source to destination.
• Transport: provides end-to-end reliability (guaranteeing all packets
reach destination).
• Session: primitive functions that coordinate the dialog between
applications.
• Presentation: concerned about the transfer syntax used by
applications.
• Application: applications such as file transfer, email, browsing, etc.

TCP/IP Layering Architecture

TCP/IP Layering Architecture
• A simplified model, provides the end-to-end
reliable connection
• The network layer
– Hosts drop packages into this layer, layer
routes towards destination
– Only promise “Try my best”
• The transport layer
– Reliable byte-oriented stream

Hypertext Transfer Protocol (HTTP) 
• A connection-oriented protocol (TCP) used

to carry WWW traffic between a browser
and a server
• One of the transport layer protocol
supported by Internet
• HTTP communication is established via a
TCP connection and server port 80

GET Method in HTTP

Domain Name System  
DNS (domain name service): mapping from domain

names to IP address
IPv4:
• IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version.
• 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to
255.255.255.255.
IPv6:
• Revision of IPv4 with 128 bit address

Top Level Domains (TLD)
Top level domain names, .com, .edu, .gov and ISO
3166 country codes
There are three types of top-level domains:
• Generic domains were created for use by the Internet
public
• Country code domains were created to be used by
individual country
• The .arpa domain Address and Routing Parameter Area
domain is designated to be used exclusively for Internet-
infrastructure purposes

Registrars
• Domain names ending
with .aero, .biz, .com, .coop, .info, .museu
m, .name, .net, .org, or .pro can be
registered through many different
companies (known as "registrars") that
compete with one another
• InterNIC at http://internic.net
• Registrars Directory: http://
www.internic.net/regist.html
Server Log Files  
Server Transfer Log: transactions between a

browser and server are logged
• IP address, the time of the request
• Method of the request (GET, HEAD, POST…)
• Status code, a response from the server
• Size in byte of the transaction
Referrer Log: where the request originated
Agent Log: browser software making the request (spider)
Error Log: request resulted in errors (404)

Server Log Analysis
• Most and least visited web pages
• Entry and exit pages
• Referrals from other sites or search
engines
• What are the searched keywords
• How many clicks/page views a page
received
• Error reports, like broken links
Server Log Analysis

Search Engines 
According to Pew Internet Project Report (2002),

search engines are the most popular way to
locate information online
• About 33 million U.S. Internet users query on
search engines on a typical day.
• More than 80% have used search engines
Search Engines are measured by coverage and
recency

Coverage 
Overlap analysis used for estimating the size

of the indexable web
• W: set of webpages
• Wa, Wb: pages crawled by two independent
engines a and b
• P(Wa), P(Wb): probabilities that a page was
crawled by a or b
• P(Wa)=|Wa| / |W|
• P(Wb)=|Wb| / |W|
Overlap Analysis
• P(Wa ∩Wb| Wb) = P(Wa ∩ Wb)/ P(Wb)
= |Wa ∩ Wb| / |Wb|
• If a and b are independent:
P(Wa ∩Wb) = P(Wa)*P(Wb)
• P(Wa ∩Wb| Wb) = P(Wa)*P(Wb)/P(Wb)
= |Wa| * |Wb| / |Wb|
= |Wa| / |W|
=P(Wa)

Overlap Analysis
Using |W| = |Wa|/ P(Wa), the researchers
found:
• Web had at least 320 million pages in 1997
• 60% of web was covered by six major
engines
• Maximum coverage of a single engine was
1/3 of the web

How to Improve the Coverage?  
• Meta-search engine: dispatch the user

query to several engines at same time,
collect and merge the results into one list
to the user.
• Any suggestions?

Web Crawler
• A crawler is a program that picks up a page
and follows all the links on that page
• Crawler = Spider
• Types of crawler:
– Breadth First
– Depth First

Breadth First Crawlers
Use breadth-first search (BFS) algorithm
• Get all links from the starting page, and
add them to a queue
• Pick the 1st link from the queue, get all
links on the page and add to the queue
• Repeat above step till queue is empty

Breadth First Crawlers

Depth First Crawlers
Use depth first search (DFS) algorithm
• Get the 1st link not visited from the start
page
• Visit link and get 1st non-visited link
• Repeat above step till no no-visited links
• Go to next non-visited link in the previous
level and repeat 2nd step

Depth First Crawlers

WEB CRAWLER
• Spiders or robots that automatically
download web pages
• A crawler can visit many sites to collect
information that can be analyzed and
mined in a central location, either online
(as it is downloaded) or off-line (after it is
stored)

APPLICATIONS OF WEB CRAWLERS
• BUSINESS INTELLIGENCE
• MONITOR WEB SITES AND PAGES OF
INTEREST
• UNIVERSAL CRAWLERs
• PREFERENTIAL CRAWLERS

BASIC CRAWLER ALGORITHM

CONTINUED…..
• The crawler maintains a list of unvisited URLs called the frontier.
• The list is initialized with seed URLs which may be provided by
the user or another program.
• In each iteration of its main loop, the crawler picks the next URL
from the frontier, fetches the page corresponding to the URL
through HTTP, parses the retrieved page to extract its URLs,
adds newly discovered URLs to the frontier, and stores the
page (or other extracted information, possibly index terms) in
a local disk repository.
• The crawling process may be terminated when a certain
number of pages have been crawled. The crawler may also
be forced to stop if the frontier becomes empty


Chapter 1

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Chapter 1

Загружено:

Авторское право:

Доступные форматы

Basic WWW Technologies

Modeling the Internet and the Web

Modeling the Internet and the Web 2

Modeling the Internet and the Web 4

Modeling the Internet and the Web 5

Modeling the Internet and the Web 6

Modeling the Internet and the Web 9

Modeling the Internet and the Web 10

<!ELEMENT IMG - O EMPTY>

Modeling the Internet and the Web 11

• HTML was originally developed by Tim Berners-

Modeling the Internet and the Web 12

Modeling the Internet and the Web 13

Modeling the Internet and the Web 14

Modeling the Internet and the Web 15

Modeling the Internet and the Web 16

Modeling the Internet and the Web 17

Modeling the Internet and the Web 18

Modeling the Internet and the Web 20

• As an example, three common schemes for identifying Web

• Describes how messages are encoded and exchanged

• Different Layering Architectures

Modeling the Internet and the Web 22

Modeling the Internet and the Web 23

Modeling the Internet and the Web 24

• Physical: concerned with the transmission of electrical or optical

Modeling the Internet and the Web 25

Modeling the Internet and the Web 26

Modeling the Internet and the Web 27

• A connection-oriented protocol (TCP) used

Modeling the Internet and the Web 28

Modeling the Internet and the Web 29

DNS (domain name service): mapping from domain

Modeling the Internet and the Web 30

Modeling the Internet and the Web 31

Server Transfer Log: transactions between a

Modeling the Internet and the Web 33

Modeling the Internet and the Web 35

According to Pew Internet Project Report (2002),

Modeling the Internet and the Web 36

Overlap analysis used for estimating the size

Modeling the Internet and the Web 38

Modeling the Internet and the Web 39

• Meta-search engine: dispatch the user

Modeling the Internet and the Web 40

Modeling the Internet and the Web 41

Modeling the Internet and the Web 42

Modeling the Internet and the Web 43

Modeling the Internet and the Web 44

Modeling the Internet and the Web 45

Modeling the Internet and the Web 46

Modeling the Internet and the Web 47

Modeling the Internet and the Web 48

Modeling the Internet and the Web 49

Вам также может понравиться

Basic WWW Technologies