Exer 5 Ducusin

Deep Web Means
The "deep web" (aka

"deep net", aka "the
Hidden web") refers to the
web sites that are not
accessible via search
engines (Google, Bing,
etc)
There are certain
!eas"res that a
web!aster can take in
order to ens"re that their
sites are not inde# able b$
the search engines % in
!an$ cases (b"t not all),
web!asters !ake their
sites inaccessible to
search engines for a
reason
&ites in the "deep web"
can incl"de web sites that
are readil$ accessible if
accessed directl$ b"t not
inde#ed b$ search
engines, web sites that
are password protected
and web sites that are
onl$ accessible via an
anon$!it$ network, s"ch
as the Tor network
'or instance % the site "&ilk
(oad", which was a
!arketplace for dr"gs and
other illegal ite!s, was
onl$ accessible via the Tor
network The Tor network,
according to their site, is a
"network of virt"al t"nnels
that allows people and
gro"ps to i!prove their
privac$ and sec"rit$ on
the )nternet"
)nfor!ation*s+
Size
)t is i!possible to !eas"re
or p"t esti!ates onto the
si,e of the deep web
beca"se the !a-orit$ of
the infor!ation is hidden
or locked inside
databases .arl$
esti!ates s"ggested that
the deep web is /,000 to
1,000 ti!es larger than
the s"rface web However,
since !ore infor!ation
and sites are alwa$s being
added, it can be ass"!ed
that the deep web is
growing e#ponentiall$ at a
rate that cannot be
2"anti3ed
.sti!ates based on
e#trapolations fro! a
st"d$ done at 4niversit$
of 5alifornia, Berkele$ in
6007,spec"late that the
deep web consists of
abo"t 81 pet b$tes More
acc"rate esti!ates are
available for the n"!ber
of reso"rces in the deep
Web+ research of He et al
detected aro"nd 900,000
deep web sites in the
entire Web in 600/,
:
and,
according to &hestakov,
aro"nd 7/,000 deep web
sites e#isted in the
("ssian part of the Web in
600;
Naming
Berg!an, in a se!inal
paper on the deep Web
p"blished in The Journal of
Electronic Publishing,
!entioned that <ill
.llsworth "sed the ter!
invisible Web in 7==/ to
refer to websites that
were not registered with
an$ search engine
Berg!an cited a <an"ar$
7==; article b$ 'rank
Garcia+
)t wo"ld be a site that>s
possibl$ reasonabl$
designed, b"t the$ didn>t
bother to register it with
an$ of the search engines
&o, no one can 3nd the!?
@o">re hidden ) call that
the invisible Web
Another earl$ "se of the
ter! Invisible Web was b$
Br"ce Mo"nt and Matthew
B Boll of Cersonal Dibrar$
&oftware, in a description
of the E7 deep Web tool
fo"nd in a Dece!ber 7==;
press release
The 3rst "se of the
speci3c ter! Deep Web,
now generall$ accepted,
occ"rred in the
afore!entioned 6007
Berg!an st"d$
Deep resources
Deep Web reso"rces !a$
be classi3ed into one or
!ore of the following
categories+
D$na!ic content+
d$na!ic pages
which are ret"rned
in response to a
s"b!itted 2"er$ or
accessed onl$
thro"gh a for!,
especiall$ if open%
do!ain inp"t
ele!ents (s"ch as
te#t 3elds) are
"sedF s"ch 3elds
are hard to
navigate witho"t
do!ain knowledge
4nlinked content+
pages which are
not linked to b$
other pages, which
!a$ prevent Web
crawling progra!s
fro! accessing the
content This
content is referred
to as pages witho"t
back links (or in
links)
Crivate Web+ sites
that re2"ire
registration and
login (password%
protected
reso"rces)
5onte#t"al Web+
pages with content
var$ing for
diGerent access
conte#ts (eg,
ranges of client )C
addresses or
previo"s navigation
se2"ence)
Di!ited access
content+ sites that
li!it access to their
pages in a
technical wa$ (eg,
"sing the (obots
.#cl"sion &tandard,
5ACT5HAs, or no%
cache Crogra!
HTTC headers
which prohibit
search engines
fro! browsing
the! and creating
cached copies
:79H
)
&cripted content+
pages that are onl$
accessible thro"gh
links prod"ced b$
<ava&cript as well
as content
d$na!icall$
downloaded fro!
Web servers via
'lash or A-a#
sol"tions
Ion%HTMDJte#t
content+ te#t"al
content encoded in
!"lti!edia (i!age
or video) 3les or
speci3c 3le for!ats
not handled b$
search engines
Accessing the Deep
Web
While it is not alwa$s
possible to discover a
speci3c web server>s
e#ternal )C address,
theoreticall$ al!ost an$
site can be accessed via
its )C address, regardless
of whether or not it has
been inde#ed
5ertain content is
intentionall$ hidden fro!
the reg"lar internet,
accessible onl$ with
special software, s"ch as
Tor Tor allows "sers to
access websites "sing the
onion host s"K#
anon$!o"sl$, hiding their
)C address Lther s"ch
software incl"des )6C and
'ree net
)n 600M, in order to
facilitate "ser access and
search engine inde#ing of
hidden services "sing the
onion s"K#, Aaron &wart,
designed Tor6web a pro#$
software able to provide
access to Tor hidden
services b$ !eans of
co!!on web browsers
:7/H
To discover content on the
Web, search engines "se
web crawlers that follow
h$perlinks thro"gh known
protocol virt"al port
n"!bers This techni2"e
is ideal for discovering
reso"rces on the s"rface
Web b"t is often
ineGective at 3nding Deep
Web reso"rces 'or
e#a!ple, these crawlers
do not atte!pt to 3nd
d$na!ic pages that are
the res"lt of database
2"eries d"e to the
indeter!inate n"!ber of
2"eries that are possible
)t has been noted that this
can be (partiall$)
overco!e b$ providing
links to 2"er$ res"lts, b"t
this co"ld "nintentionall$
inNate the pop"larit$ for a
!e!ber of the deep
WebDeepCeep, )nt"te,
Deep Web Technologies,
&cir"s, and Ah!ia3 are a
few search engines that
have accessed the Deep
Web )nt"te ran o"t of
f"nding and is now a
te!porar$ static archive
as of <"l$, 6077 &cir"s
retired near the end of
<an"ar$, 6079
Classifying resources
A"to!aticall$ deter!ining
if a Web reso"rce is a
!e!ber of the s"rface
Web or the deep Web is
diKc"lt )f a reso"rce is
inde#ed b$ a search
engine, it is not
necessaril$ a !e!ber of
the s"rface Web, beca"se
the reso"rce co"ld have
been fo"nd "sing another
!ethod (eg, the &ite!ap
Crotocol, !odOoai,
LA)ster) instead of
traditional crawling )f a
search engine provides a
back link for a reso"rce,
one !a$ ass"!e that the
reso"rce is in the s"rface
Web 4nfort"natel$,
search engines do not
alwa$s provide all back
links to reso"rces
'"rther!ore, a reso"rce
!a$ reside in the s"rface
Web even tho"gh it has
$et to be fo"nd b$ a
search engine
Most of the work of
classif$ing search res"lts
has been in categori,ing
the s"rface Web b$ topic
'or classi3cation of deep
Web reso"rces, )peirotis et
al presented an algorith!
that classi3es a deep Web
site into the categor$ that
generates the largest
n"!ber of hits for so!e
caref"ll$ selected,
topicall$%foc"sed 2"eries
Deep Web directories
"nder develop!ent
incl"de LA)ster at the
4niversit$ of Michigan,
)nt"te at the 4niversit$ of
Manchester, )nfo !ine
:61H

at the 4niversit$ of
5alifornia at (iverside,
and Direct &earch (b$
Gar$ Crice) This
classi3cation poses a
challenge while searching
the deep Web whereb$
two levels of
categori,ation are
re2"ired The 3rst level is
to categori,e sites into
vertical topics (eg,
health, travel,
a"to!obiles) and s"b%
topics according to the
nat"re of the content
"nderl$ing their
databases
The !ore diKc"lt
challenge is to categori,e
and !ap the infor!ation
e#tracted fro! !"ltiple
deep Web so"rces
according to end%"ser
needs Deep Web search
reports cannot displa$
4(Ds like traditional
search reports .nd "sers
e#pect their search tools
to not onl$ 3nd what the$
are looking for special, b"t
to be int"itive and "ser%
friendl$ )n order to be
!eaningf"l, the search
reports have to oGer so!e
depth to the nat"re of
content that "nderlie the
so"rces or else the end%
"ser will be lost in the sea
of 4(Ds that do not
indicate what content lies
beneath the! The for!at
in which search res"lts are
to be presented varies
widel$ b$ the partic"lar
topic of the search and
the t$pe of content being
e#posed The challenge is
to 3nd and !ap si!ilar
data ele!ents fro!
!"ltiple disparate so"rces
so that search res"lts !a$
be e#posed in a "ni3ed
for!at on the search
report irrespective of their
so"rce
How to search the
Deep Web?
A brand%new search
engine na!ed Tor&earch
deb"ted last week,
!aking waves aro"nd
tech !edia )t*s not hard
to see wh$+ The Lct 6
sei,"re of &ilk (oad
captivated neti,ens b$
e#posing a corner of
c$berspace that didn*t
appear on Google and was
diKc"lt to navigate "nless
$o" alread$ knew what
$o" were looking for
)f Tor &earch fo"nder 5hris
Mac Ia"ghton has tr"l$
b"ilt a new engine that
can help "s 3nd o"r wa$
aro"nd the Deep Web,
that*s gen"inel$ e#citing
B"t does it deliverP
We took it for a test drive
and then pitted it against
so!e of its co!petitors
Lne spoiler+ Tor &earch
has a long wa$ to go
before it beco!es the best
wa$ to search the Deep
Web
Evil Wii! Witho"t a
do"bt, this is the single
best entr$ point into the
world of Tor The well%
!aintained website
provides an organi,ed list
of links to hidden services
with e#planations and
even reviews )t*s not
!eant to be "sed as a
search engine, b"t it often
is
"orSearch! A new search
engine that has garnered
so!e b",, in p"blications
like Qent"reBeat )t
operates in !"ch the
sa!e wa$ as Google, with
a link%crawling spider that
will forever b"ild its
arsenal
#oogle! With pro#$ tools
like Lnion to, Google
act"all$ crawls !"ch of
the Deep Web in a
ro"ndabo"t wa$ And
beca"se it*s so pop"lar,
it*s the 3rst tool that
al!ost an$one who hears
abo"t the Deep Web "ses
DucDuc#o! &i!ilar to
Google b"t with one
signi3cant diGerence,
D"ck D"ck Go oGers
anon$!o"s search, a
feat"re in keeping with
Tor*s powers of anon$!it$
)t*s no s"rprise that it*s
pop"lar a!ong the Tor
crowd
"orch+ An older Deep Web
search engine, Torch has
e#isted for a long ti!e b"t
little fanfare
&earch+ &ilk (oad
replace!ent
)n light of &ilk (oad*s
sei,"re on Lct 6, !an$
"sers have been searching
for a s"bstit"te Det*s see
who co!es "p with the
best one
Winner+ #oogle
i!!ediatel$ pointed !e
to Black Market
(eloaded*s s"breddit, a
h"b on social news site
(eddit for one of the
!a-or Deep Web black
!arkets )t also gave !e a
long
Dist of articles oGering
other replace!ents Lther
services gave !e
nonsensical or o"tdated
res"lts Tor &earch
ret"rned &ilk (oad itself as
a top res"lt
&earch+ Black Market
(eloaded
)f $o" alread$ know the
na!e of a new !arket,
$o" still need to 3nd the
address to reach it Which
search engine can get !e
where ) want to goP
"he Deep Web $or
%nvisible web& is
the set of
information
resources on the
Worl' Wi'e Web
not reporte' by
normal search
engines(
According several
researches the principal
search engines inde# onl$
a s!all portion of the
overall web content, the
re!aining part is "nknown
to the !a-orit$ of web
"sers
What do $o" think if $o"
were told that "nder o"r
feet, there is a world
larger than o"rs and !"ch
!ore crowdedP We will
literall$ be shocked, and
this is the reaction of
those individ"al who can
"nderstand the e#istence
of the Deep Web, a
network of interconnected
s$ste!s, are not inde#ed,
having a si,e h"ndreds of
ti!es higher than the
c"rrent web, aro"nd 100
ti!es
Qer$ e#ha"stive is the
de3nition provided b$ the
fo"nder of Bright Clanet,
Mike Berg!an, that
co!pared searching on
the )nternet toda$ to
dragging a net across the
s"rface of the ocean+ a
great deal !a$ be ca"ght
in the net, b"t there is a
wealth of infor!ation that
is deep and therefore
!issed
Lrdinar$ search engines
to 3nd content on the web
"sing software called
"crawlers" This techni2"e
is ineGective for 3nding
the hidden reso"rces of
the Web that co"ld be
classi3ed into the
following categories+
Dynamic
content+ d$na!ic
pages which are
ret"rned in response
to a s"b!itted 2"er$
or accessed onl$
thro"gh a for!,
especiall$ if open%
do!ain inp"t
ele!ents (s"ch as
te#t 3elds) are "sedF
s"ch 3elds are hard to
navigate witho"t
do!ain knowledge
Unlinked
content+ pages which
are not linked to b$
other pages, which
!a$ prevent Web
crawling progra!s
fro! accessing the
content This content
is referred to as pages
witho"t back links (or
in links)
Private Web+
sites that re2"ire
registration and login
(password%protected
reso"rces)
5onte#t"al Web+
pages with content
var$ing for diGerent
access conte#ts (eg,
ranges of client )C
addresses or previo"s
navigation se2"ence)
Limited access
content+ sites that
li!it access to their
pages in a technical
wa$ (eg, "sing the
(obots .#cl"sion
&tandard, 5ACT5HAs,
or no%cache Crag!a
HTTC headers which
prohibit search
engines fro!
browsing the! and
creating cached
copies)
Scripted
content+ pages that
are onl$ accessible
thro"gh links
prod"ced b$
<ava&cript as well as
content d$na!icall$
downloaded fro! Web
servers via 'lash or
A-a# sol"tions
Non-HTML/text
content+ te#t"al
content encoded in
!"lti!edia (i!age or
video) 3les or speci3c
3le for!ats not
handled b$ search
engines
Text content
sin! t"e #op"er
protocol and $les
"osted on %TP t"at
are not indexed by
most searc"
en!ines .ngines
s"ch as Google do not
inde# pages o"tside
of HTTC or HTTC&
A parallel web that has a
!"ch wider n"!ber of
infor!ation represents an
inval"able reso"rce for
private co!panies,
govern!ents, and
especiall$ cybercrime )n
the i!agination of !an$
persons, the Deep Web
ter! is associated with
the concept of anon$!it$
that goes with cri!inal
intents the cannot be
p"rs"ed beca"se
s"b!erged in an
inaccessible world
As we will see this
interpretation of the Deep
Web is deepl$ wrong, we
are facing with a network
de3nitel$ diGerent fro!
the "s"al web b"t in !an$
wa$s repeat the sa!e
iss"es in a diGerent sense
Why isn)t everything
visible?
There are still so!e
h"rdles search engine
crawlers cannot leap Here
are so!e e#a!ples of
!aterial that re!ains
hidden fro! general
search engines+
"he Contents of
Searchable
Databases( When
$o" search in a
librar$ catalog,
article database,
statistical
database, etc, the
res"lts are
generated "on the
N$" in answer to
$o"r search
Beca"se the
crawler progra!s
cannot t$pe or
think, the$ cannot
enter passwords on
a login screen or
ke$words in a
search bo# Th"s,
these databases
!"st be searched
separatel$
o A special
case!
Google
&cholar is
part of the
p"blic or
visible web
)t contains
citations to
-o"rnal
articles and
other
p"blications
, with links
to
p"blishers
or other
so"rces
where one
can tr$ to
access the
f"ll te#t of
the ite!s
This is
convenient,
b"t res"lts
in Google
&cholar are
onl$ a s!all
fraction of
all the
scholarl$
p"blications
that e#ist
online
M"ch !ore %
incl"ding
!ost of the
f"ll te#t % is
available
thro"gh
article
databases
that are
part of the
invisible
web The
45 Berkele$
Dibrar$
s"bscribes
to over 600
of these,
accessible
to o"r
st"dents,
fac"lt$,
staG, and
on%ca!p"s
visitors
thro"gh o"r
'ind Articles
page
E*clu'e' +ages(
&earch engine
co!panies e#cl"de
so!e t$pes of
pages b$ polic$, to
avoid cl"ttering
their databases
with "nwanted
content
o Dynamicall
y
generate'
pages of
little value
beyon'
single use(
Think of the
billions of
possible
web pages
generated
b$ searches
for books in
librar$
catalogs,
p"blic%
record
databases,
etc .ach of
these is
created in
response to
a speci3c
need
&earch
engines do
not want all
these pages
in their web
databases,
since the$
generall$
are not of
broad
interest
o +ages
'eliberatel
y e*clu'e'
by their
owners( A
web page
creator who
does not
want hisJher
page
showing "p
in search
engines can
insert
special
"!eta tags"
that will not
displa$ on
the screen,
b"t will
ca"se !ost
search
engines>
crawlers to
avoid the
page
How to ,in' the
%nvisible Web
Simply thin
-'atabases- an' eep
your eyes open( @o" can
3nd searchable databases
containing invisible web
pages in the co"rse of
ro"tine searching in !ost
general web directories
Lf partic"lar val"e in
acade!ic research are+
ipl6
)nfo!ine
4se Google and other
search engines to locate
searchable databases b$
searching a s"b-ect ter!
and the word "database"
)f the database "ses the
word database in its own
pages, $o" are likel$ to
3nd it in Google The word
"database" is also "sef"l
in searching a topic in the
@ahoo? director$, beca"se
the$ so!eti!es "se the
ter! to describe
searchable databases in
their listings
D"c"sin, 5hris
B&)T7

Exer 5 Ducusin

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Exer 5 Ducusin

Загружено:

Авторское право:

Доступные форматы

Deep Web Means

The "deep web" (aka

Вам также может понравиться