"deep net", aka "the Hidden web") refers to the web sites that are not accessible via search engines (Google, Bing, etc) There are certain !eas"res that a web!aster can take in order to ens"re that their sites are not inde# able b$ the search engines % in !an$ cases (b"t not all), web!asters !ake their sites inaccessible to search engines for a reason &ites in the "deep web" can incl"de web sites that are readil$ accessible if accessed directl$ b"t not inde#ed b$ search engines, web sites that are password protected and web sites that are onl$ accessible via an anon$!it$ network, s"ch as the Tor network 'or instance % the site "&ilk (oad", which was a !arketplace for dr"gs and other illegal ite!s, was onl$ accessible via the Tor network The Tor network, according to their site, is a "network of virt"al t"nnels that allows people and gro"ps to i!prove their privac$ and sec"rit$ on the )nternet" )nfor!ation*s+ Size )t is i!possible to !eas"re or p"t esti!ates onto the si,e of the deep web beca"se the !a-orit$ of the infor!ation is hidden or locked inside databases .arl$ esti!ates s"ggested that the deep web is /,000 to 1,000 ti!es larger than the s"rface web However, since !ore infor!ation and sites are alwa$s being added, it can be ass"!ed that the deep web is growing e#ponentiall$ at a rate that cannot be 2"anti3ed .sti!ates based on e#trapolations fro! a st"d$ done at 4niversit$ of 5alifornia, Berkele$ in 6007,spec"late that the deep web consists of abo"t 81 pet b$tes More acc"rate esti!ates are available for the n"!ber of reso"rces in the deep Web+ research of He et al detected aro"nd 900,000 deep web sites in the entire Web in 600/, : and, according to &hestakov, aro"nd 7/,000 deep web sites e#isted in the ("ssian part of the Web in 600; Naming Berg!an, in a se!inal paper on the deep Web p"blished in The Journal of Electronic Publishing, !entioned that <ill .llsworth "sed the ter! invisible Web in 7==/ to refer to websites that were not registered with an$ search engine Berg!an cited a <an"ar$ 7==; article b$ 'rank Garcia+ )t wo"ld be a site that>s possibl$ reasonabl$ designed, b"t the$ didn>t bother to register it with an$ of the search engines &o, no one can 3nd the!? @o">re hidden ) call that the invisible Web Another earl$ "se of the ter! Invisible Web was b$ Br"ce Mo"nt and Matthew B Boll of Cersonal Dibrar$ &oftware, in a description of the E7 deep Web tool fo"nd in a Dece!ber 7==; press release The 3rst "se of the speci3c ter! Deep Web, now generall$ accepted, occ"rred in the afore!entioned 6007 Berg!an st"d$ Deep resources Deep Web reso"rces !a$ be classi3ed into one or !ore of the following categories+ D$na!ic content+ d$na!ic pages which are ret"rned in response to a s"b!itted 2"er$ or accessed onl$ thro"gh a for!, especiall$ if open% do!ain inp"t ele!ents (s"ch as te#t 3elds) are "sedF s"ch 3elds are hard to navigate witho"t do!ain knowledge 4nlinked content+ pages which are not linked to b$ other pages, which !a$ prevent Web crawling progra!s fro! accessing the content This content is referred to as pages witho"t back links (or in links) Crivate Web+ sites that re2"ire registration and login (password% protected reso"rces) 5onte#t"al Web+ pages with content var$ing for diGerent access conte#ts (eg, ranges of client )C addresses or previo"s navigation se2"ence) Di!ited access content+ sites that li!it access to their pages in a technical wa$ (eg, "sing the (obots .#cl"sion &tandard, 5ACT5HAs, or no% cache Crogra! HTTC headers which prohibit search engines fro! browsing the! and creating cached copies :79H ) &cripted content+ pages that are onl$ accessible thro"gh links prod"ced b$ <ava&cript as well as content d$na!icall$ downloaded fro! Web servers via 'lash or A-a# sol"tions Ion%HTMDJte#t content+ te#t"al content encoded in !"lti!edia (i!age or video) 3les or speci3c 3le for!ats not handled b$ search engines Accessing the Deep Web While it is not alwa$s possible to discover a speci3c web server>s e#ternal )C address, theoreticall$ al!ost an$ site can be accessed via its )C address, regardless of whether or not it has been inde#ed 5ertain content is intentionall$ hidden fro! the reg"lar internet, accessible onl$ with special software, s"ch as Tor Tor allows "sers to access websites "sing the onion host s"K# anon$!o"sl$, hiding their )C address Lther s"ch software incl"des )6C and 'ree net )n 600M, in order to facilitate "ser access and search engine inde#ing of hidden services "sing the onion s"K#, Aaron &wart, designed Tor6web a pro#$ software able to provide access to Tor hidden services b$ !eans of co!!on web browsers :7/H To discover content on the Web, search engines "se web crawlers that follow h$perlinks thro"gh known protocol virt"al port n"!bers This techni2"e is ideal for discovering reso"rces on the s"rface Web b"t is often ineGective at 3nding Deep Web reso"rces 'or e#a!ple, these crawlers do not atte!pt to 3nd d$na!ic pages that are the res"lt of database 2"eries d"e to the indeter!inate n"!ber of 2"eries that are possible )t has been noted that this can be (partiall$) overco!e b$ providing links to 2"er$ res"lts, b"t this co"ld "nintentionall$ inNate the pop"larit$ for a !e!ber of the deep WebDeepCeep, )nt"te, Deep Web Technologies, &cir"s, and Ah!ia3 are a few search engines that have accessed the Deep Web )nt"te ran o"t of f"nding and is now a te!porar$ static archive as of <"l$, 6077 &cir"s retired near the end of <an"ar$, 6079 Classifying resources A"to!aticall$ deter!ining if a Web reso"rce is a !e!ber of the s"rface Web or the deep Web is diKc"lt )f a reso"rce is inde#ed b$ a search engine, it is not necessaril$ a !e!ber of the s"rface Web, beca"se the reso"rce co"ld have been fo"nd "sing another !ethod (eg, the &ite!ap Crotocol, !odOoai, LA)ster) instead of traditional crawling )f a search engine provides a back link for a reso"rce, one !a$ ass"!e that the reso"rce is in the s"rface Web 4nfort"natel$, search engines do not alwa$s provide all back links to reso"rces '"rther!ore, a reso"rce !a$ reside in the s"rface Web even tho"gh it has $et to be fo"nd b$ a search engine Most of the work of classif$ing search res"lts has been in categori,ing the s"rface Web b$ topic 'or classi3cation of deep Web reso"rces, )peirotis et al presented an algorith! that classi3es a deep Web site into the categor$ that generates the largest n"!ber of hits for so!e caref"ll$ selected, topicall$%foc"sed 2"eries Deep Web directories "nder develop!ent incl"de LA)ster at the 4niversit$ of Michigan, )nt"te at the 4niversit$ of Manchester, )nfo !ine :61H
at the 4niversit$ of 5alifornia at (iverside, and Direct &earch (b$ Gar$ Crice) This classi3cation poses a challenge while searching the deep Web whereb$ two levels of categori,ation are re2"ired The 3rst level is to categori,e sites into vertical topics (eg, health, travel, a"to!obiles) and s"b% topics according to the nat"re of the content "nderl$ing their databases The !ore diKc"lt challenge is to categori,e and !ap the infor!ation e#tracted fro! !"ltiple deep Web so"rces according to end%"ser needs Deep Web search reports cannot displa$ 4(Ds like traditional search reports .nd "sers e#pect their search tools to not onl$ 3nd what the$ are looking for special, b"t to be int"itive and "ser% friendl$ )n order to be !eaningf"l, the search reports have to oGer so!e depth to the nat"re of content that "nderlie the so"rces or else the end% "ser will be lost in the sea of 4(Ds that do not indicate what content lies beneath the! The for!at in which search res"lts are to be presented varies widel$ b$ the partic"lar topic of the search and the t$pe of content being e#posed The challenge is to 3nd and !ap si!ilar data ele!ents fro! !"ltiple disparate so"rces so that search res"lts !a$ be e#posed in a "ni3ed for!at on the search report irrespective of their so"rce How to search the Deep Web? A brand%new search engine na!ed Tor&earch deb"ted last week, !aking waves aro"nd tech !edia )t*s not hard to see wh$+ The Lct 6 sei,"re of &ilk (oad captivated neti,ens b$ e#posing a corner of c$berspace that didn*t appear on Google and was diKc"lt to navigate "nless $o" alread$ knew what $o" were looking for )f Tor &earch fo"nder 5hris Mac Ia"ghton has tr"l$ b"ilt a new engine that can help "s 3nd o"r wa$ aro"nd the Deep Web, that*s gen"inel$ e#citing B"t does it deliverP We took it for a test drive and then pitted it against so!e of its co!petitors Lne spoiler+ Tor &earch has a long wa$ to go before it beco!es the best wa$ to search the Deep Web Evil Wii! Witho"t a do"bt, this is the single best entr$ point into the world of Tor The well% !aintained website provides an organi,ed list of links to hidden services with e#planations and even reviews )t*s not !eant to be "sed as a search engine, b"t it often is "orSearch! A new search engine that has garnered so!e b",, in p"blications like Qent"reBeat )t operates in !"ch the sa!e wa$ as Google, with a link%crawling spider that will forever b"ild its arsenal #oogle! With pro#$ tools like Lnion to, Google act"all$ crawls !"ch of the Deep Web in a ro"ndabo"t wa$ And beca"se it*s so pop"lar, it*s the 3rst tool that al!ost an$one who hears abo"t the Deep Web "ses DucDuc#o! &i!ilar to Google b"t with one signi3cant diGerence, D"ck D"ck Go oGers anon$!o"s search, a feat"re in keeping with Tor*s powers of anon$!it$ )t*s no s"rprise that it*s pop"lar a!ong the Tor crowd "orch+ An older Deep Web search engine, Torch has e#isted for a long ti!e b"t little fanfare &earch+ &ilk (oad replace!ent )n light of &ilk (oad*s sei,"re on Lct 6, !an$ "sers have been searching for a s"bstit"te Det*s see who co!es "p with the best one Winner+ #oogle i!!ediatel$ pointed !e to Black Market (eloaded*s s"breddit, a h"b on social news site (eddit for one of the !a-or Deep Web black !arkets )t also gave !e a long Dist of articles oGering other replace!ents Lther services gave !e nonsensical or o"tdated res"lts Tor &earch ret"rned &ilk (oad itself as a top res"lt &earch+ Black Market (eloaded )f $o" alread$ know the na!e of a new !arket, $o" still need to 3nd the address to reach it Which search engine can get !e where ) want to goP "he Deep Web $or %nvisible web& is the set of information resources on the Worl' Wi'e Web not reporte' by normal search engines( According several researches the principal search engines inde# onl$ a s!all portion of the overall web content, the re!aining part is "nknown to the !a-orit$ of web "sers What do $o" think if $o" were told that "nder o"r feet, there is a world larger than o"rs and !"ch !ore crowdedP We will literall$ be shocked, and this is the reaction of those individ"al who can "nderstand the e#istence of the Deep Web, a network of interconnected s$ste!s, are not inde#ed, having a si,e h"ndreds of ti!es higher than the c"rrent web, aro"nd 100 ti!es Qer$ e#ha"stive is the de3nition provided b$ the fo"nder of Bright Clanet, Mike Berg!an, that co!pared searching on the )nternet toda$ to dragging a net across the s"rface of the ocean+ a great deal !a$ be ca"ght in the net, b"t there is a wealth of infor!ation that is deep and therefore !issed Lrdinar$ search engines to 3nd content on the web "sing software called "crawlers" This techni2"e is ineGective for 3nding the hidden reso"rces of the Web that co"ld be classi3ed into the following categories+ Dynamic content+ d$na!ic pages which are ret"rned in response to a s"b!itted 2"er$ or accessed onl$ thro"gh a for!, especiall$ if open% do!ain inp"t ele!ents (s"ch as te#t 3elds) are "sedF s"ch 3elds are hard to navigate witho"t do!ain knowledge Unlinked content+ pages which are not linked to b$ other pages, which !a$ prevent Web crawling progra!s fro! accessing the content This content is referred to as pages witho"t back links (or in links) Private Web+ sites that re2"ire registration and login (password%protected reso"rces) 5onte#t"al Web+ pages with content var$ing for diGerent access conte#ts (eg, ranges of client )C addresses or previo"s navigation se2"ence) Limited access content+ sites that li!it access to their pages in a technical wa$ (eg, "sing the (obots .#cl"sion &tandard, 5ACT5HAs, or no%cache Crag!a HTTC headers which prohibit search engines fro! browsing the! and creating cached copies) Scripted content+ pages that are onl$ accessible thro"gh links prod"ced b$ <ava&cript as well as content d$na!icall$ downloaded fro! Web servers via 'lash or A-a# sol"tions Non-HTML/text content+ te#t"al content encoded in !"lti!edia (i!age or video) 3les or speci3c 3le for!ats not handled b$ search engines Text content sin! t"e #op"er protocol and $les "osted on %TP t"at are not indexed by most searc" en!ines .ngines s"ch as Google do not inde# pages o"tside of HTTC or HTTC& A parallel web that has a !"ch wider n"!ber of infor!ation represents an inval"able reso"rce for private co!panies, govern!ents, and especiall$ cybercrime )n the i!agination of !an$ persons, the Deep Web ter! is associated with the concept of anon$!it$ that goes with cri!inal intents the cannot be p"rs"ed beca"se s"b!erged in an inaccessible world As we will see this interpretation of the Deep Web is deepl$ wrong, we are facing with a network de3nitel$ diGerent fro! the "s"al web b"t in !an$ wa$s repeat the sa!e iss"es in a diGerent sense Why isn)t everything visible? There are still so!e h"rdles search engine crawlers cannot leap Here are so!e e#a!ples of !aterial that re!ains hidden fro! general search engines+ "he Contents of Searchable Databases( When $o" search in a librar$ catalog, article database, statistical database, etc, the res"lts are generated "on the N$" in answer to $o"r search Beca"se the crawler progra!s cannot t$pe or think, the$ cannot enter passwords on a login screen or ke$words in a search bo# Th"s, these databases !"st be searched separatel$ o A special case! Google &cholar is part of the p"blic or visible web )t contains citations to -o"rnal articles and other p"blications , with links to p"blishers or other so"rces where one can tr$ to access the f"ll te#t of the ite!s This is convenient, b"t res"lts in Google &cholar are onl$ a s!all fraction of all the scholarl$ p"blications that e#ist online M"ch !ore % incl"ding !ost of the f"ll te#t % is available thro"gh article databases that are part of the invisible web The 45 Berkele$ Dibrar$ s"bscribes to over 600 of these, accessible to o"r st"dents, fac"lt$, staG, and on%ca!p"s visitors thro"gh o"r 'ind Articles page E*clu'e' +ages( &earch engine co!panies e#cl"de so!e t$pes of pages b$ polic$, to avoid cl"ttering their databases with "nwanted content o Dynamicall y generate' pages of little value beyon' single use( Think of the billions of possible web pages generated b$ searches for books in librar$ catalogs, p"blic% record databases, etc .ach of these is created in response to a speci3c need &earch engines do not want all these pages in their web databases, since the$ generall$ are not of broad interest o +ages 'eliberatel y e*clu'e' by their owners( A web page creator who does not want hisJher page showing "p in search engines can insert special "!eta tags" that will not displa$ on the screen, b"t will ca"se !ost search engines> crawlers to avoid the page How to ,in' the %nvisible Web Simply thin -'atabases- an' eep your eyes open( @o" can 3nd searchable databases containing invisible web pages in the co"rse of ro"tine searching in !ost general web directories Lf partic"lar val"e in acade!ic research are+ ipl6 )nfo!ine 4se Google and other search engines to locate searchable databases b$ searching a s"b-ect ter! and the word "database" )f the database "ses the word database in its own pages, $o" are likel$ to 3nd it in Google The word "database" is also "sef"l in searching a topic in the @ahoo? director$, beca"se the$ so!eti!es "se the ter! to describe searchable databases in their listings D"c"sin, 5hris B&)T7