Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT pages.
Spammers use questionable search engine optimization (SEO) We motivate our work using a real example. Around mid-
techniques to promote their spam links into top search results. In October 2006, the following three doorway URLs appeared
this paper, we focus on one prevalent type of spam – redirection among the top-10 Live Search results for “cheap ticket”:
spam – where one can identify spam pages by the third-party • http://-cheapticket.blogspot.com/
domains that these pages redirect traffic to. We propose a five- • http://sitegtr.com/all/cheap-ticket.html
layer, double-funnel model for describing end-to-end redirection • http://cheap-ticketv.blogspot.com/
spam, present a methodology for analyzing the layers, and
identify prominent domains on each layer using two sets of All these pages appeared to be spam: they used cloaking, their
commercial keywords – one targeting spammers and the other URLs were posted as comments at numerous open forums1, and
targeting advertisers. The methodology and findings are useful for they redirected traffic to known-spammer redirection domains
search engines to strengthen their ranking algorithms against vip-online-search.info, searchadv.com, and webresourses.info.
spam, for legitimate website owners to locate and remove spam Surprisingly, ads for orbitz.com, a reputable company, appeared
doorway pages, and for legitimate advertisers to identify on all these three spam pages. A search using similar keywords2 at
unscrupulous syndicators who serve ads on spam pages. Google and Yahoo! revealed another two spam pages, hosted on
hometown.aol.com.au and megapage.de, that also displayed
Categories and Subject Descriptors orbitz.com ads. If we believe that a reputable company is unlikely
H.3.5 [Information Storage and Retrieval]: Online Information to buy service directly from spammers, a natural question to ask
Services - Commercial services, Web-based services is: who are the middlemen who indirectly sell spammers’ service
to sites like orbitz.com?
General Terms: Measurement, Security, Experimentation We discovered the answer by “following the money”: when we
Keywords: Search Spam, Web Spam, Redirection and clicked the orbitz.com ads on each of the five pages and
Cloaking, Advertisement Syndication monitored the resulting HTTP traffic using the Fiddler tool [27],
we saw that the ads click-through traffic got funneled into either
64.111.210.206 or the block of IP addresses between
1. INTRODUCTION 66.230.128.0 and 66.230.191.255 [30]. Moreover, the chain of
redirections stopped at http://r.looksmart.com, which then
Search spammers (or web spammers) refer to those who use redirected to orbitz.com using HTTP 302.
questionable search engine optimization (SEO) techniques to
promote their low-quality links into top search rankings. Common In this paper, we analyze end-to-end redirection spam activities
SEO techniques include stuffing keywords, creating link farms comprehensively with an emphasis on syndication-based spam.
(e.g., large number of mutually linked, made-for-ads websites), We propose a five-layer double-funnel model in which displayed
posting links to spam pages as comments at public forums ads flow in one direction and click-through traffic flows in the
(referred to as comment spamming), and using crawler-browser other direction. By constructing two different benchmarks of
cloaking techniques [8] to serve different pages to crawlers and commercial search terms and using the Strider Search Ranger
end users. To evade spam investigation, some spammers in recent system [21] to analyze tens of thousands of spam links that
years have started using click-through cloaking techniques appeared in top results across three major search engines, we
[15,22] to display bogus content to spam investigators who visit identified the major domains in each of the five layers and their
their pages directly without clicking through any search results. interesting characteristics.
We use redirection spam to refer to the web pages that redirect The paper is organized as follows. Section 2 gives an overview
browsers to visit known spammer-controlled third-party domains. of the Search Ranger system and introduces the double-funnel
model. In Section 3 we construct a spammer-targeted search
Many redirection spam pages use syndication where they
participate in pay-per-click programs and display ads-portal
1
For ease of presentation, throughout the paper, we use the term “forums”
Copyright is held by the International World Wide Web Conference to include all blogs, bulletin boards, message boards, guest books, web
Committee (IW3C2). Distribution of these papers is limited to classroom journals, diaries, galleries, archives, etc. that can be abused by web
use, and personal use by others. spammers to promote spam URLs.
2
WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. We use the terms “keyword”, “query”, and “search term”
ACM 978-1-59593-654-7/07/0005. interchangeably in this paper to refer to the entire query phrase that a
user enters into a search box to perform a query.
benchmark. Section 4 analyzes spam density and double-funnel demonstrate, we apply redirection analysis to tracking both the
for this benchmark. In Section 5 we construct an advertiser- ads-fetching traffic and the ads click-through traffic.
targeted benchmark and compare the analysis results using this
benchmark with those in Section 4. Section 6 discusses non- 3. Similarity-based Grouping for Identifying Large-scale
redirection spam that also connects to the double-funnel model. Spam – Rather than analyzing all crawler-indexed pages, Search
Section 7 surveys related work, and Section 8 concludes the Ranger focuses on monitoring search results of popular queries
paper. Since all the analyses in this paper are based on the data targeted by spammers to obtain a list of URLs with high spam
gathered in September and October of 2006, some spam URLs densities. It then analyzes the similarity between the redirections
may no longer be active. from these pages to identify related pages, which are potentially
operated by large-scale spammers. In its simplest form, this
2. REDIRECTION SPAM similarity analysis identifies doorway pages that share the same
redirection domain. After we verify that the domain is responsible
2.1 Definitions: Search Spam and Redirection for serving the spam content, we then use the domain as a seed to
perform “backward propagation of distrust” [13] to detect other
SEO techniques span a wide spectrum. Since the precise related spam pages.
boundary between legitimate SEO techniques and search spam is
In summary, Search Ranger identifies spam URLs using the
often subjective and fuzzy, we focus on one type of spam –
process summarized below.
redirection spam – which is widely used by large-scale spammers
to associate many doorway pages with a single redirection Search Ranger Spam Detection Process
domain. These doorway pages often exhibit similar patterns in Step 1: Given a set of search terms and a target search engine,
their appearance, their cloaking and code obfuscation techniques Search Monkeys retrieve the top-N search results for each query,
for avoiding detection, and the way by which their URLs appear remove duplicates, and scan each unique URL to produce an
in the comment fields of public forums. These repeated patterns XML file that records all URL redirections.
allow human investigators to judge spam pages more easily and
Step 2: At the end of a batched scan, Search Ranger applies
confidently. We will describe the exact steps in detecting spam in
redirection analysis to all the XML files to classify URLs that
the next subsection. In Sections 4 and 5, we will show that
redirected to known-spammer redirection domains as spam.
redirection spam accounts for significant spam densities in both
our benchmarks, which indicate that our spam detection Step 3: Search Ranger groups unclassified URLs by each of the
mechanism is effective in practice. third-party domains that received redirection traffic.
After a user instructs the browser to visit a URL (the primary Step 4: Search Ranger submits sample URLs from each group to
URL), the browser may visit other URLs (secondary URLs) a spam verifier, which gathers evidence of spam activities
automatically. The secondary URLs may contribute to inline associated with these URLs. Specifically, the spam verifier checks
contents (e.g., Google AdSense ads) on the primary page, or may if each URL uses crawler-browser cloaking to fool search engines
replace the primary page entirely (i.e., they replace the URL in the or uses click-through cloaking to evade manual spam
address bar). We consider both these types of secondary URLs investigation. It also checks if the URL has been widely
redirection. See [31] for screenshots of sample redirection spam. comment-spammed at public forums.
Step 5: Search Ranger submits groups of unclassified URLs,
2.2 Strider Search Ranger System ranked by their group sizes and tagged by spam evidence, to
human judges. Once the judges determine a group to be spam,
The Strider Search Ranger system [21] is an automated spam Search Ranger adds the redirection domains responsible for
detection system with the following three key features: serving the spam content to the set of known spam domains,
1. Web Patrol with Search Monkeys [19] - Since search engine which will be used in Step-2 classification in future scans.
crawlers typically do not execute scripts, spammers exploit this
fact using crawler-browser cloaking techniques, which serve one 2.3 Spam Double-Funnel
page to crawlers for indexing but display a different page to
browser users [8,23]. To defend against cloaking, Search A typical advertising syndication business consists of three
Monkeys visit each web page with a full-fledged popular browser, layers: the publishers who attract traffic by providing quality
which executes all client-side scripts. To combat the newer click- content on their websites to achieve high search rankings, the
through cloaking technique, which serves spam content only to advertisers who pay for displaying their ads on those websites,
users who click through search results, our monkey programs and the syndicators who provide the advertising infrastructure to
mimic the click-through by first retrieving a search-result page to connect the publishers with the advertisers. The Google AdSense
set the browser’s document.referrer variable, then inserting a program [29] is an example syndicator. Although some spammers
link to the spam page in the search-result page, and finally have abused the AdSense program [28], the abuse is most likely
clicking through the inserted link. the exception rather than the norm.
In a questionable advertising business, spammers assume the
2. Follow the Money through Redirection Tracking – Common role of publishers, who set up websites of low-quality content and
approaches to detecting “spammy” content and link structures use black-hat SEO techniques to attract traffic. To better survive
merely catch “what” spammers are doing today. By contrast, if spam detection and blacklisting by search engines, many
we follow the money by tracking traffic redirection, we would be spammers have split their operations into two layers. At the first
closer to identifying “who” are behind spam activities, even if layer are the doorway pages, whose URLs the spammers promote
their spam techniques evolve. Search Ranger uses the Strider URL into top search results. When users click those links, their
Tracer [20] to intercept browser redirection traffic at the network browsers are instructed to fetch spam content from redirection
layer to record all redirection URLs. As Sections 4 and 5 will domains, which occupy the second layer.
To attract prudent legitimate advertisers who do not want to be rank URLs. For example, the anchor text for the spam URL
too closely connected to the spammers, many syndicators have http://coach-handbag-top.blogspot.com/ is typically “coach
also split their operations into two or more layers, which are handbag”. Therefore, we collect spammer-targeted keywords by
connected by multiple redirections, to obfuscate the connection extracting all the anchor text from a large number of spammed
between the advertisers and the spammers. Since these forums and ranking the keywords by their frequencies.
syndicators are typically smaller companies, they often join forces Between June and August of 2006, we manually investigated
through traffic aggregation to attract sufficient traffic providers spam reports from multiple sources including search user
and advertisers. feedback, heavily spammed forum types, online spam discussion
We model this end-to-end search spamming business with the forums, etc. We compiled a list of 323 keywords that returned
five-layer double-funnel illustrated in Figure 1: tens of thousands spam URLs among the top 50 results at one of the three major
of advertisers (Layer #5) pay a handful of syndicators (Layer #4) search engines. We then queried these keywords at all three
to display their ads. The syndicators buy traffic from a small search engines, extracted the top-50 results, scanned them with an
number of aggregators (Layer #3), who in turn buy traffic from earlier version of Search Ranger, and identified 4,803 unique
web spammers to insulate syndicators and advertisers from spam redirection-spam URLs.
pages. The spammers set up hundreds to thousands of redirection Next, we issued a “link:” query on each of the 4,803 URLs and
domains (Layer #2), create millions of doorway pages (Layer #1) retrieved 35,878 unique pages that contained at least one of these
that fetch ads from these redirection domains, and widely spam spam URLs. From these pages, we collected a total of 1,132,099
the URLs of these doorways at public forums. If any such URLs unique keywords, with a total of 6,026,699 occurrences, and
are promoted into top search results and are clicked by users, all ranked the keywords by their occurrence counts. The top-5
click-through traffic is funneled back through the aggregators, keywords are all drugs-related: “phentermine” (8,117), “viagra”
who then de-multiplex the traffic to the right syndicators. (6,438), “cialis” (6,053), “tramadol” (5,788), and “xanax”
Sometimes there is a chain of redirections between the (5,663). Among the top one hundred, 74 are drugs-related, 16 are
aggregators and the syndicators due to multiple layers of traffic ringtone-related, and 10 are gambling-related.
affiliate programs, but almost always one domain at the end of
Among the above 1,132,099 keywords, we could select a top
each chain is responsible for redirecting to the target advertiser’s
list, say top 1000, for our subsequent analyses. However, we
website.
observed that keywords related to drugs and ringtones dominate
the top-1000 list. Since it would be useful to study spammers who
Doorway Doorway Doorway Doorway Layer #1
s s
target different categories, we decided to construct our benchmark
s s
by manually selecting ten of the most prominent categories from
Redirection
the list. They are:
Redirection
Domain Domain Layer #2 1. Drugs: phentermine, viagra, cialis, tramadol, xanax, etc.
2. Adult: porn, adult dating, sex, etc.
Ads
Display 3. Gambling: casino, poker, roulette, texas holdem, etc.
Aggregators Layer #3 4. Ringtone: verizon ringtones, free polyphonic ringtones, etc.
Click-Thru 5. Money: car insurance, debt consolidation, mortgage, etc.
Syndicator Syndicator Layer #4 6. Accessories: rolex replica, authentic gucci handbag, etc.
7. Travel: southwest airlines, cheap airfare, hotels las vegas, etc.
8. Cars: bmw, dodge viper, audi monmouth new jersey, etc.
Advertiser Advertiser Layer #5 9. Music: free music downloads, music lyrics, 50 cent mp3, etc.
10. Furniture: bedroom furniture, ashley furniture, etc.
Figure 1: Spam Double-Funnel We then selected the top-100 keywords from each category to
form our first benchmark of 1,000 spammer-targeted search terms.
In the case of AdSense-based spammers, the single domain
googlesyndication.com plays the role of the middle three layers, 4. REDIRECTION-SPAM ANALYSIS
responsible for serving ads, receiving click-through traffic, and
redirecting to advertisers. Specifically, browsers fetch AdSense In late September 2006, we submitted the 1,000 keywords to
ads from the redirection domain googlesyndication.com and the Search Ranger system, which retrieved the top-50 results from
display them on the doorway pages; ads click-through traffic goes all three major search engines. In total, we collected 101,585
into the aggregator domain googlesyndication.com before unique URLs from 1,000x50x3=150,000 search results. With a set
reaching advertisers’ websites. of approximately 500 known-spammer redirection domains and
AdSense IDs at that time, the system identified 12,635 unique
3. SPAMMER-TARGETED KEYWORDS spam URLs, which accounted for 11.6% of all the top-50
appearances. (The actual redirection-spam density should be
To study the common characteristics of redirection spam, our higher because some of the doorway pages had been deactivated,
first step was to discover the keywords and categories heavily which were no longer causing URL redirections when we scanned
targeted by redirection spammers. In this section, we describe our them.) We first give a brief analysis of per-category spam
methodology for deriving 10 spammer-targeted categories and a densities in Section 4.1 and then focus on the double-funnel
benchmark of 1,000 keywords, which serve as the basis for the analysis for the remainder of this section.
analyses presented in Section 4.
Redirection spammers often use their targeted keywords as the 4.1 Spam Density Analysis
anchor text of their spam links at public forums, exploiting a Figure 2 compares the per-category spam densities across the
typical algorithm by which common search engines index and 10 spammer-targeted categories. The numbers range from 2.7%
for Money to 30.8% for Drugs. Two categories, Drugs and 3,882
# of Spam Appearance
600
Ringtone, are well above twice the average (shown on the far 493
500 396
right). Three categories – Money, Cars, and Furniture – are well 400 296
below half the average. We also calculated DCG (Discounted 300 242 225 218 207
178 172 150
Cumulated Gain) [10] spam densities, which give more weights to 200 131 124 123 110
100
spam URLs appearing near the top of the search-result list, but
0
found no significant difference from Figure 2.
rg
m
om
m
om
it
m
n.a m
tud om
tow ol .c o
de
m
m
om
.g o
me l ic e.
.co
co
om as .o
co
co
.c o
.c o
Per-Category Spam Density
35% 30.8%
ol .
ne pot. c
fr e a ol. c
s. c
g.c
a id
lx.
i o.
me ape.
gt r
27.5%
ce
es
n.a
a
30%
eb
ri n
er .
g.h
us
pa
ag
e
.
gs
tow
25%
s it
ew
t sc
ha
blo
os
xp
gs
blo
ho
20%
gs
me
xo
for
ma
blo
14.2%
blo
15% 11.6%
ho
ho
8.9% 9.7%
10% 7.6% 7.8%
2.7% 3.3% 3.9%
5% Figure 3: Layer #1: top-15 primary domains/sites by spam
0% doorway appearance counts
am ul t
el
es ey
Fu ar s
re
ge
g
ic
gs
s
e
rie
in
av
us
itu
n
d
ra
ru
C
bl
o
A
gt
so
Tr
M
sites to scrutinize their URLs. Figure 4 shows that 14 of the top-
ve
rn
D
M
in
A
15 doorway domains have a spam percentage3 higher than 74%;
R
G
cc
A
m
v
rg
m
om
m
m
m
m
om
m
de
co
om
e.i
.go
co
co
co
s.o
. co
o
co
. co
l ic
ol.
ol.
l. c
e. c
s. c
g. c
#2, #3, #4, and #7 in Figure 3 all belong to the same company, an
t r.
aid
lx.
i o.
t.
oa
ce
es
a
n.a
n.a
.ao
po
eg
ap
eb
er.
ri n
g.h
tud
pa
ag
tow
tow
gs
me
sit
tsc
ew
om
ha
blo
os
xp
gs
fr e
gs
ho
ne
me
me
xo
for
ma
blo
blo
ho
ho
# of Spam Appearances
1200
to paysefeed.net. 1022
1000 879
(2) Unprotected upload areas, such as 800 649
http://uenics.evansville.edu:8888/school/uploads/1/buy- 600
543
398 356
carisoprodol-cheap.html and http://xdesign.ucsd.edu/twiki/bin/ 400
334 326 309 308 289
266 260 258
view/main/tramadolonline. 200
(3) Home page-like directories, such as 0
http://aquatica.mit.edu/albums/gtin/texas-country-ringtones.html
fo
t
iz
rg
m
n fo
et
fas .com
m
fo
rs e m
m
m
m
ne
-on ightf x.inf
and http://find.uchicago.edu/~loh/albums/ cial.php?id=56.
e .b
. in
co
arc ed.n
are ind.o
.c o
arc ct.co
.co
.co
o
.in
e-s der.
s.i
0 .c
b4
te .
or
rch
ids
mp s10
dv
e1
e fe
s ix
h1
160 150
a3
-m
ara
# of Spam Appearances
in
ir e
tf
ha
ea
we reev
b il
d
ou
d
ys
3d
me
140
fin
mo
es
ur
pa
se
f
120
to p
se
ur
li n
r
yo
br
to p
to p
th e
yo
100
80 63
54
v ip
60 35 34 32 27
40 25 24 22 18 17
16 15 13
20
0 Figure 6: Layer #2: top-15 redirection domains by doorway-
URL appearance counts
t ec d u
v
er . u
m ov
ne du
d o du
cu tsm du
uc . edu
ha i lle.e u
ap du
vir rd.e u
ton u
ev con ed u
gin du
u
wa ica t .go
d
sv .ed
ing o. ed
.ed
g a it. e
.g
u. e
u.e
u.e
.e
e
ia.
aid
h
sd
r va
us
6000
using a different benchmark based on the most-bid keywords from
5000 legitimate advertisers.
4000
3000 1841
5.1 Benchmark of 1,000 Most-Spammed
2000 Advertiser-Targeted Keywords
1000
0 For our second benchmark, we obtained a list of 5,000 most-
bid keywords from a legitimate ads syndication program, queried
.26
66 30.1 34
66 0.18 28
3
64 0.16 .32
54
3
66 0.18 1
0.1 .179
8
66 1.19 83
66 0.17 17
66 0.16 15
72 2
4
them at all three major search engines to retrieve the top-50
0.3
.23 8.24
.23 8.21
.23 2.17
8
.11
0.
.
.23 14.1
4.1
6.1
2.1
4.1
.23 .164
73
.23 .180
66 0.18
2
8
1.2
0.1
0.1
0.1
0
with Search Ranger, and selected the 1,000 keywords with the
.23
.2
.11
.23
.23
.11
.23
.23
.23
66
66
64
66
66
66
4000 previous benchmark, and there are two partial explanations. First,
3500 this second benchmark has fewer keywords from the heavily
3000
2500
spammed categories in Figure 2. Second, we measured the second
2000 1695 benchmark two weeks after we measured the first one, while one
1500 987 961 of the three major search engines started to remove spam URLs
846
1000 620 right after our first measurement.
500
0 5.3 Double-Funnel Analysis
e s ge r om
ce m
We next analyze the five layers and compare them with the
po m e om
m
ph ring com
do nmo loa n o m
5s gton e rs m
rin offe om
m
lo a ta in m
ne .net
ter un o ne s net
my mp g to n .com
om
be rock etpla e.co
se arm2 s .co
.co
ma m o .co
in g .co
tun erin lk .c
results from the first benchmark. In all the figures, we color those
top rch - 4 h.c
c
.c
s.
s .c
p to ds
.
erm rs.
es
r
bil
wn un
e
4t
d
sid
a
e
f
bil
tar
o
up
mo
# of Spam Appearances
2,533 1000
400
# of Spam Appearances
350 800
287 280 271 579
300 600
250 194 178
200 400 330 307
135 135 130 123 266 262 249 239 224
150 110 110 104 93 219 204 193 189 189
79 200 150
100
50 0
0
om
ma ant-l rch z
rg
bo t hro m
s.i z
et
as dep om
m
ea .com
-t im f o
m
dv m
nfo
da m
mb ead . co
arc .org
i
u
i
t
b lo mer . info
m
om
n.a du
ca com
se re.b
fi n tud ce.i t
nf o
me co n f o
g o nce- .com
ou es.b
co
ma talo biz
ca . or g
fi n e-se ed.n
.ne
ce fi nd r o.o
.co
o
- on pays h.co
. ed
we cial ch. in
o
sit d.i nf
i .c
en r.co
te 0.c
. in
arc 0.c
h.c
.e
ol .
o.c
re.
a. i
o
.
sja mit
s
o
al i
t56
g
t.
n
sh
nt h
rse
1
efe
io
se ch1
c
-m
t
po
us
a
ar
he
ar
t hr
e
eg
ha
nu
r
bil
en
og
e
as
to w
gs
tse
ar
rel eb-
u
t na
-ca
as
rtg mo
gs
ve era
blo
es
se
o
an
xo
li n
od
a
-w
br
mo top
ev
top
ag
r yf
ho
an
vip
fi n
Figure 9: Layer #1: top-15 primary domains/sites by spam
doorway appearance counts
Figure 10: Layer #2: top-15 redirection domains by number of
TLD .com .org .net .biz .info spam doorway appearances
Spam % 4.1% 11% 12% 53% 68%
Table 1: Spam percentages for Top-Level Domains (TLDs) # of Ads Appearances 5000 4119
based on search results in our second benchmark 4000
3000 1922
5.3.2 Layer #2: Redirection Domains 2000
Figure 10 shows the top-15 redirection domains, all of which 1000
were syndication-based. Seven of them overlap with the list in 0
Figure 6, and nudai.com was previously discussed.
64 0.18 .34
66 0.18 28
.23 .180 3
66 .164 32
66 0.16 4.26
64 0.13 243
66 1.21 11
66 .182 54
66 30.1 178
66 .196 9
66 0.17 7
66 230.1 182
66 0.17 83
0.1 .112
5
Topsearch10.com stands out as the only redirection domain that
0.3
7
.11
.11
3.
.
8.2
4.1
2.1
4.1
80
8.
.
was behind over 1,000 spam appearances in both benchmarks. In
2
72
66 0.13
0
.23
.23
.23
0
0
.2
.23
.23
.11
.23
.23
.11
.23
.23
.23
209.8.25.150~209.8.25.159 IP block continued to have a
.
66
66
significant presence with 2,208 doorway appearances, which
accounted for 25% of all spam appearances. The most notable
differences are that drugs and adult spammers are replaced by Figure 11: Layer #3: top-15 click-through traffic receiver
money spammers, reflecting the different compositions of the two domains by the number of ads appearances on spam pages
benchmarks. Finally, we note that veryfastsearch.com (page analysis)
(64.111.196.122) and nudai.com (64.111.199.189) belonged to
the 64.111 IP block described in Section 4.2.3, and could Layer #5: Advertisers (Page analysis)
potentially connect to the aggregator more directly. Again, none
Figure 12 identifies the top-15 advertisers, which are
of the AdSense spammers appeared in the top-15 list. The highest-
significantly different from the ones in Figure 8; only six of them
ranking one was ca-pub-2706172671153345, who ranked #31
overlap. Well-known sites – such as bizrate.com, shopping.com,
with 61 spam appearances of 27 unique spam blogs at
dealtime.com, and shopzilla.com, which previously ranked
blogspot.com.
between #20 and #60 – now move into the top 15. This reflects
the fact that advertiser-targeted keywords better match these
5.3.3 The Bottom Three Layers shopping websites than spammer-targeted keywords.
Among the 6,153 unique spam URLs, we extracted 2,995 ads-
portal pages that contained a total of 37,962 ads.
Layer #3: Aggregators (Page analysis)
Figure 11 shows that, again, the 66.230 and 64.111 IP blocks
contained dominating receiver domains for spam-ads click-
through traffic. In total, we collected 28,938 and 6,041 ads for
these two IP blocks, respectively.
By visiting each page and analyzing the ads URLs, we found that
1400 all 17,050 ads forwarded click-through traffic to 64.111.196.117,
which was #12 in Figure 7 and #7 in Figure 11.
1200
# of Ads Appearances
fre op z a.c m
4 to us co m
fun ringlla .c m
m o e rs om
sh im om
s.c m
sh iz ra lk .c om
tu f o m
ide in m
om
de p pin .com
t extracted 15,580 ads and found that 6,200 of them were funneling
bil .ne
sh op ic e.co
fin ek o
ne .co
b wa co
o
e s ta .co
o.c
alt g.c
f.
.
e
bil o un c e
through.
Figure 12: Layer #5: top-15 advertisers by number of ads
appearances on spam pages (page analysis)
7. RELATED WORK
Cloaking and redirection are two techniques that Gyongyi and
Layer #4: Syndicators (Click-through analysis) Garcia-Molina identified as tactics for hiding spam content [8].
Our click-through analysis shows that the two benchmarks Wu and Davison studied cloaking and redirection on the web and
shared the same list of top-3 syndicators, despite the fact that the found that more than 8% of the top 200 URLs returned by Google
benchmarks had only 15% overlap in the list of keywords and employed cloaking and that some sites even used redirection
very different top-advertisers list. Again, the top-3 syndicators cloaking, i.e., redirecting different user agents to different sites
appeared on a large number of redirection chains in our analysis: [23]. They proposed an automated method to detect semantic
looksmart.com (881), findwhat.com (809), and 7search.com cloaking, which first identifies suspect pages by the content of the
(335), which together accounted for 2,025 (68%) of the 2,995 pages returned to a browser and a crawler, and then uses machine
chains. These numbers demonstrate that these syndicators appear learning to create a classifier [25]. Our Search Monkeys are able
to be involved in the search spam industry both broadly and to foil cloaking, including the newer click-through cloaking
deeply. techniques, by mimicking search users’ behavior using a full-
fledged browser so that redirection analyses are performed on true
6. OTHER COMMON SPAM pages displayed to the users.
In this section, we show that many syndication-based Money is a major incentive for spammers. Jansen observed that
spammers who do not use client-side browser redirections to fetch despite the problem of click-fraud, sponsored search could reduce
ads share the same bottom half of the double-funnel with the amount of spam [9]. Sarukkai proposed a way to quantify a
redirection spammers; that is, although they fetch ads on the search term’s monetizability [17]. Chellapilla and Chickering
server side, they also funnel the click-through traffic from their investigated cloaking from an economic perspective by
pages into the same IP blocks that we uncovered in the previous comparing search results from the top 5000 queries and the top
sections. This shows that the aggregators and the syndicators are 5000 monetizable queries. They observed that for queries whose
profiting from even more spam traffic. All scans were performed results used cloaking, 73.1% pages of the popular queries were
in the month of October 2006. spam while 98.5% pages of the monetizable queries were spam
[5]. We focus on detecting large-scale spammers by following the
6.1 BLOG FARMS money to track down major domains that appear in the redirection
The web page at http://urch.ogymy.info/ is a commonly seen chains involving spam ads.
made-for-ads blog page that consists of three parts: a list of ads, Various ranking mechanisms, such as Pagerank, HITS, and
followed by a few programmatically generated short comments, Trust Rank, incorporate the idea that a link is a “vote” of trust [13,
followed by a long list of meaningless paragraphs designed to 26]. Baeza-Yates, Castillo, and Lopez found that Pagerank was
promote several randomly named .org and .info URLs sprinkled vulnerable to Sybil attacks in which pages with low score formed
throughout the paragraphs. By issuing the following queries – a complete subgraph or a star [2]. However, Adali et al argued
"Welcome to my blog" "Hello, thanx for tips" phentermine that maximizing rank could be as simple as a link bomb
domain:info, as well as “linkdomain:ogymy.info” and consisting of one central page to which every other page links [1].
“linkfromdomain:ogymy.info” – we found 1,705 unique pages Methods for adapting ranking algorithms to combat link farms
that shared the same format and belonged to the same blog farm. include investigating trust starting with a known-bad seed or
introducing a measure of distrust [26]. Krishnan and Raj used this unique blogspot URLs that appeared in top-50 results for
idea for Anti-Trust Rank, in which they used an algorithm similar commercial queries were spam (77% and 75%). We also showed
to Trust Rank to propagate anti-trust from an initial seed set [12], that over 60% of unique .info URLs in our search results were
similar to the work for identifying “neighborhoods” of distrust on spam, which was an-order-of-magnitude higher than the spam
the web [13] and link farms [24]. Benczur and Csalognany percentage number for .com URLs.
presented Spamrank as an automated spam detection technique by For Layer #2 – redirection domains, we showed that the
identifying pages that violated the power law distribution by spammer domain topsearch10.com was behind over 1,000 spam
linking to one another [4]. They observed that link similarity appearances in both benchmarks, and the
measures could be more effective than trust/distrust measures in 209.8.25.150~209.8.25.159 IP block where it resided hosted
classifying spam pages. Similarly, Carvalho et al. focused on multiple major redirection domains that collectively were
identifying “noisy” links, which are sites with abnormal support responsible for 22-25% of all spam appearances. We also
between each other, by measuring the amount of linking between observed that the majority of the top redirection domains were
two sites [6]. Becchetti et al analyzed the heuristics – purely link- syndication-based, serving text-based ads-portal pages.
based analyses, Pagerank, Trustrank, Truncated PageRank, and
various combinations of these heuristics – for spam detection and For Layer #3 – aggregators, we presented the surprising finding
compared their performance [3]. In contrast, we use link analyses that two IP blocks 66.230.128.0~66.230.191.255 and
only to identify spammed forums, but rely on redirection analysis 64.111.192.0~64.111.223.255 appeared to be responsible for
to identify spam pages. funneling an overwhelmingly large percentage of spam-ads click-
through traffic. In our study, we easily collected over 100,000
Content analysis is also useful for detecting spam. Kolari, Finn, spam ads that were associated with these two IP blocks, including
and Joshi took a machine learning approach by building a many ads served by non-redirection spammers as well. These two
classifier based on meta tags, anchor text, and tokenized URLs IP blocks occupy the “bottleneck” of the spam double-funnel and
[11]. Fetterly, Manasse, and Ntoulas began with content may prove to be the best layer for attacking the search spam
independent heuristics, such as URL structure and average change problem.
throughout a site [7], and continued with site-dependent
heuristics, such as the words used in a page or title and the For Layer #4 – syndicators, we discovered that a handful of ads
fraction of visible content [16]. Urvoy et al modelized the style of syndicators appeared to serve as the middlemen for connecting
HTML documents based on properties such as spacing and advertisers with the majority of the spammers. In particular, the
HTML tags to determine stylistic similarities that could be used to top-3 syndicators were involved in 59-68% of the spam-ads click-
identify authors [18]. Mishne, Carmel, and Lempel compared the through redirection chains that we sampled. By serving ads on a
language model between a sample blog entry and the target page large number of low-quality spam pages at potentially lower
specified by a comment [14]. Our traffic-based analysis is prices, these syndicators could become major competitors to
complementary to these content-based analyses. main-stream advertising companies who serve some of the same
advertisers’ ads on search-result pages and other high-quality,
8. CONCLUSIONS non-spam pages.
For Layer #5 – advertisers, we showed that even well-known
We have presented redirection-spam analyses using the Strider websites’ ads had significance presence on spam pages.
Search Ranger system, which detects spam pages by monitoring Ultimately, it is advertisers’ money that is funding the search
their redirection traffic to known-spammer domains. Using a spam industry, which is increasingly cluttering the web with low-
benchmark of spammer-targeted keywords, we showed that quality content and reducing web users’ productivity. By
“drugs” and “ringtone” were the two most-spammed categories exposing the end-to-end search spamming activities, we hope to
with an average search-result spam density as high as 30.8% and educate users not to click spam links and spam ads, and to
27.5%, respectively. We have also constructed a second encourage advertisers to scrutinize those syndicators and traffic
benchmark of advertiser-targeted keywords in order to study the affiliates who are profiting from spam traffic at the expense of the
similar and different spam characteristics between the two long-term health of the web.
benchmarks.
We have presented a five-layer double-funnel model for 9. REFERENCES
analyzing redirection spam, in which ads from merchant [1] Adali, S., Liu, T., and Magdon-Ismail, M. Optimal Link
advertisers are funneled through a number of syndicators, Bombs are Uncoordinated. In the 1st International Workshop
aggregators, and redirection domains to get displayed on spam on Adversarial Information Retrieval on the Web (AIRWeb),
doorway pages, whereas click-through traffic from these spam ads May 2005.
is funneled, in the reverse direction, through the aggregators and [2] Baeza-Yates, R, Castillo, C., and Lopez, V. Pagerank Increase
syndicators to reach the advertisers. Domains in the middle layers Under Different Collusion Topologies. In the 1st International
provide the critical infrastructure for converting spam traffic to Workshop on Adversarial Information Retrieval on the Web
money, but they have mostly been hiding behind the scenes. We (AIRWeb), May 2005.
used systematic and quantitative traffic-analysis techniques to
identify the major players and to reveal their broad and deep [3] Becchetti, L., Castillo, C., Donato, D., Leonardi, S., Baeza-
involvement in the end-to-end spam activities. Yates, R. Link-based Characterization and Detection of Web
Spam. In the 2nd International Workshop on Adversarial
For Layer #1 – doorway domains, we showed that the free Information Retrieval on the Web (AIRWeb), August 2006.
blog-hosting site blogspot.com had an-order-of-magnitude higher
spam appearances in top search results than other hosting domains [4] Benczur, A., Csalogany, K., Sarlos, T., and Uher, M.
in both benchmarks, and was responsible for about one in every SpamRank – Fully Automatic Link Spam Detection. In the 1st
four spam appearances (22% and 29% in the two benchmarks International Workshop on Adversarial Information Retrieval
respectively, to be exact). In addition, at least three in every four on the Web (AIRWeb), May 2005.
[5] Chellapilla, K. and Chickering, D.M. Improving Cloaking [18] Urvoy, T., Lavernge, T., Filoche, P. Tracking Web Spam
Detection Using Search Query Popularity and Monetizability. with Hidden Style Similarity. In the 2nd International
In the 2nd International Workshop on Adversarial Information Workshop on Adversarial Information Retrieval on the Web
Retrieval on the Web (AIRWeb), August 2006. (AIRWeb), August 2006.
[6] da Costa Carvalho, A. L., Chirita, P., de Moura, E. S., Calado, [19] Wang, Y. M., Beck, D., Jiang, X., Roussev, R., Verbowski,
P., and Nejdl, W. Site Level Noise Removal for Search C., Chen, S., and King, S. Automated Web Patrol with
Engines. In Proc. of International World Wide Web Strider HoneyMonkeys: Finding Web Sites That Exploit
Conference (WWW). May, 2006. Browser Vulnerabilities. In Proc. Network and Distributed
[7] Fetterly, D., Manasse, M., and Najork, M. Spam, Damn System Security (NDSS) Symposium, February 2006.
Spam, and Statistics: Using Statistical Analysis to Locate [20] Wang, Y. M., Beck, D., Wang, J., Verbowski, C., and
Spam Web Pages. In Proc of the 7th International Workshop Daniels, B. Strider Typo-Patrol: Discovery and Analysis of
on the Web and Databases. pp. 1-6, 2004. Systematic Typo-Squatting. In Proc. 2nd Workshop on Steps
[8] Gyongyi, Z. and Garcia-Molina, H. Web Spam Taxonomy. In to Reducing Unwanted Traffic on the Internet (SRUTI), July
the 1st International Workshop on Adversarial Information 2006.
Retrieval on the Web (AIRWeb), 2005. [21] Wang, Y. M. and Ma, M. Strider Search Ranger: Towards an
[9] Jansen, B.J. Adversarial Informaton Retrieval Aspects of Autonomic Anti-Spam Search Engine. Microsoft Research
Sponsored Search. In the 2nd International Workshop on Technical Report, MSR-TR- 2006-174, December 2006
Adversarial Information Retrieval on the Web (AIRWeb), [22] Wang, Y. M. and Ma, M. Detecting Stealth Web Pages That
2006. Use Click-Through Cloaking. Microsoft Research Technical
[10] Jarvelin, K. and Kekalainen, J. IR Evaluation Methods for Report, MSR-TR- 2006-178, December 2006
Retrieving Highly Relevant Documents. In Proc. ACM [23] Wu, B. and Davison, B.D. Cloaking and Redirection: A
SIGIR Conference on R&D in Information Retrieval, 2000 Preliminary Study. In the 1st International Workshop on
[11] Kolari, P., Finin, T., and Joshi, A. SVMs for the Adversarial Information Retrieval on the Web (AIRWeb),
Blogosphere: Blog Identification and Splog Detection. In 2005.
AAAI Spring Symposium on Computational Approaches to [24] Wu, B., and Davison, B.D. Identifying Link Farm Pages. In
Analysing Weblogs, March 2006. Proc. International World Wide Web Conference (WWW),
[12] Krishnan, V. and Raj, R. Web Spam Detection and Anti- 2005.
Trust Rank. In the 2nd International Workshop on [25] Wu, B. and Davison, B.D. Detecting Semantic Cloaking on
Adversarial Information Retrieval on the Web (AIRWeb), the Web. In Proc. International World Wide Web Conference
August 2006. (WWW), August 2006.
[13] Metaxas, P. and DeStephano, J. Web Spam, Propaganda and [26] Wu, B., Goel, V., Davison, B.D. Propagating Trust and
Trust. In the 1st International Workshop on Adversarial Distrust to Demote Web Spam. In Proc. Models of Trust for
Information Retrieval on the Web (AIRWeb), May 2005. the Web Workshop (MTW), International World Wide Web
[14] Mishne, G., Carmel, D., and Lempel, R. Blocking Blog Conference, 2006.
Spam with Language Model Disagreement. In the 1st [27] Fiddler HTTP Proxy, http://www.fiddlertool.com/
International Workshop on Adversarial Information [28] Fighting Splogs, http://fightsplog.blogspot.com/
Retrieval on the Web (AIRWeb), May 2005.
[29] The Google AdSense Program, http://google.com/adsense
[15] Niu, Y., Wang, Y. M., Chen, H., Ma, M., and Hsu, F. A
Quantitative Study of Forum Spamming Using Context- [30] Network Whois records, http://whois.domaintools.com/
based Analysis. In Proc. Network and Distributed System 66.230.138.211 and http://whois.domaintools.com/
Security (NDSS) Symposium, February 2007. 64.111.214.154
[16] Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. [31] Screenshots of sample redirection spam pages,
Detecting Spam Web Pages through Content Analysis. In http://research.microsoft.com/SearchRanger/Redirection-
Proc. International World Wide Web Conference (WWW), spam_3_types.htm
May 2006. [32] Screenshots of sample click-through analyses,
[17] Sarukkai, R.R. How Much is a Keyword Worth? In Proc. http://research.microsoft.com/SearchRanger/Spam_ads_click
International World Wide Web Conference, (WWW), May -through_analysis.htm
2005.