Академический Документы
Профессиональный Документы
Культура Документы
1. INTRODUCTION
Food gained by fraud tastes sweet to a man, but he ends up with a mouth full
of gravel. Proverbs 20:17 (New International Version)
Fraud, the act of deceiving others for personal gain, is certainly as old as civilization itself. The word comes from the
Latin fraudem, meaning deceit or injury, and over the years has
come to represent a wide array of injustices, including forged
artwork, confidence schemes, academic plagiarism, and email
advance-fee frauds (such as the well-known Nigerian-email
scam). Although these forms of fraud are very different in nature, they all have in common a dishonest attempt to convince
an innocent party that a legitimate transaction is occurring when
in fact it is not. Usually the fraud is for monetary gain, but not
always, as fraud may be perpetrated for political causes (e.g.,
electoral fraud), personal prestige (e.g., plagiarism), or selfpreservation (e.g., perjury).
In the twentieth century, fraud matured in the area of transactional businesses, most notably in the telecommunications and
credit card industries. Due to the sheer volume of transactions
in these businesses, fraud could go unnoticed fairly easily, because it was such a small proportion of the overall business.
In the early days of the telecommunications business, social
engineering was used to convince telephone operators to give
access to lines or complete calls that were not authorized. In
the 1950s, AT&T started automating direct-dial long distance
calling, exposing themselves to the first generation of hackers.
Because fraud could now be perpetrated without speaking to a
human, it could be automated.
In the early days of fraud, perpetrators (hereinafter referred to
as fraudsters) were able to inflict significant losses on telecommunications companies, resulting in billions of dollars in uncollectable revenue. As time went on, these companies tried many
techniques to combat the fraudsters, with some success. Thus
began an arms race that has continued to this day, with the
I want to work for Ma Bell. I dont hate Ma Bell the way some phone phreaks
do. I dont want to screw Ma Bell. With me its the pleasure of pure knowledge.
Theres something beautiful about the system when you know it intimately the
way I do.
domestic access charges between companies. The Communications Fraud Control Association (cfca.org) periodically estimates the extent of worldwide telecommunications fraud. In
1999 this estimate was $12 billion, in 2003 it was between $35
and $40 billion, in 2006 it was between $55 and $60 billion,
and in 2009 it was between $70 and $78 billion. Other industries, most notably the credit card industry, also have been challenged with fighting sophisticated fraud schemes, with varying
levels of success.
One characteristic of fighting fraud that reaches across most
transaction-based industries is the massive scale of the problem.
Massive is a hard word to define and is clearly relative to the
analysts resources and the era. In the case of current telecommunications and credit card fraud, there are billions of transactions per day plus associated metadata. Clearly, data storage,
compression, and management must be a well-designed part of
any fraud detection scheme.
With the rise of the Internet and Internet-based businesses
in the 1990s and early 2000s, new types of fraud began to
emerge. More transactions and purchases were taking place online, and new companies, such as eBay.com and Amazon.com,
had to concern themselves with fraud detection. The Internet
era has spawned new types of fraud, including click-fraud, email spam, phishing schemes, and denial of service attacks. In
addition, the Internet provides access to data about individuals
that was previously much more difficult to collect and consolidate, resulting in the serious problem of identity theft.
In recent years, fraud modeling and detection has become
an academic pursuit; attempts to summarize the state of the
art include reviews by Bolton and Hand (2002) and Phua et
al. (2005). In this article we review some of the strategies that
AT&T has used to fight fraud. Although fraud cuts across many
industries and takes many shapes and forms, we believe some
of the lessons learned over the decades of experience at AT&T
are relevant, and that there are some common philosophies that
apply to fighting many different kinds of fraud. Some of these
philosophies are specific to problems involving massive data
sets and thus are quite relevant to new types of fraud emerging
in the twenty-first century.
The article is organized as follows. Section 2 discusses the
nature of fraud. Section 3 explores how we have gone about
fighting fraud over the years. Section 4 covers issues in implementation of fraud systems. Finally, Section 5 summarizes the
lessons learned at AT&T and how these might apply to the fraud
detection problems of today and tomorrow.
2. THE NATURE OF FRAUD
The most difficult aspect of fighting fraud is identifying it. In
the context of telecommunications, a fraudulent phone call is
one in which there is no intent to paytheft of service. When
fraudulent phone usage is found, the normal response is to shut
down whatever avenue the fraudster found, although in egregious cases, law enforcement may be involved.
Usage management is a business function closely related to
fraud detection, dealing with accounts that are likely to be uncollectable. There is no sharp line between intent to pay and
21
ability to pay; thus it is often reasonable to have the same detection tools look for both fraud and usage management problems. In fact, fraud often will masquerade as a usage management problem; rather than seeking clever ways to subvert the
telecom networks, the fraudster will simply subscribe to a legitimate service, but with no intention to pay.
Historically, fraud takes many different forms, ranging from
teenagers hacking into systems from their bedrooms to sophisticated organized crime rings (Angus and Blackwell 1993). Our
goal was to create a fraud management system that was powerful enough to handle the many different types of fraud that we
encountered and flexible enough to potentially apply to things
we had not seen yet. We next provide examples of some common varieties of fraud in the telecommunications world.
1. Subscription fraud. Subscription fraud happens when
someone signs up for service (e.g., a new phone, extra
lines) with no intent to pay. In this case, all calls associated with the given fraudulent line are fraudulent but are
consistent with the profile of the user.
2. Intrusion fraud. This occurs when an existing, otherwise
legitimate account, typically a business, is compromised
in some way by an intruder, who subsequently makes or
sells calls on this account. In contrast to subscription calls,
the legitimate calls may be interspersed with fraudulent
calls, calling for an anomaly detection algorithm.
3. Fraud based on loopholes in technology. Consider voice
mail systems as an example. Voice mail can be configured in such a way that calls can be made out of the voice
mail system (e.g., to return a call after listening to a message), as a convenience for the user. However, if inadequate passwords are used to secure the mailboxes, it creates a vulnerability. The fraudster looks for a way into
a corporate voice mail system, compromises a mailbox
(perhaps by guessing a weak password), and then uses the
system to make outgoing calls. Legally, the owner of the
voice mail system is liable for the fraudulent calls; after
all, it is the owner that sets the security policy for the voice
mail system.
4. Social engineering. Instead of exploiting technological
loopholes, social engineering exploits human interaction
with the system. In this case the fraudster pretends to be
someone he or she is not, such as the account holder, or
a phone repair person, to access a customers account.
Recently, this technique has been used by pretexters in
some high-profile cases of accessing phone records to spy
on fellow board members and reporters (Kaplan 2006).
5. Fraud based on new technology. New technology, such as
Voice over Internet Protocol (VoIP), enables international
telephony at very low cost and allows users to carry their
US-based phone number to other countries. Fraudsters realized that they could purchase the service at a low price
and then resell it illegally at a higher price to consumers
who were unaware of the new service, unable to get it
themselves, or technologically unsophisticated. Detecting
this requires monitoring and correlating telephony usage,
IP traffic and ordering systems.
6. Fraud based on new regulation. Occasionally, regulations
intended to promote fairness end up spawning new types
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
22
of fraud. In 1996, the FCC modified payphone compensation rules, requiring payphone operators to be compensated by the telecommunication providers. This allowed
these operators to help cover the cost of providing access
to phone lines, such as toll-free numbers, which do not
generate revenue for the payphone operator. This spawned
a new type of fraudpayphone owners or their associates
placing spurious calls from payphones to toll-free numbers simply to bring in compensation income from the
carriers.
7. Masquerading as another user. Credit card numbers can
be stolen by various means (e.g., shoulder surfing
looking over someones shoulder at a bank of payphones,
say) and used to place calls masquerading as the cardholder.
There are many more fraud techniques, some of which are
quite sophisticated and combine more than one known method.
Telecommunications fraud is not static; new techniques evolve
as the telecom companies put up defenses against existing ones.
The fraudsters are smart opponents, continually looking for exploitable weaknesses in the telecom infrastructure. Part of their
motivation is accounted for by the fact that once an exploit is
defined, there are thousands (or millions) of potential targets.
New types of fraud appear regularly, and these schemes evolve
and adapt to attempts to stop them.
3. DETECTING FRAUD
A fraud detection system must be flexible to respond quickly
and effectively to a variety of fraud types. There is one simple
rule to follow in detecting fraud: Follow the money. When the
notorious Willie Sutton was asked Why do you rob banks?, he
replied Because thats where the money is. Similarly, telecom
fraud most often involves international calls, because those calls
are the most costly. Although early phone phreakers often did
their exploits for the glory, much of current telecom fraud is
involved with making money. If a fraudster can steal a phone
call and then sell it to someone else, that is pure profit. This is
also why there are various schemes associated with audiotext
numbers: high-cost, international chat lines or lines with adult
content. They are expensive services that can generate lots of
revenue quickly, and fraudsters can receive a cut of the proceeds
from these expensive calls.
Fighting fraud is complicated by the existence of multiple
telecom carriers. Since equal access was instituted in 1982
(Kellogg, Huber, and Thorne 1999), hundreds of long-distance
carriers have come on the scene, and they can be used on a percall basis by dialing a carrier prefix when placing a call. Thus
fraudsters can easily migrate from one carrier to another; just
as paying customers have a choice of long-distance carriers, so
do the fraudsters. That means that once a carrier detects and
stops fraudulent usage on its network, that usage is likely to
migrate onto another network. Notifying a PBX owner to secure the compromised system is often the only way to ensure
that all of the leaks are plugged.
There is an interesting phenomenon associated with the differential ability of telecom companies to carry out this task. If
one company is better than others, fraudsters will tend to migrate to the others, where their cost of doing business will be
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
23
24
(1)
where Xp and Xn are the previous and new values of the parameter and Dc is the information that comes from the current
call.
These types of EWMA models (Winters 1960) have been
widely used in statistical forecasting and statistical process
management. The properties of EWMA models have been studied extensively and are covered in standard textbooks (e.g.,
Abraham and Ledolter 1983). EWMAs update parameters in
a smoother fashion than simple moving averages and allow for
streaming updating of parameters without the need to access
data that have already been seen.
Two important parameters must be set in an EWMA: the parameter and the updating interval. Both of these together help
determine how quickly new information washes out old information. If is large, then the decay curve is steep and new information quickly washes out old information. If the updating
interval is small, as in call-based updating, then might have
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
25
26
(2)
652-5467 5.2
756-2656 5.2
543-6547 10.0
756-2656 5.0
756-2656 6.2 652-5467 4.6
652-4132 4.5
652-5467 2.0 652-4132 3.9
653-4231 2.3 + 652-4132 0.8 653-4231 2.0
(1 )
=
624-3142 1.9
624-3142 1.6
735-4212 1.8
735-4212 1.5
423-1423 0.8
543-6547 1.5
534-2312 0.5
423-1423 0.7
526-4532 0.2
534-2312 0.4
Other
0.1
Other
0.3
Other
0.0
Figure 2. Computing a new top-k edge set from the old top-k edge set and todays edges. Note how a new edge enters the top-k edge set,
forcing the addition of an old edge to other.
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
27
2|A B|
|A| + |B|
or twice the cardinality of the intersection of the two sets divided by the sum of the cardinalities of the sets. The Dice criterion is bounded between 0 and 1, with a value of 1 when the two
sets are identical. We use a weighted version of this criterion to
account for the weights on the edges.
The second predictive criterion is based on the Hellinger distance (Beran 1977), which was designed to measure distances
between statistical distributions. Applied to graph matching, the
Hellinger distance becomes
HD(A, B) =
wA (j)wB (j),
jAB
3.3.4 Catching Fraud via Graph Matching. The COI signatures serve as a communications fingerprint, a way to characterize the usage not of a phone number, but rather of the person behind the phone number. As such, they have been useful
in tracking down miscreants who try to cover their tracks by
changing their phone number, name, or address. One example
of this is our repetitive debtors database (RDD), designed to
keep a running database of COI signatures of delinquent customers in an attempt to track them down if they tried to assume
other identities. Consider the case where a customer is disconnected due to nonpayment. In some cases this customer might
try to sign up for a new account under another identity using
a different name, a different billing address, or a stolen credit
card. Sometimes just permuting letters in the name is enough
to avoid detection by the sales representatives. But in this case,
even with different identifying information, the new account belongs to an individual likely to have the same communications
fingerprint as the delinquent account. Our intuition is that if we
build COI signatures on all new accounts and match them to the
signatures of the delinquent accounts, we can find these fraudsters. This approach entails building a distance metric between
COI signature graphs. The distance function is based primarily on the overlap between the two graphs (i.e., phone numbers
appearing in both signatures) and the proportion of the overall
28
Figure 4. Shows calls from a set of seven phone numbers associated with a single customer over a 1-year period. Two sets of spikes are quite
high, associated with two fraud incidents in the year.
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
29
Figure 5. Light-colored blocks show calls to a specific foreign country occurring out of hours, starting on January 15 and continuing to
January 28. The display shows 7,000 calls. Notice the uniform height of calls at 45 minutes.
Figure 6 shows another instance of fraud in this same country. In this case, the fraudulent calls occur out of normal business hours (which are shown by the shaded rectangles), in contrast to the normal calling pattern.
The next example, shown in Figure 7, demonstrates how intense fraudulent activity can contrast with infrequent normal
behavior. Identifying the fraud pattern here is particularly easy.
Finally, Figure 8, shows a single social-engineering event superimposed over a year of regular calling. These calls are often
very long duration.
These displays are static and monochromatic. But when analysts are investigating fraud, the displays are fully interactive
and use color to encode categorical variables, such as called
country. The analyst may, for example, pop up a histogram that
shows the number of calls going to each country. A menu item
can deactivate all countries, and a button click can reactivate the
country that appears to be the recipient of the fraudulent calls.
With just those calls displayed, the horizontal axis can be adjusted to narrow the time window. A rectangle swept over the
tip of one or more of the call spikes, causes the full call detail
for those calls to be displayed in another window. If the analyst sees that one of the particularly long fraudy-looking calls
is operator-assisted, then it is easy to use a histogram to display
all of the operator-assisted calls.
This tool is natural and powerful, allowing the analyst to perform an interactive data analysis session on the call detail, posing and answering questions that arise from seeing the calls.
4. IMPLEMENTATION
In 1998, AT&T implemented a new fraud detection system
called the Global Fraud Management System (GFMS), built
mostly by statisticians. How did a group of statisticians get involved in writing a production fraud detection system? That
was not our plan. The initial concept was that we would provide a signature-based alerting algorithm to plug into a new
fraud system to be built by other organizations within AT&T.
But the new system did not get built through standard development channels, and we were left with a stand-alone detection
algorithm. At that time, it became apparent that we would have
to provide the entire fraud system to deliver the signature alerting. This was also under a strict deadline due to the Y2K effort
within the company; the older fraud detection system was not
Y2K-compliant.
We succeeded because we had a small team and used simple
components, many of which we had developed to carry out fullvolume testing of the signature alerting. In particular, we had
already constructed the call detail database and various tools to
Figure 6. A 2-day set of fraudulent calls to a specific foreign country on the middle Tuesday and Wednesday. Note how these calls are
primarily out of business hours, and that the general pattern for this phone is calls only during business hours.
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
30
Figure 7. One year of calling behavior for a single phone line. The pattern of intense activity at the center is fraudulent, contrasting with the
infrequent call pattern the rest of the year.
help with the huge flow of data. We also used extreme programming teams (Nosek 1998), in which two programmers worked
jointly to produce code quickly and effectively.
The systems that we built all shared several important characteristics. They were implemented in UNIX-like environments,
they used basic tools in those environments in small pieces of
code, and these pieces were connected to one another in ways
natural to the operating system. As an example, one of the paradigms that we used extensively to manage the flow of data
was what we term hose parts. The word hose is intended
to recall the well-known analogy comparing obtaining an MIT
education with taking a drink from a firehoseprocessing continuously flowing data on a massive scale is like this. In our
infrastructure, a hose is a UNIX directory with subdirectory
parts. Each part is a place where data is received, stored,
processed, and eventually deleted. The part has subdirectories
to accomplish these tasks, and by using, for example, the inherently atomic nature of the UNIX mv command (which renames a file), we were able to elaborate a relatively complex set
of generic operations with a surprisingly small amount of code.
Our vetter is a typical user of the hose paradigm. We receive hundreds of thousands of files of AMA data daily from
hundreds of telecommunication switches around the country.
Each switch has a mechanism in place for serializing the data
Figure 8. One year of calls, showing a spike in the middle due to social engineering, with long-duration operator-assisted calls to a known
high-fraud country.
TECHNOMETRICS, FEBRUARY 2010, VOL. 52, NO. 1
tion closes any open files (and removes files that were never
written to).
The vetter is implemented as a streamer, as is the fraud detection software. In the latter case, files extracted from the call
stream by each plug-in are typically processed by a job scheduled to run at regular time intervals. Often these jobs will apply thresholds of various sorts, producing alerts when appropriate. Each detection algorithm comprises a plug-in and the code
that processes the extracted data to generate alerts. Plug-ins can
themselves produce alerts if they detect from a single call that
an alert is justified; an example of this situation is an alert based
on a call of very long duration.
As new fraud patterns arise, detection algorithms are generally implemented by adding a new plug-in and some associated
logic, allowing a quick response to the changing face of fraud.
We have been able to add new plug-ins, fully tested, in as little
as 45 minutes.
Our system is validated in a few different ways. The cases
that we detect are handed off to an internal fraud investigation
department, which makes a final determination as to whether
or not to take action on a particular case. The results of these
investigations allow for regular assessment of our detection algorithms. Because the system is modular, it is easy to adjust a
model and reimplement it quickly if necessary. Another method
of validation comes from the fact that the impact of fraud often shows up on the bills of innocent customers. Therefore, we
have millions of customers acting as fraud detectors when they
inspect their bills carefully on a monthly basis. Reports from
customers regarding incorrect charges often lead us to determine that some fraud has occurred. Keeping track of such reports allows us to get a high-level sense of the amount of falsenegatives (i.e., undetected cases of fraud) in our system. These
measures ensure that our models effectively address the fraud
problem over time.
5. CONCLUSION
In our effort to develop world-class fraud management tools
from massive data streams, we have attacked problems that on
the surface appear to be quite different: subscription fraud, intrusion detection, repetitive debtors, access management fraud,
and the like, each with its own characteristics that hinder detection. Nonetheless, over the years we have found some common
themes that propagate across these different classes of fraud,
specifically in cases dealing with extremely large data sets:
The need to join data analysis and data management. In
large data analysis projects, there is no way to decouple the
analysis of the data from the storage, management, and retrieval of the data. In our early days, we thought we would
simply be building an algorithm, but at massive scales of
data, even simple exploratory data analysis methods are
challenging. Questions like how much data exist, what the
distributions are, and whether outliers exist become difficult. We discovered that having a stake in data management allowed us to answer these questions more easily and
also led to new possibilities and learning. In addition, after
observing how the fraud team used our tools, it was clear
that one of the most useful tools that they had was the ability to take high-level summaries of the data, hone in on
31
32
ACKNOWLEDGMENTS
The authors thank the referees and editors for their many
helpful comments.
[Received July 2008. Revised April 2009.]
REFERENCES
Abraham, B., and Ledolter, J. (1983), Statistical Methods for Forecasting, New
York: Wiley. [24]
Angus, I., and Blackwell, G. (1993), Phone Pirates, Ajax, Ontario, Canada:
Telemanagement Press. [21]
Beran, R. (1977), Minimum Hellinger Distance Estimates for Parametric Models, The Annals of Statistics, 5 (3), 445463. [27]
Bolton, R., and Hand, D. (2002), Statistical Fraud Detection: A Review, Statistical Science, 17 (3), 235255. [21,22]
Cahill, M. H., Lambert, D., Pinheiro, J. C., and Sun, D. X. (2002), Detecting
Fraud in the Real World, Norwell, MA: Kluwer Academic. [22,24]
Cortes, C., and Pregibon, D. (2001), Signature-Based Methods for Data
Streams, Data Mining and Knowledge Discovery, 5 (3), 167182. [24]
Cortes, C., Pregibon, D., and Volinsky, C. (2001), Communities of Interest, in
Advances in Intelligent Data Analysis. Lecture Notes in Computer Science,
Vol. 2189, Cascais, Portugal, pp. 105114. [25]
Cortes, C., Pregibon, D., and Volinsky, C. (2003), Computational Methods for
Dynamic Graphs, Journal of Computational and Graphical Statistics, 12,
950970. [26]
Dice, L. R. (1945), Measures of the Amount of Ecologic Association Between
Species, Ecology, 26 (3), 297302. [27]
Drechsler, R. L., and Mocenigo, J. M. (2007), The Yoix Scripting Language:
A Different Way of Writing Java Applications, Software: Practice and Experience, 37, 643667. [28]
Fawcett, T., and Provost, F. (1997), Adaptive Fraud Detection, Data Mining
and Knowledge Discovery, 1 (3), 291316. [22]
Goodman, J., Cormack, G. V., and Heckerman, D. (2007), Spam and the Ongoing Battle for the Inbox, Communications of the ACM, 50 (2), 2433.
[31]
Greer, R. (1999), Daytona and the Fourth-Generation Language Cymbal, in
Proceedings of the ACM SIGMOD International Conference on Management of Data, Philadelphia, PA: ACM Press, pp. 525526. [23]
Hill, S., Agarwal, D., Bell, R., and Volinsky, C. (2006), Building an Effective Representation for Dynamic Networks, Journal of Computational and
Graphical Statistics, 15 (3), 584608. [24,26,27]
International Telecommunication Union (1988), Recommendation Q.761, Signalling System no. 7, technical report, ITU. [23]
(1993), Recommendation Q.764, Signalling System no. 7ISDN
User Part Signalling Procedures, technical report, ITU-T. [23]
Kaplan, D. A. (2006), Intrigue in High Places: To Catch a Leaker, Hewlett
Packards Chairwoman Spied on the HomePhone Records of Its Board of
Directors, Newsweek (September). [21]
Kellogg, M. K., Huber, P. W., and Thorne, J. (1999), Federal Telecommunications Law (2nd ed.), New York: Aspen Law and Business. [22]
Lambert, D., Pinheiro, J. C., and Sun, D. X. (2001), Estimating Millions of
Dynamic Timing Patterns in Real Time, Journal of the American Statistical
Association, 96, 316330. [24]
Metwally, A., Agrawal, D., and Abbadi, A. E. (2005), Efficient Computation
of Frequent and Top-k Elements in Data Streams, in Database Theory
ICDT 2005, eds. T. Eiter and L. Libkin, BerlinHeidelberg: Springer,
pp. 398412. [24]
Moreau, Y., Preneel, B., Burge, P., Shawe-Taylor, J., Stoermann, C., and Cooke,
C. (1997), Novel Techniques for Fraud Detection in Mobile Telecommunication Networks, in ACTS Mobile Summit, Grenada, Spain, Brussels,
Belgium: ACTS. [22]
Nosek, J. T. (1998), The Case for Collaborative Programming, Communications of the ACM, 41 (3), 105108. [30]
Phua, C., Lee, V., Smith, K., and Gayler, R. (2005), A Comprehensive Survey
of Data Mining-Based Fraud Detection Research, technical report, Clayton
School of Information Technology, Monash University, available at http:
// clifton.phua.googlepages.com/ fraud-detection-survey.pdf . [21,22]
Rosenbaum, R. (1971), Secrets of the Little Blue Box, Esquire, 76, 117125,
222226. [20]
Rosset, S., Murad, U., Neumann, E., Idan, Y., and Pinkas, G. (1999), Discovery of Fraud Rules for TelecommunicationsChallenges and Solutions, in
KDD99: Proceedings of the Fifth ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, New York, NY: ACM Press,
pp. 409413. [22]
33