Академический Документы
Профессиональный Документы
Культура Документы
Randy Abbott
KM60370
Professor Dolan
10 Dec 2017
Benford’s Law or the First-Digit Law as it has been called sometimes, has been discovered twice
as Stoessiger (2013) describes; once by an astronomer named Newcomb in 1881and by a physicist after
whom the law is named, Benford in 1938 (pp. 29-30). How it was discovered by both men was the
same; each noticed the pages of the publicly-accessible logarithm tables’ book as the filthiest on those
pages corresponding to the most often looked-up pages, namely the lowest numbered pages. One might
think this is how the pages are supposed to look as everyone begins at the start of the book and works
through it to that area that is of interest and this would be true, except as Stoessiger (2013) points out the
astronomer Newcomb “noticed that the first pages of log tables, which show numbers starting with 1,
were much grubbier than pages starting with other digits. He realised that the log table users, he and his
fellow astronomers, must be looking up the logs of numbers that started with the digit 1 much more
commonly than other digits (p. 29).” Benford rediscovered this phenomenon in 1938 and began to
research it by analyzing the frequency with which numbers recur in various positions relative to each
other within the numeric values of the datasets he had access to. Miller (2015) points out “…Benford’s
2
The Law of Anomalous Numbers, published in the Proceedings of the American Philosophical Society in
1938” on his research was better received than that of his predecessor Newcomb in 1881 (pp. 4, 5, 7).
There are it seems, two kinds of random data in the world. The kind of random data such as
when state lottery officials or institutions pick numeric values that are random because they have equal
probability of occurrence that anyone of them as a singular value might be drawn at that moment and
another kind of random data having non-equal probability of occurrence because there exists a sliding-
scale of plural values interacting with each other, such that all of them cannot have an equal probability
of occurrence as singular values occurring simultaneously at that moment, thus creating non-equal
distributions in the physical manifestations of the plural values for this reason. An example would be the
wind or wind currents as the summation of countless values interacting with each other across the
particulars of seasons and geography, totally random in that the wind blows where it will and yet, not
random as some factors in their interactions with other factors increase or decrease the probability that
the wind will blow this way or that way. Such is the way that Benford’s Law could be described and the
kinds of numeric data described by Benford’s Law. Benford’s Law cannot describe all numeric data, it
has limitations as to its applicability, but this applicability has more to do with describing the random
structure that seems to come about as a result of data with ratios of proportion within its self-referencing
descriptors, rather than any kind of data that is generated in a random or manufactured sense as to its
values or content. It is believed that the numbers 1-9 appear with equal frequency in random
distributions for each digit or number. In reality however, the frequency of the first through ninth digits
as to which number appears first and with what frequency of occurrence is anything but equal. The
pattern is so consistent across so many applications of this phenomenon that it has been called Benford’s
Law or the First-Digit Law because of its predictions about which numbers are the most frequent as to
3
their first appearance in data. Stoessiger (2013) describes what Benford discovered; each number has a
stable frequency as to its “first” appearance. The number one appears 30.1%. The number two appears
17.6%. The number three appears 12.5%. The number 4 appears 9.7%. The number five appears 7.9%.
The number six appears 6.7%. The number seven appears 5.8%. The number eight appears 5.1%. The
number nine appears 4.6% (p. 29). Each succeeding digit within the same data field that contained the
original digit of first appearance, appearing in 2nd, 3rd or 4th place of frequency converges ever closer
upon total randomness of probability as to its occurrence until all numbers are equally probable in
appearing at the 5th place of frequency. Stoessiger (2013) sums up the counter-intuitiveness of the First-
Digit Law with his observation about its stark contrast: “Benford’s Law requires that either 1 or 2
should be the leading digit about 47.7% of the time, while 8 and 9 should be the initial digits
on less than 10% of occasions” (p. 30). Miller (2015) mentions Newcomb’s 1881 observation
“…noting the importance of scale. The numerical value of a physical quantity clearly depends on the
scale used, and thus Newcomb suggests that the correct items to study are ratios of measurements” (p.
5). Badal-Valero, Alvarez-Jareño, and Pavía (2017) describe dual characteristics in this description
where “Benford’s Law has as its main properties invariance in scale and in base. The scale-invariant
property implies that Benford’s Law continues to be fulfilled even if the units of measurement are
changed. That is, the level of fit of some data to Benford’s Law is independent of the measurement
system. In economic terms, the currency in which the variable is measured does not influence the result.
The base-invariant property states that the logarithmic law remains independent of the base used. It is
equally valid in base 10, in binary basis, or in any other base. Hill proves that Benford’s Law is the
unique continuous distribution base-invariant and that scale-invariance (a property impossible for
continuous variables) entails base-invariance” (p. 26). The summary point being made is Benford’s Law
describes the frequency of first appearance for numbers 1-9, while acknowledging the frequency of first
4
appearance is based on ratios between compared quantities captured in the data that can be measured
with any scale or base-counting system without changing the equivalent frequency of first appearance
for any numeric values that might replace the base-counting system of numbers such as 1-9. There is one
caveat to the use of Benford’s Law that must be mentioned as a disclaimer for when Benford’s Law
cannot be applied to various datasets: Cartlidge (2010) quotes mathematician T. Hill (the same Hill
mentioned earlier by Badal-Valero et al.); “The ubiquity of Benford’s Law,” he says, “especially in real-
life data, remains mysterious.” Cartlidge mentions Hill’s prominence concerning Benford’s Law when
Cartlidge says “Hill proved mathematically in 1995 that Benford's law is the only possible universal law
describing the distribution of digits that is invariant under changes of scale... But neither he nor anyone
else has discovered a general principle that can predict a priori which kinds of data sets should obey the
The practicality of Benford’s law is in its ability to predict random digit values representing
ratios of proportion in generated data as to which digit appears first in a data field, which digit appears
second.., which digit appears…etc. as a way to determine the distribution of numeric values in sequence
of frequency when the pattern is compared with Benford’s Law so as to measure its approximation
against that predicted by Benford’s Law. Badal-Valero et al. (2017) mention, historically Benford’s Law
has been suggested to predict the sequence of digits as to which should occur with frequency of first
appearance in financial documents and databases using analysis of these financial records for
comparison with the naturally-occurring Benford’s Law pattern of digits in their frequency of first
appearance to detect fraud (p. 25). Badal-Valero et al. (2017) take note in their article that “the use of
Benford’s Law in the field of accounting is prominent, having shown itself able to detect anomalies in
accounting data” (p. 25). It has been suggested that Benford’s Law when used to detect fraud in the
5
same way for the tax returns and supporting documents of corporations and individuals, does so by
predicting what the digit frequency of first appearance should be if the data is randomly conforming to
Benford’s Law in such documents. Per Badal-Valero et al. (2017) “Benford’s Law has been extensively
used as a tool for detecting election fraud” (p. 25). These examples serve to highlight Benford’s Law is
used most often for the detection of irregular digit distribution patterns within financial documents and
other records caused by the manipulation of non-random numeric values with the effect upon the data of
expressing irregularities in ratios of proportion between frequencies of first appearance for digits, that
such irregularities might be compared with the random values predicted by Benford’s Law. This gets to
the heart of why Benford’s Law is used for fraud detection: the probability of repetition that mirrors the
digit distribution pattern randomly predicted by Benford’s Law is what is being manipulated with the
use of non-random numeric values introduced by fraudulent methods into the data. What makes
Benford’s Law unique is pointed out by Badal-Valero et al. (2017) in this observation; “…when people
manufacture data manually, the data rarely fits Benford’s Law” (p. 26).
2.) Introduction to Benford’s Law processed with the RStudio/Rattle graphical user interface.
The purpose in writing this paper is to document the ability of the algorithm using the random rule of
Benford’s Law in looking for the frequency of first appearance in numeric values to govern its
processing of numeric data for irregular patterns suggesting fraud when that algorithm is used by a data
mining application such as RStudio/Rattle. The algorithm using Benford’s Law in RStudio/Rattle is
called up within the Rattle GUI by choosing the distributions radio button under the “explore” tab and
then selecting the number of digits 1-9 for analyzing the frequency of first appearance for each digit
specified within the number of digits to be analyzed within the data file. This is done after selecting the
file to be opened within the “data” tab using the filename field to browse for the file and clicking
execute, once it is found. The algorithm specified for use within the “explore” tab utilized by the user’s
6
choice of both “distributions” and “Benfords” values specified in the “Starting Digit” field is the
algorithm that must correlate each of the nine digits (if that value is chosen in the “Starting Digit” field)
with its frequency of first appearance for each of the digits, if such a data pattern in the dataset is found
to conform to Benford’s Law. The algorithm must look at each data value within each data field
according to the data format for the first appearing digit equivalent with the numeric values of one
through nine, data field by data field, while totaling the frequency of first appearance for each of the
numeric values found for comparison with the random frequencies found in nature with Benford’s Law.
Conformity with Benford’s Law while using RStudio/Rattle to analyze and view a dataset will appear as
values represented with the use of bars in a bar graph or as plot lines, but this depends on the viewer’s
choice within the “explore” tab as the plot lines view is the default choice unless the bars option is
chosen otherwise. The viewer’s choice of using either bar graphs or plot lines as to how to view the
results of the dataset will appear in the RStudio “Plots” viewing window, whereas all other options or
choices are made within the Rattle GUI application. Upon comparison between the data results of the
bar graphs or plot lines viewed in RStudio and the standard Benford’s Law frequencies of first
appearance for all digits quoted by Stoessiger (2013) earlier in this paper; the dataset either conforms
approximately, or it cannot be compared for approximation with Benford’s Law because the dataset
lacks the structure for numeric ratios somehow within itself. There are no other processing options for
the algorithm as the dataset either has the random data structure allowing Benford’s Law to be applied to
it or it doesn’t and the dataset either approximates the random data structure comparable to that
predicted by Benford’s Law or it doesn’t. This simplifies the purpose the algorithm is tasked with
accomplishing because nature sets the bar for what approximates Benford’s Law through the random
frequency of first appearance and its corresponding numeric values represented as the counting digits.
7
References
Badal-Valero, E., Alvarez-Jareño, J. A., & Pavía, J. M. (2017). Combining Benford's Law and machine
learning to detect money laundering. An actual Spanish court case. Forensic science international, 282,
main.pdf?_tid=514b5398-dc52-11e7-979e-
00000aab0f27&acdnat=1512763365_63d1a34fd4288b38e7c0dd7d7803ab41
Cartlidge, E. (2010, October). In Nature, Number One Dominates. Physicsworld.com. Retrieved from
http://physicsworld.com/cws/article/news/2010/oct/20/in-nature-number-one-dominates
Miller, S. J. (Ed.). (2015). Benford's Law: Theory and Applications. Princeton University Press.
Stoessiger, R. (2013). Benford's Law and why the integers are not what we think they are: A critical
numeracy of Benford's law. Australian Senior Mathematics Journal, 27(1), 29. Retrieved from
https://files.eric.ed.gov/fulltext/EJ1093383.pdf