Вы находитесь на странице: 1из 7

1

Randy Abbott

KM60370

Professor Dolan

10 Dec 2017

Benford’s Law or The First-Digit Law

1.) Introduction to Benford’s Law.

a.) How Benford’s Law was discovered.

Benford’s Law or the First-Digit Law as it has been called sometimes, has been discovered twice

as Stoessiger (2013) describes; once by an astronomer named Newcomb in 1881and by a physicist after

whom the law is named, Benford in 1938 (pp. 29-30). How it was discovered by both men was the

same; each noticed the pages of the publicly-accessible logarithm tables’ book as the filthiest on those

pages corresponding to the most often looked-up pages, namely the lowest numbered pages. One might

think this is how the pages are supposed to look as everyone begins at the start of the book and works

through it to that area that is of interest and this would be true, except as Stoessiger (2013) points out the

astronomer Newcomb “noticed that the first pages of log tables, which show numbers starting with 1,

were much grubbier than pages starting with other digits. He realised that the log table users, he and his

fellow astronomers, must be looking up the logs of numbers that started with the digit 1 much more

commonly than other digits (p. 29).” Benford rediscovered this phenomenon in 1938 and began to

research it by analyzing the frequency with which numbers recur in various positions relative to each

other within the numeric values of the datasets he had access to. Miller (2015) points out “…Benford’s
2

The Law of Anomalous Numbers, published in the Proceedings of the American Philosophical Society in

1938” on his research was better received than that of his predecessor Newcomb in 1881 (pp. 4, 5, 7).

b.) The real-world nature of data described by Benford’s Law.

There are it seems, two kinds of random data in the world. The kind of random data such as

when state lottery officials or institutions pick numeric values that are random because they have equal

probability of occurrence that anyone of them as a singular value might be drawn at that moment and

another kind of random data having non-equal probability of occurrence because there exists a sliding-

scale of plural values interacting with each other, such that all of them cannot have an equal probability

of occurrence as singular values occurring simultaneously at that moment, thus creating non-equal

distributions in the physical manifestations of the plural values for this reason. An example would be the

wind or wind currents as the summation of countless values interacting with each other across the

particulars of seasons and geography, totally random in that the wind blows where it will and yet, not

random as some factors in their interactions with other factors increase or decrease the probability that

the wind will blow this way or that way. Such is the way that Benford’s Law could be described and the

kinds of numeric data described by Benford’s Law. Benford’s Law cannot describe all numeric data, it

has limitations as to its applicability, but this applicability has more to do with describing the random

structure that seems to come about as a result of data with ratios of proportion within its self-referencing

descriptors, rather than any kind of data that is generated in a random or manufactured sense as to its

values or content. It is believed that the numbers 1-9 appear with equal frequency in random

distributions for each digit or number. In reality however, the frequency of the first through ninth digits

as to which number appears first and with what frequency of occurrence is anything but equal. The

pattern is so consistent across so many applications of this phenomenon that it has been called Benford’s

Law or the First-Digit Law because of its predictions about which numbers are the most frequent as to
3

their first appearance in data. Stoessiger (2013) describes what Benford discovered; each number has a

stable frequency as to its “first” appearance. The number one appears 30.1%. The number two appears

17.6%. The number three appears 12.5%. The number 4 appears 9.7%. The number five appears 7.9%.

The number six appears 6.7%. The number seven appears 5.8%. The number eight appears 5.1%. The

number nine appears 4.6% (p. 29). Each succeeding digit within the same data field that contained the

original digit of first appearance, appearing in 2nd, 3rd or 4th place of frequency converges ever closer

upon total randomness of probability as to its occurrence until all numbers are equally probable in

appearing at the 5th place of frequency. Stoessiger (2013) sums up the counter-intuitiveness of the First-

Digit Law with his observation about its stark contrast: “Benford’s Law requires that either 1 or 2

should be the leading digit about 47.7% of the time, while 8 and 9 should be the initial digits

on less than 10% of occasions” (p. 30). Miller (2015) mentions Newcomb’s 1881 observation

“…noting the importance of scale. The numerical value of a physical quantity clearly depends on the

scale used, and thus Newcomb suggests that the correct items to study are ratios of measurements” (p.

5). Badal-Valero, Alvarez-Jareño, and Pavía (2017) describe dual characteristics in this description

where “Benford’s Law has as its main properties invariance in scale and in base. The scale-invariant

property implies that Benford’s Law continues to be fulfilled even if the units of measurement are

changed. That is, the level of fit of some data to Benford’s Law is independent of the measurement

system. In economic terms, the currency in which the variable is measured does not influence the result.

The base-invariant property states that the logarithmic law remains independent of the base used. It is

equally valid in base 10, in binary basis, or in any other base. Hill proves that Benford’s Law is the

unique continuous distribution base-invariant and that scale-invariance (a property impossible for

continuous variables) entails base-invariance” (p. 26). The summary point being made is Benford’s Law

describes the frequency of first appearance for numbers 1-9, while acknowledging the frequency of first
4

appearance is based on ratios between compared quantities captured in the data that can be measured

with any scale or base-counting system without changing the equivalent frequency of first appearance

for any numeric values that might replace the base-counting system of numbers such as 1-9. There is one

caveat to the use of Benford’s Law that must be mentioned as a disclaimer for when Benford’s Law

cannot be applied to various datasets: Cartlidge (2010) quotes mathematician T. Hill (the same Hill

mentioned earlier by Badal-Valero et al.); “The ubiquity of Benford’s Law,” he says, “especially in real-

life data, remains mysterious.” Cartlidge mentions Hill’s prominence concerning Benford’s Law when

Cartlidge says “Hill proved mathematically in 1995 that Benford's law is the only possible universal law

describing the distribution of digits that is invariant under changes of scale... But neither he nor anyone

else has discovered a general principle that can predict a priori which kinds of data sets should obey the

law” (para. 8).

c.) The application of Benford’s Law in the real-world.

The practicality of Benford’s law is in its ability to predict random digit values representing

ratios of proportion in generated data as to which digit appears first in a data field, which digit appears

second.., which digit appears…etc. as a way to determine the distribution of numeric values in sequence

of frequency when the pattern is compared with Benford’s Law so as to measure its approximation

against that predicted by Benford’s Law. Badal-Valero et al. (2017) mention, historically Benford’s Law

has been suggested to predict the sequence of digits as to which should occur with frequency of first

appearance in financial documents and databases using analysis of these financial records for

comparison with the naturally-occurring Benford’s Law pattern of digits in their frequency of first

appearance to detect fraud (p. 25). Badal-Valero et al. (2017) take note in their article that “the use of

Benford’s Law in the field of accounting is prominent, having shown itself able to detect anomalies in

accounting data” (p. 25). It has been suggested that Benford’s Law when used to detect fraud in the
5

same way for the tax returns and supporting documents of corporations and individuals, does so by

predicting what the digit frequency of first appearance should be if the data is randomly conforming to

Benford’s Law in such documents. Per Badal-Valero et al. (2017) “Benford’s Law has been extensively

used as a tool for detecting election fraud” (p. 25). These examples serve to highlight Benford’s Law is

used most often for the detection of irregular digit distribution patterns within financial documents and

other records caused by the manipulation of non-random numeric values with the effect upon the data of

expressing irregularities in ratios of proportion between frequencies of first appearance for digits, that

such irregularities might be compared with the random values predicted by Benford’s Law. This gets to

the heart of why Benford’s Law is used for fraud detection: the probability of repetition that mirrors the

digit distribution pattern randomly predicted by Benford’s Law is what is being manipulated with the

use of non-random numeric values introduced by fraudulent methods into the data. What makes

Benford’s Law unique is pointed out by Badal-Valero et al. (2017) in this observation; “…when people

manufacture data manually, the data rarely fits Benford’s Law” (p. 26).

2.) Introduction to Benford’s Law processed with the RStudio/Rattle graphical user interface.

The purpose in writing this paper is to document the ability of the algorithm using the random rule of

Benford’s Law in looking for the frequency of first appearance in numeric values to govern its

processing of numeric data for irregular patterns suggesting fraud when that algorithm is used by a data

mining application such as RStudio/Rattle. The algorithm using Benford’s Law in RStudio/Rattle is

called up within the Rattle GUI by choosing the distributions radio button under the “explore” tab and

then selecting the number of digits 1-9 for analyzing the frequency of first appearance for each digit

specified within the number of digits to be analyzed within the data file. This is done after selecting the

file to be opened within the “data” tab using the filename field to browse for the file and clicking

execute, once it is found. The algorithm specified for use within the “explore” tab utilized by the user’s
6

choice of both “distributions” and “Benfords” values specified in the “Starting Digit” field is the

algorithm that must correlate each of the nine digits (if that value is chosen in the “Starting Digit” field)

with its frequency of first appearance for each of the digits, if such a data pattern in the dataset is found

to conform to Benford’s Law. The algorithm must look at each data value within each data field

according to the data format for the first appearing digit equivalent with the numeric values of one

through nine, data field by data field, while totaling the frequency of first appearance for each of the

numeric values found for comparison with the random frequencies found in nature with Benford’s Law.

Conformity with Benford’s Law while using RStudio/Rattle to analyze and view a dataset will appear as

values represented with the use of bars in a bar graph or as plot lines, but this depends on the viewer’s

choice within the “explore” tab as the plot lines view is the default choice unless the bars option is

chosen otherwise. The viewer’s choice of using either bar graphs or plot lines as to how to view the

results of the dataset will appear in the RStudio “Plots” viewing window, whereas all other options or

choices are made within the Rattle GUI application. Upon comparison between the data results of the

bar graphs or plot lines viewed in RStudio and the standard Benford’s Law frequencies of first

appearance for all digits quoted by Stoessiger (2013) earlier in this paper; the dataset either conforms

approximately to the random frequencies predicted by Benford’s Law, it doesn’t conform

approximately, or it cannot be compared for approximation with Benford’s Law because the dataset

lacks the structure for numeric ratios somehow within itself. There are no other processing options for

the algorithm as the dataset either has the random data structure allowing Benford’s Law to be applied to

it or it doesn’t and the dataset either approximates the random data structure comparable to that

predicted by Benford’s Law or it doesn’t. This simplifies the purpose the algorithm is tasked with

accomplishing because nature sets the bar for what approximates Benford’s Law through the random

frequency of first appearance and its corresponding numeric values represented as the counting digits.
7

References

Badal-Valero, E., Alvarez-Jareño, J. A., & Pavía, J. M. (2017). Combining Benford's Law and machine

learning to detect money laundering. An actual Spanish court case. Forensic science international, 282,

24. Retrieved from https://ac.els-cdn.com/S0379073817304644/1-s2.0-S0379073817304644-

main.pdf?_tid=514b5398-dc52-11e7-979e-

00000aab0f27&acdnat=1512763365_63d1a34fd4288b38e7c0dd7d7803ab41

Cartlidge, E. (2010, October). In Nature, Number One Dominates. Physicsworld.com. Retrieved from

http://physicsworld.com/cws/article/news/2010/oct/20/in-nature-number-one-dominates

Miller, S. J. (Ed.). (2015). Benford's Law: Theory and Applications. Princeton University Press.

Retrieved Chapter One from http://assets.press.princeton.edu/chapters/s10527.pdf

Stoessiger, R. (2013). Benford's Law and why the integers are not what we think they are: A critical

numeracy of Benford's law. Australian Senior Mathematics Journal, 27(1), 29. Retrieved from

https://files.eric.ed.gov/fulltext/EJ1093383.pdf

Вам также может понравиться