Fraud Detection By The Numbers
February 21, 2011 at 7:39 am Terence Craig 4 comments
Mary mentioned our new fraud detection capabilities in her last post. Our primary fraud detection mechanism uses what is known as Benford’s Law. Benford’s law, also known as the first-digit law, is a neat little algorithm that checks to see if the digits in a randomly selected subset from a large group of numbers match the experimentally determined probabilities for a particular digit.
While powerful, you have to be careful that your problem really fits within its constraints. Benford’s law works best on:
- Highly variable numeric data (such as stock prices, global sales figures, tax returns and not IQs, body weight, or most things that follow a normal distribution)
- Data that is truly numeric and not an identifier (for example, a price versus a Social Security number)
- Large data sets (if sampling a larger set, make sure to use a truly random sample)
Also, even if your data fits these criteria, you need to remember that Benford’s law is only an indication that there might be fraud, not that there is fraud.
How It Works
For a detailed explanation, see this Wikipedia article or this great book from O’Reilly: Statistics Hacks. The shorthand version is that for data meeting the above requirements it has been experimentally determined that the probability of the first non-zero digit being a certain number is the following:
First digits probabilities under Benford’s Law | |
First digit | Probability according to Benford’s law |
1 | 0.301 |
2 | 0.176 |
3 | 0.125 |
4 | 0.097 |
5 | 0.079 |
6 | 0.067 |
7 | 0.058 |
8 | 0.051 |
9 | 0.046 |
To utilize Benford’s law for fraud detection, you simply calculate the relative frequency of the first digit of each number of your data set and compare them to the table above. Large discrepancies mean that your data should be viewed with some skepticism. Benford’s law has been accepted as evidence in US courts of law and has become popular lately with the IRS, SEC, and forensic accountants. Accountants tend to refer to Benford’s law as digital frequency analysis. Here is an example using U.S. tax data from author T.P. Hill.
To give you an idea of how universal Benford’s law applications are, take a look at this great graphic from this ANU research paper that shows how closely non-fraudulent naturally occurring data
matches expected Benford values.
In the next version of PAF we allow Benford’s law to be applied to any times series data that we track, with a single click for the first two non-zero digits. PAF will also warn you if the data set is not a good candidate for Benford’s law. We will be putting up some videos of this and some other new features, after we get done tweaking the UI based on some of our beta feedback. It is pretty cool – for example, it spotted that some of our unit test data was fake. Happy fraud detection!
Entry filed under: Data, General Analytics. Tags: analytics, Benford's Law, first-digit law, fraud detection, PatternBuilders Analytic Framework, Statistics Hacks.
1. Tweets that mention Fraud Detection By The Numbers « Big Data Big Analytics -- Topsy.com | February 21, 2011 at 1:13 pm
[…] This post was mentioned on Twitter by Dirk, Terence Craig. Terence Craig said: Benford's law and #fraud detection in the PatternBuilders Analytic Framework-http://bit.ly/eytgK0 #li #analytics #statistics #math #bigdata […]
LikeLike
2. How “Real” is Real-Time and What the Heck is Streaming Analytics? « Big Data Big Analytics | February 22, 2011 at 8:08 pm
[…] minutes (from purchase to analysis to result to phone call to me). This is not just an example of fraud detection but one of streaming analytics. Put simply, data is analyzed as it comes in to predict an outcome […]
LikeLike
3. The McKinsey Study and the U.S. Health Care System: Now for Some Good News… « Big Data Big Analytics | June 29, 2011 at 2:50 pm
[…] that include fraud detection (see my post on how the credit card industry does this so well and Terence’s post on “Fraud Detection by the Numbers”) could help to reduce health care costs as well as […]
LikeLike
4. Real-time Analytics: It’s Always Decision Time! « Big Data Big Analytics | September 23, 2011 at 12:04 pm
[…] Fraud Detection By The Numbers […]
LikeLike