Classifying Online Fraud
The Internet has the power to connect people from all over the world, making it an incredible tool for communication and sharing of knowledge. However, the web has also opened up enormous new fraud opportunities. Before the web, a person from Nigeria or Russia would find it exceedingly difficult to rip off a business in the U.S. This fraud "gold rush" has significantly increased the importance of fraud prevention software on the web -- any business that neglects fraud will eventually be overrun by anonymous fraudsters trying to make a quick buck.
Companies in this space almost always talk about fraud detection or fraud prevention, which is natural. We tend to focus on the thing we want to prevent. I would like to talk about this problem from the point of view of an online merchant. The following discussion is about this general problem, and applies to anyone doing any kind of fraud prevention.
As a data scientist I see this as a binary classification problem, independent of the labels we put on it. The linked Wikipedia article goes into gory detail (of course it does, it is Wikipedia) if you are interested. However, the two important points are:
- There are only two possibilities - something is a member of one class or another.
- In most cases, the two classes are not proportionate - there are many more members of one than the other.
So, you are either a fraudster or you are "legitimate". Furthermore, membership in one class means you are not a member of the other. When we classify someone, we can be wrong (of course), in two different ways, leading to the following matrix:
We are going to arbitrarily designate a correct fraudster determination as a "positive". That leads us to two types of errors:
- We incorrectly identify a fraudster as genuine (false negative). That is bad because it means we let in a bad guy and he can do damage and cost us money
- We incorrectly identify a genuine person as a fraudster (false positive). That is bad because we turn away a customer who is actually honest, and we lose that business
While we would love to eliminate errors, that is impossible in any real classification problem. However, we can tune things based upon which type of error is worse, and relative probabilities. Unfortunately, when we optimize for one kind of error, we increase the other kind of error. No way around that.
Which error is worse, though? A lost customer costs us $X, where X is what they would have paid us had they used our service. A fraudster can theoretically have no upper limit on how much they cost us, depending upon the type of business.
So, it would seem that we should focus on getting rid of the fraudsters, even if that costs us some genuine customers, right? Electronic ID verification should be a no-brainer. Well, it isn't that simple. The second factor deals with something called prior probabilities. Basically, when a new customer comes in, what is the probability they are fraudulent vs genuine, assuming we know nothing at all about them? That turns out to be a really important question.
Assuming most of our new customers are genuine, we run into something called the False Positive Paradox. This is a really interesting concept, but the basic idea is that if we are testing for something that is relatively unlikely to occur (e.g. fraud), the more precise our test has to be. Visually, it looks kind of like this:
The image on the left is from of the earliest video games, Space Invaders. The image on the right is from one of the latest - Grand Theft Auto V. If you need to point to a particular pixel on the left, you could use a broad tip pencil. To point to a particular pixel on the right, you would need something super-fine. That's how classification works when the thing you are checking for is relatively rare - you need a very precise, accurate test.
One of the big complaints we hear from people using fraud detection techniques is that they have too high of a false positive rate - they classify people as fraudulent when they are genuine. That makes sense to me. When you are trying to prevent fraud, you are continually battling with fraudsters. They get better and better (they have to in order to survive). But the genuine people don't get more sophisticated at proving they are genuine because that isn't their job. So the net effect is that the fraud technique become more invasive over time to keep up with the fraudsters, causing both genuine customer friction and more false positives.
All of this is a long way of explaining that talking about identity and fraud prevention isn't as easy as it sounds, because the concept of accuracy when applied to biometric technologies is more complicated than it sounds.