RESEARCH: Bias Correction for Supervised Machine Learning

I am thrilled to share that after a ton of work I recently published a research paper on Arxiv called Sampling Bias Correction for Supervised Machine Learning. A local link is available as well as a Local Maximum podcast episode (218).

 
 

The paper says that if you receive only a portion of a dataset for machine learning, and you know the mechanisms by which the original data was abridged, you can work out the formulas for learning on the original dataset.

The lost data will make your results less certain, but at least the bias can be counteracted in a principled way.

I originally wrote this because of a problem that I worked out when I was building the attribution product for Foursquare in 2017. We were using machine learning to predict the likelihood of people visiting places. Because the “non-visit” examples were humongous, we sampled it at a larger rate than the “visit” examples. The mathematics of sampling bias was researched to counteract this.

I encourage anyone who is interested in machine learning and computer science in general to check out the introduction. It’s a quick read - and presents the high level ideas in our field.

Section 3 is really just “machine learning 101” which I started writing in order to establish the vocabulary, and eventually decided to make it a free-standing walk-through of the machine learning process. If you want an introduction to machine learning from a Bayesian perspective, read this section! It serves as a high-level primer for people who are unfamiliar with machine learning but have some mathematical background.

Supervised machine learning is not the only variety - and it is not always presented as a Bayesian inference problem - but it is an incredible tool for anyone trying to learn how this all works.

If you want to gain expertise in the bias correction problem, and want a deep understanding of how the Bayesian formulation solves this, then you should read the whole thing! Of course, if you need to solve the bias correction problem for work and the deadline is coming up, feel free to skip to the answers in section 5 (and 6 for logistic regression)!

I hope that this research demonstrates that formulating a machine learning problem in the language of Bayesian Inference helps to break down and answer really tough questions. This solution would not have been possible without thinking in terms of Bayesian distributions - and I think this paper will serve as an excellent case study for people to understand why that is. Once you stop thinking in terms of exact answers and start thinking in terms of beliefs over possible answers, a whole new world of insights opens up.

Finally, I want this to be used in practice.

Much noise has been made about bias in datasets. The solution I present here will allow practitioners to make assumptions about this bias and peek at the consequences of those assumptions which is a useful first step.

But more immediate is the original motivation of the paper, which is to reduce the size and composition of training sets so that computing becomes more efficient. The thinking behind this progresses as follows:

1) Big Data Product: Great news - we have a ton of data to run this model on! We can build something useful here.

2) Big Data Drawback: Hey, using all this data takes up a ton of resources. Are we past the point of diminishing returns? How about we throw out some data - pocket the savings - and the result will be just as good, or like 95% as good which is totally fine for us.

3) Uniform Random Sampling: How many examples do we have - 100 million? I think this will work on only 1 million, so for each datapoint I am going to pick a random number so that it keeps it 1% of the time. Then we'll end up with around a million points.

4) Bias Sampling: But wait!! Some data points are more valuable than others because we have a label imbalance. So let's be more selective about what we throw away. This means we can either safely throw away more data, or we can get better performance from the 1% rate we did before.

Now that bias sampling can be dealt with appropriately, this method can be deployed routinely and hopefully bring about compounding savings.

Work still needs to be done in terms of solving for specific cases (as this paper does for logistic regression) and accounting for different sampling types. Let me know if you’re interested in any of these questions - and I’ll give you my assessment!

Finally, this exercise has made me come to understand that the teaching of Bayesian Inference, while fairly standardized, is still ripe for innovation from educators and systematizers.

For example, the mathematics of Bayes rule benefits from considering probability to be proportional rather than absolute. This allows us to safely remove all of these pesky constant factors which are ultimately unnecessary and confuse everyone trying to follow the math. I tried to rely more on proportionalities and the idea of probability ratios rather than raw probabilities - but I’d like to see better notation around it. For example, in proportionality statements it should be obvious which symbols are considered variable and which are considered constant.
I hope to continue this line of research and incorporate these tools into my future projects, including newmap.ai which is currently in development.

The Idea of Subjective Probability

I've been deep in Bayesian analysis recently, and I want to discuss some of the philosophical foundations.

The background here is that there are roughly two camps of statistical thought: the Frequentists and the Bayesians.  They represent very different ways of thinking about the world.  I fall squarely on the Bayesian side. The purpose of this post isn't to construct some grand argument. I just want to introduce a simple idea: Subjective Probability.

Just like the world of statistics is divided between the Frequentists and the Bayesians, the interpretation of probability is divided into objective and subjective. Objective probability is associated with the frequentists and subjective with the Bayesians.

The prime example of objective probability is a coin flip. Suppose that this is a fair coin and it produces heads on half of all flips. It is an objective property of the coin that it produces heads one out of every 2 times.

Let's look at another example: a deck of cards. A standard deck is weighted to produce a heart a quarter of the time, and to produce a picture card 3/13ths of the time. Again, it's helpful to think of the deck as yielding an objective probability - but this way of thinking is limiting.

For example, suppose you have a deck of cards on the table and you again want to assign a probability of seeing a picture card. You know that it's 3/13, but you keep staring at the top card in that deck. You see the back of that card. You know it's either a picture or it's not. "What are you?" you say. As soon as you turn over that card, the probability either goes to 0 (it's not a picture) or it goes to 1 (it is a picture).

What if you shuffled the deck and you happened to get a peak at the card on the bottom? You'd then change your expectations of what the top card is going to be. What if you caught a glimpse of that card, but you're not exactly sure?

The probability now isn't some inherent property of the deck, it's a number in your mind that represents your expectations of the top card being a picture card. This number can take into account the inherent properties of the deck of course, but it can also take into account any other information you have as well as your experience.  For example, maybe you suspect the deck is rigged. You're belief about the deck might be different from someone else's.

Subject probability applies much better in real-world forecasting situations. Let's say you want to assign a probability to a particular candidate winning an election. In the end, they'll either win or they'll lose - but the probability you assign is an expectation of that event. You don't need to be well informed to have a subjective expectation - but you want to set yourself up to have more accurate expectations as you gather more information.

Sometimes we assign binary expectations to an event. For example, if I am absolutely sure something will occur I will assign it a 1. If I believe it is impossible, I'll assign it a 0. And then I make decisions based on that belief.But it turns out that we can make better decisions by hedging. If I see on my phone that there's a 30% chance of rain, maybe I won't bring my umbrella but I'll wear clothes that I don't mind getting wet.

What does it mean to have a degree-of-belief of 30% rain? It's not like we're living in a frequentist world where that particular day can be repeated over and over again to get a fraction. This is a difficult concept to define, but another way to think about it is a ratio of expectations. If there's a 30% chance of rain, that means that there's a 70% chance of no-rain, and the ratio of expectations is 3:7. It's related to the amount of risk we're willing to take on a certain outcome.

When the event finally occurs, we can quantify how surprising that event was by using logarithms on the assigned probability of that event. For the example above, if it rains the surprise is -ln(0.3), or roughly 1.2. If it doesn't rain, it's -ln(0.7) or roughly 0.35.

Just because you're very surprised doesn't mean you were wrong to assign the probabilities that you did. It could be that your forecasting was really good given the information at hand, and a rare event occurred. But it's generally true that if you are surprised less often after adjusting your methods for assigning probabilities, your new methods are probably better. In complex systems, there's no optimal method - you can always add more data and computation. In simple games, there's usually an optimal - and these can be thought of as objective probabilities.

Anyone can assign a subject probability to an event. You'll often hear in casual conversations remarks like "there's a 20% chance we'll be on time". These probabilistic assignments are often made before any thought has been put into then. If you want to assign better probabilities, a good start is to follow some basic logic. For example, if X always leads to Y, the probability of Y must be greater than or equal to X. There's also the indifference principle: if you have no information distinguishing two mutually exclusive events, then you should assign them equal probabilities.

And finally, there's Bayes rule. This tell us how to update our beliefs when we are exposed to new information. This most important rule is how the idea of subjective probability gives rise to Bayesian inference.