RESEARCH: Bias Correction for Supervised Machine Learning
I am thrilled to share that after a ton of work I recently published a research paper on Arxiv called Sampling Bias Correction for Supervised Machine Learning. A local link is available as well as a Local Maximum podcast episode (218).
The paper says that if you receive only a portion of a dataset for machine learning, and you know the mechanisms by which the original data was abridged, you can work out the formulas for learning on the original dataset.
The lost data will make your results less certain, but at least the bias can be counteracted in a principled way.
I originally wrote this because of a problem that I worked out when I was building the attribution product for Foursquare in 2017. We were using machine learning to predict the likelihood of people visiting places. Because the “non-visit” examples were humongous, we sampled it at a larger rate than the “visit” examples. The mathematics of sampling bias was researched to counteract this.
I encourage anyone who is interested in machine learning and computer science in general to check out the introduction. It’s a quick read - and presents the high level ideas in our field.
Section 3 is really just “machine learning 101” which I started writing in order to establish the vocabulary, and eventually decided to make it a free-standing walk-through of the machine learning process. If you want an introduction to machine learning from a Bayesian perspective, read this section! It serves as a high-level primer for people who are unfamiliar with machine learning but have some mathematical background.
Supervised machine learning is not the only variety - and it is not always presented as a Bayesian inference problem - but it is an incredible tool for anyone trying to learn how this all works.
If you want to gain expertise in the bias correction problem, and want a deep understanding of how the Bayesian formulation solves this, then you should read the whole thing! Of course, if you need to solve the bias correction problem for work and the deadline is coming up, feel free to skip to the answers in section 5 (and 6 for logistic regression)!
I hope that this research demonstrates that formulating a machine learning problem in the language of Bayesian Inference helps to break down and answer really tough questions. This solution would not have been possible without thinking in terms of Bayesian distributions - and I think this paper will serve as an excellent case study for people to understand why that is. Once you stop thinking in terms of exact answers and start thinking in terms of beliefs over possible answers, a whole new world of insights opens up.
Finally, I want this to be used in practice.
Much noise has been made about bias in datasets. The solution I present here will allow practitioners to make assumptions about this bias and peek at the consequences of those assumptions which is a useful first step.
But more immediate is the original motivation of the paper, which is to reduce the size and composition of training sets so that computing becomes more efficient. The thinking behind this progresses as follows:
1) Big Data Product: Great news - we have a ton of data to run this model on! We can build something useful here.
2) Big Data Drawback: Hey, using all this data takes up a ton of resources. Are we past the point of diminishing returns? How about we throw out some data - pocket the savings - and the result will be just as good, or like 95% as good which is totally fine for us.
3) Uniform Random Sampling: How many examples do we have - 100 million? I think this will work on only 1 million, so for each datapoint I am going to pick a random number so that it keeps it 1% of the time. Then we'll end up with around a million points.
4) Bias Sampling: But wait!! Some data points are more valuable than others because we have a label imbalance. So let's be more selective about what we throw away. This means we can either safely throw away more data, or we can get better performance from the 1% rate we did before.
Now that bias sampling can be dealt with appropriately, this method can be deployed routinely and hopefully bring about compounding savings.
Work still needs to be done in terms of solving for specific cases (as this paper does for logistic regression) and accounting for different sampling types. Let me know if you’re interested in any of these questions - and I’ll give you my assessment!
Finally, this exercise has made me come to understand that the teaching of Bayesian Inference, while fairly standardized, is still ripe for innovation from educators and systematizers.
For example, the mathematics of Bayes rule benefits from considering probability to be proportional rather than absolute. This allows us to safely remove all of these pesky constant factors which are ultimately unnecessary and confuse everyone trying to follow the math. I tried to rely more on proportionalities and the idea of probability ratios rather than raw probabilities - but I’d like to see better notation around it. For example, in proportionality statements it should be obvious which symbols are considered variable and which are considered constant.
I hope to continue this line of research and incorporate these tools into my future projects, including newmap.ai which is currently in development.