How I Made a 35% ROI in football betting using Machine Learning – Part 1

One day, when I was looking around for interesting data science toy projects to play with, an article from a machine learning tutorial website caught my attention. It says, “How to predict football results easily using machine learning”. Interesting, I thought, and I clicked into the article.

As I was reading the article with a curious mind, it didn’t take me very long to be sucked into the idea of making money using machine learning via this manner! I was thinking: Finally! I can convert all my machine learning skills into real, cold hard cash! I knew that all my precious weekends of tuning the bells and whistle of classifiers are not wasted after all! YES! A scene from The Lord of the Rings (below) appears in my head and I’m the Frodo who has finally turned to the dark side in the face of temptation.

But wait! If it’s so easy, why wouldn’t everyone do it? I’m sure there are tons of people who are smarter than me probably thought of that X years ago. Some further googling set me on a path of entering the fascinating world of sports betting. I’m amazed that are tons of sports betting, tipsters, predictions, and real-serious analytical websites out that provides updated prediction for every single match! Also, I realized that there are even many academic papers that dives into the idea of using machine learning to predict football results, and they have been extremely helpful(at the later stage) because it allows me to look at football betting in a rational, and mathematical perspective.

While the article which I first read explains a method that didn’t have a high accuracy score, it gives me a good insight into this topic and most importantly ignite a tiny spark in my next toy project: Make a profit in football betting using Machine Learning!

Could I do better than many tutorials, analytical individuals, book-thick academic papers and eventually made a tidy profit on my ambitious project? Join me as we enter the magical world of sports betting…


One of the first steps in a data science project is to check for availability of public datasets. Fortunately, I found a website here that not only provides the intricate details of football matches for almost every league that you could think of, it has historical data as far as 10 years ago, which practically gives you thousands of samples to start with! This is pure gold mine for any data science enthusiast and you could imagine the grin when I saw the website.

For a start, I narrowed down my scope to English Premier League (EPL) since the upcoming season is just around the corner and I might be able to bet using real money soon if this project is a success.

You can download the EPL dataset for 2007 to 2017 season (10 years of data) here.

When you inspect the dataset, you will see a lot of columns, which cover basic match statistics to betting odds across dozens of sites. There’s a “data dictionary” that provides the column definition on the download page as well. For ease of reading, I’ll just paste the important ones here:

Div = League Division
Date = Match Date (dd/mm/yy)
HomeTeam = Home Team
AwayTeam = Away Team
FTHG and HG = Full Time Home Team Goals
FTAG and AG = Full Time Away Team Goals
FTR and Res = Full Time Result (H=Home Win, D=Draw, A=Away Win)
HTHG = Half Time Home Team Goals
HTAG = Half Time Away Team Goals
HTR = Half Time Result (H=Home Win, D=Draw, A=Away Win)

Attendance = Crowd Attendance
Referee = Match Referee
HS = Home Team Shots
AS = Away Team Shots
HST = Home Team Shots on Target
AST = Away Team Shots on Target
and many other related match statistics...

B365H = Bet365 home win odds
B365D = Bet365 draw odds
B365A = Bet365 away win odds
BSH = Blue Square home win odds
BSD = Blue Square draw odds
BSA = Blue Square away win odds
and many other betting odds...

As a non-football fan, I could easily identify a few features that play a big role. For example, the “Home Team” feature could represent the home team advantage. It’s commonly known that Home Team has a higher probability of winning than the Away Team, probably because of the presence of more supporters, and familiarity of the physical surroundings.

The economics of Sports Betting

While I’m not a “King of Gambling” in sports betting (in fact I only picked up the knowledge for this project), I’ll go through the rough basics of sports betting to give you a better picture of what’s going on and why it matters. If you are not a sports betting newbie, skip this.

There are tons of columns in the dataset that represent the betting odds of a particular match from major online betting websites. I thought that these features are good representative of the winning probabilities set by the bookmakers (the betting company). The bookmaker didn’t come out with the magical betting odds from thin air. They set the initial odds according to the outcome probabilities they feel and eventually let the market forces decide the fluctuating odds.

For example, bookmaker A felt that Team Apple is going to win hands-down for the match on Team Apple VS Team Orange. They will probably set the following odds:

Option 1: Team Apple Win: 1.10
Option 2: Draw: 2.09
Option 3: Team Orange Win: 3.80

It means that bookmaker A feels that the probability of Team Apple winning is: 1/1.3 = 0.77 or 77%
The probability of a Draw is: 1/2.5 = 0.4 or 40%
The probability of Team Orange winning is: 1/3.8=0.26 or 26%

Wait a second, 77+40%26% = 143%! Yep that’s right. The surplus beyond the 100% is the margin that bookmakers make for each match. Therefore, as you can see, they will almost certainly make money for each match as long as there is a good number of people betting on 3 different odds. They might make a loss on 1 odd, but they will make it back on the other 2 odds.

What happens if everyone bets on the safe side (1.10) and if Team Apple really won, wouldn’t the bookmarkers lost a lot of money?

Here’s the fun part. The odds above is only the initial odds, and the bookmarker will adjust the odds to reflect market demand and supply. In our example, at some point in the match, the bookmaker might adjust the odds of Team Orange Win (3.80) to even a higher odds-perhaps 4.0- to attract more people to bet on this odd. So, even if Team Apple really wins, they could easily cover back the loss using the bets on Team Orange Win. It’s pretty fun to watch the odds for a live match as you could witness how the odds fluctuate at different points of a game.

There’s an entire world of sports betting analytics out there and how people are taking advantage of odds leveraging to make lots of money but that’s well beyond the scope of our machine learning project.

Back to machine learning

What has this got to do with our machine learning project?

There’s a theory of “Wisdom of Crowd”, which basically means that the guess of many people (hundreds or even thousands) are much more accuracy than a few individuals. There are tons of experiments and papers written on this theory, and many researches claimed that this is true. If that’s the case, does that means that there’s a correlation in the odds and the likelihood of a team winning? If most people thinks that Team Apple will win, they will bet on “Team Apple Win” and the bookmaker will react by lowering the odds further, and increasing the opposing odds higher, to deter people from betting on the winning team. Does that mean that we can just make use of these odds in the dataset as features and proceed with our training of model? We will soon find out.

It’s important to grasp the idea that the accuracy power of your model alone isn’t enough to make good profit out from it. You have choose the betting odds that are worthwhile to bet on, accordingly to the accuracy of your model. It took me quite a while to realize this due to the lack of betting knowledge and I hope it makes perfect sense to you.

Here’s an example:

Your model accurately predicts 60% of the match results and these are the opening odds for the next 5 matches:

$50 down the drain

If you have blindly betted on any 5 matches, you would probably lost money even though you got a stellar 60% of accuracy power. Not every match is created equal, and you wasted your “probability points” on match that don’t give you a reward that is worthwhile.

Therefore, if you select the bets that meets a minimum value, for example 2.0, you would make a profit when the model predicts correctly 60% of the time:

That’s a neat 34% profit!

Tip: When you see the opening odds are not worthwhile, don’t strike it off totally. Wait till the match is live, and you might see some new worthwhile odds for your bets.

Going back to the point of this blog post and more specifically this project of profiting through sports betting using machine learning, you could see that domain knowledge is extremely valuable and is required to succeed. This applies to not just this project but also real data science projects in your day job as a data scientist. I’m sure many data scientist can agree with me on that.

What if you don’t have enough domain knowledge?

Obviously, the first option is to go through the pain and learn from books or resources from the internet by yourself. This is a long, time consuming process that might impact your data science project timeline.

The second option is to engage a domain knowledge expert or subject matter expert, who are people that have been working on these area of study for years and know these topics much better than the Average Joe.

Fortunately, I have a friend, who is very knowledgeable about everything that has to do with football and he became my subject matter expert for this project! I have gained a lot of practical insights, from betting strategies to team performance analytics, that made this project a success. I’ll come to that later.

I believe being a successful data scientist is beyond the technical skills of cleaning data (real boring I know), features engineering, employing machine learning algorithms, tuning model, but also the soft skills of presenting results to stakesholders, convincing them the results are true(sounds silly I know) and also effectively engaging individuals who know stuff much better than you.

Okay, we have talked a lot about the theories and perhaps it’s time to do some hands-on coding, shall we?

Looking at the tons of features available, I thought that everything that I need to make a good prediction is already there. In fact, there are at least 20 good features (including betting odds from various bookmakers) ready for me to train the model, and like I said before about the wisdom of crowd theory, the betting odds would be good representations of the likelihood of outcome. Therefore, I would just need to feed this bunch of 10 years data into a machine learning model and it should spit out a high prediction right? At most, I just need to do some evaluation of models and hyper-parameter tuning and that wouldn’t take me too much time. Does that really sound too simple on our journey to riches? Let’s find out.

#Attempt 1 – Feed and forget

As usual, let’s feed the 10 datasets to the data frames.

import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score,StratifiedKFold

dataset2017 = pd.read_csv("dataset/p2017.csv", error_bad_lines=False )
dataset2016 = pd.read_csv("dataset/p2016.csv", error_bad_lines=False )
dataset2015 = pd.read_csv("dataset/p2015.csv", error_bad_lines=False )
dataset2014 = pd.read_csv("dataset/p2014.csv", error_bad_lines=False )
dataset2013 = pd.read_csv("dataset/p2013.csv", error_bad_lines=False )
dataset2012 = pd.read_csv("dataset/p2012.csv", error_bad_lines=False )
dataset2011 = pd.read_csv("dataset/p2011.csv", error_bad_lines=False )
dataset2010 = pd.read_csv("dataset/p2010.csv", error_bad_lines=False )
dataset2009 = pd.read_csv("dataset/p2009.csv", error_bad_lines=False )
dataset2008 = pd.read_csv("dataset/p2008.csv", error_bad_lines=False )

#concat all the dataset together as a dataframe
giantData=pd.concat([dataset2017,dataset2016,dataset2015,dataset2014,dataset2013,dataset2012,dataset2011,dataset2010,dataset2009,dataset2008],join='outer', axis=0, ignore_index=True)

When we import the dataset from various years, we can see that some columns do not exist at that time and that would give us a null or empty value when we merge the datasets together. We could have fill the null values with an estimated averaged value but for simplicity sake, let’s just remove the columns which have null values since we have tons for columns to play with.

Obviously, we have to drop off match results like total home goals(for this particular match) because at the point of predicting, we wouldn’t have these statistics(if you have it,you wouldn’t be reading this) and it would defeat the purpose of our predicting work.

#see the count of non-null values for each column

#drop off columns that contains null values in at least 1 row
giantData.dropna(axis=1, how='any',inplace=True)

#drop off match results that we realistically wouldn't have at the time of predicting
giantData.drop(['AC', 'AF', 'AR', 'AS', 'AST', 'AY', 'Date','Div', 'HC', 'HF', 'HR', 'HS', 'HST', 'FTHG','FTAG,'HTAG','HTHG', 'HTR', 'HY'],axis=1,inplace=True)

Here,we did some data cleaning/pre-processing like encoding the categorical value into numerical value, dropping and changing the name of some columns which would avoid some problem later on.

#the column names must be alphanumeric or else scikit learn will throw a bunch of training error later
giantData.rename(columns = {'BbAv<2.5':'BbAvB2.5'},inplace=True) giantData.rename(columns = {'BbAv>2.5':'BbAvA2.5'},inplace=True)
giantData.rename(columns = {'BbMx<2.5':'BbMxB2.5'},inplace=True)

#Encode some columns
lbRef = preprocessing.LabelEncoder()
giantData['Referee'].replace(to_replace='[^0-9a-z ]+', value='',inplace=True,regex=True) 
giantData['Referee'] = lbRef.fit_transform(giantData['Referee'])
lbHTeam = preprocessing.LabelEncoder()
giantData['HomeTeam'] = lbHTeam.fit_transform(giantData['HomeTeam'])
lbATeam = preprocessing.LabelEncoder()
giantData['AwayTeam'] = lbATeam.fit_transform(giantData['AwayTeam'])
lbRes = preprocessing.LabelEncoder()
giantData['FTR'] = lbRes.fit_transform(giantData['FTR'])

#Create a seperate dataframe for the y true values
giantData_Y = giantData.pop('FTR')

Finally, we whip up our good ol’ Logistic Regression for model training. We are using randomized 10-fold stratified cross validation to train and evaluate our model so that it would give us an array of accuracy score based on training 90% of randomized data and testing 10% of randomized data, for 10 times.

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg_scores = cross_val_score(logreg, giantData, giantData_Y,cv =StratifiedKFold(n_splits=10,shuffle = True)) 

This image(5-fold) does a better job explaining the concept of Stratified K-Fold:

Here’s the moment we have been waiting for! Can our 10 mins(okay perhaps a few hours) work of data science magic wonderfully predict the results given that the dataset has already so many rich features?

Drum roll please…


The result is out: We got a 53% accuracy without doing any feature engineering or any sorts of magical tricks!

Okay, that’s a little disappointing and you might ask that Logistic Regression might not be the golden ticket so let’s try a few more other machine learning algorithms that has a different decision boundary just in case our data is not linearly separable.

One of the things that I have learnt over time is that the reason why people try to apply all machine learning algorithms when one doesn’t work is not simply because they are desperate and is certainly not because the Logistic regression model is broken.

It’s because when you applied a machine learning algorithm that doesn’t fit well to the characteristic of the dataset, you will get lousy accuracy. An example would be fitting non-linear separable data to a linear machine learning model like Logistic Regression. You would probably get a higher accuracy if you use the dataset using a non-linear model like SVM or tree-based model like decision tree so that the models could learn the “exceptional cases” that linear models cannot learn.

Here’s a picture of comparison that speaks a thousand words:

As you can see, the RBF SVM and decision tree can learn data points that belong to a class in awkward places.

Let’s give the SVM and decision tree a try.

from sklearn import svm
svm_clf = svm.SVC(kernel='rbf')
svm_scores = cross_val_score(svm_clf, giantData, giantData_Y,cv =StratifiedKFold(n_splits=10,shuffle = True)) 
#prints 0.46093711279464278

import xgboost as xgb
xgb_clf = xgb.XGBClassifier() 
xgb_scores = cross_val_score(xgb_clf, giantData, giantData_Y,cv =StratifiedKFold(n_splits=10,shuffle = True)) 
#prints 0.52944780377617029

SVC (RBF kernel, Support Vector Machine) gives a unsatisfactory low score of 0.46 and XGBoost (Tree based ensemble) comes close with a score of 0.53 which is similar to our Logistic Regrssion model.

It appears that linear models like Logistic Regression is a better fit for our data and it’s not case of non-separable problem for our low accuracy score.

Anyway, it seems that we the best we can do is 53% accuracy (Logistic Regression)! It’s not a figure that we should be proud of, and in fact, a theoretical coin flip can give you a 50% accuracy (half of the time on heads and another on tails).

Would you put your hard earned money on a bet based on a coin flip? Definitely not! I certainly wouldn’t do that.

We got to do much better than 53% to even begin talking about making money with this model and you will see how we eventually achieve a higher-accuracy and reliable model through the techniques we apply to numerous attempts and some valuable opinions from our subject matter expert.

This post has gotten a little too long and I’ll leave some good bits for the next part. Thanks for reading, and joining me in this exciting journey. Stay tuned!

Find it helpful? Share the knowledge!

Add a Comment

Your email address will not be published. Required fields are marked *