Use Case #3: Horse Racing Prediction: A Machine Learning Approach (Part 2)

In the first part of this article, we demonstrated that using intuitive approaches for horse racing prediction such as picking randomly the winner or the favorite will render a negative profit in the long term. Our motivation is to find an edge and generate profitable models. The purpose of this second part is to apply an approach based on machine learning and deep learning.

We, running at the hippodrome after our discovery (from Racebets)

Use Case #3: Horse Racing Prediction: A Machine Learning Approach (Part 2)

Written by Idrissa Ndiaye & Koffi Cornelis

Abstract

In the first part of this article, we demonstrated that using intuitive approaches for horse racing prediction such as picking randomly the winner or the favorite will render a negative profit in the long term. Our motivation is to find an edge and generate profitable models.
The purpose of this second part is to apply an approach based on machine learning and deep learning.

With some horse racing knowledge, we first created useful features (Feature Engineering) to improve the performance of machine learning models.

Then, we created our training set continuously to benefit from horses’ recent results.

Finally, we developed a tree based model (LightGBM) and a Neural Network to evaluate profit/loss for each month.
In this article we will focus on ensemble models.

Development environment

Here is a short list of tech stack used for this project :

  • Python
  • Keras and Tensorflow
  • Scikit-Learn
  • LightGBM
  • Hyperopt

GCP project oriented

This use case about horse racing will obviously use some machine learning algorithms. Benefiting from our GCP certifications, we used GCP for all our machine learning development.

We used Cloud Storage to store our data, notebooks from AI Platform to write our code and Compute Engine to create and run our virtual machine.

To reproduce the results, here is a link to our GitHub repository with source codes.

I. Data Pipeline

This part will show how we handle the features creation using a data pipeline with python. This will also explain our approach and how we create, train and test our datasets.

A. Describe the initial data

The initial dataset comes from Kaggle and covers races in Hong Kong from 1997 to 2005. It consists of 6,348 races with 4,405 runners. Initially, we had 104 different features, but since we had access to the date of the race, we could create a lot more features.

B. Feature engineering

The features you use influence more than everything else the result. No algorithm alone, to my knowledge, can supplement the information gain given by correct feature engineering.

— Luca Massaron

We created features that measured :

  • horse’s change in weight,
  • rest time between races,
  • jockey’s change between races,
  • former horse’s results

We ended-up with 1 618 features after this step. We used a complete pipeline with several functions for multiple features. We called this python file extract_features and you can find it here.

C. Our approach

As explained before, we need to compare ML results with our previous baseline models which can be found here. So, we need to compare the exact same races to conclude which of the baselines and ML models bring best profits.

Our test set goes from January 2005 to August 2005.

We will create different models based on incremental periods of time so that test cases are more accurate. For each month, we will have a model and we will combine their results to compare with our baseline models.

The graph below shows the number of races available that we will use for training and testing.

Numbers of races ran from 1997 to 2005

Set 1 :

  • Train : races before 2005
  • Predict : January 2005

Set 2 :

  • Train : races before 2005 + January 2005
  • Predict : February 2005

This continues until we reach the eighth set.

Just remember that set #4 doesn’t exist because there are no races in April 2005.

II. Machine Learning and Betting

This second part will briefly explain how bettings work for horse racing, then focus on our model based approach, and our metrics.

A. Betting for horse racing

As explained in the first part on this use case (here), there are two ways of betting. We can bet for each run on the winner or on placed horses.

  • Winner : we select a horse before the race and if he wins, we win our bet
  • Placed : we select a group of 2 or 3 horses before the race (depending on the number of runners) and if they finish in the top 2 or top 3, we win those bets.

If it is not very clear, feel free to check here for a full explanation.

B. Ensemble is strength

For both approaches, winner and placed bet, we will use two different ways of prediction:

  • Deep learning with Keras and Tensorflow
  • Tree based model classifier with LightGBM and HyperOpt Optimization

We proceed as follows:

  • We will combine both results in term of probabilities
  • We apply a weighting coefficient of 0.3 for deep learning and 0.7 to LGBM values.

For example if the prediction for a race is:

  • Deep learning : [0.3, 0.2, …]
  • LGBM : [0.2, 0.3, …]
  • Then, the final prediction should be [0.23, 0.27, … ]

C. And what about metrics?…

Well, in this case we chose to focus on two things:

  • The percentage of successful bets:
    This is not a great indicator because we can succeed in a lot of bets but still end-up losing money because of low odds.
  • The profit metrics
    The most important metric for our use case is the profit we generate from our models and test set.

The goal of this model is to improve the profit we had before with our baseline models (part 1).

III. Results for winner

On this part, we will only focus on winner bets.

This section will be split between explanations and results:

  • Explain our approach with HyperOpt optimization for LGBM
  • Explain how we train ML algorithms for all sets corresponding to each month
  • Show results for a specific month, January 2005
  • Show final results, a consolidation of all models and sets

A. HyperOpt: Optimization for LGBM

First we define our hyperparameter searched space : the range of values available for our optimizer to select from. We use the TPE algorithm implemented in HyperOpt.

Then we evaluate several parameters on multiple rounds with our profit metrics.

To find the best parameters, we use our first set of data. We save all those rounds in a CSV file and we retrieve the best parameters from it.

Here is the final best parameters for the LGBM where the profit metrics indicates $27.4 for set #1.

B. The LGBM approach

For the lightGBM, we used the parameters we defined earlier with the hyperparameter optimization. We trained all ours LGBM models for all sets keeping always the same optimized parameters (best parameters from set #1). Once a model is trained, we save it.

We will focus on the first set where the test set is January 2005.

We can see below the evolution of our profit with an initial investment of $100. After 20 bets, we have around $20 of profit but after the 70th bet, the profit is around $-13.

Evolution of the profit for LGBM for the first set

Here are the result for the set #1 with LGBM:

LGBM ‘s winner result for the set 1

The last two lines show that we are not necessarily always betting on the favorite horse. Those two insights will stay for other analysis.

C. LGBM feature importance

The features importance for LGBM models for January is displayed below. Features related to “win_odds” and “declared_weight” appear quite often.
We used the default feature importance calculation: result contains the number of times the feature is used in the model.

TOP 20 most important features for the LGBM model for set #1

D. The Deep Learning approach

First we train our deep learning models for each set. We only choose one layer of 96 neurons, 40 epochs and a batch size of 100. The image below shows the architecture of the network.

Image from author : Neural networks used to train our deep learning model for each sets

We saved the model for each set. The figure below shows deep learning results for January 2005 test set.

Evolution of the profit for Deep Learning for the first set

Here are the result for the set #1 with deep learning:

Deep learning‘s winner result for the set #1

Important note : The average win odds for this model is “twice” the one of all favorite horses.
This literally means we earn twice the amount we bet with the deep learning model than betting on favorite horses.

E. Consolidated results with ensemble models for winner

To obtain our final result, we take all models from both algorithms and we predict the winner for each set and each race.

The table below shows :

  • Deep learning results
  • LGBM results
  • Consolidated result from the ensemble model composed of the deep learning and LGBM with a weighted coefficient of 0.7. This coefficient gives an advantage to LGBM over Deep Learning but allows us to reach better results.
Consolidated’s winner result for all sets

Here are the final result:

  • Investment of $470 for 470 bets
  • Loss of $8.7 for 7 different sets (7 months)

The ensemble model is very useful here because the profit is higher for the consolidated model than either one of the two other models. However, the percentage of success is lower for the consolidated model than the LGBM.

V. Results for Placed

In this next part, we will only focus on placed horses.

We will continue to use the sets-based approach with all 470 same races as before.
Now instead, we will bet 2 or 3 times for each race depending on the number of racers as explained in the previous part here. This corresponds to 746 bets.

For placed bets we discard all races where placed odds are unavailable. This lack of data decrease the amount of runs by sets (no races considered for the last two sets).

A. The LGBM approach

For the placed approach, we decided not to train another LGBM model, we will keep the “winner” models with the same optimization.

We will predict the probabilities for each class and select 2 or 3 horses depending on the number of runners.

For this type of bet, we don’t take the odds related to a win (“win_odds”) but the ones related to a placed horse (“placed_odds”) . Obviously, place_odds is lower than win_odds since it’s easier to predict a placed horse than a winner.

We can see below the graph of the evolution of our profit for the second set corresponding to February 2005.

Evolution of the profit for LGBM for the second set
LGBM‘s placed results for the set 2 (February 2005)

B. The Deep Learning approach

For deep learning, we decided to train new models. We use the same neural network but instead of having one class to predict the winner for each race, we now have 2 or 3 to predict placed horses. We use those new models to predict placed horses for all our sets.

We can see below results for February 2005.

Evolution of the profit for Deep Learning for the second set
Deep learning‘s placed result for the set 2 (February 2005)

This month specifically, the model detected outsider horses (cf the last two lines of the table). The mean win odds is 10.48 !!!
That’s why our profit is so high comparing to other months.

C. Consolidated result with ensemble models for placed

We proceeded the exact same way we did before and we can see below the final table with all results for placed bets. Results below only concern the first 5 sets because last 2 don’t have any races available due to the missing “place_odds”.

We can see that only the second set with February have a positive profit and by far because it compensates all loses.

Here is a comparison of all percentages:

  • Mean percentage of successful bets for Deep Learning: 30%
  • Mean percentage of successful bets for LGBM: 42%
  • Mean percentage of successful bets for Consolidated: 41.3%
Consolidated’s placed result for all sets

Ensemble model is less useful here because the profit is lower for the consolidated model than either one of both other models. However, ensemble model attenuates important losses where deep learning and LGBM were less profitable (set #1 and #3) . Furthermore, the percentage of success is lower for the consolidated model than it is for LGBM but higher than it is for the deep learning model.

Conclusion

The goal was initially to compare baseline models from part 1 with our ensemble model from this part. The table below shows a comparison between them on various factors :

  • profit in $
  • # of winning bets
  • % of successful bets
Summary table with all models

We can easily conclude that machine learning helps to have better profit compared to some basics ideas from baseline models.

We can also notice that our results with ensemble models seem consistant. Even if we win less bets than when we used the favorite methods (cf. winning_rate_bet on the table above), bets we won are on non-favorite horses, which are more profitable.

However, a question might torment you, how can we win $336.7 on placed horses, with a positive profit for only one set ? We agree this is kinda strange but sometimes the art of betting doesn’t only rely on statistics and features.

Luck remains the most important feature and we guess we had some luck there.

The GitHub code is available for GCP but feel free to ask questions we will be happy to help.

Useful Links :

Part 1

GitHub horse racing prediction repository

CodeWorks, un modèle d'ESN qui agit pour plus de justice sociale.

Notre Manifeste est le garant des droits et devoirs de chaque CodeWorker et des engagements que CodeWorks a vis-à-vis de chaque membre.
Il se veut réaliste, implémenté, partagé et inscrit dans une démarche d'amélioration continue.

Rejoins-nous !

Tu veux partager tes connaissances et ton temps avec des pairs empathiques, incarner une vision commune de l'excellence logicielle et participer activement à un modèle d'entreprise alternatif, rejoins-nous.