Penaltyblog Python Package Updated to v0.5.1

Introduction

My penaltyblog python package has recently been updated to v0.5.1 so let's take a look at some of the new features.

Python 3.7 Support

By popular request, penaltyblog is now compatible with Python 3.7. The main reason for doing this is to allow it to run on Google Colab, which at the time of writing is still stuck on what is now a fairly old version of Python.

penaltyblog isn't included in Colab by default but can easily be installed via pip by running the command below within one of your notebook's cells

!pip install penaltyblog==0.5.1

Once it's installed, you can then import penaltyblog as normal and use all of its functions.

import penaltyblog as pb

understat = pb.scrapers.Understat("ENG Premier League", "2022")
fixtures = understat.get_fixtures()
fixtures.head()

Bayesian Hierarchical Goals Model

Another exciting update is the addition of a new goals model based on Bayesian hierarchical modelling. I explained the theory behind this approach in a previous article so I won't go into the theory here but this model is now included in the package.

The hierarchical model follows the same API as all the other goals models meaning that you also have the ability to optionally apply a decay weighting to the data so that more recent fixtures are considered more important when fitting the model.

Here's a quick example to get you started

import penaltyblog as pb

fd = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fixtures = fd.get_fixtures()

fixtures["weights"] = pb.models.dixon_coles_weights(fixtures["date"], 0.001)

model = pb.models.BayesianHierarchicalGoalModel(
    fixtures["goals_home"],
    fixtures["goals_away"],
    fixtures["team_home"],
    fixtures["team_away"],
    fixtures["weights"],
)

model.fit()

prediction = model.predict("Man City", "Chelsea")

print(prediction)

Bayesian Bivariate Poisson Goals Model

As well as the hierarchical model, I've also added in a Bayesian bivariate Poisson model as well. I'll probably write a separate article at some point to explain the theory behind the modelling so again I won't go into too many details here.

However, as I've mentioned in previous articles there is a common issue with Poisson-based models where they treat both team's scores as independent from each other.

This doesn't reflect reality though where each team's goals scored / conceded are likely not independent. For example, if the score is 0-0 with 15 minutes to go then the underdog may settle for a draw and not push to score. Or if a team goes a goal down early on they may park the buss to prevent a more humiliating score line.

The bivariate model attempts to account for this by modelling the underlying Poisson distributions as a bivariate function. So instead of having seperate Poisson distributions for the home and away teams, we have one combined distribution.

Here's another quick example to get you started. Notice how similar the code is to the hierarchical example above - all we have to do is change one word to switch out the model, making it easy to try out different approaches.

import penaltyblog as pb

fd = pb.scrapers.FootballData("ENG Premier League", "2021-2022")
fixtures = fd.get_fixtures()

fixtures["weights"] = pb.models.dixon_coles_weights(fixtures["date"], 0.001)

model = pb.models.BayesianBivariateGoalModel(
    fixtures["goals_home"],
    fixtures["goals_away"],
    fixtures["team_home"],
    fixtures["team_away"],
    fixtures["weights"],
)

model.fit()

prediction = model.predict("Man City", "Chelsea")

print(prediction)

So which model should you use 🤷

Unfortunately, there's no simple answer here. If you want something fast and reliable then go with the Dixon and Coles model. Otherwise, it you've got more time / computational power then try out the Bayesian models. The only real way of knowing though is backtesting them on your data to find out which ones work best for your particular use case.

So Fifa

The scrapers in penaltyblog have also been updated to include So Fifa. The get_players function essentially scrapes the front page of the website, which contains top-level player data. You can control the number of pages to scrape and how the data should be sorted to make it easier to just get the top-ranked players if that's all you're interested in.

import penaltyblog as pb

sofifa = pb.scrapers.SoFifa()

player_info = sofifa.get_players(max_pages=2, sort_by="potential")
print(player_info.head())

You can then use the get_player function to get more detailed stats about players you're interested in based on So Fifa's player ID.

import penaltyblog as pb
from time import sleep

sofifa = pb.scrapers.SoFifa()

player_info = sofifa.get_players(max_pages=1, sort_by="value")

players = list()
for id_ in player_info.index[:5]:
    tmp = sofifa.get_player(id_)
    players.append(tmp)
    sleep(1)

players = pd.concat(players)
print(players)

Remember to scrape nicely though, please don't crash someone's website by scraping too much / too fast.

What's Next?

My TODO list has plenty more modelling approaches to try and more websites to scrape but if there's anything else you think would be good to include then let me know.

Thanks for reading!