A question that often comes up in data science is how you determine the performance of different marketing channels. For example, somebody might land on your website by clicking an advert on Google. They may find something they want to buy but leave the site only to return a few minutes later via an affiliate link with a discount code. They then add something to their shopping basket but never complete the purchase so you send them an email to remind them and finally they convert.
Which channel drove the conversion here, was it the final channel which they converted from? Was it the first one as it brought them onto the site in the first place? Or was it one of the other ones in-between that helped guide them to the conversion?
Okay, this is a blog about football analytics so why I am writing about marketing? Well, consider this next example - Aymeric Laporte tackles the opposition's striker to win the ball and passes it to Fernandinho. Fernandinho then plays a short ball to Bernardo Silva, who then knocks it out wide to Raheem Sterling. Raz runs down the wing, beats the fullback and crosses the ball to Sergio Agüero who then scores. Which player was responsible for the goal here, was it Laporte for winning possession in the first place, was it Agüero since he scored or was it one of the players in-between who helped move the ball down the pitch to Agüero so he could score the goal?
Hopefully you've spotted that both examples are effectively the same problem - how do you determine the value of all the events leading up to a conversion?
Traditionally, this has been 'solved' using simple heuristics. For example the last player who touches the ball is awarded the goal or the last marketing channel a customer interacts with gets the credit for their purchase. Depending on the metric being measured, people will occasionally give credit to the first event in the sequence instead, or perhaps share everything out equally because it sounds fairer. If they are feeling really adventurous they may even apply some sort of curve to it but there's no real scientific rationale being used, it's typically just somebody's personal preference.
A more scientific approach taken from the world of marketing analytics is to use multi-touch attribution modelling to quantify the importance of each event in the sequence and assign a fractional amount of credit to it based on how much it drives the final outcome.
There are lots of different ways of doing this, including using Markov chains. These are mathematical systems that can be used to model the probability of sequences transitioning from one event to another. For example, the probability of a customer clicking through to a company's home page from a tweet, followed by the probability of that being their last interaction (and therefore failing to convert) or the probability they move onto some other interaction with the company.
We can apply this same principle to football, e.g. if Sergio Agüero is in possession of the football then what is the probability he passes, what is the probability he scores, what is the probability the sequence of possession ends with him?
To apply multi-touch attribution modelling to football and Expected Goals (xG) I created a dataset of possession sequences from the Premier League where each sequence contained the players involved plus a True / False flag terminating the sequence to designate whether it ended with a shot or not and used it to train a Markov Chain.
[Ederson, Laporte, Stones, Sterling, True]
Figure 1: Example possession sequence used to train the Markov chain
The trained Markov Chain was then used to simulate possessions by picking a starting player and taking a random walk through the probabilities until it hit a True / False event. From here we can calculate the importance of each player in terms of shot generation - essentially, each possession sequence's propensity to lead to a shot changes as different players become involved. These differences in shot propensity can then be used to reattribute the xG from a given shot across all the players in the possession leading up to it - the more important a player is the more xG is awarded to them even if they didn't take the shot.
The table below shows the attributed xG (axG) for Manchester City's 2018/2019 season. The first thing to note is that the most attacking players, such as Agüero and Jesus, have lower axG compared with xG. This is to be expected as traditional xG models will credit them with 100% of the value of the shot whereas axG takes some of that xG and reassigns it to the players involved in the build up play. They still come out with the highest axG scores overall though as these are the players taking the majority of the shots generating the xG so their presence in the possession sequences is important in terms of shot generation.
|kevin de bruyne||1.99||2.26||1.14|
Table 1: Manchester City axG 2018/2019
Looking at the ratio of xG to axG shows that the biggest beneficiaries for Manchester City are their defenders, particularly the fullbacks. Kyle Walker has a 3.7 fold increase in xG credited to him and Oleksandr Zinchenko has a 4.7 fold increase. Laporte and Fernandinho also have noticeable increases too reflecting their importance in City's build up play.
It's not just Manchester City's full backs who do well when we reattribute xG, it's pretty common across all other teams too as shown in the table below. These players are typically out wide where they can't take many shots but are important for getting the ball into the danger zones for the attacking players. It's still small volumes of xG compared with attackers but once we start accounting for fullbacks' involvement in the build up play then their xG numbers increase noticeably.
|West Ham||arthur masuaku||0.08||0.42||5.25|
|Manchester City||oleksandr zinchenko||0.19||0.9||4.74|
|Manchester City||kyle walker||0.45||1.66||3.69|
|Manchester United||ashley young||0.47||1.42||3.02|
|West Ham||pablo zabaleta||0.17||0.51||3|
|Newcastle United||javier manquillo||0.07||0.16||2.29|
|Manchester United||luke shaw||1.1||2.15||1.95|
Table 2: Fullback axG 2018/2019
|Manchester United||paul pogba||15.6||17.06||1.09|
|Manchester City||sergio agüero||19.92||15.86||0.8|
|Wolverhampton Wanderers||raúl jiménez||15.46||14.48||0.94|
|Newcastle United||salomón rondón||11.86||12.32||1.04|
|Manchester City||raheem sterling||13.14||12.2||0.93|
|Manchester City||david silva||8.07||8.79||1.09|
|Manchester United||romelu lukaku||10.18||8.57||0.84|
|Manchester City||bernardo silva||6.74||7.78||1.15|
Table 3: Top 25 Players by axG 2018/2019
It's worth clarifying that this is not an expected possession value (EPV) model. It's taking the output from a shots-based xG model and redistributing it across the players involved in the build up play based on the propensity of a shot occurring from that particular group of players.
In many ways, the output of the model is closer to a Shapley Value in that it's looking at all the different combinations of players in the possession sequences to quantify how much each player contributed to the propensity of a shot occurring. In fact, this is something I want to play around with further to see what other uses it has.
Whilst the approach described here is perhaps not as complex as some EPV models, it has a couple of advantages. First of all, it's quick to process the data, but most importantly it's easy to explain to stakeholders. This isn't some complicated and uninterpretable black box that senior management need to take a leap of faith to trust, multi-touch attribution is just sharing out xG more fairly based on the probabilities of shots occurring during the sequence of play and for me that's a big win. A simpler approach that people can relate to often has a bigger impact in a business than a bigger model that's much more complicated to get buy in for.
Multi-touch attribution has pretty much achieved this in marketing analytics now, it's a significant improvement from the simple heuristics without being so complex it scares the C-Suite off. Perhaps it could also play a similar role in football as a step up from xG without ostracizing the more data-reluctant coaches?
Submit your comments below, and feel free to format them using MarkDown if you want. Comments typically take upto 24 hours to appear on the site and be answered so please be patient.