Sport Informatics and Analytics/Performance Monitoring/Expected Goals

Introduction
The availability of data from tracking technologies in professional sports has made it possible to engage in fine-grained performance monitoring and analysis. See, for example, the discussions in: Peter Larsson (2003) ; Steve Edgecomb and Kevin Norton (2006) ; Robert Aughey (2011) ; Martin Buchheit and his colleagues (2014) ; Daniel Link, Steffen Lang and Phillipp Seidenschwartz (2016) ; Daniel Memmert, Koen Lemmink and Jaime Sampaio (2017) ; and Tom Stevens, Cornelis de Ruiter and Jos Twisk (2017).

One example is the study of goal scoring in professional association football with a metric called Expected Goals abbreviated to ExpG and xG in discussions about it. Alex Rathke provided a review of some of the discussions that have taken place in blog posts or other social media platforms such as Twitter. Keith Lyons aggregated a bibliography about expected goals that provided additional references and links to the literature.

Joe Mulberry (2017) has provided a note of caution about the use made of football tracking data and their accuracy. He observed that in an xG model "two shots with a 100cm positional change could result in a large variance of xG values". Joe concluded his discussion with this consideration: Although coaches utilise long-term trends, they have a large thirst for the data of singular matches. Therefore we have to ask ourselves are we comfortable with the dataset’s inherent error when presenting results at this level of granularity? If we are not, how do we present this margin or error/uncertainty to coaches?

Background
There is some debate about the origin of the term Expected Goals. Vic Barnett and his colleague S. Hilditch referred to 'expected goals' in their 1993 paper that investigated the effects of artificial pitch (AP) surfaces on home team performance in association football in England. Their paper included this observation: Quantitatively we find for the AP group about 0.15 more goals per home match than expected and, allowing for the lower than expected goals against in home matches, an excess goal difference (for home matches) of about 0.31 goals per home match. Over a season this yields about 3 more goals for, an improved goal difference of about 6 goals.

Jake Ensum, Richard Pollard and Samuel Taylor (2004) reported their study of data from 37 matches in the 2002 World Cup in which 930 shots and 93 goals were recorded. Their research sought "to investigate and quantify 12 factors that might affect the success of a shot". Their logistic regression identified five factors that had a significant effect on determining the success of a kicked shot: distance from the goal; angle from the goal; whether or not the player taking the shot was at least 1 m away from the nearest defender; whether or not the shot was immediately preceded by a cross; and the number of outfield players between the shot-taker and goal. They concluded "the calculation of shot probabilities allows a greater depth of analysis of shooting opportunities in comparison to recording only the number of shots". In a subsequent paper (2004), Richard, Jake and Samuel combined data from the 1986 and 2002 World Cup competitions to identify three significant factors that determined the success of a kicked shot: distance from the goal; angle from the goal; and whether or not the player taking the shot was at least 1 m away from the nearest defender.

In 2004, Alan Ryder shared a methodology for the study of the quality of an ice hockey shot at goal. His discussion started with this sentence “Not all shots on goal are created equal”. Alan's model for the measurement of shot quality was:
 * Collect the data and analyze goal probabilities for each shooting circumstance
 * Build a model of goal probabilities that relies on the measured circumstance
 * For each shot, determine its goal probability
 * Expected Goals: EG = the sum of the goal probabilities for each shot
 * Neutralize the variation in shots on goal by calculating Normalized Expected Goals
 * Shot Quality Against

Alan concluded: The model to get to expected goals given the shot quality factors is simply based on the data. There are no meaningful assumptions made. The analytic methods are the classics from statistics and actuarial science. The results are therefore very credible.

In 2007, Alan issued a product recall notice for his shot quality model. He presented “a cautionary note on the calculation of shot quality” and pointed to “data quality problems with the measurement of the quality of a hockey team’s shots taken and allowed”.

He reported: I have been worried that there is a systemic bias in the data. Random errors don’t concern me. They even out over large volumes of data. But I do think that ... the scoring in certain rinks has a bias towards longer or shorter shots, the most dominant factor in a shot quality model. And I set out to investigate that possibility.

Howard Hamilton (2009) proposed "a useful statistic in soccer" that "will ultimately contribute to what I call an 'expected goal value' — for any action on the field in the course of a game, the probability that said action will create a goal".

Sander Itjsma (2011) discussed "a method to assign different value to different chances created during a football match" and in doing so concluded: we now have a system in place in order to estimate the overall value of the chances created by either team during the match. Knowing how many goals a team is expected to score from its chances is of much more value than just knowing how many attempts to score a goal were made. Other applications of this method of evaluation would be to distinguish a lack of quality attempts created from a finishing problem or to evaluate defensive and goalkeeping performances. And a third option would be to plot the balance of play during the match in terms of the quality of chances created in order to graphically represent how the balance of play evolved during the match.

Sarah Rudd (2011) discussed probable goal scoring patterns (P(Goal)) in her use of Markov Chains for tactical analysis (including the proximity of defenders) from 123 games in the 2010-2011 English Premier League season. In a video presentation of her paper at the 2011 New England Symposium of Statistics in Sport, Sarah reported her use of analysis methods to compare "expected goals" with actual goals and her process of applying weightings to incremental actions for P(goal) outcomes.

The term 'expected goals' appeared in a paper about ice hockey performance presented by Brian Macdonald at the MIT Sloan Sports Analytics Conference in 2012. Brian's method for calculating expected goals was reported in the paper: We used data from the last four full NHL seasons. For each team, the season was split into two halves. Since midseason trades and injures can have an impact on a team’s performance, we did not use statistics from the first half of the season to predict goals in the second half. Instead, we split the season into odd and even games, and used statistics from odd games to predict goals in even games. Data from 2007-08, 2008-09, and 2009-10 was used as the training data to estimate the parameters in the model, and data from the entire 2010-11 was set aside for validating the model. The model was also validated using 10-fold cross-validation. Mean squared error (MSE) of actual goals and predicted goals was our choice for measuring the performance of our models.

In April 2012, Sam Green wrote about 'expected goals' in his assessment of Premier League goalscorers. He asked "So how do we quantify which areas of the pitch are the most likely to result in a goal and therefore, which shots have the highest probability of resulting in a goal?". He added: If we can establish this metric, we can then accurately and effectively increase our chances of scoring and therefore winning matches. Similarly, we can use this data from a defensive perspective to limit the better chances by defending key areas of the pitch.

Sam proposed a model to determine "a shot's probability of being on target and/or scored". With this model "we can look at each player's shots and tally up the probability of each of them being a goal to give an expected goal (xG) value".

Introduction
After a small number of mentions of 'expected goals' between 1993 and 2012, there was a significant growth in the discussions about the metric and its operational definition in association football. This review of the literature is presented as a chronology (it excludes discussions about expected goals in ice hockey, Keith Lyons provided a summary of some of this literature written between 2013 and 2017). One of the important catalysts in the expected goal discussions was the quantification of shot rates and shot ratios. Paul Riley, among others, noted: We can begin to assign an ‘expected goal’ value for each shot by looking back into history. Not all shots are the same, so you have to group similar ones together. How often over time is the same type of chance converted? Working this out will give us an ‘expected goal’ value.

Shot rates and ratios
James Grayson (2011a, 2011b , 2012a ) introduced a shots on target ratio (STR) as a measure to predict future performance in football. Subsequently (2012b ), James investigated total shots ratio (TSR) and a refined version of TSR as TSR2.4 as “the gold standard for in-season measures”.

Paul Riley (2012a ,2012b, 2013a , 2013b ) shared the process he used to combine historic data (30,000 English Premier League shots over 3 seasons) and current data in order “to go beyond basic goals and assists to find the true, long-term value of players actions” and to create a shot model he termed 'Shot Position Average Model' (SPAM).

Ted Knutson (2013) discussed shots on target (SoT). These shots “are almost three times as likely to result in a goal than a basic "shot"”. He added “Teams who win the battle for Shots on Target show a significantly increased probability of winning compared to those who just create more shots than their opponents”. Ted analysed data from four European league competitions from the 2009-2010 to 2011-2012 seasons “to build a kind of baseline or "shots on target par" (SoTPar) to compare how teams are doing at creating shots on target in current seasons versus historical norms”.

Mark Taylor (2013) discussed goals scored in a counter attack. He observed "If we use the x,y shot data matched to actual outcome, the ultimate origin of the chance does appear to be significant". In a subsequent post (2014), Mark made these comments: Shot attempts and saves, when measured against a robust expected baseline can increasingly be used to identify striking and keeping talent, but the use of shot models to highlight potentially fortunate team wins or unlucky losses, may be more problematical. The fluid nature of the game inevitably means that a shot scored will inevitably alter the path a match takes compared to that same shot being saved. Not only do subsequent events inevitably take different courses in the two mutually exclusive scenarios, but the game state, a product of the score, relative abilities of each side and the time remaining will also meet of fork in the road, depending upon the success or otherwise of a single attempt.

Michael Caley (2013a, 2013b , 2013c ) explored a range of shot and goal metrics. In (2013a), Michael created “a database of all the game events recorded in the WhoScored game logs for the 2012-2013 Premier League season”. He investigated the effects of game state on shot-on-target (SoT) and shared data on “minutes played at different game states, shots (S/Min) and shots on target per minute (SoT/Min) at those game states, and conversion rate of shots (G/S) and shots on target at those game states (G/SoT)”. (For further discussion of game state issues, see, for example, Ben Pugsley (2013, 2014 ), Sander Ijtsma (2013b) .) Two of Michael's posts (2013b , 2013c ) provide an important connection with the sharing of the quantification of expected goals conversations of that time (and subsequently).

2013
Sander Itjsma (2013a) investigated strike zones and identified four zones for the conversion of shots. His approach used "a fine grid to identify where on the pitch the biggest drop of in shot quality occurs" in order to work out the best zones. He observed: right in front of the goal, nearly all shots are converted into goals. I will therefore refer to this zone as Zone 1. The next most threatening part, Zone 2, covers mostly the central penalty box area, but stretches just beyond the edge of the penalty box. Then comes Zone 3, covering the wider penalty box area as well as longer distance central shots. The rest of the pitch will be Zone 4. Sander provided a mean conversation rate chart for each of the four zones.

Michael Caley (2013b) developed his expected goals formula and described how his approach to measuring goals was changing. This approach used a minute-by-minute database. He noted: Shots in the box on target (SiBoT) are converted at a way, way higher rate than shots out of the box on target (SoBot). For SiBoT, it's 35%. SoBoT are about one-third as likely to be scored, at a 12% conversion rate. In the same paper, he added: For the past few months, I have been running spreadsheets using an expected goals formula based on shots on target, big chances, and shots in the box. I want to test a different expected goals formula, one based only on shots on target in the box and shots on target outside the box. As a control, I will also use a dummy formula based only on shots on target.

Michael tested his models with correlation coefficients and root square mean errors of his estimators.

Sander Itjsma continued his discussion in a second post (2013b) : Using our recent explorations on strike zones and game states, we can stratify shots according to location and match situation and come up with expected goals per shot. This is much more valuable than simply adding shot numbers, as it removes the basic – and incorrect – assumption that all shots are of equal value. Sander proposed a single number to quantify " how many goals the average Eredivisie player would have scored from the attempts that a team, or a player, has had". This number was Expected Goals Scored. Sander concluded that this parameter could be expressed "per shot, over a match or over a series of matches" and that "It has an offensive and a defensive side and the former can be applied to teams and players, while the latter is limited to team level, since shots conceded can’t be linked to single defenders".

Colin Trainor (2013) introduced an expected goals (ExpG) metric in his discussion of team selection at Chelsea. ExpG was defined as the number of goals a league average player was expected to score "based on the type of chances that the players attempted". He said of this metric "The inputs to this measure won’t be disclosed, but we find that it is fairly accurate and allows us to compare the quality of chances created and then the efficiency with which they were finished".

Colin's colleague Constantinos Chappas (2013) continued the discussion of a Goal Expectation (ExpG) metric for association football: The reason behind the introduction of ExpG would be to provide a metric that chances / strikers / teams can be compared on. If a striker has a 25% conversion rate, that does not mean that he is a better finisher compared to someone with a 20% conversion rate. Perhaps his chances were from more favourable positions compared to the other striker’s chances. Therefore unless we somehow break down the conversion rate (e.g. shots from inside/outside the area) and look at those individual figures, we would be comparing apples with oranges. Constantinos investigated shooting efficiency in five European league competitions in the 2012-2013 season and, for comparison purposes, he looked at the deviation of each shooting efficiency figure from 1.00.

Michael Caley (2013c) discussed shot location and expected goals and discusses eight distinct shot location categories. His data from 2009 to 2013 indicated that “Zone 1 shots are gold” (1908 shots, 57% on target, 43% of goals from shots taken in this zone and 75% of goals from shots on target in the zone).The post (2013c) concluded with a table of expected goals from shots on target (SoT) (“league average numbers for SoT conversion for all shot types and multiple them by each club's SoT numbers”) that was “an admittedly limited first draft”.

Antonio Jose Saez-Castillo, Jose Rodrıguez-Avi and Jose Marıa Perez-Sanchez (2013) provided one of the first published papers to discuss an expected number of goals. In their paper, the authors presented: A Bayesian regression model for the number of goals scored by players in the Spanish football league during nine seasons is fitted. The model handles overdispersion in such a way that individual footballers ability for scoring may be estimated regardless of the number of minutes played, the position in the field and the team in which they play.

2014
Mark Taylor (2014a) explored goal expectations in detail and observed: Accrued goal expectation may be the prime driver in evaluation if an actual result was deserved, but the ability to create a couple of outstanding chances (which may or may not be repeatable at team or player level) may also play a minor, yet important role. In certain circumstances, the benefit may be greater than simply the sum of the parts. In a second post (2014b), Mark provided an explanation of how to calculate expected goals and the use made of match simulation.

Sander Ijtsma (2014a) provided a summary of his use of the expected goals (ExpG) parameter: It measures not how many goals a team has scored, but how many goals an average team would have scored with the amount and quality of shots created. Each goal scoring attempt is assigned a number based on the chance that this attempt produces a goal. Typical parameters to use are shot location and shot type (shot vs header). Some models, including the one I use on 11tegen11, also use assist information to separate through-balls from crosses. Teams that produce more ExpG than they concede have the best chances of winning football matches. In a subsequent post (2014b), Sander provided more detail about his expected goals (ExpG) metric: each goal scoring attempt is assigned a number between 0 and 1 that represents "the chance that this goal scoring attempt results in an actual goal". He added that "each goal scoring attempt is judged on the basis of its relevant contextual information".

Martin Eastwood wrote a number of posts in 2014 (2014a, 2014b , 2014c , 2014d , 2014e , 2014f ) to share his Expected Goals model. In an introductory post (2014a), he observed: It seems that everybody has their own expected goals models for football nowadays but they all seem to be top secret and all appear to give different results so I thought I post a quick example of one technique here to try and stimulate a bit of chat about the best way to model them. By the end of 2014, Martin had aggregated a database of 45,000 shots at goal that included records of 7,500 headed shots that enabled him to investigate expected goals and exponential decay ("the probability of scoring a goal decreases exponentially based upon the distance from goal the shot is taken from"). .

Richard Whittall (2014a), in his discussion of expected goals, proposed that “it’s more helpful to think of ExpG in a more a general sense as methods that incorporate shot characteristics, like type or location, into raw shot data”. In a his synthesis of a range of performance metrics, Richard pointed out “each metric is a tool to be used alongside one another to paint a more comprehensive picture”. In a second post (2014b), Richard outlined his support for the expected goals metric.

Constantinos Chappas (2014) continued his discussion of expected goals with a comparison of expected goals and actual goals. Constantinos concluded: This is a method to compare actual versus expected goals scored which does not make any unnecessary distributional assumptions on ExpG. It can therefore accommodate comparisons either in terms of specific players who could have a non-homogeneous shot profile (e.g. including shots from favourable positions, penalty shots or long range attempts) or even be used to evaluate a team’s scoring performance. Finally, it can be applied from a defensive standpoint to evaluate how teams or goalkeepers have prevented shots from being scored. Constantinos' colleague, Colin Trainor (2014) discussed expected goals and long-range shooting at the 2014 OptaPro Analytics Forum.

Devin Pieuler (2014) considered expected goals in terms of the repeatability of goal scoring year-on-year. He noted that whilst expected goals per 90 minutes demonstrates “a clear correlation between a player’s year-on-year Expected Goals rate”, the correlation disappears "when we take the obvious next step and examine the year-on-year repeatability of Goals above Expected Goals”. He proposed “we need an additional step to our model that measures an opportunity "post-shot" so that we can measure the change in probability against the chance quality ‘pre-shot’”. This led Devin to propose a metric Expected Goals Added and the consideration of a player’s finishing skill.

Michiel de Hoog (2014) extended the discussion of expected goals to expected assists when he discussed the performance of Lex Immers.

Paul Riley (2014a, 2014b , 2014c ) developed his expected goals model using shots on target and shot location. His database included 13000 shots on target over four seasons in the English Premier League (EPL). He noted: An expected goals model can be the start of team (and player) diagnostics. If it ain’t going right, it can start to tell you where it’s gone wrong and put a figure on how much it’s costing you.

Dan Altman (2014) produced an explanatory video about expected goals.

Michael Caley (2014) shared his predictions for the 2014-2015 season. He used an expected goals model for these predictions and provided a methodological background for his system. For each shot, Michael calculated "an estimated expected goals value" that used:
 * shot location
 * shot type
 * assist type
 * speed of attack
 * long balls
 * dribble
 * set play

Michael's system used an exponential decay formula ("your chance of scoring decreases non-linearly as you move away from goal, with larger drops occurring in moves from three to six yards, and relatively smaller drops coming at twelve to sixteen yards" ).

2015
Michael Caley returned to the discussion of expected goals in 2015. His first paper (2015a) was a response to criticism of expected goal methodology. A second paper (2015b) later in the year shared his projections for the English Premier League and provided details of a new expected goals model with six shot types and six equations. His projections were run with a Monte Carlo simulation that used a random sample from bi-variate poisson for goals.

Sander Ijtsma also continued his discussion of expected goals in two posts (2015a, 2015b ). The second post (2015b) provided a detailed look at the development of Sander's expected goals models (ExpG) since 2013 and included three video examples of the calculation of an ExpG metric (see also a 2016 video explanation).

Matthias Kullowatz (2015) shared an expected goals 3.0 methodology (xGoals 3.0) that included a log-Distance variable. The model also took into account the goal mouth available to a player taking a shot. (Matthias shared a validation of the XGoals model in a subsequent post (2017) that included R code for fitting the model and a gradient-boosted tree model.)

Will Gürpınar-Morgan wrote a number of posts in 2015 that discussed a variety of topics related to expected goals. These included: territorial advantage (2015a) territory and possession (2015b) ; single match expected goal totals (2015c) ; unexpected goals (2015d). Will conclude his discussion of single match expected goals with this observation: Due to the often random nature of football and the many flaws of these models, we wouldn’t expect a perfect match between actual and expected goals but these results suggest that incorporating these numbers with other observations from a match is potentially a useful endeavour. He recommended that the reporting of expected goals take the form "Team x’s expected goal numbers would typically have resulted in the following…here are some observations of why that may or may not be the case today".

Michael Bertin presented a critique of expected goals models early in the year (2015a). He presented his expected goals candidate models in a second post (2015b). His calculations included a time interval “that breaks the game down into 12 mostly equal time periods (one for every 7.5 minutes)”. He noted that “most of the info in an ExpG calculation is in the shot location ... If all you know is the distance to the goal from where the shot was taken, you can make a decent ExpG model”. Michael shared the coefficients generated by his model. He provided a link to his data set of four seasons of player calculations for four European league competitions.

Martin Eastwood (2015) discussed expected goals as a binary problem and suggested "we can view expected goals as a classification problem rather than regression". He proposed that support vector machine models are “a viable approach to calculating expected goals and comfortably outperform the naïve model that ignores shot locations”. In his post, Martin advocated that given the uncertainty of expected goals forecasts, confidence intervals for models should be reported.

Ola Lidmark Eriksson (2015a) launched his blog with a discussion of his use of machine learning approach to expected goals in Sweden. He described his approach as one of a Python programmer rather than a statistician. A subsequent post (2015b), provided details of his methodology. Ola gathered data from Swedish football and compiled a data set that contained: coordinates; a finish from a Corner – yes/no?; a finish from a freekick – yes/no?; a penalty – yes/no; time of the event; current possession of the team creating the event (in 5min intervals). He used the Scikit-Learn library to train his model. He discussed his use of a Gradient Boosting Regressor to increase the sensitivity of his model. A third post (2015c) shared Ola’s consideration of game state. Three subsequent posts (2015d, 2015e , 2015f ) extended Ola’s discussion of game state (see also 2015f).

Mathijs Steneker (2015) shared his expected goals model from data gathered in Argentina's Primera Division (23,931 shots from 1209 games).

Zorba (2015a, 2015b , 2015c ) shared his analysis of expected goals in Swedish football. (There were follow up posts in [2016, 2017 and 2018).

Thom Lawrence (2015) discussed the anatomy of a shot in his consideration of expected goals models. He noted "I haven’t used the word ‘expected’ anywhere, because I think it does subtle things to your outlook that aren’t helpful. But I think creating these different models and using them in different situations might be helpful".

Sam Gregory (2015) proposed a variation on the expected goals metric in order to develop a non-shot based expected goal model called Pass ExpG Projections that focused on passes into the danger zone (“essentially passes into the penalty area”). Sam's discussion noted other analysts who were using this kind of approach (Dan Altman (2015), Will Gürpınar-Morgan (2015b) and Dustin Ward (2015) . Sam noted too Tom Worville’s (2015) discussion of defensive contribution.

Danny Page (2015) integrated insights from ice hockey and football in his discussion of expected goals. He proposed that any reporting of expected goals must consider variance also ("Summing xG is only half of the story; we must also consider variance as well. Failure to do so may mislead the reader into taking the wrong conclusion from a simple sum xG scoreline".) Danny shared his Match Expected Goals Simulator to exemplify his approach.

2016
Will Gürpınar-Morgan (2016) extended his discussion of non-shot expected goals ("Each player is a ‘variable’ in the equation, with the idea being that their contribution to a teams expected goal difference can be ‘solved’ via the regression equation").

SciSports (2016) shared an expected goals model that investigated the strength and skill of a defensive line and individual attackers. Their post presented Base xG Maps derived from 240,000 attempts at goal in seven competitions. The authors noted that their results apply to long term estimation of xG values and point to variance in short term estimation of xG values.

Martin Eastwood (2016) returned to the discussion of expected goals. He considered uncertainty, sampling error and confidence intervals. He concluded: The variance associated with expected goals, especially at the level of individual matches (let alone individual shots), is such that the uncertainty needs to be clearly accounted for. Without this information, expected goals are at best an inaccurate measure and at worst misleading or wrong. His recommendation was: So instead of just saying a shot is worth 0.25 expected goals, we would say say the shot was worth 0.25 ± 0.1 expected goals at the 95% confidence level. This essentially means that we are 95% confident the true value of expected goals for that particular shot lies somewhere in the range of 0.15 - 0.35.

Nils Mackay posted a number of expected goals articles including: (2016a) ; (2016b) ; and (2016c). His first post (2016a) shared his approach shot locations and goals scored from those locations (80,000 shots over 10 seasons): All models I know calculate the influence of shot location by dividing it into several factors such as distance to goal, and angle to goal (some use even more). Although it might be a good approximation, to me it sounded like a very complex way to compute the influence of location. The problem is that the goal posts make the distribution of values across the field very complex. For instance, a shot from 10 cm on the outside of the goalpost on the goal line will have an xG-value of practically zero, whereas the xG-value for a shot from 10 cm on the inside of the goalpost on the goal line will be about 1. This makes the exact values very hard to approximate by using angle and distance only. Nils evaluated his model and other xG models in a second post (2016b). A third post (2016c) discussed bias in xG models. In that post, he concluded: Not all xG models are the same. When I see a statistic that uses xG I usually just assume it’s correct, while my analysis has shown that especially simpler models can make big systemic errors. I feel like the use of a too simple xG model might lead you to wrong conclusions.

Thom Lawrence (2016) added to the discussion of non-shots expected goals with his consideration of time-to-shot.

David Sumpter (2016) discussed an expected goals model that was based on the probability of a shot from one of three zones (outside the 18 yard box, inside the 18 yard box and inside the 6 yard box) being a goal. He noted that football clubs use much more sophisticated models but “the basic methodology is the same: past shots from similar situations are used to give a statistical model of the probability that a certain shot results in a goal”.

Olav Nørstebø, Vegard Rødseth Bjertnes, and Eirik Vabo (2016)  extended the empirical reach of conversations about shots and goals with their discussion of Norwegian football. They discussed an expected goals (xG) model that uses data from 13,440 shots attempted in 480 football matches in Norway’s Tippeligaen league. Olav, Vegard and Erik reported: The likelihood of scoring is estimated using binary logistic regression with ten explanatory variables. This model is used as a foundation to evaluate the performance of players with regard to their shot efficiency.

Chad Murphy wrote a series of posts in 2016 about expected Goals. In his posts ((no longer available on line), using National Women’s Soccer League (NWSL) data, he: provided an introduction to expected goals (xG 101) (2016a) and regression analysis; discussed the effects of defensive pressure on xG (500 shots); shared a visualisation of his data and analyses; and considered what shot to take when losing in a game. Some of this work appeared in a post written for FourFourTwo (2016).

James Curley (2016) used data from 1014 MLS games (seasons 2012-2014) to explore how to work with play-by-play sports data and develop an expected goals (ExG) model.

Jan Mullenberg (2016) discussed expected goals in the context of performance in Eredivisie in the Netherlands.

Ted Knutson (2016) reflected how goal scoring data is used in television commentary and considered the importance of contextualising data for an audience.

Ben Mayhew (2016)ref> discussed the visualisation of individual players as goalscorers.

Matt Rhein (2016) noted there are a lot of misconceptions about ‘expected goals’. He pointed to a difficulty in developing a rich data account of Scottish football (“data publicly available for Scotland is very limited”). Matt reported his data from expected goals in the Scottish Premier League.

2017
The ubiquity of an expected goals metric in conversations about association football analytics prompted a number of review articles in 2017 as well as discussions of specific topics.

Review articles

Sam Gregory (2017) provided a recap and overview of expected goals (xG). He noted the Opta methodology for measuring xG based on 300,000 shots from the Opta database.

David Cheever (2017) presented an overview of the xG metric using Opta data from the United States of America.

Paul MacInnes (2017) discussed some of the origins of the expected goals metric. He suggested: The genesis of expected goals most likely lies with Opta, the data company that has been analysing football matches since 2001, recording all the information that for years has appeared in the small statistical summaries that round up each match on TV and in the papers. According to Caley, it was two of the company’s analysts, Sam Green and Devin Pleuler, who first began modelling xG in the late noughties. Such is the complicated and often cumulative nature of such research however, Sarah Rudd of StatDNA, also an American, was working on similar models at the same time.

Spencer Jackman (2017) provided another review of “football’s trendiest stat”. His response to current practice was: Goals are the currency of soccer so figuring out how many a team should have scored, and, more importantly, projecting how many they will score, is a worthwhile endeavor. The challenge for soccer analytics folks is to connect xG to all actions on the field with a continuously more nuanced approach.

Garry Gelade reviewed expected goals models in two posts, the first (2017b) investigated shots, and the second (2017c) focused on big chances ). (See also Mark Taylor's (2017) discussion of big chances.)

Benjamin Cronin (2017) analysed a number of expected goal models in his survey of the metric.

Richard Whittall (2017a) proposed "football analytics might get a little more traction if it ... refocused its efforts on more mid-to-long term diagnosis and treatment". (See also Richard Whittall (2017b) on xG in single games.)

He suggested: In general, this would mean less adding up xG totals after individual games, and more looking at 5-10 game trends. It would mean less relying solely on xG and it’s derivative stats (xA etc.), and more using xG in tandem with other statistical tools, as well as video and tactical analysis, to present a more vivid theory as to why a team may be underperforming. It would mean less obsession with which teams are ‘riding their luck’ or are ‘unlucky,’ and more to see why certain clubs fail to create or prevent dangerous chances in the first place.

Topics

A Stasbomb post early in 2017 sought "to squeeze one last drop of juice from the desiccated lemon that is expected goals" by exploring an xGChain (xGC) metric: find all the possessions each player is involved in; find all the shots within those possessions; sum their xG; assign that sum to each player, however involved they were.

Imran Khan (2017a) sought to offer an xG model based on 225,372 shots from 9,133 matches that produced 21,424 goals. A second post (2017b) explored the use of machine learning to model Expected Goals. The approach chosen used a k-nearest algorithm.

Peter Goldstein (2017) discussed overfperformers and underperformers in the context of expected goals. He noted: First things first: it’s not The Stat That Solves All Our Problems, and has never been claimed as such. It’s a number like any other, which tells us some things and doesn’t tell us others. But for the moment (and a very strong emphasis on those three words), it appears to be the number which best sums up a team’s overall performance level.

Howard Hamilton (2017) returned to a discussion he initiated in 2009. He notes that an expected goals model "is a conditional probability model that answers the question, “Given a collection of parameters that describes a shot toward goal, what is the probability that a goal is scored?”. He observes: It’s the selection of the shot parameters that makes every analyst’s xG model unique as a product of observation, conjecture, and judgment. Some parameters are common to almost all models, but more exotic ones depend on the richness of the data set and the willingness to search the entire possession chain.

Howard used the LogisticRegression class from scikit-learn to create his xG model. He illustrated his model with the use of data from the Argentine's Primera División.

Andrew Beasley (2017a) discussed how to calculate expected goals. His post includes a definition of expected goals: The number of goals a team (or teams) would expect to score in a match. This is determined by assigning a value to shots on goal, the number of shots, shot location, the in-game situation and the proximity of opposition defenders. (See also Andrew's (2017b) discussion of using expected goals to predict the Premier League table.)

Mike Goodman (2017) discussed Manchester United's performance in the context of expected goal metrics. Hector Ruiz and his colleagues (2017) used "new soccer analytics tools" (expected goal value, expected save value, strategy-plots and passing quality measures) to compare Leicester City's performances in the 2015-2016 and 2016-2017 EPL seasons. (For a detailed discussion about the objective measurement of risk and reward in passing see Paul Power and colleagues' (2017) paper.) Will Gurpinar-Morgan (2017) investigated Burnley's performance as "the poster-child for expected goals flaws".

Paul Riley (2017) in his discussion of an xG model noted: When I started all this nearly 5 years ago, collecting enough shot position data to build a model took months worth of effort. Today, you can get a decent enough equivalent in 20 minutes for a ton of leagues.

Paul's post provided guidelines about how to build an xG model. An example of this advice can be found in a post on the way we play blog. .

Nils Mackay returned to the discussion of expected goals in 2017. In his first post (2017a), Nils replicated his evaluation of xG models introduced in an earlier post. In a second post (2017b) he discussed improvements he had made to his xG added model.

A number of video explainers about expected goals appeared in 2017. They included videos from Adam Bate, Opta , Michael Caley and James Lorenzo.

Garry Gelade discussed expected and unexpected goals in posts written in 2017. The first (2017a) investigated 'great goals'. He noted: Goals we judge as great are often subjectively experienced as surprising, astonishing, coming out of nowhere – in a word unexpected; statistically speaking they have a low probability of being scored. If this is so, great goals should have lower xG (expected goal) values than ordinary goals.

He adds "great goals are 'unexpected' in xG terms. A great goal is only half as expected – and  therefore  twice as surprising – as an ordinary goal".

In subsequent posts, Garry assessed expected goals models (2017b (shots), 2017c (big chance) ).

Ravi Mistry (2017) discussed the ways in which expected goals might be visualised.

Marek Kwiatkowski (2017), shared his approach to xG models in his consideration of the quantification of finishing skill and his use of Bayesian inference. His post included a reflection on the limitations of his model. (Marek acknowledged readers of a draft of his post. These included Will Gürpınar-Morgan, Martin Eastwood, Devin Pleuler, Sam Gregory, Ben Torvaney, Łukasz Szczepański and Thom Lawrence.)

Matt Rhein discussed the use of expected goals data from Scottish football. His posts included: an introduction to a prediction model that used expected goals data (2017a) ; 'advanced' expected goals and the analysis of single game data (2017b). (See also Ferdia O'Hanrahan (2017) on interpreting the expected goals of single matches.)

Ted Knutson (2017) discussed the care needed when using xG information in television broadcasts. (See also, Henry Bushnell (2017) ; Breaking News (2017) ; and David Sumpter (2017) .)

Ben Mayhew discussed the relationship between expected goals and league position (2017a) and in a second post visualised attack breakdowns (2017b).

James Tippett (2017) discussed expected goals in a book published in September 2017.

Played off the Park wrote about expected goals and Stratabet data in three posts (2017a, 2017b , 2017c ). (See also David Willoughby's (2017) discussion of StrataData.)

Bobby Gardiner (2017) sought to answer a number of questions about expected goals. He noted that "all of its proponents have problems with it" (original emphasis). (See also Oli Platt (2017) .)

2018
Bobby Gardiner (2018) provided one of the first posts about expected goals in 2018 with his discussion of Raheem Sterling's performance in Manchester City games.

Mark Taylor (2018) discussed the potential of xG2 analysis of goal scoring. Mark noted that xG2 "is based entirely upon shots or headers that require a save and uses a variety of post shot information, such as placement, power, trajectory and deflections. Typically this model would be the basis for measuring a keeper's shot stopping abilities".

Andrew Beasley (2018) reported the use of expected goals to analyse the importance of speed of attack.

Tifo Football (2018) produced a video exlplainer of expected goals.

William Spearman (2018) constructed "a probabilistic physics-based model" that used spatiotemporal player tracking data to quantify off-ball scoring opportunities. William noted that "with the proliferation of spatiotemporal tracking data, exciting	new ways of measuring the probability of scoring have been developed" and cited the work undertaken by Patrick Lucey and his colleagues and Daniel Link, Steffan Lang and Philipp Seidenschwarz.

Criticism of expected goals
In 2015, Michael Bertin presented a critique of expected goals models. In a first post (2015a), Michael suggested that some ExpG models were sub-optimally constructed and misused r-squared. He argued also that logistic regression indicates "the results are not at all good". He concluded: It's early days for the analyticification of soccer. Part of moving the process along is figuring out what's good and what's bad. At this point, I'd look at ExpG as "marginally interesting, but nothing to get religious about." It's better than nothing, but the extraordinarily complex models don't look to be vastly superior to what you can get if all you knew about a shot was where it was taken. Michael Caley (2015a) replied to Michael's ExpG comments. His response to the r-squared criticism was "I don't see in the piece any specific argument explaining why this r-squared is meaningful. If this is the primary critique of expected goals, I'm looking for an explanation of why these two particular r-squared values suffice as a critique". He added: The expected goals method is very much a work in progress. It is open to critique from a wide variety of angles. I published my expected goals method openly in order to facilitate critique, as well as to help other analysts who want to use it for their own purposes, to build upon it or edit it. Michael Caley identified his own concerns with the ExpG approach: He concluded: We should always be improving our methods and we should always be critiquing them. A minutely descriptive, intuitively articulated, highly accurate expected goals model remains no more than a goal itself. But the only way we're going to get there is by continuing to trudge this road and build and improve and discuss the models we have. A second post (2015b) extended Michael Bertin's critique. In it he used three seasons of data to argue for an alternative to existing ExpG models.
 * It does not know where the defenders are
 * It does not care who is shooting
 * It may be missing on the very best teams

Michael Bertin's third post (2015c) started: Of all the positions I may or may not have about expected goals, the one that’s truly indefensible is to say there’s not much value in ExpG, then sit on the results as if it’s a set of nuclear launch codes. In the post, Michael provided the details of his alternative approach to quantifying goal scoring in association football.

In 2016, Nils Mackay posted three articles, (2016a), (2016b) , and (2016c) about evaluating xG models.

Nils concluded in his third post (2016c): Not all xG models are the same. When I see a statistic that uses xG I usually just assume it’s correct, while my analysis has shown that especially simpler models can make big systemic errors. I feel like the use of a too simple xG model might lead you to wrong conclusions. On the other hand half of the models I tested fall within the margin of error on every test I did. To me it seems like it is definitely worth it to invest a few extra hours to improve your xG model to get into that category. Simpler models can obviously still be used for analysis, but only if we understand their limitations and communicate this when publishing results.

Jack Coles (2016) noted some criticisms of expected goals models and the role of data analytics in football.

Marcus Cleaver (2017) suggested "there is a real concern that by giving the current iteration of xG legitimacy by putting it on Match of the Day it will actually prove to be a major setback for advanced statistics in football" in his discussion of football analytics.