Delta Corsi and Assessment of Individual Player Impacts on Possession Accounting for Usage
Statistical analysis of skaters in the NHL (and other hockey leagues) is a difficult and multifaceted process. At this point in the hockey analytics (or #fancystats) community one of the biggest problems with analyzing and assessing hockey play from a possession - and particularly defensive - perspective is tracking which players actually make a significant impact within a given system. Generally this has been tackled by assessing a player's results within the context of their usage, then comparing a variety of statistical metrics to those of their "usage" peers. The best tools available at this point are player usage charts (Vollman), which are now available in a few locations. Metrics have also been devised that consider expected goals (Parkatti) and expected shooting percentage (Pfeffer), and win contributions - i.e. Total Hockey Rating aka THoR (Schuckers and Curro).
The problem one typically finds with trying to analyze players in "context", is that eyeballing context isn't amazingly easy. We all know what tough minutes are generally speaking, i.e. facing top opposition; playing big minutes; starting in your own zone frequently; playing with weaker line-mates, etc. Unfortunately there are implicit assumptions and unfounded impressions floating around that very few people seem to have spent much energy on assessing. Work must be done to determine what combination of various factors have meaningful impacts, thus providing us with a context for interpreting each player's results.
Following a number of rough forays into modelling impacts based on context, I decided that a multi-variate linear regression would be an effective means to predict what Corsi results a given player should expect based on their usage. As a means of assessing usage, a series of variables were correlated to a player's 5v5 Corsi. "Usage" in this instance is determined through weighting a number of factors, which implicitly are beyond an individual skater's control during the course of play, that may impact upon their Corsi results.
In an effort to determine a skater's personal impact on shot based possession metrics (Corsi) I conducted a multi-variate linear regression using R to assess the player's Expected Corsi For and Expected Corsi Against. The residuals (differential) between the Expected Corsi For and observed Corsi For and the Expected Corsi Against and observed Corsi Against are then combined into a single dCorsi (delta Corsi) score. This dCorsi value represents the seasonal average level above or below Expected Corsi a player has produced for every 20 minutes of 5v5 game play in a given season when usage is taken into account.
The logical justification for separating out Corsi For and Corsi Against is built upon an examination of the correlation between the two. They were found to have a Pearson's R correlation value of -0.13, with a coefficient of determination (R2) of 0.02. Thus the explanatory value of Corsi For to Corsi Against or vice versa is very weak (approximately 2%). The linkage between the two has been over-stated in many corners in the past - apparently at the individual skater level this is a flawed assumption.
Thus a player's Expected Corsi For and Expected Corsi Against are determined using the following factors:
Expected Corsi For - Regression Variables
• 5v5 TOI/60
• Team Mate CF/20 (average of all line-mates Corsi For per 20 minutes WITHOUT the skater being analyzed on the ice, weighted by the TOI the line-mate and skater in question played together)
• OZFO%, NZFO% (percentage of faceoffs taken in offensive and neutral zones)
• OZFOW%, NZFOW% (faceoff win percentage in the offensive and neutral zones)
• Team (a dummy variable is used to represent the team of the skater in question) †‡
Expected Corsi Against - Regression Variables
• Team Mate CA/20 (average of all line-mates Corsi Against per 20 minutes WITHOUT the skater being analyzed on the ice, weighted by the TOI the line-mate and skater in question play together)
• Opposition CF20 (average of all opposition players' Corsi For per 20 minutes without the skater being analyzed on the ice, weighted by the TOI the opposing player and skater are on the ice together)
• DZFO% (percentage of faceoffs taken in the defensive zone)
• DZFOW%, NZFOW% (faceoff win percentage in the defensive and neutral zones)
• Team (a dummy variable is used to represent the team of the skater in question)†‡
† A note on the Team variables at this point for the sake of explanation. No players that switched teams mid-season were included in the regression, as manipulation of data in order to separate out their TOI with distinct clubs was deemed labourious.
‡ Secondarily - this same process has been conducted with Yearly Team Effects accounted for as dummy variables, which then showed high collinearity with Team Mate CF/20 and CA/20. These considerations are still being examined - and look to be an improvement on the current model - but have not been completed as of the date of this posting.
The regression was run using data from stats.hockeyanalysis.com, behindthenet.ca, and hockey-reference.com.
Once a model is obtained for Expected Corsi For and Expected Corsi Against the two values can be compared to the skaters’ observed Corsi results to obtain dCorsi For and dCorsi Against. By combining the two resulting values, an overall Expected Corsi result can be determined, which can then be compared to the skaters’ observed Corsi to obtain their dCorsi for a given season.
The following graph presents Expected Corsi 20 vs. Actual Corsi 20 for all NHL skaters with 200+ mins of 5v5 TOI in each season from 2007-2014.
The graph below displays a frequency plot of the residual dCorsi values. The graph is approximately normally distributed, which implies the majority of players fall within the range of a single standard deviation (σ = ± 2.0404) around the population mean of μ = 0.019. A normal distribution of dCorsi values would also suggest that to some extent the majority of outcomes being observed are random, but also implies extreme outcomes at either the high positive or low negative end are unlikely to be caused by random variation alone.
Analysis of Results
Following the regression as described, the correlation between Expected Corsi For and the skater’s observed Corsi For was found to have and an adjusted coefficient of determination r2 = 0.6117 (61.17%). The correlation between Expected Corsi Against and the skater’s observed Corsi Against was found to have an adjusted coefficient of determination r2 = 0.5542 (55.42%). The overall correlation between Expected Corsi and the skater’s observed Corsi was thus r = 0.7334, with coefficient of determination r2 = 0.5379 (53.79%). This translates into the view that contextual factors outside of the individual skater’s control – i.e. Usage – explain at least 53% of what is being observed on the ice in terms of shot differentials (and likely more as the model is improved).
As a verification of the model, an Out-of-Sample Correlation was also assessed for 100 players randomly withdrawn from the 7 year population of skaters prior to the regression being performed. The correlation between Expected Corsi and observed Corsi for the Out-of-Sample group resulted as r = 0.7854 and r2 = 0.6168 (61.68%). This would indicate that the model is quite effective in predicting outcomes for the population in question.
dCorsi represents the unexplained residual portion of Corsi results observed for a given skater in a given season. Admittedly variation in these numbers can arise from a combination of randomness, skill, and other factors not assessed in this regression such as coaching, defensive or offensive system structures, and injury. Also, it should be noted that smaller samples are inevitably prone to greater variation due to random occurrences, and are thus less reliable as a descriptor of skill. As we move to larger sample sizes (i.e. greater time on ice) we develop a clearer picture of a player’s impact on their team’s shot differential.
Conceptually this model is accounting for team affects as if they are fixed. In reality this is known to not be the case, as a variety of factors we are already aware of will be captured by the Team Effects coefficient. Rink scorekeeper biases, roster turnover, coaching effects, and team wide systems will quite likely play a role in what is being observed. As of this writing it would be best to consider this an area where both further efforts and more time could describe these effects in more detail.
Further analysis has been done on the auto-correlation (or repeatability) of Expected Corsi and dCorsi in comparison to observed Corsi values across the 7 seasons in our data set. Also consideration has been given to the interaction between coaching adjustments and observed possession results. For a more complete exploration of the data that resulted from this analysis, and to obtain the actual dCorsi equation I would direct readers to the following:
Readers can obtain outcome data for individual skaters using the following Tableau Visualizations.§
§NOTE - The Linked Tableau Viz is updated using the most recent regression model which differs from the one discussed in this posting fairly significantly. Update includes seasonal Team Effects Factors (TEF), and the removal of TMCF20 and TMCA20 which were found to be collinear with the aforementioned TEF.
Readers interested in the full dCorsi paper including tables of player results can find the document in PDF form here.
I'm a math (and physics) teacher with the TDSB. I have a degree in Mechanical Engineering. I like working with statistics. I use twitter (a lot).