dCorsi - Introductions

Steve Burtch
July 19 2014 12:00PM


3d-graph-tablet

Delta Corsi and Assessment of Individual Player Impacts on Possession Accounting for Usage


Statistical analysis of skaters in the NHL (and other hockey leagues) is a difficult and multifaceted process. At this point in the hockey analytics (or #fancystats) community one of the biggest problems with analyzing and assessing hockey play from a possession - and particularly defensive - perspective is tracking which players actually make a significant impact within a given system. Generally this has been tackled by assessing a player's results within the context of their usage, then comparing a variety of statistical metrics to those of their "usage" peers. The best tools available at this point are player usage charts (Vollman), which are now available in a few locations. Metrics have also been devised that consider expected goals (Parkatti) and expected shooting percentage (Pfeffer), and win contributions - i.e. Total Hockey Rating aka THoR (Schuckers and Curro).

The problem one typically finds with trying to analyze players in "context", is that eyeballing context isn't amazingly easy. We all know what tough minutes are generally speaking, i.e. facing top opposition; playing big minutes; starting in your own zone frequently; playing with weaker line-mates, etc. Unfortunately there are implicit assumptions and unfounded impressions floating around that very few people seem to have spent much energy on assessing. Work must be done to determine what combination of various factors have meaningful impacts, thus providing us with a context for interpreting each player's results. 

Following a number of rough forays into modelling impacts based on context, I decided that a multi-variate linear regression would be an effective means to predict what Corsi results a given player should expect based on their usage. As a means of assessing usage, a series of variables were correlated to a player's 5v5 Corsi. "Usage" in this instance is determined through weighting a number of factors, which implicitly are beyond an individual skater's control during the course of play, that may impact upon their Corsi results. 


In an effort to determine a skater's personal impact on shot based possession metrics (Corsi) I conducted a multi-variate linear regression using R to assess the player's Expected Corsi For and Expected Corsi Against. The residuals (differential) between the Expected Corsi For and observed Corsi For and the Expected Corsi Against and observed Corsi Against are then combined into a single dCorsi (delta Corsi) score. This dCorsi value represents the seasonal average level above or below Expected Corsi a player has produced for every 20 minutes of 5v5 game play in a given season when usage is taken into account.

The logical justification for separating out Corsi For and Corsi Against is built upon an examination of the correlation between the two.  They were found to have a Pearson's R correlation value of -0.13, with a coefficient of determination (R2) of 0.02. Thus the explanatory value of Corsi For to Corsi Against or vice versa is very weak (approximately 2%). The linkage between the two has been over-stated in many corners in the past - apparently at the individual skater level this is a flawed assumption. 


Thus a player's Expected Corsi For and Expected Corsi Against are determined using the following factors:

Expected Corsi For - Regression Variables

• Age

• Position

• 5v5 TOI/60

• Team Mate CF/20 (average of all line-mates Corsi For per 20 minutes WITHOUT the skater being analyzed on the ice, weighted by the TOI the line-mate and skater in question played together)

• OZFO%, NZFO% (percentage of faceoffs taken in offensive and neutral zones)

• OZFOW%, NZFOW% (faceoff win percentage in the offensive and neutral zones)

• Team (a dummy variable is used to represent the team of the skater in question) †‡

Expected Corsi Against - Regression Variables

• Age

• Position

• Team Mate CA/20 (average of all line-mates Corsi Against per 20 minutes WITHOUT the skater being analyzed on the ice, weighted by the TOI the line-mate and skater in question play together)

• Opposition CF20 (average of all opposition players' Corsi For per 20 minutes without the skater being analyzed on the ice, weighted by the TOI the opposing player and skater are on the ice together)

• DZFO% (percentage of faceoffs taken in the defensive zone)

• DZFOW%, NZFOW% (faceoff win percentage in the defensive and neutral zones)

• Team (a dummy variable is used to represent the team of the skater in question)†‡

 A note on the Team variables at this point for the sake of explanation. No players that switched teams mid-season were included in the regression, as manipulation of data in order to separate out their TOI with distinct clubs was deemed labourious.

‡ Secondarily - this same process has been conducted with Yearly Team Effects accounted for as dummy variables, which then showed high collinearity with Team Mate CF/20 and CA/20.  These considerations are still being examined - and look to be an improvement on the current model - but have not been completed as of the date of this posting.

The regression was run using data from stats.hockeyanalysis.com, behindthenet.ca, and hockey-reference.com.

Once a model is obtained for Expected Corsi For and Expected Corsi Against the two values can be compared to the skaters’ observed Corsi results to obtain dCorsi For and dCorsi Against. By combining the two resulting values, an overall Expected Corsi result can be determined, which can then be compared to the skaters’ observed Corsi to obtain their dCorsi for a given season.

The following graph presents Expected Corsi 20 vs. Actual Corsi 20 for all NHL skaters with 200+ mins of 5v5 TOI in each season from 2007-2014.

ExpCorsivsObsCorsi

The graph below displays a frequency plot of the residual dCorsi values. The graph is approximately normally distributed, which implies the majority of players fall within the range of a single standard deviation (σ = ± 2.0404) around the population mean of μ = 0.019. A normal distribution of dCorsi values would also suggest that to some extent the majority of outcomes being observed are random, but also implies extreme outcomes at either the high positive or low negative end are unlikely to be caused by random variation alone. 

dCorsiFreqPlot

Analysis of Results

Following the regression as described, the correlation between Expected Corsi For and the skater’s observed Corsi For was found to have and an adjusted coefficient of determination r2 = 0.6117 (61.17%). The correlation between Expected Corsi Against and the skater’s observed Corsi Against was found to have an adjusted coefficient of determination r2 = 0.5542 (55.42%). The overall correlation between Expected Corsi and the skater’s observed Corsi was thus r = 0.7334, with coefficient of determination r2 = 0.5379 (53.79%). This translates into the view that contextual factors outside of the individual skater’s control – i.e. Usage – explain at least 53% of what is being observed on the ice in terms of shot differentials (and likely more as the model is improved).

As a verification of the model, an Out-of-Sample Correlation was also assessed for 100 players randomly withdrawn from the 7 year population of skaters prior to the regression being performed. The correlation between Expected Corsi and observed Corsi for the Out-of-Sample group resulted as r = 0.7854 and r2 = 0.6168 (61.68%). This would indicate that the model is quite effective in predicting outcomes for the population in question. 

dCorsi represents the unexplained residual portion of Corsi results observed for a given skater in a given season. Admittedly variation in these numbers can arise from a combination of randomness, skill, and other factors not assessed in this regression such as coaching, defensive or offensive system structures, and injury. Also, it should be noted that smaller samples are inevitably prone to greater variation due to random occurrences, and are thus less reliable as a descriptor of skill. As we move to larger sample sizes (i.e. greater time on ice) we develop a clearer picture of a player’s impact on their team’s shot differential.

Conceptually this model is accounting for team affects as if they are fixed. In reality this is known to not be the case, as a variety of factors we are already aware of will be captured by the Team Effects coefficient. Rink scorekeeper biases, roster turnover, coaching effects, and team wide systems will quite likely play a role in what is being observed. As of this writing it would be best to consider this an area where both further efforts and more time could describe these effects in more detail. 

Further analysis has been done on the auto-correlation (or repeatability) of Expected Corsi and dCorsi in comparison to observed Corsi values across the 7 seasons in our data set. Also consideration has been given to the interaction between coaching adjustments and observed possession results. For a more complete exploration of the data that resulted from this analysis, and to obtain the actual dCorsi equation I would direct readers to the following:

Readers can obtain outcome data for individual skaters using the following Tableau Visualizations.§


§NOTE - The Linked Tableau Viz is updated using the most recent regression model which differs from the one discussed in this posting fairly significantly.  Update includes seasonal Team Effects Factors (TEF), and the removal of TMCF20 and TMCA20 which were found to be collinear with the aforementioned TEF.

Readers interested in the full dCorsi paper including tables of player results can find the document in PDF form here.

 

11eaee9101ce4dbcbbcee175cc9fe442
I'm a math (and physics) teacher with the TDSB. I have a degree in Mechanical Engineering. I like working with statistics. I use twitter (a lot).
Avatar
#1 Poughy
July 19 2014, 10:20PM
Trash it!
0
trashes
Props
0
props

Great stuff!

One question - why was TOI/60 included in the Corsi For regression but not the Corsi Against?

Avatar
#3 David Johnson
July 20 2014, 10:21AM
Trash it!
0
trashes
Props
2
props

I haven't take a deep look into dcorsi yet but I have a few concerns.

I don't like the idea of using age as a variable. Sure, it may be a factor in a players performance but it devalues your stat as a player evaluation metric. The reason is, it tells us more about whether the player is a good player for his age, not a good player overall. For example, a 38 year old and a 28 year old have the same dCorsi one might think they are similarly valuable players. They are not because the 28 year old out performs a much higher expectation than the 38 year old. Here you are adding context to a metric you are claiming is useful because it removes context. It is more important to know who the good players are than it is to know who the good players for their age are (though there is some value in that too).

The same might hold true for your team metric. You end up being able to say "that player was better than his team mates" but depending on whether the player plays on a particularly good or particularly bad team it could mean different things.

Avatar
#4 Jay32600
July 20 2014, 10:38AM
Trash it!
0
trashes
Props
1
props

Just to throw out a potential explanation for why TOI/60 is significant in the CF but not CA regression (in the original version): Coaches may be better at evaluating effective offensive players than defensive players, so the TOI distribution more accurately matches the skill distribution for shot generation than shot suppression.

Avatar
#5 Jay32600
July 20 2014, 11:46AM
Trash it!
0
trashes
Props
0
props

If you remove age you end up with the opposite issue of the one that David brings up above. Without including age if you have a 25-27 year old putting up a positive dCorsi it may just be because they are in their prime, but still aren't performing well relative to other players in their age group. So it would be good for saying this player was valuable this past year, but it doesn't really help you determine if that player might be good going forward once they begin to decline with age.

On the other hand, with age included you can see if a player is outperforming within their age group, which could help identify players who can continue to be effective as they age. Either way you end up needing to look at age yourself to add context to the metric. So in the end if you include age or not depends on if you are interested in using the metric as a descriptive or predictive measure.

**It may be a somewhat moot point anyway, because the magnitude of the age coefficient is rather small.

Avatar
#8 Mack
July 20 2014, 05:06PM
Trash it!
0
trashes
Props
0
props

Great stuff. Really impressed by this, will definitely use it in the future

Avatar
#9 Jay32600
July 20 2014, 05:22PM
Trash it!
0
trashes
Props
0
props

I don't see it as a problem at all--I should have avoided the word issue. Great work on this by the way.

Avatar
#10 JBloom
July 21 2014, 07:02AM
Trash it!
0
trashes
Props
0
props

New to the "fancy stats"...where on the chart does it indicate how good a player is? Is a guy like Pavel Datsyuk a model for where you want players to be? Who is a good indicator of an efficient player based on this model? Perhaps this is a dumb question, but is it better to be closer to 0 or do you want as high of a number as possible?

Avatar
#12 JBloom
July 21 2014, 10:37AM
Trash it!
0
trashes
Props
0
props

@Steve Burtch

Awesome. Thanks so much.

Avatar
#13 John
July 21 2014, 12:43PM
Trash it!
1
trashes
Props
0
props

@Steve Burtch

First, this is awesome, and long overdue.

I understand where you're coming from with regards to determining "expected" results, but I think the number would be more valuable for assessment of players if the expected Corsi did not include variables that speak to the player's skill rather than to the context in which they are deployed (zone starts, quality of teammates, quality of opposition).

Thus, I think it'd be better without Age (or at least potentially have two sets, one age-adjusted, one not).

I also think TOI could be more telling of a player's skill than context. (I suppose very high TOI could have a negative impact on an individual player, as they'd be tired more often, but it's probably more likely that better players get more time).

Avatar
#14 Daniel W.
July 22 2014, 06:03AM
Trash it!
1
trashes
Props
0
props

great post, interesting to see the thing as a whole after reading about bits on twitter.

Have you thought about adjusting for score effects?

Avatar
#15 Colin
July 22 2014, 07:12AM
Trash it!
0
trashes
Props
1
props

Interesting analysis. However, the observations for individual players are not independent - for example, each time player X takes a defensive zone faceoff, that's one fewer faceoff for player Y to take. Or player X and player Y are on the ice for the same faceoff. Players aren't randomly assigned to "treatments", so this violates a core assumption of regression analysis. There are other options though - robust regression or randomization tests could provide more accurate results. Given the large number of explanatory variables included, an AIC or BIC criterion could be used to find the best model - adj. R-squared is inflated as more variables are added.

I can see how your model would help evaluate performance relative to expectations, but what is really needed is a better model to predict absolute Corsi.

Avatar
#18 nateb123
July 25 2014, 12:32PM
Trash it!
0
trashes
Props
0
props

I would call any model with an R2 north of 0.5 as a huge success when analyzing such a complex game. Well done.

Most interesting to me is the t-stats and coefficients of the different variables. Age REALLY counts for a lot less than many claim, both for CA20 and CF20. Also, I had no idea that OZFO% (not even OZFO win %) was so significant.

A lot of the most statistically significant variables hint at being the result of hidden variables, which really outline "player skill".

One nitpicking point: r2 is for regression with a single independent variable, R2 is for models using multiple independent variables. You confused the two a couple of times.

Avatar
#19 Jani K
July 25 2014, 11:21PM
Trash it!
0
trashes
Props
1
props

If I pick one player at non-random.. Say, Patrick Kane. So dCorsi tells us that he has underperformed since season 2007-2008, especially the last two seasons? Is he expected to perform at Sidney Crosby level, or is he actually not so good at hockey? I don't follow Chicago so I have no eye-ball data to form an opinion myself.

Avatar
#20 SmellOfVictory
July 28 2014, 12:07AM
Trash it!
0
trashes
Props
0
props
Steve Burtch wrote:

While I accept and agree that this is an issue in how you are interpreting it - that is not the intent of the statistic.

The intent (mine to be specific) is to identify which players are performing above or below expectations. If age factors impact upon expected results, that should be factored in.

The reason for it's original inclusion was the identification of distinct aging curves I came across as I examined earlier results.

A player's "expected" results are the issue here - not whether or not they're older or younger than their peers. If age is confounding the identification of skill/random effects I don't think it's desirable to include it.

Either way - this is part of why I'm making this all clear - I can provide both results if desired - but this wasn't my intent.

I think it would be worthwhile to have your presented version of dCorsi alongside a variation that lacks the age variable. I do understand the desire to simply see how a player is faring in regard to his circumstances as a matter of interest. However, David Johnson's point is a good one - dCorsi could potentially be a very useful player evaluation tool, and age is the only variable included that is actually a trait of the player himself rather than his circumstances. It's akin to (assuming one could acquire the data) using fitness level as a variable and controlling for that; do we really care if a player is better than other players of the same fitness level for the most part? Or do we simply want to ignore that variable and go after the player who performs better regardless of this personal attribute?

Avatar
#21 Steve Burtch
July 29 2014, 08:23PM
Trash it!
0
trashes
Props
1
props

@SmellOfVictory

Part of the problem with this IMO is removing Age from the regression will change the beta values and lower the R2 of the regression, making it less accurate by removing information we KNOW is relevant to what you're observing on the ice.

I understand the logic of what you guys are saying - but in essence what the regression will do is attempt to fit the observed Corsi results to the remaining variables and inflate or deflate the beta values that are left to compensate for the age information it now lacks.

I personally think this would be reducing the usefulness not adding to it.

Comments are closed for this article.