# Selection Bias, Methodologies and Outcomes

Kent Wilson
July 23 2013 10:30AM

By: Patrick D. (SnarkSD) of Fear the Fin

## Introduction

When it comes to statistical analysis the population of interest is everything. Outcomes only apply to the population in the study. Furthermore, any manipulation of the population adds bias to the study. Sometimes this is obvious, but other times it might hide behind a curtain which isn’t obvious, even if we apply tests of statistical significance. That is why methodology is important. In an effort not to bore you, let’s first discuss the more pertinent issue; selection bias, and return to methodology in the end of the article.

## On Selecting Data to Use

Analyzing NHL data often involves selecting the largest population one can collect. In hockeymetrics analysts are often forced to do this in order to increase the power of the study (Power is a statistical term meaning the likelihood of a confirmatory result; ie. reject the null hypothesis). But this doesn’t come without perils. I tend to focus on team measures for a variety of reasons, but one of which is I don’t have to deal with as many selection bias issues.

Selection bias refers to an error in the outcomes as a result of the method in which the data was collected or sampled. When analyzing teams, this is easily avoided by collecting as many if not all games in each season, then randomizing the team-games for every year. For analysis of skaters or goalies this is tremendously more difficult. We aren’t working with a natural population with new individuals entering the population at random, nor leaving at random. Individuals are selected (remember Darwin, Survival of the fittest!), on past (not current or intrinsic) characteristics. This is made all the more difficult because these selection pressures are often the same variables we are interested in studying.

One such example is shooting percentage.[1] As Eric T. showed recently, GMs heavily select based on shooting percentage. Any analysis we perform on shooting percentage will feature significant selection bias, which an analyst must take into account.

I ran a simple simulator to show the effects of selection bias on shooting percentage, and then compared it to actual NHL data just to see the fit of the data as a secondary outcome. We are really interested in seeing (our primary outcome) how shooting percentage changes when we introduce selection pressure.

Briefly, I selected every forward to appear on BTN; 5v5 strength, 2007-2012 (N=941), and compiled their GP, and cumulative ice time (TOI) over that time period. I then ran a histogram to see how long (in minutes) each of these players played for in this sample. (Note: this isn’t a study of survivorship because some players are still active at the end of the time period selected):

I selected bins of TOI based on this histogram, with each bin representing approximately 10% of the population. I then generated average TOI/G, average cumulative TOI, GP, and shooting percentage for these bins of forwards.

TOI cut-off 45 155 345 685 1070 1955 2755 3725 4720 6605
On-ice Sh% 3.92% 6.38% 7.20% 6.46% 7.30% 7.47% 7.59% 7.94% 8.50% 9.19%
Avg GP 2.97 11.74 30.29 57.45 86.09 149.93 216.07 280.78 336.77 379.25
Avg total TOI 21.52 91.16 238.59 503.45 859.37 1520.32 2331.42 3230.12 4251.63 5219.34
Avg TOI/G 7.25 7.77 7.88 8.76 9.98 10.14 10.79 11.50 12.62 13.76

On-ice Sh%: On-ice shooting percentage (The shooting percentage of a forward’s team while he is on the ice). GP: Games Played TOI: Time on Ice, 5v5. TOIΓ: Time On Ice per game

As you can see, there is a sharp reduction in the total population early in the career of many forwards. On average, our first 10% barely make it to 3 games, playing an average of 7 min. This steadily increases as we move up to our “elite” players, the top 10% that averaged 379.25 games, playing 13.76 minutes 5v5 per game.

I’m sure your eyes went to On-ice Sh% immediately. There clearly is a trend in shooting percentage here. As our population decreases, (remember we lose 10% every time we move to a new column) our shooting percentage increases until we have our final population shooting at a career rate of 9.19%, well above our total population average of 8.28%. That qualifier is significant however. Our population isn’t decreasing randomly, they are being selected for by GMs, and we know that GMs prefer players with higher career shooting percentages, despite evidence that shooting percentage regressing heavily to the mean.

But maybe GMs and scouts are deft at identifying high shooting percentage players even when they aren’t shooting well. They can see through the initial variance to pick long-term winners, hence the increased shooting percentage of our final 10-20%. While I can’t equivocally prove that wrong, I’d think GMs and scouts are much more likely to act on the results they have, which are players with low shooting percentages and drop them from the NHL. (Which by the way, still wouldn’t explain regression to the mean.)

In order to show how selection bias will result in a population with a career average above the mean I created a simulator in excel. It’s very basic. I created a sample of “simulation” forwards that all have an on-ice shooting percentage of NHL league average 8.28%. I then let them “play” for the average ice time for each bin in our population, (eg. 21 minutes for bin 1, 70 + 21 = 91 total minutes for bin 2).

I let their on-ice shooting percentages vary as if each shot had a completely random chance (based on the mean, and TOI of that bin) of going in. That is to say the variance in the population is the same variance if on-ice shooting percentage is all luck and no skill. For each bin I calculated the weighted “career” shooting percentage.

Now the fun part.

I dropped the worse 10% “sim” forwards by on-ice shooting percentage after each bin of ice time ended, and calculated the on-ice shooting percentage of that group of dropped forwards. Thus our population decreases at the same rate as we saw in our NHL sample, and we have calculated on-ice shooting percentage in a similar way as well. We can compare the on-ice shooting percentage of our dropped “sim” forwards to our sample derived from NHL data, to see what the trends look like.

The most important part of this study is not necessarily that the “sim” data fits our NHL data nicely, it’s that as we move across our time bins, shooting percentage increases, despite the fact that our “sim” forwards have a true shooting percentage of league average (8.28%), and our sample variance was all due to luck (ie; was the variance expected for a completely random distribution).

As you can see, the selection pressure we induced artificially by cutting off the bottom 10% of each ice time bin resulted in an effect in the population that was still “active.” We selected for higher shooting percentages, even though no such skill existed. This, as it turns out, is similar to front offices’ in the NHL. A few notes, I gave all “sim” players exactly the same total TOI for each bin, which is different from our NHL sample that have an average total TOI about that bin. Also, the “sim” forwards generate exactly league average shot and goal rates, which is probably different from the NHL.

In conclusion, you can see how selecting only a subset population in the total of NHL skaters can create biases. This effect is not found only in on-ice shooting percentage, but every stat that NHL offices’ use to select players, including Corsi or zone starts. The NHL is not a random system, the population is selected for, and those selection pressures will alter our data. If we are going to perform analysis of individual skaters, the effect must be either mitigated, or accounted for.

We’ll switch up gears here, and discuss methodology in general.

## Methodology

When I refer to “method,” I use it as a basket term for everything one does in a study to come to a conclusion. This includes a) the initial inquiry, b) the study design, c) the method used to collect data, d) analysis of that data, and e) interpretation of the results. Each of those steps feature potential biases. I’ll synopse each stage briefly.

A) The intial inquiry is the question that’s being asked and the reason for collecting data. For example, “What is the average length of attacking zone time?” Solid research must have a foundational question it is attempting to answer. A common error is to collect a ton of data, and then “look” for patterns. Due to random chance, trends will emerge. Furthermore, no outcome (what your final conclusion will be) is specified, rendering the study a fishing expedition that is heavily influenced by chance (random chance that these trends show). Conclusions are then based on interpretation of the trends or new findings not initially considered, instead of an initial question.

B) I could talk about study design for pages, but let’s keep it simple. A study that asks a question and then collects data real-time, moving forward is known as prospective. Drawing conclusions at the end of this study period provides the best evidence for the initial inquiry. For example, a question for this season might be, “Are injuries reduced by the new divisional schedule?” By asking this question now, we don’t introduce biases we may know about the data in the future. Due to time constraints data is often looked at retrospectively (looking back at data), which works well, but is certainly not as strong evidence as prospective studies.

C) Data collection is critical, and this is where a lot of bias is introduced. As discussed above, filtering data can lead to selection bias, which can really change the results of a study. It’s best to try to include as much of the population of interest as possible, and search diligently for confounders.

D) Data analysis is perhaps the most critical. The method of analysis must be tailored to the data. If we have a normal distribution, we can often apply standard tools such as Pearson correlation. If the data is linear we can use regression. The character (distribution, mean, skewness, or pattern) of the data must be confirmed before using these tools.

These powerful statistical tools are (almost) a must if we want to offer any evidence toward our conclusions. A list of the top 20 or bottom 20 by a stat is in no way statistical proof, nor even suggestive of it. Binning data arbitrarily confers a generous ability to manipulate data toward a conclusion pre-formed by the analyst. If you’re interested in analyzing NHL data, learn about these powerful tools (they’re all easily available in excel as add-ons).

E) Lastly we interpret our results, applying our findings to real-world concepts. Again we must be careful not to overstate, or understate our results. We must again consider the population we selected, the strength of the evidence we collected. Tests of statistical significance (p-value, T-statistic, 95%CI, hazard ratio) are all available to help us figure out the strength of our findings. But it is on the researcher to determine the applicability, generalizability, and validity of the study.

## Notes

1. I used to think there would be a time when a majority of the analysis would move beyond this, but it is such a pervasive truism held by a huge population of fans and analysts alike that it’s existence will never go away. It's likely that there will always be those that oppose shot quality existence, and those that believe.

A substantial amount of evidence is available for the former, and I still haven't seen a study fimrly confirming shot quality as a repeatable skill. The one area that I think one may show shot quality at the NHL level is in face-punchers. These players have been selected for an entirely different skill set than the vast majority of the NHL, and if there ever was to exist a population that showed (lack of) shot quality, it would be in these skaters.

## Other Thoughts

1. Be open minded. It’s easy to see through authors biases after a few articles. These will ultimately reduce the quality of your work because the community knows how these biases will unconsciously and consciously change your results.

2. If you want the study to carry weight, you must show %regression to the mean, and correlation with winning. That way we know how the stat compares to other important stats.

3. Although a formal review process is not available, sending work to others who are familiar with the inquiry question can be very helpful. They often uncover errors and additional implications in the work that that may have been missed. The liberal transparency of the statistical community is its strongest attribute.

4. Diligently look for errors before posting, and always include a paragraph about what may have caused a false confirmatory finding (if applicable).

(Thanks to Patrick for sumitting this article. Follow him on twitter @FTFs_SnarkSD)

Former Nations Overlord. Current FN contributor and curmudgeon For questions, complaints, criticisms, etc contact Kent @ kent.wilson@gmail. Follow him on Twitter here.
#1 RexLibris
July 23 2013, 04:26PM

Strong article.

I have been wondering about intrinsic biases in some of the stats that are popular today, such as Corsi, CorsiRel, and so on. Trying to tease out whether there are any structural problems with the data that might trend towards a given conclusion, creating a biased data set, is proving difficult.

Thanks for the article, Patrick. This was a great read.

#2 jvuc
July 23 2013, 08:27PM

Excellent article and Patrick thank you for being one of the few who are keeping advanced stats true as I'm steadily losing confidence with the abundance of bad analysis out in the blogsphere and on twitter (some of which I'm guilty of from ignorance). Part of the problem as I see is many do not understand their own biases. And some of us just blindly follow what these other seemingly smart bloggers post because math and fancy graphs.

I wish there was a better peer review process you allude to as I'm moving further away from advanced stats and more to "watching" hockey because of the abundance of "bad stat" bloggers

#3 Patrick D. (SnarkSD)
July 23 2013, 09:12PM

Thanks for the comments,

There is no doubt that having a peer review process strengthens publications. However, those articles are only as strong as the individuals reviewing them.

The blogosphere is unique in that anyone with any amount of education or experience can write, freely, without an editor. I enjoy that freedom, and enjoy reading the unfiltered writings of many peers. Instead of editors reviewing the work, the onus shifts to the consumer, the readers, to determine validity.

One of the ways we as a statistical community have mediated the work available to the public is through commenting. I've personally done my best to comment frequently, as have many other bloggers, to provide some of that context. While we will always have a substantial issue with confirmation bias, both within the statistical analysis community and its viewership, the truth always seems to settle out in the end.

The key is to keep reading, and keep learning.

#4 RexLibris
July 24 2013, 03:00PM

@Patrick D. (SnarkSD)

The online experience is a fascinating one when the topics blend academic backgrounds with something like sports analysis.

If the bar is set high enough through content and careful moderation, it can attract a community of readers whose comments can further raise the standard of discussion. The improper balance results in a dumbing-down of content/comment or adversarial experience through during the feedback process.

It is an interesting juxtaposition between the professional/academic review process (one I have seen secondhand but not been a part of directly) and the seemingly more "democratic" process of writing and responding to online comments. The sheer variety of personalities and perspectives one encounters offers a broader spectrum of writer experience and interaction.

The point being, when done well, an online community can, like the tide, raise everybody's level.

#5 Patrick D. (SnarkSD)
July 24 2013, 05:08PM

@RexLibris

Right, I agree with what you're saying

As an author of these articles, you're pushed to convey information to both people who have a lot of background knowledge in your topic of interest our type of analysis, and those for which it is all new. It creates a challenge, but in general I think sometimes it even helps clarify thoughts in the article further. Additionally, you speak to a much broader audience.

#6 Badger M
July 24 2013, 05:25PM

Why does the author complain about selection bias, when earlier he cites that Shooting Percentage Regression article as "evidence"?

Selecting players who played over 2500 minutes over three seasons (then players who played over 2500 minutes over the next three seasons) seems incredibly arbitrary.

129 players also seems like far too small a sample size.

#7 Patrick D. (SnarkSD)
July 24 2013, 09:14PM

@Badger M

Badger, Thanks for reading, I'll try to address each point you bring up individually for clarity.

> Why does the author complain about selection bias, when earlier he cites that Shooting Percentage Regression article as "evidence"?

I'm presenting selection bias because I don't think it's been widely discussed. It's an observed phenomena of populations, specifically how these populations are either selected for (in the case of the NHL), or how analysts sample available data.

Shooting percentage regression is a separate issue from selection bias. Selection bias occurs when a population (or sample of data) is collected, whereas regression is a statistical phenomena observed for most real-world data. Eg. exam scores regress heavily toward the mean. Selection bias occurs prior to the study, in the data collection phase. Regression is observed in data with noise, ie. a component of randomness due to outside influences or measurement error.

> Selecting players who played over 2500 minutes over three seasons (then players who played over 2500 minutes over the next three seasons) seems incredibly arbitrary.

I'm not entirely sure what you're saying here, but I think it has to do with the way I've binned the data. I agree that putting data into bins can sometimes be a red flag for data manipulation, but as I explained above twice, I choose those bins based on the decay we see in the NHL population. Every bin represents approximately (nearly exactly) 10% of the original population (N of 941*0.1 = 94.1 players per bin). It wasn't arbitrary.

> 129 players also seems like far too small a sample size.

129 players isn't a great sample, I'll agree with that, but it's what I've got available to me on BTN. Nevertheless, the primary outcome of the study was to show how selection bias can influence outcomes. I showed this through our "sim" model (this was a 10,0000 player sample). I fixed the "sim" players at an on-ice shooting percentage of 8.28%, then letting a normal distribution fall based on NHL shot rates, and cutting the last 10%. As a result we see the players that have gone through this iteration 10 times have higher percentages than players cut earlier, even though they have a true on-ice shooting percentage that is exactly the same. This is what happens with selection bias.

#8 Patrick D. (SnarkSD)
July 24 2013, 09:22PM

I should have also added,

When you want to determine the sample size needed for a study, the stastical term for that is Power, and you can find various places online with formulas you need to calculate.

Power refers to the probability of detecting a real difference between 2 groups. The above study doesn't imply there is a difference, on the contrary, it suggests there is no real difference at all.

Furthermore, if you plug in those numbers, you realize that you will need approximately 10,000 players to conclusively say that there is a difference between the 1st and 10th bin, with a much larger samples needed between bins. Essentially collecting this amount of players is not possible with data available now.

Comments are closed for this article.