“Ah yes, the two genders” – Analyzing the Statistical Differences in Play Between Womens’ and Mens’ Basketball Leagues [PART 1: Score My Gender in That Box!]

When I was a young boy, sometime around 2004, my father took me to New Orleans. Growing up in Memphis, it represented a relatively affordable yet adventurous spring break vacation, especially considering the preferred method to reach NOLA from MEM is by train, the folk-famous “City of New Orleans” running along the North-South rail corridor from Chicago all the way down to fabulous NOLA.

I couldn’t help but notice that in the areas near the Superdome, there were a lot more people walking the streets wearing orange and white. According to my dad, he also had noticed that there were a lot of butch women who were wearing orange and white, something I was oblivious to at the time.

It turns out that we had booked our trip accidentally at the same time as the NCAA Women’s Final Four. Tennessee, at the time, was coached by the legendary Pat Summitt. Coach Summitt was already well through an all-time great career by this point, a career that would see her teams win 8 National Championships and appear in 18 Final Fours. At the time of her retirement, she had the most recorded wins of any college basketball coach, womens’ or mens’, and left a legacy as one of the greatest basketball coaches of all time.

I was intrigued. We first figured out something was up when, while walking through a mall connected to an adjacent hotel/convention center, a suited figure from the UT Alumni association noticed our clothing and directed us into a ballroom in said adjacent hotel. They had assumed that we were there… for the pep rally.

And so one thing led to another, and we ended up purchasing some tickets from a fan whose team had been eliminated in the semifinals. We ended up sitting next to two older women in that little New Orleans Arena right next to the Superdome, and I took in the first in-person college basketball experience of my life.

So, if you know anything about how the next 15+ years of my life unfolded after that, you probably have lots of questions. And I have some, but not all of those answers.

But this post isn’t about me, really. It’s about something my dad said, while we were talking with those two older women, about the game of basketball.

He said something to the effect of “I think I like the women’s game better actually. It’s more about finesse and fundamentals”.

That stuck with me for a while, a memory filed carefully into a little filing drawer somewhere in the back of “childhood gender memories”, and it wouldn’t be reviewed until a lovely conversation I had with my partner about the differences between the womens’ and mens’ basketball games.

The Point

So what is the point I’m trying to go after here? I want to quantify what those words mean. I want to measure, analyze, break down, and understand every aspect of that statement, from the cultural implications which appear screaming in the mind of anyone who’s ever “experienced gender in some way”, to the statistical implications of individual games.

This series will do that. I will be diving, in progressively deeper levels, into just what it is that makes us watch these two games, and lead ourselves to conclusions that they are “different”. And since this is a blog focused on analytics, I’ll be writing about that side of it here. Thousands upon thousands of pages of ink have been thoughtfully put to paper on the other elephant in this room (the cultural aspects of why we see these games differently), and I will not flatter myself with the thought that I could add a singular milligram of new perspective or insight in that direction.

Oh no. I’ll be staying in the numbers lane.

Part 1 TL;DR

The TL;DR of today’s post is this:

Box score analysis shows a clear statistical difference between the womens’ and mens’ game, at the professional level.
This difference is unlikely to be caused by:
- Different measures or statistics compiled for one game but not the other
- Missing or incomplete data
- Differences in statistical variance within each box score category (these are controlled for)
- Differences arising from the unequal lengths of a single game (also controlled for)
- Differences due to an imbalance between the number of games played from a womens’ league to a mens league (also controlled for)
The difference makes itself most apparent in several correlated variables that are difficult to separate from an overall “pace of play” hidden variable that is not directly accessible from a box score
- A variable which, we will tackle in a future edition of this series
The difference is highly significant (p-value less than 0.01)
Outliers seem to be more significant in the men’s game than in the women’s game.

Code

This post has code! You can find it and instructions to replicate here:

basketball-gender on GitHub

Analytical Method and Background

We will be using a combination of several medium-sophisticated techniques today to examine this question.

The core of the analytical analysis is a technique I originally learned from a paper by Rabanser et al. Simply put, this technique aims to solve the general “data drift” problem using a two-step process that first reduces the dimensions of source data down to a single variable (tuned to maximize variance between the two test sets), and then performs a two-sample test on the resulting single-variable samples.

image credit Rabanser et al.

More specifically, the reduction will be via a binary classifier trained on set membership labels (ie, this row comes from set (1 or 2)). The two-sample test used will be Kolmogorov-Smirnov

This technique is relatively elegant to implement in code, and it also has the advantage of providing some level of insight into the drivers of data drift, via permutation feature importance and similar methods.

This technique is also notably more powerful than traditional statistical methods for the multiple-comparisons problem, in that the dimensionality reduction step also “selects” a subset of variables for which hypothesis tests can be performed, resulting in a smaller dimension (and thus a smaller correction necessary to minimize type I error).

Datasets

The datasets examined in this analysis will be box scores from the highest levels of Womens’ and Mens’ professional basketball: the WNBA and NBA, respectively. For reasons of completeness, games will be pulled from 2017-2021, and only complete box scores will be considered (no missing or incomplete values).

The datasets were collected from the fantastic repository known as Sportsdataverse. Go give them a look.

Pretreatment

The following preprocessing steps are done sequentially:

Per-minute normalization: The WNBA and NBA use different game lengths and divisions of said lengths; the women’s game is divided into two 20-minute halves, and the men’s game is divided into four 12-minute quarters. So, we divide by 40 and 48 respectively, for metrics that are counts.
Winsorization: I noticed pretty early on that the variance between NBA and WNBA box scores were somewhat unequal, which skewed the means of the two samples differently. To control this, a 60th-percentile Winsorization was done, until mean and median differences were about equivalent.
Sampling: To reduce any class imbalance issue, equal samples are taken from the two datasets.
Normalization: Finally, an overall normalization was applied to try and reduce any as-of-yet uncontrolled sources from external factors.

Before (left) and After (right) Winsorization.
These are the differences of differences between mean (above) and median (below) for men’s and women’s box scores.

Classifier

Today I will be boring and just use XGBoost. It’s a good balance between model interpretability and model power. I did not tune the hyperparameters of the classifier.

Results and Discussion

The accuracy of our classifier is quite good, which bodes ill for the null hypothesis when we eventually run our K-S test. Before we get there, though, I’d like to dive a little bit deeper into the drivers of the apparent difference in box scores and propose a hypothesis to be tested in a future edition of this series.

I couldn’t help but notice that although percentages did not seem terribly important, absolute numbers did have an impact, even after normalization. Consider a hypothesis of the form “The WNBA and NBA are different because WNBA players take lower-percentage shots”. If such a hypothesis were to hold, we should expect that the percentages would be important factors in determining the difference between a WNBA and a NBA box score.

This leads me immediately to the following hypothesis, based on my (relatively to some in this community, who study basketball quite intensely) somewhat amateurish understanding: There may be a problem of pacing, spacing, and fast breaks. These are hidden variables which are not reflected in ordinary box scores, but they can change the game in a pretty significant way for the viewer, and also explain this discrepancy we’ve come to here.

Of course, it also deserves mention that the number of rebounds seems to be objectively glaring at us. This is a takeaway others have made, including this excellent pre-read by Zoe Vernon.

The immediate next step to be taken in understanding this better is the recursive elimination of variables until accuracy of the classifier weakens. This helps us understand just what statistics are the most crucial in differentiating the two leagues’ style of play.

Doing so does give us some clarity. By removing 6 of the most significant variables, we were able to reduce the K-S test statistic by a decent margin. However even with those significant variables removed, the test statistic still shows that there is a gap between the two datasets, meaning that further study is needed.

Anecdotally, as someone who’s worked on a few different ML platform teams, a statistic that high is a definite red flag that there is some data drift going on.

It goes without saying of course that box scores contain many correlated variables. Thus a recursive elimination approach must sacrifice a proportionally large amount of the dataset in order to achieve parity. This justifies subject-matter pre-hoc treatment; put simply, we need to find hidden variables that drive all of what’s shown in a box score, and examine those variables instead.

A Rough Roadmap for Future Work

In no particular order, the following ideas have come to me while preparing this analysis.

Analyzing the play-by-play breakdown
- Does one league shoot earlier in the clock?
- Later?
- Is the number of passes made different?
Shot Selection
- Can the differences be explained once shot selection is controlled for (eg comparing midrange jumpers between leagues, correcting for the difference in the 3-point line, and so on)
Per-possession normalization
- Most immediate followup; this can be done to help control for the pace of play distinction.

This is just a sketch, of course. There will doubtlessly be many illuminating conversations we have in the future that refine this roadmap.

But until then, that’s all for now. If you have been, thanks for reading.

Leave a Reply Cancel reply