What I'm Up To: Winter Break (and some early March Madness)

Joe
Dec 29, 2018
3 min read

Updated: Jan 6, 2019

Every year, over my three to four weeks of winter break in college, I’ve resolved to do something. Whether it was to learn to program, get ahead in a spring class, or lose 10 pounds, I always had some kind of goal. This year, it’s to build a March Madness bracket.

But this isn’t just any bracket. I’ve decided that I finally have the programming skills and experience I need to really over-engineer this. And while my chops as a data scientist are underdeveloped at best, I figure I’ve got nothing to lose.

So what, exactly, does this project look like?

There are four parts (one for each week of break). First, I need to collect data, including matchups with some custom statistics and the winners of each game. I’ll use this post to go into detail about that later. Second, I want to train a machine learning model to make predictions on a game based on these custom stats. In essence, I hope to input a matchup, with the custom stats I develop, and output a probability or certainty for the result. The third step is to simulate one or multiple tournaments multiple times. I’ve had some experience with Monte Carlo simulation before, and this is a pretty straightforward application of it. Lastly, I’ll use the data and try to optimize my bracket. Instead of striving for a perfect bracket, I’m shooting for the bracket with the highest expected score. Along with the classic bracket challenge, where you can only fill out the bracket at the beginning of the tournament, I want to develop a different bracket for the challenges that allow one change after a certain point in the tournament. The optimization step is far and away the least defined part of this process. But that’s a week four problem, not a week one problem.

This week, I focused on downloading all the necessary data. I found a python library called sportsreference, which accesses the sports-reference API to download data. Once I got the hang of it, using the objects built into the library was pretty straightforward.

For each season, I downloaded a “team” object, which gives me access to a host of information about teams and their schedules. Iterating through it, I created comparative statistics for each team. Most of these follow a similar format; each statistic takes an existing one and compares it to the opponent’s “opponent” statistic. For example, the offensive efficiency statistic divides a team’s offensive rating by the defensive rating (or “opponent offensive rating”) of their opponent. Of note here is that I used the whole season to develop each of these stats; I’m basically assuming that the final average of the team’s statistic is the “true” value. Maybe this is law of large numbers? Don’t quote me on it.

After iterating through all of the teams and their seasons to create a single dataframe with all of these statistics, I iterated back through the schedule (I know, probably not super efficient) to build a training set of matchups. This training set pulls the statistics in terms of “team1” and “team2,” as well as label of the winner. To ensure I didn’t get a ton of duplicates, I created a list to keep track of which games have been added to the training set and which haven’t.

I faced some issues with this part. At first, I just wanted to use games from the tournament. But I couldn’t find data for my statistics from before the last couple of years, at least not through the API package. Since I’m not a great programmer, I decided to just compensate by creating a training set from all games, not just NCAA games. While this won’t capture the special flare of the NCAA tournament, it does give me plenty of games to work with when it comes to training the model. Just in case, I saved the NCAA data to my computer. I’m hoping it’ll work well, but I fear the model will overfit.

After running the script, I ended up with about 11,000 game entries. Next week, I’ll work on picking the best machine learning model for predicting the outcomes of these games. Hopefully, with that, I’ll have a good foundation for the simulation and optimization steps to follow.

If you’d like to see the ins and outs of the script, you can find it here. It’s fairly well commented, so it shouldn’t be too difficult to follow. It took a long time to run, so if you want the data I ended up with, you can find the tournament-only data here and the full data here.

What I'm Up To: Winter Break (and some early March Madness)

Recent Posts

Comments

Subscribe Form