What I'm Up To: March Madness Part 2

Joe
Jan 6, 2019
5 min read

Since last week, I’ve worked on developing the machine learning part of my March Madness prediction model.

To be completely honest, this model is based on a machine learning tutorial I found online. It’s pretty basic for anybody who actually knows about machine learning, but for the exercise of it (and for other noobs like me), I’ll give an outline of that tutorial’s model, then explain where I made some modifications.

The first thing is to import libraries and data. I used pandas to read my data from last time from the csv file and load it into a dataframe. Pretty straightforward stuff.

In the aforementioned tutorial, the author does some brief visualization of the data. I skipped that part and went straight to the modeling part. The next step is to set the “X” vectors (as inputs) and the “y” values (the outputs). Now, the dataset I’m using has about 11,000 entries, and I don’t think I need all of them. So I used a random number generator in numpy to pick out entries in the data, up to the total number I want to use in the model. This is where I first ran into issues.

A big topic in machine learning is minimizing overfitting. With overfitting, the model fits closer to the sample of the population, not to the population itself. Now I’m no expert, but I suspect having duplicate data entries encourages overfitting. Herein lies the problem. At first, I didn’t delete the rows I’d selected as data entries. With 11,000 entries, this doesn’t seem like a big deal. The chance of drawing a duplicate in 500 or 1000 draws is still next to zero, right?

WRONG!

If you’ve taken a a probability, course, you’ve likely seen the famous birthday problem (if you haven't, check it out on Wikipedia). The essence of the problem is simple: given a certain number of people in a room, what’’s the chance that two of them share a birthday? If you’d like to see the full solution to the birthday version of this problem, check out the Wikipedia article on it.

If you want to see how the probability applies to my situation, just keep reading. My version of the problem is nearly identically; the main difference is the number “days in the year” is 11,051, not 365.

The easiest way to approach this problem is via the complement, or the probability that there are no duplicates in the sample. On the first draw, the chance of having a duplicate is zero; there’s nothing to be a duplicate of. After that first draw, the probability of a unique draw is 11050/11051, or the number of unpicked samples divided by the total. Likewise, the probability of a unique draw the next time, given that we have no duplicates already, is 11049/11051. Since each selection is independent, the probability of both of these happening is simply the product of these probabilities.

The best way to visualize these probabilities is in a graph.

In the horizontal axis, we have the number of samples from our big dataframe. On the vertical axis, we have the probability of getting a duplicate. The probability of getting a duplicate looks like it’s just about one around 600 samples, which made me think it was an error, but in reality it’s just really close to one. The point here is I was almost certainly drawing duplicates in my sample. To eliminate that possibility, I just deleted each row from my “big” data frame (from which I drew entries) and reset its index. That way, the only games that could be drawn were games that hadn’t been chosen yet.

After taking my detour into applied probability, I returned to the tutorial for the next step: defining the method for training and validation sets, and choosing my models. Most of this came straight from the tutorial, including the clever list of tuples for names and models. However, I ran into issues using the SVM model, since it’s not equipped to spit out probabilities (which is what I’m after). A quick google search told me that I could get around this by using a Calibrated Classifier of the support vector machine. I’ll be honest, I don’t know what this means, but it worked!

Now it was time to pick a model. The tutorial I drew from used cross validation, so I followed suit. This is kind of sat in two parts. The first part is the “k-fold,” which creates and object containing 10 sets of training and testing data. This is supposed to find the most robust algorithm for the data by testing it a bunch of times instead of just testing it once. After this, we evaluate our results, and build a dictionary with the score of each method.

This was another adjustment I made to the tutorial’s method, opting to use a log loss score instead of an accuracy score. I was first exposed to scoring rules my sophomore year, when a professor (shoutout Dr. Bickel) let me sit in on meetings with his PhD students. One of the students was working on a paper related to scoring rules, and I understood the premise of them: how do you assign a “score” to probabilistic forecasts (i.e. forecasts that estimate the probabilities of a set of outcomes). I learned more about them in my Applied Probability class last semester. When it comes to scoring rules, there’s a quality of certain rules called “strict propriety.” When a scoring rule is strictly proper, the optimal set of reported probabilities is the reporter’s true beliefs. In other words, the reporter maximizes their expected score by reporting what they believe to be the true probabilities. Using a simple binary prediction with a binary rule (“correct” or “incorrect”) encourages the reporter to put all the probability on the more likely outcome. Now, the “reporter” in this case (in a loose sense) is my computer, which won’t do that unless I tell it to. Seen in a different lens, using a strictly proper scoring rule gives me a more accurate picture of which model is better, as the score can differentiate between at 70% score in favor of the winner and a 99% probability in favor of the winner.

There were a lot of strictly proper scoring rules at my disposal. I decided, for this version of the model, to use the logarithmic scoring rule because it’s a local scoring rule. In other words, the score depends only on the probability reported for the outcome that occurred, and not on the other probabilities. This scoring rule has the benefit of exposing probability estimations with large probabilities on the wrong choice. This is good for my purposes; if there’s a chance a team wins, my later probabilistic analysis will capture it and use it as part of the decision process. The log loss rule has the added benefit of being built into sklearn’s scoring methods, which makes my life easier.

Now, I’m not claiming that this is the best scoring rule to use. If it’s not already incredibly clear, I’m sort of an analytics rookie. It just has the right balance of convenience and persuasiveness.

Once the scores are in, the script picks the model with the max log score and saves it. That way, I can use it in the simulation part of the modeling.

Just like last time, I’ve commented through the code so it should be fairly easy to follow. You can find it here. This week, I’ll be working on simulating the tournament. I also realized I can probably use the Law of Total Probability to evaluate it analytically, so we’ll see what I end up going with. Check back next week to find out!

What I'm Up To: March Madness Part 2

Recent Posts

Comments

Subscribe Form