What I'm Up To: March Madness part 3
- Joe
- Jan 11, 2019
- 5 min read
This week, I worked on generating some probability data for the tournament. I chose to use the 2018 tournament data as my starting point, and I wanted to calculate the probability of each team getting to each round in the tournament.
My first intuition here was a Monte Carlo simulation. In this case, it just requires a uniform random number generator (which returns a random value between zero and one). If this is below the probability of Team A winning, then Team A wins. Otherwise, Team B wins.
To implement this, I first built a dictionary of games with every possible matchup. I did this by drawing stats from the season data frame and using the trained model to estimate probabilities. For each round, I created a list of teams in the order you’d see on a bracket. That way, I could create matchups using their indices; team 0 plays team 1, team 1 plays team 2, etc. Every game is evaluated with our weighted “coin flip,” with a winner emerging and moving to the next round. Then the process repeats until we find a winner. The data for this was stored in a dataframe, where, when a team lost in my simulation, they gained an extra “point” in the that round. For example, a team with 50 points in the Final Four got to the round 50 times. By dividing by the total number of simulations, you can predict the probability of a team getting to a certain point in the tournament.
There’s a bunch of advantages to this. For one, it’s super intuitive. With some minor tweaking (evaluating games with a random number instead of an actual basketball game), you can essentially imagine your computer tracking the games as they happen. Moreover, if I put in the scoring system (including my picks), I could do better than a simple arithmetic mean. I could really find the score at any arbitrary probability. There’s a good chance I’ll rely on simulation in the next step for this reason.
But this also has some drawbacks. First, it’s pretty computationally intensive. Generating the random number and evaluating the logic every time can take a while. Second, it takes a lot of iterations converge. If you run the simulation twice, you can get vastly different results. It’s a classic tradeoff when it comes to numerical solutions: it’s somewhat easier to calculate, but isn’t as accurate and requires significant computing power.
Last week I alluded to an analytical solution using the Law of Total Probability (henceforth referred to as LOTP). It’s difficult to explain this without an example.
The first four teams in my seed list (not the top four seeds, but the order in which they appear on the bracket) are Villanova, Radford, Virginia Tech, and Alabama. In the Round of 64, the probability of Villanova winning is simply the probability that it beats Radford; that matchup is set. But since we don’t know which of VA Tech and Bama is going to win the Round of 64, we can’t just use one of the matchup probabilities. In this case, the probability of Villanova winning the Round of 32 (and advancing to the Sweet Sixteen) is: (probability of VA Tech getting to the Round of 32 * probability of Villanova beating VA Tech) + (probability of Bama getting to Round of 32 * probability of Villanova beating Bama). This logic is contained in the LOTP function.
It’s clunky to represent this in words (and Wix doesn’t support equations copied and pasted from Pages), but for the Round of 32 the math is pretty straightforward.
Then it gets more complicated. Where the probability of Villanova getting to the Sweet Sixteen given it got to the round of 32 includes two teams, the next round (winning the Sweet Sixteen given that it’s there) involve 4. Then 8. Then 16.
I really didn’t want to do this by hand. So I had to find a pattern.
Because of the way brackets are designed, matchups between two specific teams can only happen at one point. For example, Villanova can only play Virginia Tech or Bama in the Round of 32. For this purpose, VA Tech and Bama are in the same “group”. The next round, that group will grow to four teams, based on which teams can play the winner of Villanova’s group. This continues through the tournament.
Every round has a two-dimensional array. Each row in the array has three entries: the team’s name, the probability of that team reaching that particular round, and a group number. For every group, the teams in the group must have played in previous rounds. In other words, it’s a list of teams that a team will never play after this point.
In every round, the program evaluates the probability of reaching the next round using an LOTP function. This function picks the next group of teams to include them in the LOTP calculation. This conditional probability is then multiplied by the probability of reaching the previous round, to get an absolute probability (this assumes independence, which may not be entirely justifiable but the convenience outweighs the risk of over-assuming). The result of the computation is the array for the next round, with the same name, updated probabilities, and a new group number (reflecting that the groups double in size). All of this logic is called in the bayes_next_round function.
I had a lot of issues with this part. At some point, the probabilities were astronomically high numbers, and certainly not less than one. As a gut check, when I finally got numbers that looked possible, summed the probabilities in every round, so the sum of all the probabilities should be equal to the number of teams in the round. This held true for every round, so I was good to go!
The advantages and disadvantages to this way of solving the problem are essentially mirror images to those of the simulation. This computation took less than a second, and I know my probabilities are as accurate as they can be. However, if you asked me to figure out the top quartile of a given bracket’s score, I probably couldn’t tell you.
The next part of the project is optimization. Given the information that I have, how can I maximize my possible score? I mentioned before that I don’t really have a clue on how to make this happen, and little has changed since then. I figure I’ve got about a week to figure it out before winter break comes to an end.
In all likelihood, I’ll probably test various heuristics for picking each matchup (like picking the one with the maximum expected score or picking the team with a higher probability each time). At any rate, you’ll just have to check back in next week.
If you’ve been looking at my previous blog posts, you’ve probably seen all the different code files. I’m in the process of reorganizing them so it’s a bit easier to run (including putting all of the functions and constants into a separate file that I can import) to make the whole thing a bit more streamlined. When I settle on a final way of organizing it, I’ll be sure to update a ReadMe doc in the folder with all of the requisite files (which you can find here)
コメント