Ten-Minute Talks: Pickup Basketball and Data Science

Joe
Oct 15, 2019
9 min read

Updated: Oct 17, 2019

So I've had a month here at MIT Sloan's Business Analytics program. I have a lot of takeaways, but in this post I'm just going to discuss three.

The first takeaway is about pickup basketball. If I'm being honest, I expected the level of pickup basketball here at MIT to be a lot less intense than at UT. I mean, UT is a big state school with tons of people. So obviously, the people who play there would be better! I especially expected that trend to continue with the MBA students, who are typically older and have had more years since their competitive playing days.

Boy was I wrong. To begin, these guys are around 27, so they're in their athletic primes. But they also play insane defense. I mean the kind of defense that makes you scared of dribbling. And beyond that, they're all a lot faster and stronger than I am. Realizing I can't hang with these guys is definitely a disappointment.

To be honest though, it's not the biggest disappointment I've had. A big focus on this program is data science. Don't get me wrong: data science is an extremely relevant and popular field. That we're diving into the field is big plus of the program. It's not the program's emphasis I'm disappointed in; it's data science itself. Before I launch into this discussion, I want to mention a couple of things. The first is a shoutout to my professor, Dr. Alex Jacquillat, who graciously took some time last week to discuss this topic with me. The conversation cleared up a lot of the things that are going to show up later. That said, the second qualifiaction is that I'm still making these claims here, and I am by no means an expert, so you shouldn't blindly trust my opinions. My point of view is based on my single month's worth of knowledge.

With that out of the way, let's talk about my gripe with data science. My background (and main interest) is in Operations Research. These fields have related but different goals. The goal of data science is to take data and make effective predictions. To come back to this pickup basketball example, a good data scientist would leverage the right data to predict that Sloan MBA students are, in fact, very good at basketball.

Operations Research comes in at a different part of the workflow. Given what we know about these MBA's and their basketball abilities, how should the agent (in this case, me) respond? Should I just play whenever and hope they're not there? Seek out opportunities to challenge them so I can get better? Infiltrate their group chat so I can avoid them like the plague?

In the real world, this problem becomes far more complicated. Let's imagine, for example, that not all of the MBA's were inexplicably good; just a few of them were. Then I would have a "flatter" distribution of skill levels to contend with, and maybe my strategy would change. Instead of avoiding them altogether, I could just avoid the good ones.

The characterizations of data science methods I've seen are obsessed with getting as accurate an answer as possible. As a data scientist, I don't want to get a distribution with uncertainty. Nobody wants their data scientist to be "uncertain." It's just not sexy!

So when I pose a prediction problem ("How much will the MBA's embarrass me on the court?"), the machine learner offers me a single prediction by considering the uncertainty around who'll show and how I'll play: "you'll commit 3 turnovers and will miss all of your shots." As a prediction of the "average," will this answer be exactly correct?

No! I may turn over the ball 4 times but make a couple shots. Maybe I get matched up with someone who's a (far) below-average defender and I can actually show off my lackluster ball handling a little. Perhaps I miss almost every shot but hit the game winner (it happens surprisingly often). Each outcome has a probability, however small, that it could happen. And how I make my decision should rely on those non-average outcomes and their probabilities more than on the average outcome alone.

Data science isn't well equipped (yet) to give me answers beyond an average.

If you look at some of the most commonly used methods for regression, like linear regression, CART, and random forests, the usual decision rule is to predict a single value. In linear regression, it comes from an equation for a line. In CART, it's based on the average of a bunch of similar points. And in random forests, it's based on the average of a bunch of smaller CART models (to oversimplify).

It's especially with methods like CART that I'm frustrated, as these are primed to return a distribution. For the uninitiated, CART essentially assigns data-points into groups, called a "leaf." The idea is to make each leaf as homogenous as possible. At first, you "grow" the tree to create a bunch of leaves, then you "prune" it to consolidate to fewer leaves. This makes your model more robust and less likely to overfit. It also means you'll have a lot of points on any given leaf, which you assume are similar enough (or in statistical terms, come from the same distribution). The typical practice is to just take the average of all these points, but with so many of them, why not just return a distribution! It would quantify how certain your model is; a narrow distribution means the leaf estimates are close together, while a flat, wide one would betray a lack of certainty. These would give you far more information than just the mean, which gives a woefully small amount of information.

This is the crux of my gripe with data science: as a field so poised to help us make better decisions, it (in its current form) so often restricts us to suboptimal decision-making.

Let's continue with the basketball example. Since this is just a single decision (to play or not to play) given some potential gains (having fun) and potential losses (time cost and getting embarrassed), I'll bring in some principles from Decision Analysis.

Let's start by framing the decision. The MBA's are really good. My uncertainties are which/how many of them will show up and how well I'll play. My decision is whether or not to play. Let's just assume I have the prediction from your typical machine learning model: an estimate on how many of them will be at the gym when I get there.

Surely this isn't enough information! What if we got some more information for one of these uncertainties? Maybe I can pay my friend, who claims to be a psychic, to tell me if I'll play well. Then, theoretically, the outcome where only a few MBA students show up is more valuable: I'll play well enough to feel like a rockstar (this is what my ego considers to be the "best case" scenario).

So if I were to decide how much to pay for this information, I'd need to know the distribution of outcomes for the "Joe's on-the-day ability" uncertainty. If there's a good chance few of them show up, and I'm guaranteed to be on fire when I play, it would counterbalance the probability on the other side where the whole game is just MBA students, and they'll shut me out no matter how well I play (we're considering reasonable upper limits on my ability here). While there's got to be a way to pick that up from the data, my machine learning method only gave me a point estimate.

You're probably rolling your eyes right now. The decision in this case is heavily over-engineered. Yet the same principle applies to may higher-stakes scenarios. Say you're, for example, an upstream oil company, getting ready to invest tens (hundreds?) of millions of dollars in a well. Or you're NASA, and you need to get a sense of the reliability of one of your parts. Maybe you're a company trying to forecast demand for a new product that requires a lot of up front investment. Perhaps you're using data to give medical advice to a doctor. Or you're just trying to figure out who to start in your fantasy football league. These are all decisions where you'd want to consider your uncertainty, making a decision on more than just an average estimate.

This principle is actually quite like another principle already understood in data science: interpretability, or the ability of a model to be understood by human reasoning. Interpretability isn't a hard-and-fast rule that matters in every single case. But it certainly matters in some, and occasionally trumps improvements in accuracy.

Now that I've espoused on how important getting the full distribution is, let me soften my argument a bit. See, my basketball dilemma is just one example of where your decision process should influence your model building process. If you're making a different kind of decision, then you may be ok (or prefer) a point estimate.

This is where using data becomes an art, and not just a science.

It's in this process that I see parallels between my background in mechanical engineering and the data science workflow. When it comes to mechanical design, there's a lot to consider. For a simple example, let's talk about a refrigerator's compressor. The compressor takes coolant in, compresses it, and pushes it through the rest of the system to make it able to remove heat from (aka cool) the inside of the fridge. There's some more math (and thermodynamics) involved, but this is the gist.

On the most basic level, the compressor only has to compress to some specified pressure. That's a simple enough goal. Then beyond that, we may care about minimizing its cost. Making a compressor that works and is as cheap as possible is a very scientific, well-defined process on the surface, and should be the same for every kind of refrigerator you can think of (walk-ins, freezers, mini-fridges, etc). But as you think more about design, there are more elements to the refrigerator's design that you have to consider. What about energy efficiency? Does it matter here? Is our design decision to improve our energy efficiency going to affect our cost too much? What about maintainability? One of the first lessons I learned in design was to think about sticking a wrench in whatever you're designing. For a cheap mini-fridge, it's not an issue; you can just buy a new one. But if you're designing a refrigerated truck or an important refrigeration unit for a large space, repairability can be essential, even trading off with core features like cost and the rate of heat removal. In some cases, the noise also matters, while in others it doesn't. So even though the surface level choice of a refrigerator's compressor is both simple and routine, fitting it to particular needs quickly becomes a very involved design problem.

In the same way, modeling isn't just about getting the "right" or "best" answer, because what form that takes can change in ways that we can't just quantify through mean squared error. In many cases, the decision is trivial based on the prediction. For example, a self-driving car doesn't have much uncertainty to evaluate at a traffic light; if the car thinks it's got a red light, it stops. Further, there are plenty of prediction problems that are already incredibly difficult, and sometimes your choice of model is entirely determined by what actually works. But yes, there are also some cases where you need to use a full distribution to make a decision. The takeaway here is the design of your modeling process should be more involved than just checking a bunch of R-squared values. The process of making is defined by data "science," but the art of modeling is better described as engineering.

While this design analogy is kind of abstract, there's another more directly applicable engineering principle I want to invoke: the gut check. An essential routine when doing engineering calculations is to gut check it at the end. If you do four pages of calculations and end up with a negative mass, then you've definitely messed up.

But where we often have an intuitive feel for things with engineering, data science is all about answering questions for which we don't have a convenient baseline answer. So what do you do?

In the case of modeling, you may get different results from different models. Maybe one returns a single value, while another returns a distribution. But these models should expose the same qualitative relationships. In the context of this MBA-tracking example, I shouldn't just look at my uber-rigorous, distribution-returning model. I may also run a quick regularized linear regression (a very standard model) to return a single value prediction.

If my plain and simple regression tells me that rain strongly deters MBA's from playing basketball while my complex and innovative distribution-returning model says it brings them to the gym, then I have a relevant and important contradiction. And while the regression isn't the best way to make predictions for making decisions, its simplicity means that the qualitative understandings I pull from it are often solid.

So if I had to distill this whole post into a single paragraph (read: my second takeaway), it would go like this.

As people who work with data, we have to look beyond cookie-cutter methods for every problem. The problem we're solving is, overall, a decision problem and not just a prediction problem. While the field has strong roots in math, at the end of the day the data scientist must engineer the modeling process to fit the needs of the decision-maker as closely as possible.

Now if you've kept with this post until now, you may be wondering what the third takeaway is. And in this case it's an admission. I have a problem.

I think I'm addicted to Chewy bars.

There's a spot here on campus that has free coffee and some free snacks, and those free snacks include Quaker Chewy bars. Now I don't know who decided to call these "granola" because they are just straight up candy bars. They're literally bits of grain (I think?) bound together by syrup, then finished with chocolate. They taste like America's obsession with giving sugary stuff to kids and calling it healthy (s/o to the American cereal lobby). But either way, they're freaking delicious, and for better or for worse, they've become a part of my daily routine. While I already have a worrying habit, I can only imagine what's coming as the weeks get more stressful. I guess everyone's got a vice.

Ten-Minute Talks: Pickup Basketball and Data Science

Recent Posts

Comments

Subscribe Form