Data Science - Function and Form

Joe
Nov 14, 2019
8 min read

Since my last post about data science, I’ve had more exposure to the field and time to formulate some thoughts about it. As I’ve seen more examples of data science, a certain underlying structure has become clear to me.

Before I talk about what makes these techniques the same, let me talk about how different they are. Models vary widely. Most broadly, some are only used for regression or classification, while many are used for both. Some are parametric, enforcing a particular structure on your data (the easy example here is a linear regression). Some are interpretable, giving clear and easy-to-understand explanations of their classification or regression process (like CART). Some methods employ a single model, while others use a lot. Particular methods call for special treatment of data through bootstrapping. All of this variation should signal the amazing work that's gone into the field, both on in theoretical and application-based research.

All of this variation is the "form" of the method.

And with all these different methods and their many variations, the ML workflow is usually the same. First, you split your data into training and testing sets. Then you feed the training data into the model (where you tell it what the correct answers are) and it “learns” a model of the underlying processes. Then you check it on the test set to see how it performs. Now that you have a finished model, you can take new input data and make a prediction from it. The function of machine learning is honestly pretty simple.

Machine learning is all about taking a lot of data on a well-documented phenomenon and making a prediction while assuming the underlying relationships haven’t changed. In doing so, it effectively performs a single function: to make predictions based on a lot of data.

In this post, I’m going to elaborate on this structure. I’ll talk about its limitations, give a hypothesis of why the structure is so prevalent, and try to predict the future (and hint at the upside) of data science methods.

Let’s start with why it’s so useful.

Data Science takes advantage of the most recent asset organizations are adding to their arsenal: data. Data is to prediction what height is to basketball. Even if you don’t have a lot of specific basketball skill (analogous to very nuanced knowledge of underlying processes), you can make that skill go a long way on the court (make some good predictions).

Now this isn’t to say that domain knowledge isn’t important, or that data science is only useful in areas where we don’t have domain knowledge. Many processes are incredibly difficult to understand qualitatively, let alone model quantitatively. Using the data allows us to generate some direct insights without developing our own abstract understandings.

Developing these abstract understandings is often a very difficult task.

When physicists come up with mathematical understandings of the world, they first completely strip it of its complexities. They may assume point masses, or spherical volumes, or isentropic processes for the sake of the computation. Even after these assumptions are made, the process can be incredibly difficult to model with pen-and-paper math. Beyond that, one has to devise an experiment and figure out how to collect data.

With data science, you just have to feed data into the model and see how it performs. There’s some tweaking here and there, but overall it’s a much less involved process than building a theory from the ground up. And while data science doesn’t give you a very deep answer, it usually doesn’t need to. Sometimes, I just want my computer to tell me whether an image is of a dog or a cat.

And that leads into the second reason data science is (rightfully) such a hot topic. While the actual structure (inputs get pumped through a model, which makes a prediction) is incredibly narrow, it is also very simple, enough to be a basic building block of our decision-making. Whether you’re predicting tomorrow’s weather to decide on bringing an umbrella, making demand forecasts in a complex market, or processing a tissue image to determine its malignancy, you’re using this simple prediction structure more than you realize. With so much data at our fingertips, data science naturally gets applied in many areas. This is the underlying engine of its massive popularity.

Finally, data science, like many NBA rookies, has a lot of “upside.” While it is still pretty raw in its current form, there’s immense potential for it to grow past its limitations. For some of these, this process has already begun.

To explore one of these limitations, let me juxtapose to a different field: engineering. Engineering has many models. We develop models for thermodynamic processes, dynamic processes, material processes, and more. And while some of these are based on some experimental data, the majority comes from physical principles that are proven, a priori, to be true.

Let’s take a very common dynamic model: the suspension in a car. This system contains masses, springs, and dampers, making it a classic mechanical engineering modeling problem. Characterizing this model requires essentially no data; the physical mechanisms at work (Hooke’s law, inertia, gravity, and friction) lay the foundation to fully define the problem abstractly. Further, these principles allow us to do more than extrapolate on past behavior. We can test the response of the system to new impulses with the constitutive mechanisms at hand. We can use our dynamic model to make design decisions that influence the behavior of the system. Both of these can happen without any “new data.” So what does this model have that our data-driven ones do not?

Causality.

This is the first major boundary of our input-prediction-output workflow. Fundamentally, our prediction models do not use any causal relationships. While strictly following the data makes the models closer to “ground truth,” it makes them far less robust in their ability to test new hypotheses.

While engineering is a classic example of where these kinds of mechanical understandings lead to models that can exist beyond data, there are some other examples. For now, I’ll just pay some lip service to System Dynamics, which uses the same dynamic models and controls theory in applications in the social sciences. In addition to this field, there’s an area of research in causal inference. While data science hasn’t yet figured this problem out, work is under way. Researchers like Judea Pearl, who wrote The Book of Why, have already begun to take a stab at incorporating causality into the mathematics underlying AI. I expect that, as we encourage our models to gain a more holistic and thorough view of the world, causality will become a more lively topic of conversation.

As I mentioned before, there are plenty of cases where causal mechanisms are difficult to outline qualitatively. These are ripe opportunities to fall on data, but it comes at a cost.

In our Analytics Edge class, cases related to predicting recidivism have come up a couple of times. I won’t go into exhaustive detail, but I’ll deliver one of the punchlines: it’s very difficult to develop a model that predicts recidivism at equal rates for black and white Americans. This pattern is true for many very strong prediction methods, and even holds true with the models that have the lowest prediction error.

So what's the issue? Why are these models racist?

The answer is clear when you examine the data. Namely, the output of our dataset is not whether an ex-con actually committed a crime, but if they were convicted for it. In a system with well-documented and oft-studied racial bias, this dooms methods based on the data to failure, as they begin with a tainted dataset. This is the next drawback with our data-driven methods.

Incorrect Data.

Think back to your last lab class. Labs are pretty straightforward: you calculate some expected outcome based on the physical principles you learned in class, then perform an experiment to see how close you are. And invariably, your theoretical and experimental results are wildly different. This isn’t because the physical principles aren’t true; they’ve been verified plenty of times before both by mathematics and by far more experiments than your lab. What’s flawed is the data you’ve collected.

So why is it that we make assumptions on correct data with such blind faith?

Learning requires intellectual tunnel vision. To focus on why bagging alone requires subset selection to make random forests such a good model, we need to make some quick assumptions about the quality of our data, namely, that it’s good. The development of new models is effectively a learning process, one where we strip away the limitations of real data to innovate in our methods.

But in the grand scheme of things, this is already becoming a fairly weak assumption. Relaxing this assumption means adding “robustness” (an oft-studied data science concept) to your model. Depending on who you ask, this can be accomplished with a variety of methods. If you’re a fellow MBAn you’ll know that regularization is basically equivalent to solving a robust optimization problem. If you’re not a fellow MBAn, then you can read this paper to convince yourself that it’s true. Robustness is a topic that’s already being discussed within the data science community.

Even the issue of missing data has already gotten some response. Whether the methods are just cross-applications of our typical data science methods or newly developed ones, data imputation is a field that is already being explored.

With this development in the more circumstantial limitations, what about the structural one? More than anything else, this structure strictly limits where data science can be applied. Not every problem is as simple as making a prediction. Can data science ever grow beyond the simple thought structure of today?

I think the answer is yes. To justify it, I can point at things that have already begun to eschew the typical prediction structure.

Take clustering. Often, we have some information about some object and we want to classify them into different groups. But we don’t have an “a priori” way of classifying them. So we use clustering methods, that essentially group these objects for us.

Then there’s the ability to use these methods to look deeper into a dataset, and perhaps find something we missed. Let’s say you run a clinical trial on some patients, and on the whole you don’t see much of an impact. Can that rich data give yield more information than “the drug doesn’t work?” My Machine Learning professor, Dr. Dimitris Bertsimas, would argue that you can (and he published a paper saying so). While this method is certainly data-driven, it’s not in our conventional machine learning style.

When I began writing this blog post, I wasn’t really sure where I was going. A couple weeks, many revisions, and a trip to Texas later, I’m ready to bring it all home.

I believe that data science is paradoxically both overrated and under-utilized. People believe that data science will fundamentally change how the world works. To a degree, it will (more in a future post). But there are still some very strict limitations to the methods we currently have. These mostly stem from the very particular form it tends to take. And while this is an incredibly common structure, it’s hardly representative of all of our decision problems.

So in its current form, machine learning and data science are heavily overrated. But there’s still so much potential that lies in the future. As data becomes available in larger quantities and more places, the methods to extract meaning from it will evolve.

This means different things for different people. For most of us, it probably means pivoting our educational goals to learn how to learn to apply these methods, at least at a high level. While demand for data science and its practitioners is growing ever faster, the supply is not.

With those dynamics come some challenges for companies as well. Data will revolutionize industries the same way IT has, and these supply-demand dynamics create interesting management problems for companies to solve.

I’ll save all that for next time. For now, I’ve got some machine learning homework to do.

Data Science - Function and Form

Recent Posts

Comentários

Subscribe Form