Uncertainty: Accounting for Known and Unknown Outcomes

jayakody-anthanas-m1wFkw-Iyt8-unsplash.jpgNote: for the last few posts, I’ve been exhausting the store-house of prewritten pieces from other websites that I hadn’t yet transferred to this website. I believe all have been posted here now, so let’s return to our regularly scheduled programming.

I’ve had an article saved on Pocket for a few months now with a section highlighted. I don’t often highlight sections of articles because I don’t often keep articles on Pocket — once I’ve read it, I delete it (so the highlighting is superfluous). However, there was an article I came across a few months ago with a passage that stopped me in my tracks. It was in a rather weeds-y article about the Twitter war strongly worded discussion (?) between Nate Silver and Nassim Nicholas Taleb. I won’t get into the details of it, because it’s not really necessary for the passage that popped, though I did want to set the context, in case anyone clicked through to the link and was confused.

About halfway through the article, the author begins a discussion on uncertainty. In particular, he’s talking about two kinds of uncertainty — aleatory and epistemic. [NOTE: You’re not alone if you had to look up aleatory — I don’t recall every coming across that word.] Anyway, here’s the key bit:

Aleatory uncertainty is concerned with the fundamental system (probability of rolling a six on a standard die). Epistemic uncertainty is concerned with the uncertainty of the system (how many sides does a die have? And what is the probability of rolling a six?).

How many times have you come across a model that purports to be able to predict the outcome of something — without there being a way to “look under the hood” of the model and see how it came to its conclusions? OK, maybe looking under the hood doesn’t suit your fancy, but I bet you partake in the cultural phenomenon that is following who’s “up or down” in the election forecast for 2020? Will POTUS be re-elected? Will the other party win? Or what about our friends on the other side of the pond — Brexit!? Will there be a hard Brexit, a soft Brexit, are they going to hold another election?!

All things, all events where the author of the piece or the creator of the model might not be adequately representing (or disclosing) the amount of epistemic risk inherent in answering the underlying question.

~

So let’s bring this closer to home for something that might be more applicable. You make decisions — everyday. Some of you might make decisions that have an impact on a larger number of people, but regardless of the people impacted, the decisions you make have effects. When you take in information to make that decision, when you run it through your internal circuitry, the internal model you have for how your decision will have an effect, are you accounting for the right kind of uncertainty? Do you think that you know all possible outcomes (aleatory) and so the probabilities are “elementary, my dear Watson,” or is it possible that the answer to whether you should have cereal for breakfast is actually, “elephants in the sky,” (epistemic). OK, maybe a bit dramatic and off-beat in the example there, but you never know when you’re going to see elephants in the sky when you ponder what kind of breakfast cereal to pull down off the top of the fridge.

What is Data Science?

There’s no question that “data science” is becoming more and more popular. In fact, Booz Allen Hamilton (a consultancy) found:

The term Data Science appeared in the computer science literature throughout the 1960s-1980s. It was not until the late 1990s, however, that the field as we describe it here, began to emerge from the statistics and data mining communities. Data Science was first introduced as an independent discipline in 2001. Since that time, there have been countless articles advancing the discipline, culminating with Data Scientist being declared the sexiest job of the 21st century.

Unsurprisingly, there are countless graduate and undergraduate programs in data science (Harvard, Berkeley, Waterloo, etc.), but what is data science, exactly?

Given that the field is still in its proverbial infancy, there are a number of different perspectives. Booz Allen offers the following in their Field Guide to Data Science from 2015: “Describing Data Science is like trying to describe a sunset — it should be easy, but somehow capturing the words is impossible.”

Pithiness aside, there does seem to be consensus around some of the pertinent themes contained within data science. For instance, a key component is usually “Big Data” (both unstructured and structured data). Dovetailing with Big Data, “statistics” is often cited as an important component. In particular, an understanding of the science of statistics (hypothesis-testing, etc.), including the ability to manipulate data and almost always — the ability to turn that data into something that non-data scientists can understand (i.e. charts, graphs, etc.). The other big component is “programming.” Given the size of the datasets, Excel often isn’t the best option for interacting with the data. As a result, most data scientists need to have their programming skills up to snuff (often times in more than one language).

What’s a Data Scientist?

Now that we know the three major components of data science are statistics, programming, and data visualization, do you think you could identify data scientists from statisticians, programmers, or data visualization experts? It’s a trick question — they’re all data scientists (broadly speaking).

A few years ago, O’Reilly Media conducted research on data scientists:

Why do people use the term “data scientist” to describe all of these professionals?

[…]

We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” The people doing this work used to come from more traditional and established fields: statistics, machine learning, databases, operations research, business intelligence, social or physical sciences, and more. All of those professions have clear expectations about what a practitioner is able to do (and not do), substantial communities, and well-defined educational and career paths, including specializations based on the intersection of available skill sets and market needs. This is not yet true of the new buzzwords. Instead, ambiguity reigns, leading to impaired communication (Grice, 1975) and failures to efficiently match talent to projects.

So… the ambiguity in understanding the meaning of data science stems from a failure to communicate? Classic movie references aside, the research from O’Reilly identified four main “clusters” of data scientists (and roles within said “clusters”):

Within these clusters fits some of the components described earlier, including two additional components: math/operations research (including things like algorithms and simulations) and business (including things like product development, management, and budgeting). The graphic below demonstrates the t-shaped-nature of data scientists — they have depth of expertise in one area and knowledge of other closely related areas. NOTE: ML is an acronym for machine learning.

 

NOTE: This post originally appeared on GCconnex.