Tag Archives: Big Data

What is Data Science?

There’s no question that “data science” is becoming more and more popular. In fact, Booz Allen Hamilton (a consultancy) found:

The term Data Science appeared in the computer science literature throughout the 1960s-1980s. It was not until the late 1990s, however, that the field as we describe it here, began to emerge from the statistics and data mining communities. Data Science was first introduced as an independent discipline in 2001. Since that time, there have been countless articles advancing the discipline, culminating with Data Scientist being declared the sexiest job of the 21st century.

Unsurprisingly, there are countless graduate and undergraduate programs in data science (Harvard, Berkeley, Waterloo, etc.), but what is data science, exactly?

Given that the field is still in its proverbial infancy, there are a number of different perspectives. Booz Allen offers the following in their Field Guide to Data Science from 2015: “Describing Data Science is like trying to describe a sunset — it should be easy, but somehow capturing the words is impossible.”

Pithiness aside, there does seem to be consensus around some of the pertinent themes contained within data science. For instance, a key component is usually “Big Data” (both unstructured and structured data). Dovetailing with Big Data, “statistics” is often cited as an important component. In particular, an understanding of the science of statistics (hypothesis-testing, etc.), including the ability to manipulate data and almost always — the ability to turn that data into something that non-data scientists can understand (i.e. charts, graphs, etc.). The other big component is “programming.” Given the size of the datasets, Excel often isn’t the best option for interacting with the data. As a result, most data scientists need to have their programming skills up to snuff (often times in more than one language).

What’s a Data Scientist?

Now that we know the three major components of data science are statistics, programming, and data visualization, do you think you could identify data scientists from statisticians, programmers, or data visualization experts? It’s a trick question — they’re all data scientists (broadly speaking).

A few years ago, O’Reilly Media conducted research on data scientists:

Why do people use the term “data scientist” to describe all of these professionals?

[…]

We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” The people doing this work used to come from more traditional and established fields: statistics, machine learning, databases, operations research, business intelligence, social or physical sciences, and more. All of those professions have clear expectations about what a practitioner is able to do (and not do), substantial communities, and well-defined educational and career paths, including specializations based on the intersection of available skill sets and market needs. This is not yet true of the new buzzwords. Instead, ambiguity reigns, leading to impaired communication (Grice, 1975) and failures to efficiently match talent to projects.

So… the ambiguity in understanding the meaning of data science stems from a failure to communicate? Classic movie references aside, the research from O’Reilly identified four main “clusters” of data scientists (and roles within said “clusters”):

Within these clusters fits some of the components described earlier, including two additional components: math/operations research (including things like algorithms and simulations) and business (including things like product development, management, and budgeting). The graphic below demonstrates the t-shaped-nature of data scientists — they have depth of expertise in one area and knowledge of other closely related areas. NOTE: ML is an acronym for machine learning.

 

NOTE: This post originally appeared on GCconnex.

Advertisements

The Problem with Big Data: Lies, Damn Lies, and Statistics

I’ve used the subtitle in a previous post and I think the application to the content of this post also makes it worthwhile to use again. I was reading a post from Tim Ferriss the other day and it made me think of statistics. The post is about alternative medicine, but understanding that isn’t entirely necessary for the point I’m making. Here’s some context:

Imagine you catch a cold or get the flu. It’s going to get worse and worse, then better and better until you are back to normal. The severity of symptoms, as is true with many injuries, will probably look something like a bell curve.

The bottom flat line, representing normalcy, is the mean. When are you most likely to try the quackiest shit you can get your hands on? That miracle duck extract Aunt Susie swears by? The crystals your roommate uses to open his heart chakra? Naturally, when your symptoms are the worst and nothing seems to help. This is the very top of the bell curve, at the peak of the roller coaster before you head back down. Naturally heading back down is regression toward the mean.

If you are a fallible human, as we all are, you might misattribute getting better to the duck extract, but it was just coincidental timing.

The body had healed itself, as could be predicted from the bell curve–like timeline of symptoms. Mistaking correlation for causation is very common, even among smart people.

And the important part of the quote [Emphasis Added]:

In the world of “big data,” this mistake will become even more common, particularly if researchers seek to “let the data speak for themselves” rather than test hypotheses.

Spurious connections galore–that’s what the data will say, among other things.  Caveat emptor.

This analogy reminded me of the first time I learned about correlation and causation in my first psychology class as an undergraduate. It had to do with ice cream, hot summer days, and swimming pools. In fact, here’s a quick summary from wiki:

An example of a spurious relationship can be illuminated by examining a city’s ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Getting back to what Ferriss was saying near the end of his quote: as “Big Data” grows in popularity (and use), there may be an increased likelihood of making errors in the form of spurious relationships. One way to mitigate this error is education. That is, if the people who are handling Big Data know and understand things like correlation vs. causation and spurious relationships, these errors may be less likely to occur.

I suppose it’s also possible that some, knowing about these kinds of errors and how little the average person might know when it comes to statistics, could maliciously report statistics based on numbers. I’d like to think that people aren’t doing this and it just has more to do with confirmation bias.

Regardless, one way to guard against this inaccurate reporting would be to use hypotheses. That is, before you look at the data, make a prediction about what you’ll find in the data. It’s certainly not going to solve all the issues, but it’ll go a long way towards doing so.

The Long View Perspective on Big Data and Metrics?

One of the things that I like to write about is perspective. In my opinion, it’s so important to continue to look at things from different angles and assume other viewpoints to understand the many ways that things can interact. A little over a week ago, I came across a series of tweets from Chris Hayes that presented a perspective that I hadn’t considered:

Big Data is certainly something that has captivated the popular press and some might even say rightfully so. Of course, it’s important that we use metrics when making decisions, but is it possible that the pendulum has swung too far to metrics? It’s hard to say. Chris Hayes certainly seems to think so.

I like how he’s compared this to another phenomenon (can we call it a phenomenon?) from history where engineering took the world by storm. To be honest, given my age, and what I know about ‘recent’ history, I don’t know that engineering had as much hoopla as big data has today. Regardless, this perspective, this long view, is something that we all would be better off with. That is, looking at things from a longer perspective. Considering the adage that ‘history repeats itself.’ Maybe there’s something from our recent past that would help us better understand where we are today.

A good example of this might be international relations. If you’re looking for a ‘fictional’ example, may I recommend the movie “Now You See Me?”

How Big Data Can Make Watching Baseball More Fun

I like baseball. I played it all throughout my youth and my years as a teenager. So, not surprisingly, I also like to watch baseball. Watching baseball on TV has come quite a ways. While baseball was first televised in the 1930s, instant replay didn’t come along until almost 1960. Nowadays, you can’t watch a game without seeing just about every “key play” replayed. From the replay of the last double in the gap to the last pitch that was so close to being called a strike. And on that note about strikes, we can now see a makeshift strike zone on the screen next to the batter/catcher.

My post today is a pitch (pardon the pun) about how to improve the viewing experience in the context of that makeshift strike zone, which on some networks, is called pitch tracker.

On the pitch tracker, we can see a few things that have happened during the at bat. We can see where each pitch crossed the plate and at what height. We can also see if the pitch was fouled off and if the pitch was a ball. While all of this great, in my opinion, there is one major flaw to all of this — the “strike zone” isn’t universal. That is, as many players will tell you, each umpire has a different “strike zone.” Some umpires like to call a “wider” strike zone. Meaning, on the screen, it will appear as though the pitch is quite a few inches outside of the strike zone, the umpire calls that pitch a strike.

To the casual fan this may be confusing, but to a fan who watches baseball frequently, this may be frustrating. Especially as the game wears on, you might hear the announcer state that the last pitch was called a strike earlier in the game, but now it’s being called a ball. I’d like to eliminate the need for the announcer to tell me this. I’d also like to eliminate the confusion of the fan who sees a pitch that appears outside the strike zone, but is called a strike. How can we do this? Big Data.

Umpires go through a rigorous process before becoming an MLB umpire. As a result, their strike zone will probably be pretty much set in stone by the time they get to umpire their first MLB game. I propose that instead of using the “standard” or traditional strike zone on the screen during the game that networks show us the strike zone of the umpire. So, if an umpire usually calls strikes that appear 6 inches outside, we can see that because that’s the strike zone on the screen. We could even using a rolling average of the umpire’s career, such that only the last 3 seasons are taken into account when creating the strike zone on the screen.

The reason I suggested Big Data as the solution to this is because of all the sports, baseball is one of the ones with reams of data. Bill James did an excellent job of using data to allow us to better understand the success and failure of players, I think it’s time we use some of that data to make watching baseball just a bit more interesting.

The Habits of Successful Organizations: The Power of Habit, Part 2

In Part 1a, we had an introduction Duhigg’s book on habits. In yesterday’s post, we looked at some of the highlights and the key points from the first section (on individuals) of the book. In today’s post, we’ll look at the second section of the book and pull out some of the key highlights on successful organizations.

Upon reading the first chapter of this section, I was a bit surprised that there was a story about Michael Phelps. Although, in the context of the information on keystone habits, it makes sense. In fact, like with Tony Dungy in yesterday’s post, I was surprised that I’d never heard about Michael Phelps winning a gold medal in the 200m butterfly in the 2008 Olympics without the use of his vision. Duhigg’s retelling of the story is actually quite compelling and helps to illustrate the point of “small wins.”

There’s also a great story of Paul O’Neill a former Secretary of the Treasury who was also the Chairman amd CEO of Alcoa, one of the largest aluminum producers on the planet. When O’Neill took over as the CEO of Alcoa, it was worth $3 billion. When he left, it was worth almost ten times as much ($27.53 billion). Many folks would be interested to know how he did it. The short answer: safety. O’Neill used this focus on safety to change the culture of the organization (and the by extension, the habits!), which allowed profits to soar.

~

If you’ve ever worked at Starbucks, you know some of the secret ingredients: service with a smile and the LATTE method of handling unpleasant situations. Duhigg explains how becoming a Starbucks employee changed someone’s life by giving them the life skills they hadn’t learned elsewhere. This made me think: why don’t we teach students these kinds of skills in school? This kind of emotional intelligence is just as important as learning about history and science. Some may even argue that it’s more important.

There were three other really compelling stories in this section: there was one about the King’s Cross fire in London Underground over 25 years ago, there was one about issues between nurses and doctors in the Rhode Island Hospital, and the last was about how Target is able to know when someone’s pregnant before they are. You probably read about the Target story last year and if you’re old enough, you probably remember the King’s Cross fire and some of the aftermath that ensued. Reading about the King’s Cross fire was particularly compelling for me because of what I perceived as common rifts that are seen in organizations all the time. The problem with the rifts of the workers at King’s Cross was that it cost people their lives. The story of the Rhode Island Hospital had a similar vein in that it *potentially* cost someone their life because of the rift between the nurses and the doctors.

Some of these stories of tragedy reminded me of the idea I had about treating one’s workforce not as liabilities, but as assets. I wrote about this a couple of days ago with some help from Henry Blodget.

In tomorrow’s post, we’ll look at the habits of societies.