In this piece, we'll get down to the core of what data means, how we can extract knowledge from information from data, and how to use it as a savvy businessperson, researcher, or analyst. You won't need any preparation, but there will be some code available for the aspiring programmers to use and modify. You will walk away from this page with a better understanding of what your data needs or doesn't need and how to most effectively to use it.
Part 1: Information vs Data
Webster's online dictionary defines Data as factual measurements or statistics used as a basis for actions such as reasoning, deliberations, or calculation. We want to pay specific attention to the word "basis" here, as it highlights the fact that data can be manipulated quite easily. If we don't want to be susceptible to this manipulation, then we should make sure we turn that data into information. We see that Information's dictionary entry is "knowledge obtained from investigation, study, or instruction," and that is really what's at play when we're making smart, data-driven decisions:
Fallacies and Manipulation
<!-- --> cite: https://www.datasciencecentral.com/profiles/blogs/data-fallacies-to-avoid-an-illustrated-collection-of-mistakes
The first step in being a conscious member of the data community is to identify and refute bad data appropriately. There are many resources for the many, varied ways to manipulate data, but we'll look at only the most common and most relevant ones here.
Confirmation Bias
The biggest consequence (good or bad) of the information age is that data is available at the click of a mouse or wake-word in almost everybody's homes. It certainly is good for us to have access to more, but that means that we now have to be aware of confirmation bias as we interpret this data. If you're looking for the answer to a question that you don't have an opinion on, even slight differences in phrasing will affect the content of your results. Take Coffee and Acne for example, in a simple Google search question we see very different results based on the choice of cause/cure:
Cure | Cause |
---|---|
We see that even if you're trying to put bias into your search, your results can be skewed. This issue can be resolved by looking at aggregates of data instead of single-sources. This makes way for "pop science" where a study finds that X causes/cures Y, where both X and Y are common conditions. We see more how these unfortunate statistical anomalies become daytime news stories here.
Data Dredging
Data dredging is the effect when you have too many variables or opportunities (whether intentional or not) for correlation for a single dataset. There is a famous example of this, when the website FiveThirtyEight reported on the specific field of nutrition, as the data available has issues already. They noted the specific issues of how they generate "huge data sets with many, many variables," which are the most susceptible data dredging. The article finds a few particularly remarkable relationships, where they found the following links from the respondent data:
You'll (hopefully) not be surprised to know that there is no correlation in real life between cabbage and innie bellybuttons or tomatoes and Judaism. This is a function, as the article explains, of over 27,000 regressions being used over less than 60 complete responses, where they expect a 5% false-positive rate. Why 5%? well, that goes into the next topic, p-hacking.
P-Hacking and Cherry Picking
You might have noticed the far-right column in the above picture has a "p-value" label. This is the value that statisticians use to measure the risk of a false positive. In general, studies with a p-value of less than 0.05, or 5%, can be published. It is a hard burden to prove in some situations, but in some situations it is absurdly easy. Linked in the above article is an interactive piece by FiveThirtyEight that allows you to use p-hacking in action. This one comes with a bonus of letting you lean into confirmation bias as well, as it contains economic and political data from the last ~70 years. You get to cherry-pick the types of political data to use and the ways to measure a "good" economy, and the right panel allows you to see how significant your results are. It should go without saying that if you cherry-pick enough, you can prove any conclusion you want to.
This phenomenon is very possible during the transfer of information at any stage — whether from from data sources to scientists to reporters to consumers — and it can be intentional or unintentional. We see this taken to a comedic level with XKCD's exploration of jelly beans:
What you should take away here is that data manipulated is not meaningful, and I argue it is not even information, as it does not represent knowledge in any real way. Let's not be too cynical now, there is meaningful information in the world, and we want to separate it from this darker part of the data sphere.
Meaningful Information
Here we have information, not misleading or malicious, and want to determine what to do with it. So then, what makes your information worth producing and using?
"Useful" Information
When we say "useful", what we want is SMART. SMART is an acronym that is generally used for goal-setting, and it stands for:
- Specific: This information should be scoped to answer the question(s) you have, as information too broad risks being less applicable to the situation
- Measurable: An excellent metric for what to do with your data, and the subject of the next part
- Attainable: Similar to specificity, information too specific might be too difficult to build a solution for, or not worth the development time
- Relevant: If your information is not about your subject, it won't be a good basis for solutions
- Timely: While data being too old doesn't mean that it's wrong, it makes the correlation to today weaker and harder to build for. Likewise, a projection too far in the future might not be appropriate for your current project or plans.
<!-- --> cite https://www.mindtools.com/pages/article/smart-goals.htm
These are the criteria that you want in goal setting for sure, but we also want them in information, and it's no coincidence (p > 0.05). We want information that supports the goals you have when using it. Otherwise, information is as good as trivia, which is not bad necessarily, but useless by definition.
"Strong" Information
While this sounds like a dataset for powerlifting competitions, it reflects the next stage in data-driven decision making that we need to address: what do we do with the data?
Consider the following information:
Men 18-25 years old in the last 6 months were more likely to purchase jeans than men aged 25-40.
What does a company that produces jeans do with this information? Probably they would increase brand marketing to men who are 18-25, but by how much? The information makes a distinction (more likely vs less likely), not a measurement (X percent more likely). If the information was instead:
Men 18-25 years old in the last 6 months were 10 times more likely to purchase jeans than men aged 25-40.
Ad agencies would be jumping from their seats to start hiring male actors between 18 and 25! This touches on how useful information can be, and, as a bonus, is generally a good way to tell how good the question being asked in a survey was.
Datasets
On our journey to data-enlightenment, we do need to start somewhere. I've enjoyed sharing comics and other cherry-picked factoids for you all, but let's now practice what we preach.
Covid-19 data
The ongoing pandemic is something that has had lots of impact on the world, as I'm sure you have already noticed. The ways that it has had an impact is measurable, both to the human body and to our human society, and many high-stakes decisions depend on understanding this data. There is an abundance of sources online, such as online repositories like Kaggle and Google Research and official sources like data.gov and the CDC.
There are many data visualizations and interactives as well as experts scrutinizing and summarizing this data for us all, thankfully. If you are looking for a kit project or want to practice with some visualization tools, however, this might be a good place to start working.
A particularly well-made example is the arcgis dashboard, which aggregates data and allows you to see the impact on your local county as well as the national/international trends. It does this without leaving room for p-hacking by allowing inappropriate filtering or manipulation, but still answering the questions it seeks to address. You can see the trendline for every state, as well as any country by selecting it in the list on the left. See here:
What all of that text means is that every 10 years, falling on the years ending in 0, there is a nationwide tally of the people who live in the country. There is some historical context here of course as well, including the exception to Indian reservations that are not really part of the U.S. in lots of ways, and the three-fifths compromise that was removed in the 14th amendment.
So what do we do with this? Well, there is a tally of people living in the U.S. since its founding, and that data is aggregated and made available to the public. The individual entries are removed to protect people's privacy, but the country is big enough that the aggregates can often be quite granular. A nice repository of information is available online and is used every day for things like redistricting, allocating federal funds, and business decisions. The last of these is probably most applicable to us here today, as the types of business decisions feed us into the types of conclusions we want to draw from this information, turning knowledge into conclusions. -->
Flight data
In the following part, we will work with a dataset I have constructed specifically to make both high- and low-level decisions, as well as cleaning a dataset and detecting anomalies. This is not a real dataset, as will become clear, but hopefully, it is large enough for meaningful analysis and simple enough for interpretation with the real-world correlations.
Part 2: Knowledge into Decisions
Flight data, continued
This dataset is made to represent flights leaving from SFO over a given time period, with info about the destination airport, the number of seats on the plane, a field called "occupied", and a time guage. We see the first few lines of the dataset here:
AIRPORT,SEATS,OCCUPIED,TIME
LAX,200,180,WEEKDAY
JFK,200,161,WEEKDAY
LAX,200,186,WEEKDAY
YVR,200,0.8315980689983522,WEEKDAY
LAX,250,227,REDEYE
JFK,250,226,REDEYE
LAX,250,228,REDEYE
YVR,250,0.8663797750002538,REDEYE
⋮
We will access the full 1800 line file in the code, but all of the contents will follow the format above.
Types of Conclusions
What makes a backward-looking conclusion?
These types of conclusions are for very broad generalizations and are mostly observations, rather than predictions. There is a way to categorize the types of insights gained from data analysis: descriptive, diagnostic, predictive, and prescriptive. These can be thought of as increasing both specificities and work required:
A backward-looking conclusion is then one of the first two of these and is broader in scope. For example, in looking at the flight data we see that there are twice as many flights to LA relative to Vancouver or New York. This information is probably easy to see, and we can confirm it over the entire dataset.
Python Code:
The code below is some vanilla python3, a programming language very popular with data analysts. We'll be working out of a text file named flightdata.csv
, which is the raw data you saw above, and eventually storing the cleaned data in another file named cleanedflightdata.csv
, which has the fixed values for Vancouver. See below:
data_file = open("flight_data.csv", "r")
counts = {"LAX": 0, "YVR": 0, "JFK": 0}
for line in data_file:
airport = line[:3]
if airport in counts.keys():
counts[airport] += 1
data_file.close()
print(counts)
The output of running this code is:
{'LAX': 898, 'YVR': 449, 'JFK': 449}
Other similar backward-looking conclusions might include the size of the planes related to the destinations, the occupancy related to the plane sizes alone, etc. Most backward-looking conclusions can be drawn from the data directly, provided you are aware of the potential fallacies and have verified your datas.
Speaking of verifying your data, why is the occupied field for the YVR flights a decimal value instead of an integer number of seats like in the LAX and JFK flights? Well that would be something we have to reason about. It would seem that the Canadian flight logs record percentage of the seats occupied instead of the number of seats occupied. Not fixing this would present a barrier to analyzing the data much further, and indeed should be fixed asap:
from math import floor
data_file = open("flight_data.csv", "r")
cleaned_data = open("cleaned_flight_data.csv", "w")
for line in data_file:
airport = line[:3]
if airport == "YVR":
fields = line.split(",")
num_seats = int(fields[1])
occ_percent = float(fields[2])
occ_count = floor(num_seats * occ_percent)
line = "{},{},{},{}".format(fields[0], fields[1], occ_count, fields[3])
cleaned_data.write(line)
data_file.close()
cleaned_data.close()
We see that any backward-looking conclusion is an observation more than it is analysis. It necessarily does not try to make a claim about new data coming in or hypotheticals, those would be in the next section.
What makes a forward-looking conclusion?
While it would feel natural to draw a dichotomy here, it would be inappropriate to do so. Forward-looking or specific conclusions depend highly on the backward-looking conclusions to make prescriptive or prospective claims. The difference between these two is that prescriptive advocates for a specific next action, the way we might want for making good, data-driven decisions, while prospective is claiming what is likely to happen. Understand that the difficulty with prescription is causality (doing one thing would cause another) while prescriptive simply acknowledges a connection of some kind. We go back to XKCD to explore another application of bad statistical thinking:
Of course, despite the ironclad logic, this is wrong. We see that the correlation is present, but we understand from a prospective analysis that there isn't a dependence between any of the variables. Professor Cueball then asserts that Toyota's engineers affect the geology of western Europe, which is yet to be proven[citation needed].
Alright, so we have some flight data, but so what? Well, the businesspeople upstairs have some new investment money that they would like to invest in the airport, financing some additional flights to one of these three destinations. They care about getting the most people through the airport to increase sales of day-old Auntie Anne's. Where would you assign this flight? The weekend to JFK? A redeye to LAX? We have to prescribe something, so it might be worth looking at the data more closely to make the best recommendation.
Let's start with a look at the occupancy rates for all of the specific flights, that is every combo of airport and time:
cleaned_data_file = open("cleaned_flight_data.csv", "r")
data_by_airport_time = {}
for line in cleaned_data_file:
fields = line.split(",")
airport = fields[0]
if airport == "AIRPORT":
continue
time = fields[3]
key = "{}_{}".format(airport, time)
if key not in data_by_airport_time.keys():
data_by_airport_time[key] = []
occupancy_percent = int(fields[2]) / int(fields[1])
data_by_airport_time[key].append(occupancy_percent)
cleaned_data_file.close()
rate_by_airport_time = {}
for key, data_arr in data_by_airport_time.items():
occupancy_average = sum(data_arr) / len(data_arr)
rate_by_airport_time[key] = occupancy_average
[print(k, round(v, 4)) for k,v in rate_by_airport_time.items()]
Our output is:
LAX_WEEKDAY
0.8508
JFK_WEEKDAY
0.83
YVR_WEEKDAY
0.8165
LAX_REDEYE
0.8531
JFK_REDEYE
0.8807
YVR_REDEYE
0.8192
LAX_WEEKEND
0.8508
JFK_WEEKEND
0.8307
YVR_WEEKEND
0.8666
There we go! We can now make a conclusion about which flights would be best to add to the current schedule. From the looks of it, another redeye to JFK would be the best bet here, but this relies on the assumption that the additional supply would produce additional demand. If the demand is static for redeye flights, meaning the supply changing has little effect on the demand, then adding flight here would do very little. It is also possible that the current flights, even though they are the most full, do not have any more demand for additional seats. These aren't questions we can answer with this dataset alone, but maybe you could get some answers from somebody else working at the airport.
Robot conclusions
Finally, we acknowledge that machine learning and artificial intelligence exist. Sometimes, the data-driven decisions we want to make are not those we make ourselves, but those that we train our robot friends to make for us. There are many considerations for training a machine learning model or informing an AI's guiding parameters, and they are, unfortunately, mostly out of the scope of this piece. It is noteworthy, however, that the same issues that plague human decision-makers are prevalent with machines too. There are also additional artifacts from the mathematical tools that built these machines that cause additional strife, such as local minima during gradient descent and overfitting. These things must all be kept in mind when deciding where to feed the output of your ML model.
Part 3: Takeaways
Data Visualizations
There is a huge elephant in the room here: sometimes data needs to be communicated from the analyst to other people. The analyst might be the one making decisions and might not be, and the decision-maker might be data conscious and might not be, so how do we bridge that gap? An extensive blog post about making data-driven decisions? NO! Data visualizations of course! Here we see the power that some very simple visualizations for our flight data can give us.
matplotlib
While the use of visualization tools is out of scope, matplotlib has some excellent documentation online if you're only looking to do very simple operations as we see below.
data_file = open("cleaned_flight_data.csv", "r")
counts = {"LAX": 0, "YVR": 0, "JFK": 0}
for line in data_file:
airport = line[:3]
if airport in counts.keys():
counts[airport] += 1
data_file.close()
import matplotlib.pyplot as plt
names = ["LAX", "YVR", "JFK"]
values = [counts["LAX"], counts["YVR"], counts["JFK"]]
plt.ylabel('#flights')
plt.suptitle('Flights Per Airport')
plt.bar(names, values)
plt.show()
We see the first block of code visualized now, where the doubled amount of flights is clear, and the amount of flights is visible.
data_file = open("flight_data.csv", "r")
cleaned_data_file = open("cleaned_flight_data.csv", "r")
raw_data_lines = data_file.readlines()
cleaned_data_lines = cleaned_data_file.readlines()
# parsing data from both raw and clean files
data_by_airport_time_raw = {}
data_by_airport_time_clean = {}
for index in range(1, len(raw_data_lines)):
raw_line = raw_data_lines[index][:-1].split(",")
clean_line = cleaned_data_lines[index][:-1].split(",")
airport = raw_line[0]
time = raw_line[3]
key = "{}_{}".format(airport, time)
if key not in data_by_airport_time_raw.keys():
data_by_airport_time_raw[key] = []
data_by_airport_time_clean[key] = []
raw_occupancy = float(raw_line[2])
clean_occupancy = int(clean_line[2]) / int(clean_line[1])
data_by_airport_time_raw[key].append(raw_occupancy)
data_by_airport_time_clean[key].append(clean_occupancy)
data_file.close()
cleaned_data_file.close()
# processing data into raw bar chart
rate_by_airport_time_raw = {}
for key, data_arr in data_by_airport_time_raw.items():
occupancy_average = sum(data_arr) / len(data_arr)
rate_by_airport_time_raw[key] = occupancy_average
plt.figure(figsize=(17, 3))
names = rate_by_airport_time_raw.keys()
values = rate_by_airport_time_raw.values()
plt.ylabel('occupancy count')
plt.suptitle('Raw Airport Occupancy Rate')
plt.bar(names, values)
plt.show()
# processing data into cleaned bar chart
rate_by_airport_time_clean = {}
for key, data_arr in data_by_airport_time_clean.items():
occupancy_average = sum(data_arr) / len(data_arr)
rate_by_airport_time_clean[key] = occupancy_average
plt.figure(figsize=(17, 3))
names = rate_by_airport_time_clean.keys()
values = rate_by_airport_time_clean.values()
plt.bar(names, values, align='center')
plt.ylabel('occupancy percent')
plt.suptitle('Fixed Airport Occupancy Percentage')
plt.show()
We see the issue with the original data set very clearly here, and we see that the conclusion from part 2 is still correct (JFK Redeye has the highest percentage), but we also see how small the percentage difference is.
There are many ways to communicate the conclusions that you have gathered, and there are many ways that those conclusions can be presented. We want to make sure that in our effort to convince, we're not stepping over the line of manipulating the other person. There are some heinous examples of misleading visualizations, some of the worst posted online and easy to search up in aggregator lists. It is worth taking time to see the ways these can be misleading too, as the common mistakes can creep into your visualizations if you're not careful. I've found this guide from venngage quite useful in familiarizing myself with the do not's of data viz.
And finally, the complete guide, a bit old and extensive but still incredibly relevant as a resource: The Quartz guide to bad data. This piece is HUGE, but if you're looking for something and can't find it anywhere else, it likely lives in the tomes of Quartz.
Resource and Reference Links
I've aggregated the links to most resources throughout the article in order here for reference or for further exploration, which I would highly encourage for any aspiring data scientists or teachers among you.
- https://www.datasciencecentral.com/profiles/blogs/data-fallacies-to-avoid-an-illustrated-collection-of-mistakes
- https://www.merriam-webster.com/dictionary/data
- https://www.merriam-webster.com/dictionary/information
- https://www.google.com/search?q=does+coffee+cause+acne&rlz=1C5CHFA_enUS893US893&oq=does+coffee+cause+acne&aqs=chrome..69i57j0l7.8121j0j7&sourceid=chrome&ie=UTF-8
- https://www.google.com/search?q=does+coffee+cure+acne&rlz=1C5CHFA_enUS893US893&oq=does+coffee+cure+acne&aqs=chrome..69i57.2119j0j7&sourceid=chrome&ie=UTF-8
- https://fivethirtyeight.com/features/you-cant-trust-what-you-read-about-nutrition/
- https://fivethirtyeight.com/features/science-isnt-broken/#part1
- https://projects.fivethirtyeight.com/p-hacking/
- https://www.explainxkcd.com/wiki/index.php/882:_Significant
- https://www.mindtools.com/pages/article/smart-goals.htm
- https://www.census.gov/history/pdf/Article_1_Section_2.pdf
- https://www.census.gov/population/www/censusdata/hiscendata.html
- https://2020census.gov/en/census-data.html
- https://www.scnsoft.com/blog/4-types-of-data-analytics
- https://www.explainxkcd.com/wiki/index.php/687:_Dimensional_Analysis
- https://analythical.com/blog/examples-of-awful-data-visualization
- https://qz.com/572338/the-quartz-guide-to-bad-data/
Top comments (1)
Data-driven decision-making has gains trends as it has become the gold standard for businesses for increasing profits.