Lies, Lies and More Data

6 min read
Thumbnail for Blog Post - Lies, Lies and More Data || blog/lies-lies-more-data/truth_and_lies.jpeg

Data does not lie but it really needs interpretation. There are plenty of ways that you are vulnerable to misleading conclusions and worse. We will go through some of the most common ones.

“Just trust the data” is the worst advice you can get. Sure — data doesn’t lie, but it needs interpretation. That curation of it certainly can, and does, often mislead, misguide, or plain out lie. From sports to marketing, e-commerce to financial services and, of course, politics, there are plenty of examples where data is seen to lie.

This blogpost is meant for you if you want to avoid being the receiver or creator (yes, creator) of such lies. This is almost certainly relevant to you in your work and how you question the world around you, whether you’re a data professional or not.

Let’s begin.

In order to stop data from doing harm, you need to first identify whether it is an honest depiction of reality. Ask if there is a bias in your data or decision making process. Once you’ve identified that, you can decide how best to react. Here are some of the common lies (‘biases’) to look out for and potential solutions.

Sampling bias

Have you (un)knowingly considered an unrepresentative selection of a population.

Example: Any survey will almost always be biased in how it collects responses. Online reviews, as one example, are often only the people who have had an extreme experience. It is often after either a terrible or excellent experience that someone would seek out to review the a certain product. It is less likely that someone who had an average experience will bother with a review.

Solution: Consider whether your figures might be unrepresentative. If so, either find a more representative source or put your numbers into perspective by comparing it to other figures. Eg. Swiss Airlines reviews are poor 25% of the time. However that is only from 15’000 reviews, compared to upwards of 3 million people flying with them every year. They are also doing better than most of their competitors who have a higher ratio of poor reviews.

Time bias

How might the data change based on the time it was recorded or as time passes.

Example: Customers’ LTV (LifeTime Value) might appear higher the further away they were initially sold to because they’ve had more time to repeat purchase.

Solution: Make the comparison fair. Depending on what you’re trying to answer it is common to consider the LTV at a set point in time after the customer was acquired. For example, how much do they spend in their first 3/6/12 months rather than how much have they ever spent.

Availability bias

Is the data that you are needing to calculate something available at the time and location of needing to make the decision.

Example: A bank can predict that people with pets are more likely to buy pet insurance. However, if they do not know whether someone has a pet at the point of cold calling them it is not helpful.

Solution: Ensure the patterns one identifies are based on data that’s available at time of decision. If it is not, could there be a proxy? For the above example, if they used their credit card at a store with “pet” in the name, they likely have a pet.

Confirmation bias

Are we being impartial in the asking of questions and the range of data we consider and share? We look at what we search for and we share the data that confirms our story. Often it is unknowingly:

Example: Googling whether a headache can lead to future heart conditions will show you the outcomes where this is the case. Sharing that without sharing the full picture or the other possibilities is misleading. (I made that example up and went to test it, sure enough, some doctors say it does).

Sometimes these confirmation biases are done knowingly:

Example: In 2020, the then US president boasted on national television about the US having one of the lowest Covid death rates. However, this was looking at the number of deaths vs known cases, rather than per capita. The actual data was not wrong. He did however fail to share that the only reason there were many known cases was because they had one of the highest rates of testing - something he had previously boasted about. It is a case of purposefully misleading people by showing what confirms one’s own ideas.

Solution: Question everything. Be as impartial as possible in the question raised and consider the antithesis too. Narrate your data story by showing different angles. Even better if the different perspectives can back each other up.

Self-reinforcing cycle

A type of confirmation bias that has a more self exacerbating outcome; where the result of our analysis reinforces the world around us, which we continue to measure. Sure you want to predict probable events, but the prediction and action could exacerbate a bias. Knowing whether this feedback loop could have a positive or negative reaction is important.

Example: If a bank were to raise the interest rate for someone because they are deemed less likely to afford to repay, they will have to repay more and therefore the less likely they are to afford it than someone with a lower interest rate. This would exacerbate the cycle for the worse.

Solution: Consider each decision carefully to avoid exacerbating a prejudice, be transparent about any predictive models and hold each other accountable. If there’s a chance it’s bad or unexplainable, find another approach. (I appreciate that “be careful” is not a great solution for a scarily under-regulated world of data. Thankfully, the regulation is slowly making progress too. If you’re interested in reading more on the scary world of opaque models in society, check out “Weapons of math destruction” - Cathy O’Neil).

So What?

These lies could be imparted to you maliciously; or maybe creeping into your own decisions unknowingly. The consequences could be disastrous either way.

The greatest risk in data usage is not that something breaks, but that it is working seemingly fine while actually entirely misrepresenting reality. That’s why these biases need to be considered. If you have no information to base a decision on, you investigate; but if one bases decisions and directions off the back of misinformation, you’ll be far down the wrong path before you realise.

Think of how this could be affect a business decision of yours:

  • Maybe one spends all their marketing budget on the worst channel because it appeared the best on one incomparable metric
  • Maybe one does all the work to identify your perfect lead only to realise you won’t know the desired attribute until their 3rd purchase
  • Maybe you are looking at what’s not working in your product by sampling the established customers, rather than those who churned in the sales process

Once you are looking out for them in work, you’ll start seeing examples of these everywhere in your day-to-day too.

The biases affect consumers of data just as much as creators of data. What’s more important than knowing the names and types of biases is to know they exist and to always ask:

Why would I not be able to trust this data?

Consider how it’s sourced; how it’s manipulated; how it’s shown.

Is the output going to fairly achieve the desired action?

Will it achieve what I need and might there be unintended negative side effects.

From a more positive perspective, always strive for data based decisions in your business that are: Trusted, Fair, Actionable.