Data-driven vs. Data-informed: Let’s Acknowledge the Truth – Dev

This comic was retweeted into my timeline on Twitter (I refuse to call it X).

Here’s the dialog from the comic, for those that don’t want to click through or need alt text:
Person 1: We need to be data-informed, not data-driven.
Person 2: What’s the difference?
Person 1: It’s a loophole that allows decisions based upon my gut feeling.

While not all retweets are endorsements, I thought I would comment on a thing that I think a lot of data professionals get wrong. I have heard some smart people say things like “Let the data speak for itself” or “Data doesn’t lie”. In the same vein, you will hear Microsoft offer solutions, and consultants like me offer implementation services, to help your organization become data-driven. A more current buzzword would be AI-driven. These organizational goals are not bad. If I thought there was an inherent problem with them, I would not work in this industry. But the thing people often fail to consider is that:

Your data is imperfect. Whether you know it or not, it is likely incomplete, inaccurate, or inappropriate for the scenarios you want to use it for.

When some people read the comic, they think the decisions-based-on-gut-feelings guy is silly and would produce suboptimal results. We have tech and data for a reason, and some people just want to override it for their comfort or benefit.

But when I look at that comic, I see a potentially seasoned professional who wants to combine data with their domain knowledge, either about the industry or about particular products or customers at their organization. They know that data collection is often imperfect and that it’s impossible to collect every bit of data (quantitative and qualitative) to make a 100% perfect decision. People operate on imperfect data all the time. Some people are better at it than others.

We are now training AI and ML models (predictive analytics, LLMs, etc.) to operate on imperfect data. How many times have you talked with a chat bot that could not accommodate your request? How many times has a publicly available chatbot turned racist/bigoted/misogynist, or provided a response that contained a hallucination? How many times have you used a forecasting procedure for planning and forecasting, and the results seemed out of range? The response by practitioners when this happens is that we just need to tweak the inputs or change the parameters. And this is often true. But a lot of the challenge is identifying those parameters and providing accurate values. Issues of overfitting or underfitting can sometimes be corrected in the model itself. But that doesn’t magically resolve the issues in the underlying data. Often the training dataset provided is too small or the data values are too similar.

In all these previously described cases, the data and/or the training of the models is imperfect. In many cases, the results from AI are less optimal than the results from a human that performs the same task. And in some cases, the results from AI can be more harmful (or more often harmful) than results from a human. Job applications get rejected, people get sent to jail, students get accused of cheating because a model unintentionally (hopefully) produced an implicit bias against people of color or neurodivergent people. This happens with humans, too, but humans might be more aware of applicable laws in this area, and they can be held responsible for their actions.

I’m a bit amazed at how companies are willing to let AI interface with customers, employees, or the public when the evidence of the undesirable output and harmful results are readily available. But do I think we should remove or ignore all AI? Not at all. There are many beneficial applications of AI that can improve people’s quality of life, optimize a process, or even just produce something fun.

I do see a place for AI to optimize processes today, but I think the best applications at the moment are those that don’t require a lot of context or nuance. I am also ok with limited applications where the potential harm is far outweighed by the proven good and the users are knowledgeable of the risks and opt in to the experience.

In order to improve AI models, people have to use, test, and iterate on the current versions. But that doesn’t mean they need to implement those versions publicly, or that organizations need to make the AI-driven process the only way a person can interact with them to complete a task.

I think business decision makers and data professionals should more publicly acknowledge that these AI models are imperfectly made by imperfect people who often have blind spots about their data and themselves (or people in general). Technologists are often optimists, which can lead to great innovation and also great harm when consequences are not considered in advance. So for now, in 2024, I’m good with data-informed over data-driven.

In order to support better AI, we need better data and better data-related processes. This involves better data management and governance. We need:

better error handling and data cleansing
better data catalogs
better data quality and visibility of data quality metrics
better data contracts (to cover availability and acceptable uses)

Those are things organizations can be addressing now that have benefits beyond the use of AI. And they require support from many roles outside of AI engineers. So if you are a data engineer, analytics engineer, DBA, or analyst, you can play a role in preparing your organization for successful future AI projects.