Skip to Content

GARBAGE IN, GARBAGE OUT: THE ART OF PREPARING DATA BEFORE MODELING

November 7, 2025
Marwa Dridi

When people talk about artificial intelligence, the conversation often jumps quickly to models: deep neural networks, complex architectures, cutting-edge algorithms. Yet, before reaching that stage, there is a crucial phase that often determines 80% of the success of an AI project: preprocessing, cleaning, and understanding the data.

In other words: garbage in, garbage out. If the input data is flawed, the outcome will be too—no matter how powerful the model is.

Why this step is fundamental

A model, no matter how sophisticated, is nothing without quality data. Poorly cleaned, misunderstood, or badly structured data inevitably leads to:

  • false or biased predictions,
  • wasted time and resources,
  • and sometimes completely wrong conclusions.

The secret is simple: make the data speak before modeling.

Understanding data before modeling

Before running an algorithm, it is essential to:

  • perform exploratory data analysis (distributions, correlations, missing values),
  • detect anomalies or outliers,
  • understand the business context (what the columns represent and how they relate).

👉 And it’s critical to remember that each domain is different:

  • Signal processing has different requirements than image processing,
  • And even within images, medical imaging is very different from facial recognition or industrial vision.

This means that cleaning techniques, transformations, and even evaluation metrics must be adapted to the specific characteristics of each field.

Such understanding allows us to form solid hypotheses and guide the choice of the right model.

Studying data with statistical methods

Understanding data also requires applying robust statistical methods to validate hypotheses before moving into modeling. Techniques such as ANOVA (Analysis of Variance), the Chi-squared test, and various non-parametric tests provide a structured way to evaluate relationships and patterns in the data.

These methods help address fundamental questions:

  • Are the differences we observe truly significant, or just random noise?
  • Which variables genuinely hold explanatory power?
  • Do the statistical assumptions required by certain models actually hold?

This systematic process—observe, hypothesize, test, and then model—forms the backbone of any reliable and scientifically sound project.

Cleaning and transforming: A subtle art

Data cleaning is not just about removing missing values. It also includes:

  • smart imputation,
  • normalization and standardization,
  • feature engineering (creating meaningful new variables),
  • dimensionality reduction.

When done properly, this work can completely reshape how the problem is framed and solved.

Hypotheses and simplicity before complexity

Many fall into the trap of “the more complex, the better.” Yet, a simple model fed with well-prepared data can outperform a poorly fed deep learning network.

A logistic regression, a decision tree, or an optimized random forest can deliver excellent results—while being fast, stable, and interpretable.

That’s why it is crucial to pose clear hypotheses and test them methodically before diving into deep learning.

Conclusion: Garbage In, Garbage Out

The true strength of AI lies not only in the power of algorithms, but also in its ability to let data speak.
Extracting useful information, understanding structure and biases, forming hypotheses, and validating them statistically—this is the fundamental art.

A model, no matter how advanced, is only as good as the quality of its data.

And this principle applies far beyond artificial intelligence:

  • In everyday life, if you feed your mind with poor information, you’ll make poor decisions.
  • In sports, weak preparation leads to weak performance, even with talent.
  • In cooking, bad ingredients produce a bad dish, no matter how perfect the recipe.

👉 Garbage In, Garbage Out is a universal rule: the quality of the output always depends on the quality of the input.

About the author

R&DProject Manager | France
I hold a PhD in Applied Mathematics and specialize in modeling and analyzing complex problems. Currently, I am involved in managing research projects at Sogeti, a company within Capgemini.

Leave a Reply

Your email address will not be published. Required fields are marked *

Slide to submit