GARBAGE IN, GARBAGE OUT: THE ART OF PREPARING DATA BEFORE MODELING

November 7, 2025

Marwa Dridi

When people talk about artificial intelligence, the conversation often jumps quickly to models: deep neural networks, complex architectures, cutting-edge algorithms. Yet, before reaching that stage, there is a crucial phase that often determines 80% of the success of an AI project: preprocessing, cleaning, and understanding the data.

In other words: garbage in, garbage out. If the input data is flawed, the outcome will be too—no matter how powerful the model is.

Why this step is fundamental

A model, no matter how sophisticated, is nothing without quality data. Poorly cleaned, misunderstood, or badly structured data inevitably leads to:

false or biased predictions,
wasted time and resources,
and sometimes completely wrong conclusions.

The secret is simple: make the data speak before modeling.

Understanding data before modeling

Before running an algorithm, it is essential to:

perform exploratory data analysis (distributions, correlations, missing values),
detect anomalies or outliers,
understand the business context (what the columns represent and how they relate).

👉 And it’s critical to remember that each domain is different:

Signal processing has different requirements than image processing,
And even within images, medical imaging is very different from facial recognition or industrial vision.

This means that cleaning techniques, transformations, and even evaluation metrics must be adapted to the specific characteristics of each field.

Such understanding allows us to form solid hypotheses and guide the choice of the right model.

Studying data with statistical methods

Understanding data also requires applying robust statistical methods to validate hypotheses before moving into modeling. Techniques such as ANOVA (Analysis of Variance), the Chi-squared test, and various non-parametric tests provide a structured way to evaluate relationships and patterns in the data.

These methods help address fundamental questions:

Are the differences we observe truly significant, or just random noise?
Which variables genuinely hold explanatory power?
Do the statistical assumptions required by certain models actually hold?

This systematic process—observe, hypothesize, test, and then model—forms the backbone of any reliable and scientifically sound project.

Cleaning and transforming: A subtle art

Data cleaning is not just about removing missing values. It also includes:

smart imputation,
normalization and standardization,
feature engineering (creating meaningful new variables),
dimensionality reduction.

When done properly, this work can completely reshape how the problem is framed and solved.

Hypotheses and simplicity before complexity

Many fall into the trap of “the more complex, the better.” Yet, a simple model fed with well-prepared data can outperform a poorly fed deep learning network.

A logistic regression, a decision tree, or an optimized random forest can deliver excellent results—while being fast, stable, and interpretable.

That’s why it is crucial to pose clear hypotheses and test them methodically before diving into deep learning.

Conclusion: Garbage In, Garbage Out

The true strength of AI lies not only in the power of algorithms, but also in its ability to let data speak.
Extracting useful information, understanding structure and biases, forming hypotheses, and validating them statistically—this is the fundamental art.

A model, no matter how advanced, is only as good as the quality of its data.

And this principle applies far beyond artificial intelligence:

In everyday life, if you feed your mind with poor information, you’ll make poor decisions.
In sports, weak preparation leads to weak performance, even with talent.
In cooking, bad ingredients produce a bad dish, no matter how perfect the recipe.

👉 Garbage In, Garbage Out is a universal rule: the quality of the output always depends on the quality of the input.

About the author

I hold a PhD in Applied Mathematics and specialize in modeling and analyzing complex problems. Currently, I am involved in managing research projects at Sogeti, a company within Capgemini.

Generative AI

Cloud

Testing

Artificial intelligence

Security

GARBAGE IN, GARBAGE OUT: THE ART OF PREPARING DATA BEFORE MODELING

November 7, 2025

Why this step is fundamental

Understanding data before modeling

Studying data with statistical methods

Cleaning and transforming: A subtle art

Hypotheses and simplicity before complexity

Conclusion: Garbage In, Garbage Out

About the author

Related Posts

The Generalist/Specialist Conundrum

Domain adaptation vs domain generalization: Making AI work in the real world

The Simulacrum we chose

Why AI needs a Heart as much as a Brain?

From Coders to Conductors

How Non-Ionizing Imaging and Open Datasets are Shaping the Next Wave of Breast Screening

Agentic design patterns – Building the right foundation

Did my Fine-tuning work? A practical guide to evaluating LLMs

The Knowledge of the Ancient

Choosing the Right Lens: A Clear Guide to Breast Cancer Imaging Technologies

Leave a Reply Cancel reply

Generative AI

Cloud

Testing

Artificial intelligence

Security

Why this step is fundamental

Understanding data before modeling

Studying data with statistical methods

Cleaning and transforming: A subtle art

Hypotheses and simplicity before complexity

Conclusion: Garbage In, Garbage Out

About the author

Marwa Dridi

R&DProject Manager | France

Related Posts

The Generalist/Specialist Conundrum

Domain adaptation vs domain generalization: Making AI work in the real world

The Simulacrum we chose

Why AI needs a Heart as much as a Brain?

From Coders to Conductors

How Non-Ionizing Imaging and Open Datasets are Shaping the Next Wave of Breast Screening

Agentic design patterns – Building the right foundation

Did my Fine-tuning work? A practical guide to evaluating LLMs

The Knowledge of the Ancient

Choosing the Right Lens: A Clear Guide to Breast Cancer Imaging Technologies

Leave a Reply Cancel reply