The foundation of every successful data science project is clean, consistent data. Yet in the fast-paced process of analysis, even seasoned Python users sometimes stumble into preventable data pre-processing pitfalls. Whether you’re shaping data for a machine learning model or preparing it for visualization, knowing what mistakes to watch for can save you countless hours of frustration and lead to far more robust results.
Let’s break down five of the most common (and often deadly) data cleaning mistakes in Python, complete with practical tips and illustrative examples so you can keep your workflows rock solid and efficient.
One of the first issues you'll run into in any real-world dataset is missing data. In Python, especially with tools like pandas, replacing or removing NaNs is easy: df.dropna()
or df.fillna(0)
does it in one line. But easy doesn’t mean correct.
Automatically dropping rows with missing values can drastically shrink your dataset or—as in cases where missingness is correlated—introduce bias. Filling values with the mean or zero can distort distributions, especially in non-numeric columns or those with outliers.
Consider this snippet:
# Too hasty with missing value treatment
import pandas as pd
df = pd.read_csv('survey.csv')
df = df.dropna() # Danger: goodbye valuable data!
If 30% of rows are missing just a single optional field—say, age—you'd lose 30% of your data. If the missing ages are mostly in a specific demographic, the result is a dataset that no longer accurately represents the population.
df.isnull().sum()
or df.info()
to see patterns of missingness.sklearn.impute.SimpleImputer
for informed filling, or domain-specific logic.Data gathered from multiple sources rarely fits neatly into one format. Dates, categories, and string encodings are especially prone to subtle, hard-to-trace errors.
dtype: object
), disabling numeric operations.Classic Python issue:
# Date times imported as strings, causing issues
import pandas as pd
df = pd.read_csv('sales.csv')
df['created_at'].min() # Only finds the minimum string, not chronological min
df.dtypes
quickly exposes columns that should be numeric but aren't.pd.to_datetime()
, pd.to_numeric()
, and category conversions as soon as you import data..str.lower().str.strip()
for category columns; replace synonyms or typos with a consistent value.encoding
argument (encoding='utf-8'
or encoding='cp1252'
).df['created_at'] = pd.to_datetime(df['created_at'], errors='coerce')
df = df.dropna(subset=['created_at']) # Remove rows where dates couldn't parse
A little attention here prevents hours debugging weird analytics later.
Outliers are the wildcards of data cleaning—sometimes they signal data entry errors; other times, they are the very events worth studying!
Automated scripts that eliminate values outside a certain range without considering context can strip data of both errors and important signals.
A health dataset has a blood pressure column. Some values are recorded as 400, potentially a data error (units or input issue). Others may be edge cases, like hypertensive emergencies. Blanket removal of all values above 200 might erase real, rare patients who would be essential in medical studies.
# Don't just drop anything >200 without context
bp_outliers = df[df['blood_pressure'] > 200]
print(bp_outliers) # Investigate: are these errors or medically relevant cases?
df.describe()
and visualizations like box plots or histograms to uncover distribution details and spot outliers.When outliers turn out to be valid, they can reshape your business insights driven by the data.
Duplicate data is pervasive—data entry errors, web scraping, or system glitches all introduce them. While Python lets you df.drop_duplicates()
in an instant, the real danger is in misunderstanding where duplicates come from, or how best to resolve them.
A retail database might have multiple rows for the same customer order due to repeated system submission. Dropping all but one row works only if every column matches; otherwise, information may be lost.
# Problematic: Dropping all duplicates based only on 'order_id'
df = df.drop_duplicates(subset=['order_id']) # Could lose different addresses or notes attached to split-row orders
If columns like 'delivery_notes' differ between rows, blindly dropping duplicates either loses data or fails to reconcile conflicting info.
df.duplicated(subset=key_cols, keep=False)
to flag true duplicates.is_duplicate
for downstream analysis is preferred to outright removal.Here’s how you might aggregate potentially useful fields before deduplication:
def collapse_order_notes(notes):
return '; '.join(sorted(set(x for x in notes if pd.notnull(x))))
rollup = df.groupby('order_id').agg({
'customer_id': 'first',
'total_amount': 'sum',
'delivery_notes': collapse_order_notes
}).reset_index()
This protects important ancillary data.
Many powerful algorithms require numerical inputs, not direct string labels or categories. Encoding categorical columns is a crucial step, but rushing or choosing the wrong method can degrade model performance and introduce bugs.
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label encoding ignores category meaning
le = LabelEncoder()
df['city_code'] = le.fit_transform(df['city']) # Problem: Model interprets numbers mathematically
0,1,2
) if the labels have a natural order (e.g., size: S < M < L). Otherwise, opt for one-hot or other encodings.pd.get_dummies(df, drop_first=True, dummy_na=False)
, or for large-cardinality features, consider hashing or target encoding.fit
/transform
) so model deployment gets identical mappings—a classic trap when new, unseen categories appear in real-world inputs.city_counts = df['city'].value_counts()
# Only encode cities appearing at least 10 times
common_cities = city_counts[city_counts > 10].index
df['city'] = df['city'].apply(lambda x: x if x in common_cities else 'Other')
df = pd.get_dummies(df, columns=['city'], drop_first=True)
This keeps feature size practical and models robust.
Data cleaning in Python requires respect for subtle detail as well as speed. By steering clear of mechanical or context-free cleaning, you elevate your data science and analytics work far above the average. Review missing values with intention, bring consistency to formats, treat outliers as signals not just noise, scrutinize duplicates, and think tactically about category encoding.
Equipped with these lessons and a critical eye for your data, you’ll spend less time backtracking, minimize embarrassing errors in production, and build a reputation for engineering data pipelines that analysts trust. And in the ever-growing field of data science, becoming the person whose data is truly ready for insight is a real superpower.