Data cleaning is the unglamorous part of analytics that determines whether your insights are trustworthy. In real organisations, data rarely arrives in a neat table with perfect columns and consistent values. It comes from web forms, CRMs, billing systems, spreadsheets, surveys, and manual uploads—often all at once. If you are learning through a data analyst course in Chennai, you will quickly notice that most project time goes into preparing data before any dashboard, model, or report can be built.
Below are the most common messy data problems you will face, along with practical fixes that work in everyday business settings.
1) Missing, incomplete, and “unknown” values
What it looks like
Missing data appears as blank cells, nulls, “NA”, “N/A”, “-”, or even “0” used as a placeholder. In customer datasets, it often shows up in phone numbers, location fields, age, income, or product details. In operational data, it can appear as missing timestamps, unfilled status fields, or partial address information.
Why it happens
- Optional form fields users skip
- System integrations that fail to sync all fields
- Legacy systems with incomplete records
- Data entry teams using shortcuts
How to fix it
- Standardise missing markers: Convert “NA”, “-”, and blanks into a single missing value format.
- Decide on a strategy per column:
- Drop rows only when missingness is small and random.
- Impute values when missingness is meaningful but manageable (e.g., median for salary, mode for city).
- Create an “Unknown” category for categorical fields where missing is informative (e.g., “Lead Source = Unknown”).
- Track missingness as a metric: Add a “data completeness score” for key fields so business teams can improve data capture over time.
2) Inconsistent formats and incorrect data types
What it looks like
- Dates stored as both “08/01/2026” and “2026-01-08”
- Phone numbers with country codes, spaces, or missing digits
- Numbers stored as text (e.g., “1,200” or “₹1200”)
- Mixed casing and spelling (“Chennai”, “chennai”, “CHENNAI”)
Why it happens
Different sources follow different rules. Spreadsheets allow free-form entry, while databases enforce types—until someone exports and edits the file manually.
How to fix it
- Define a standard format: For example, ISO date format (YYYY-MM-DD) and E.164 phone format (+91XXXXXXXXXX).
- Parse and convert types early: Convert currencies to numeric by stripping symbols and commas. Convert dates using explicit parsing rules (don’t rely on auto-detection).
- Normalise text fields: Trim extra spaces, convert to consistent case, and map common variations (“TN” → “Tamil Nadu”).
- These steps are foundational skills in any data analyst course in Chennai because they prevent downstream chart errors and incorrect aggregations.
3) Duplicate records and conflicting entries
What it looks like
Duplicates are not always exact copies. You might see the same customer twice with slightly different names (“S. Kumar” vs “S Kumar”), multiple emails, or two addresses. In sales and marketing data, duplicates inflate lead counts and confuse conversion metrics.
Why it happens
- Users submit forms multiple times
- CRM imports run repeatedly
- Different departments maintain separate lists
- Matching rules are too weak (or missing)
How to fix it
- Start with exact duplicates: Remove rows identical across key columns.
- Use a “unique key” approach: If a stable ID exists (customer_id, invoice_id), enforce uniqueness and investigate collisions.
- Apply fuzzy matching for entities: Match on combinations like name + phone, or email + company. Use similarity thresholds carefully and validate with samples.
- Choose a survivorship rule: When duplicates conflict, decide which source wins (latest timestamp, most complete record, or system of record). Document this rule so it stays consistent.
4) Outliers, impossible values, and business-rule violations
What it looks like
- Negative quantities or ages like 250
- Revenue values that jump by 100x due to an extra zero
- Timestamps in the future
- Conversion rates above 100%
Why it happens
Human entry mistakes, unit mismatches (grams vs kilograms), system bugs, or partial imports.
How to fix it
- Use rule-based validation first: Define acceptable ranges (age 0–100, discount 0–100%, order quantity > 0).
- Detect outliers with context: Statistical methods help, but business understanding matters more. A “high” purchase might be valid for enterprise customers.
- Flag instead of delete: Create a validation column (Valid/Invalid) and keep records for audit until the business confirms what to do.
- Log assumptions: If you cap values, convert units, or remove records, record the reason and the rule so results can be reproduced.
Conclusion
Real-world data cleaning is less about perfection and more about reliability. The goal is to build datasets that are consistent, traceable, and fit for decision-making. By standardising missing values, enforcing formats, resolving duplicates, and validating against business rules, you turn messy inputs into analysis-ready assets. If you practise these workflows while doing a data analyst course in Chennai, you will be better prepared for actual job datasets—where the ability to clean data well is often what separates a good analyst from a great one.

