Education

Data quality checks every analyst should automate (even in spreadsheets)

If you have ever built a dashboard that “looked wrong” minutes before a meeting, you already know the real problem is rarely the chart. It is the data feeding it. The good news is you do not need a full data engineering stack to reduce errors. A few repeatable, automated checks inside Excel or Google Sheets can catch most issues early, and they take minutes to maintain once set up. This is also why many learners in a data analytics course in Kolkata spend time on data quality habits, not just formulas and visuals.

1. Completeness checks: required fields and missing values

Start by deciding which fields must never be blank. Examples: Customer ID, Order Date, SKU, Quantity, and Net Amount. Create a simple “DQ_Required” tab with a list of required columns and your table name/range.

Practical spreadsheet automations:

  • Missing required values: Use COUNTBLANK() or COUNTIF(range,””) to count blanks per column.
  • Row-level flags: Add a helper column like DQ_MissingFlag using IF(OR(ISBLANK(A2),ISBLANK(C2)), “FAIL”, “OK”).
  • Conditional formatting: Highlight blanks in required columns so issues are visible immediately.

The goal is not to clean every blank automatically. It is to ensure blanks are deliberate and explained (for example, a delivery date may be blank because the order is not shipped yet). Your automation should separate “expected blanks” from “data loss”.

2. Validity checks: types, ranges, and allowed values

A value can be present and still be wrong. Validity checks confirm that values fit rules you expect.

Common validity rules you can automate:

  • Date logic: No future dates for completed transactions; no dates before a system launch.
  • Numeric ranges: Quantity must be positive; discount percentage should be between 0 and 100.
  • Allowed values: Status must be one of a defined list (e.g., New, Shipped, Cancelled).

Spreadsheet approaches that work well:

  • Data validation lists: Restrict entry to an approved list for fields like status, region, or channel.
  • Type checks: ISNUMBER(), ISTEXT(), and DATEVALUE() help detect columns stored in the wrong format.
  • Rule flags: Use IF(AND(value>=min,value<=max),”OK”,”FAIL”) for quick pass/fail.

These checks are easy to scale. Keep thresholds (min/max, allowed categories) in a small configuration table so you update rules once, not across many formulas. This is exactly the sort of repeatable pattern you would practise in a data analytics course in Kolkata, but you can implement it immediately at work.

3. Uniqueness and key integrity: duplicates and broken relationships

Duplicates are one of the fastest ways to break reporting. If an Order ID is duplicated unintentionally, totals inflate. If a Customer ID is inconsistent, segmentation fails.

Automations to set up:

  • Duplicate ID detection: COUNTIF(ID_range, ID_cell)>1 can flag duplicates at row level.
  • Composite keys: Sometimes one column is not enough. For example, Invoice ID + Line Number should be unique. Create a combined key column using concatenation, then run duplicate checks on that key.
  • Referential integrity (lookup validity): If you have an Orders table and a Customers table, every Customer ID in Orders should exist in Customers. Use XLOOKUP() or MATCH() and flag #N/A as “FAIL”.

A simple “Orphans count” metric (how many IDs do not match) is a powerful control. Even if you cannot fix the source system, you can prevent broken relationships from silently corrupting your analysis.

4. Consistency and reconciliation: cross-field logic and control totals

Consistency checks compare fields to each other and to known totals. They catch issues that single-column checks miss.

Useful consistency checks:

  • Cross-field rules: Start Date should be before End Date; Discounted Price should not exceed List Price; Tax should align with tax rules for a state.
  • Standardisation: Trim extra spaces, remove hidden characters, and normalise case so “Kolkata ” and “kolkata” do not split into separate categories. Functions like TRIM() and CLEAN() help.
  • Control totals: Maintain a small reconciliation block: total rows, total revenue, total quantity, and count of distinct customers. Compare today’s numbers to yesterday’s or last refresh.

A practical method is to store prior refresh totals in a “DQ_Log” sheet. If row count drops by 40% overnight, your spreadsheet should show a red “Investigate” status before you update any pivots or charts.

5. Freshness and auditability: make data issues traceable

Even clean data becomes risky if it is stale or if nobody knows when it was refreshed.

Easy automations:

  • Last refresh timestamp: Store a timestamp when data was last imported/refreshed.
  • Change detection: Compare key metrics (row count, sum of amount, distinct IDs) to the previous run.
  • Issue summary dashboard: Create a small DQ panel with counts of FAILs by category (missing, invalid, duplicates, orphans). Keep it visible next to your main dashboard.

This is where spreadsheets become surprisingly strong. With one DQ dashboard tab, you can communicate data health clearly to stakeholders, rather than arguing about why a number “feels off”.

Conclusion

Automating data quality checks is not extra work; it is the work that protects everything downstream. Start small: required fields, validity ranges, duplicates, lookup integrity, and control totals. Put the checks in a dedicated tab, make the results visible, and log outcomes each refresh. Over time, you will spend less time firefighting and more time analysing. And if you are building these habits while learning through a data analytics course in Kolkata, you will find they transfer directly to real reporting environments where trust in data matters as much as the insights.