Data Validation

Establishing a data validation plan before beginning modeling is essential. It enables validation to be performed through automated checks (“tests”) rather than relying on SME review. Time for validation should be explicitly included in project planning.

Use basic data visualizations to perform quick “sniff tests” (e.g., cycle counts, trends, and distributions).
Instrument simple KPIs (e.g., total tonnage or output) to verify that aggregated values align with source systems.
Automate validation checks using notebooks and the SDK to streamline testing and support pipeline updates by multiple contributors.

Typical Validation Questions

Data completeness and integrity

What does the data look like if rows are missing? Are there detectable gaps or numeric skips?
Are key columns ever empty (e.g., foreign keys), or does data always flow with complete fields?
Which columns should always be populated, and which are expected to be optional or intermittently blank?

KPI reconciliation

Does a KPI (e.g., total downtime) reconcile with source data for edge cases such as highest and lowest downtime days?
Does aggregating a full time period (e.g., one month) match expected totals from the source system?

Asset and model alignment

Does machine type correctly map to an actual asset type on the plant floor?
If synthetic machines were created for modeling purposes, has Product confirmed this is necessary and justified by customer requirements?

Cycle and process definition

For discrete manufacturing, does the cycle boundary align with a discrete machine operation signal?
For continuous processes, is the cycle boundary based on a clearly defined time interval agreed upon with the customer?
Do cycle running time and downtime sum correctly to expected totals (e.g., 24 hours where applicable)?
If using in-cycle downtime, are Net vs. Gross cycle times handled correctly?

Downtime and operational logic

When machines stop, are downtime events correctly created? Are sensor readings correctly associated with downtime periods?
Are running time and downtime correctly used to compute Availability KPIs?

Throughput and performance

Are machine speed and throughput being calculated correctly?
Are output units consistently positive and logically valid?
Are speed and output metrics correctly used in Performance KPI calculations?

Output modeling

Is the cycle output field properly populated and used as the primary output metric? If not, why?
Are output calculations appropriate for both discrete and continuous processes?

Quality and defects

Are defects or losses tracked and attributable to specific machines?
If multiple defect classes exist, are they modeled separately and clearly?
Is total output always greater than defects to prevent negative values?
Are outputs and defects correctly combined into a Quality KPI or equivalent metric?

Data hygiene and usability

Have intermediate or technical fields been removed from final outputs (e.g., unnecessary timestamps or calculation columns)?
Are column names clear, user-friendly, and consistent (e.g., avoiding cryptic abbreviations or underscores)?
Are key business fields clearly labeled to be discoverable by non-technical users (e.g., prefixed with “Quality” or “Loss” where appropriate)?
Are units consistently applied to relevant fields?
Have all required partitions been properly defined and implemented?

Documentation Index

Data Validation

Typical Validation Questions