General Modeling Principles

Prev Next

Tag Data

Pipeline designs often combine high-volume, near real-time sources (tags) with other data sources. For tag data, consider the following:

  • Upstream aggregation: Determine whether time-based aggregation can be performed earlier in the pipeline to reduce record volume and improve performance.

  • Topic segmentation by machine:

    • Pros: Simplifies modeling and can improve processing efficiency

    • Cons: Increases operational overhead, especially if tags are frequently reassigned

  • Cycle boundary definition: Identify whether the cycle boundary is event-based or time-window-based. The optimal boundary should reflect natural process delineations.

  • Aggregation timing: Evaluate how early aggregation can be pushed in the pipeline. Earlier aggregation, combined with a smaller late-arriving data window, can improve simplicity and performance.

Tag Data Modeling Checklist:

Transactional (Database) Data

For transactional data sources, consider the following:

  • Late-arriving data window: Assess the expected lateness of records. If the window is large, delay merging with high-volume tag data as late in the modeling process as possible.

  • Time semantics: Determine whether each record represents a point in time, a time range, or has no time component. If it represents a range, consider decomposing it into point-in-time events (e.g., downtime start/stop). Refer to ETL best practices for examples.

  • Data mutability: Identify whether records may change over time (e.g., upserts or deletes). If so, ensure each record has a unique ID and an operation indicator (U/D).

  • Joins and enrichment: Evaluate whether “left join” logic is required. If possible, push joins upstream into the source query. If not, consider the Lookup table design pattern within the pipeline.

  • Multi-machine associations: When attaching data to multiple machines, assess trade-offs between duplicating records per machine (higher message volume) versus using forward fill on a non-machine partition (increased stateful processing).