General Modeling Principles

Pipeline designs often combine high-volume, near real-time sources (tags) with other data sources. For tag data, consider the following:

Upstream aggregation: Determine whether time-based aggregation can be performed earlier in the pipeline to reduce record volume and improve performance.
Topic segmentation by machine:
- Pros: Simplifies modeling and can improve processing efficiency
- Cons: Increases operational overhead, especially if tags are frequently reassigned
Cycle boundary definition: Identify whether the cycle boundary is event-based or time-window-based. The optimal boundary should reflect natural process delineations.
Aggregation timing: Evaluate how early aggregation can be pushed in the pipeline. Earlier aggregation, combined with a smaller late-arriving data window, can improve simplicity and performance.

For transactional data sources, consider the following:

Late-arriving data window: Assess the expected lateness of records. If the window is large, delay merging with high-volume tag data as late in the modeling process as possible.
Time semantics: Determine whether each record represents a point in time, a time range, or has no time component. If it represents a range, consider decomposing it into point-in-time events (e.g., downtime start/stop). Refer to ETL best practices for examples.
Data mutability: Identify whether records may change over time (e.g., upserts or deletes). If so, ensure each record has a unique ID and an operation indicator (U/D).
Joins and enrichment: Evaluate whether “left join” logic is required. If possible, push joins upstream into the source query. If not, consider the Lookup table design pattern within the pipeline.
Multi-machine associations: When attaching data to multiple machines, assess trade-offs between duplicating records per machine (higher message volume) versus using forward fill on a non-machine partition (increased stateful processing).

Documentation Index