Data Modeling at Sight Machine

Prev Next

Data Modeling is a series of design, organization, and configuration decisions made to leverage data received from clients in order to satisfy defined use cases and support Data Scientists.

Key Terms

Asset Isolation: Separating assets in the pipeline so the process for each asset is clear. When done incorrectly, fanouts appear in the pipeline, indicating it is organized by other aspects instead of the asset itself.

Template Usage: Generalizing a repeatable process into a reusable operator. When done properly, data processes are easier to identify, similar data can be integrated quickly, and the canvas is more readable.

Data Dictionary Instrumentation: Using data dictionaries to configure pipelines with the required topics, tags, and values to run operators. These should be used in every pipeline, though design may vary by use case and client need.

Canvas Readability/Organization: Ensuring pipelines are easy to read and understand. Operations should be intuitively named, and data grouped in the same areas of the canvas should generally be processed in similar ways.

Tag Triage: Ensuring all tags are defined in a data dictionary.

RTFs/KPIs: Runtime fields (RTFs) are computed at runtime using a SQL-like formula syntax instead of being precomputed in the pipeline. They appear like normal fields across the platform and help reduce pipeline complexity and restreams. KPIs (Key Performance Indicators) represent key production outcomes and can be used across various use cases once configured.

Performance and Storage Usage: Monitoring and optimizing factors such as savepoint storage, late arriving windows, store/retrieve bottlenecks, Kafka topic size, and high-cardinality partitions.