- Print
- DarkLight
Aggregate
- Print
- DarkLight
Stateful Operator that aggregates multiple records into one record based on an explicit end of window / bin flag, boundary_field
. A function is applied to the values of each record field to determine the value in the output record.
This operator is a variation of the Reduce Operator. The difference between the Reduce Operator and the Aggregate Operator is the Reduce Operator groups output by changes of the values of a set of primary keys, primary_key_fields
, where as the Aggregate Operator groups based on a boolean boundary flag, boundary_flag
, becoming true. Additionally, the Aggregate Operator has functionality to emit output a window with retractions before it closes if the emit_window
option is set to each_update.
Input Data
A single field on the input stream must be of data type boolean
for use as a boundary flag to indicate the end of a window / bin.
Configuration Settings
The following settings are supported:
boundary_field (required): The name of the
boolean
field indicating the end of a window / bin.emit_window (required): Flag indicating if operator should emit an output each watermark update, or when a window is marked closed on a watermark update. The field must be one of the following values:
when_complete: Output window result on boundary field being true and the pipeline state is being synchronized. This option only outputs a record for the window once all the data up to the boundary flag for the window is processed. This is useful if working state is not needed. Complete cycle records vs partial cycle records.
each_update: Output window current result on change each pipeline state synchronization. This option outputs partial window states and may issue retractions and updates each state synchronization. This may impact the pipeline performance downstream. This option is useful for things like KPIs where we group the data by date, but want a latest to date update.
fields (required): (ONE_OR_MORE) A list of window function transforms and its parameters for the input record to the per window output record. See each of the window function descriptions below for details on their output behavior. Each field configuration consists of at least:
function (required): Name of the function to use to calculate the output value in the window, e.g.
mean
orfirst
. Acceptable functions are listed below.to_field (required): Name of the Output-Record field that will hold the result of the function. Values of
to_field
must be unique in the output record.*: Each window function can define additional required and optional parameters. See the function descriptions below for details.
Output Data
The format of the output record will primarily be decided by the fields-window-function description. The record annotations of the input fields are copied to the output fields.
A default-aggregation annotation is added to any fields calculated from an aggregation function that has an equivalent function in the data visualization application.
Examples
PER-MACHINE CYCLES
The operator reports the minimum and maximum temperature for each cycle, where cycles are terminated when the cycle_end
field is true
. The timezone
and cycle_end
fields are not included in the output. Since cycle_end
is not used as the input to a function, it will be added to the configuration when auto-draft is run.
Input
Data
timestamp
next_timestamp
timezone
machine
temperature
cycle_end
2019-01-01 11:00:00
2019-01-01 11:00:10
America/Detroit
can
72.0
false
2019-01-01 11:00:10
2019-01-01 11:00:20
America/Detroit
can
73.0
false
2019-01-01 11:00:20
2019-01-01 11:00:30
America/Detroit
can
72.0
false
2019-01-01 11:00:30
2019-01-01 11:00:40
America/Detroit
can
73.0
true
2019-01-01 13:00:00
2019-01-01 13:00:10
America/Detroit
can
73.0
false
2019-01-01 13:00:10
2019-01-01 13:00:20
America/Detroit
can
85.0
false
2019-01-01 13:00:20
2019-01-01 13:00:30
America/Detroit
can
93.0
false
2019-01-01 13:00:30
2019-01-01 13:00:40
America/Detroit
can
86.0
true
Record Schema
Field Name
Field Data Type
Field Annotations
original-name
stream-type
stat-type
measurement-scale
unit-of-measure
timestamp
Instant
next_timestamp
Instant
timezone
ZoneId
machine
Text
temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
cycle_end
boolean
Configuration
{ "partition_by": ["machine"], "boundary_field": "cycle_end", "emit_window": "when_complete", "fields": [ { "function": "min", "from_field": "temperature", "to_field": "min_temperature" }, { "function": "max", "from_field": "temperature", "to_field": "max_temperature" }, { "function": "first", "from_field": "timestamp", "to_field": "start_time" }, { "function": "last", "from_field": "next_timestamp", "to_field": "end_time" }, { "function": "ignore", "from_field": "timezone", "to_field": "timezone" } ] }
Output
Data
timestamp
min_temperature
max_temperature
start_time
end_time
2019-01-01 11:00:00
72.0
73.0
2019-01-01 11:00:00
2019-01-01 11:00:40
2019-01-01 13:00:00
73.0
93.0
2019-01-01 13:00:00
2019-01-01 13:00:40
Record Schema
Field Name
Field Data Type
Field Annotations
original-name
stream-type
stat-type
measurement-scale
unit-of-measure
default-aggregation
timestamp
Instant
min_temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
min
max_temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
max
start_time
Instant
first
end_time
Instant
last
Aggregate Window Functions
SUM
Description
Sum of non-null values in the window. Set to 0 if no non-null values present.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to sum.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be
double
.
FIRST
Description
Returns the first value in the window. Ignores null values by default.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to first.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the first value in the window. Defaults to
false
if not provided.
LAST
Description
Returns the last value in the window. Ignores null values by default.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to last.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the last value in the window. Defaults to
false
if not provided.
MIN
Description
Minimum value in the window. Nulls sort low.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to min.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
MAX
Description
Maximum value in the window. Nulls sort low.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to max.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
MEAN
Description
Mean of non-null values in the window. Set to null if no non-null values present.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to mean.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be
double
.
WEIGHTED MEAN
Description
Weighted mean of the non-null values in the window. To compute this value, the product of each value and its corresponding weight is summed and then divided by the sum of all the weights. Weights must be 0 or greater, or the function raises a warning marker indicating that a negative weight was used, then treats that weight as 0. If any weight is infinite or if all the weights are 0, the result is null. If one or more values are positive infinity and there are no negative infinity values, the result is positive infinity. The same is true for negative infinities. If there are both positive infinity and negative infinity values in the window (with an associated weight that is positive), the result is null.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to weighted_mean.
to_field (required): Name of the Output-Record field that will hold the result of the function.
value_field (required): Name of the Input-Record field that the function will read for value inputs. The data type of the named field must be
double
.weight_field (required): Name of the Input-Record field that the function will read for weight inputs. The data type of the named field must be
double
. Weights should be >= 0.
STANDARD DEVIATION
Description
Estimate of population standard deviation from non-null values in the window. Set to null if no non-null values are present, or if one or more values are too large to store.
This function applies Bessel's correction to the result to account for bias in the sample. As a special case, if there is exactly one non-null input value then the result will be zero instead of infinity.
The standard deviation function restricts the absolute value of its input to be less than 1012. Any window containing a value larger than 1012 will produce a null result. Additionally, this function cannot store values smaller than 10-15. Such values will be treated as if they were zero instead.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to standard_deviation.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be
double
.
UNION
Description
Union of values in all the arrays in the window, in ascending order. The result is an array of distinct non-null elements found in all the arrays in the window.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to union.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. This field must be an array type.
UNIQUE
Description
All distinct values in the window in ascending order. This function was previously known as set
.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to unique.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
IGNORE
Description
Produces no output. The from_field
specified for this field will be ignored by autodraft.
Field Configuration Settings
The following settings are supported:
function (required): This must be set to ignore.
to_field (required): Unused, since this function does not produce any output, but must be unique. Setting the value equal to
from_field
is recommended.from_field (required): Name of the field on the input record to ignore.