Aggregate

Print
Share
Dark
Light

Aggregate

Print
Share
Dark
Light

Article summary

Did you find this summary helpful?

Thank you for your feedback

Stateful Operator that aggregates multiple records into one record based on an explicit end of window / bin flag, boundary_field. A function is applied to the values of each record field to determine the value in the output record.

This operator is a variation of the Reduce Operator. The difference between the Reduce Operator and the Aggregate Operator is the Reduce Operator groups output by changes of the values of a set of primary keys, primary_key_fields, where as the Aggregate Operator groups based on a boolean boundary flag, boundary_flag, becoming true. Additionally, the Aggregate Operator has functionality to emit output a window with retractions before it closes if the emit_window option is set to each_update.

Input Data

A single field on the input stream must be of data type boolean for use as a boundary flag to indicate the end of a window / bin.

Configuration Settings

The following settings are supported:

boundary_field (required): The name of the boolean field indicating the end of a window / bin.
emit_window (required): Flag indicating if operator should emit an output each watermark update, or when a window is marked closed on a watermark update. The field must be one of the following values:
- when_complete: Output window result on boundary field being true and the pipeline state is being synchronized. This option only outputs a record for the window once all the data up to the boundary flag for the window is processed. This is useful if working state is not needed. Complete cycle records vs partial cycle records.
- each_update: Output window current result on change each pipeline state synchronization. This option outputs partial window states and may issue retractions and updates each state synchronization. This may impact the pipeline performance downstream. This option is useful for things like KPIs where we group the data by date, but want a latest to date update.
fields (required): (ONE_OR_MORE) A list of window function transforms and its parameters for the input record to the per window output record. See each of the window function descriptions below for details on their output behavior. Each field configuration consists of at least:
- function (required): Name of the function to use to calculate the output value in the window, e.g. mean or first. Acceptable functions are listed below.
- to_field (required): Name of the Output-Record field that will hold the result of the function. Values of to_field must be unique in the output record.
- *: Each window function can define additional required and optional parameters. See the function descriptions below for details.

Output Data

The format of the output record will primarily be decided by the fields-window-function description. The record annotations of the input fields are copied to the output fields.

A default-aggregation annotation is added to any fields calculated from an aggregation function that has an equivalent function in the data visualization application.

Examples

PER-MACHINE CYCLES

The operator reports the minimum and maximum temperature for each cycle, where cycles are terminated when the cycle_end field is true. The timezone and cycle_end fields are not included in the output. Since cycle_end is not used as the input to a function, it will be added to the configuration when auto-draft is run.

Input

Data

timestamp
next_timestamp
timezone
machine
temperature
cycle_end
2019-01-01 11:00:00
2019-01-01 11:00:10
America/Detroit
can
72.0
false
2019-01-01 11:00:10
2019-01-01 11:00:20
America/Detroit
can
73.0
false
2019-01-01 11:00:20
2019-01-01 11:00:30
America/Detroit
can
72.0
false
2019-01-01 11:00:30
2019-01-01 11:00:40
America/Detroit
can
73.0
true
2019-01-01 13:00:00
2019-01-01 13:00:10
America/Detroit
can
73.0
false
2019-01-01 13:00:10
2019-01-01 13:00:20
America/Detroit
can
85.0
false
2019-01-01 13:00:20
2019-01-01 13:00:30
America/Detroit
can
93.0
false
2019-01-01 13:00:30
2019-01-01 13:00:40
America/Detroit
can
86.0
true

Record Schema

Field Name
Field Data Type
Field Annotations
original-name
stream-type
stat-type
measurement-scale
unit-of-measure
timestamp
Instant
next_timestamp
Instant
timezone
ZoneId
machine
Text
temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
cycle_end
boolean

Configuration

{
  "partition_by": ["machine"],
  "boundary_field": "cycle_end",
  "emit_window": "when_complete",
  "fields": [
    {
      "function": "min",
      "from_field": "temperature",
      "to_field": "min_temperature"
    },
    {
      "function": "max",
      "from_field": "temperature",
      "to_field": "max_temperature"
    },
    {
      "function": "first",
      "from_field": "timestamp",
      "to_field": "start_time"
    },
    {
      "function": "last",
      "from_field": "next_timestamp",
      "to_field": "end_time"
    },
    {
      "function": "ignore",
      "from_field": "timezone",
      "to_field": "timezone"
    }
  ]
}

Output

Data

timestamp
min_temperature
max_temperature
start_time
end_time
2019-01-01 11:00:00
72.0
73.0
2019-01-01 11:00:00
2019-01-01 11:00:40
2019-01-01 13:00:00
73.0
93.0
2019-01-01 13:00:00
2019-01-01 13:00:40

Record Schema

Field Name
Field Data Type
Field Annotations
original-name
stream-type
stat-type
measurement-scale
unit-of-measure
default-aggregation
timestamp
Instant
min_temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
min
max_temperature
double
temperature
sensor_values
Continuous
thermodynamic_temperature
celsius
max
start_time
Instant
first
end_time
Instant
last

Aggregate Window Functions

SUM

Description

Sum of non-null values in the window. Set to 0 if no non-null values present.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to sum.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

FIRST

Description

Returns the first value in the window. Ignores null values by default.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to first.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the first value in the window. Defaults to false if not provided.

LAST

Description

Returns the last value in the window. Ignores null values by default.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to last.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.
include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the last value in the window. Defaults to false if not provided.

MIN

Description

Minimum value in the window. Nulls sort low.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to min.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.

MAX

Description

Maximum value in the window. Nulls sort low.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to max.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.

MEAN

Description

Mean of non-null values in the window. Set to null if no non-null values present.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to mean.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

WEIGHTED MEAN

Description

Weighted mean of the non-null values in the window. To compute this value, the product of each value and its corresponding weight is summed and then divided by the sum of all the weights. Weights must be 0 or greater, or the function raises a warning marker indicating that a negative weight was used, then treats that weight as 0. If any weight is infinite or if all the weights are 0, the result is null. If one or more values are positive infinity and there are no negative infinity values, the result is positive infinity. The same is true for negative infinities. If there are both positive infinity and negative infinity values in the window (with an associated weight that is positive), the result is null.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to weighted_mean.
to_field (required): Name of the Output-Record field that will hold the result of the function.
value_field (required): Name of the Input-Record field that the function will read for value inputs. The data type of the named field must be double.
weight_field (required): Name of the Input-Record field that the function will read for weight inputs. The data type of the named field must be double. Weights should be >= 0.

STANDARD DEVIATION

Description

Estimate of population standard deviation from non-null values in the window. Set to null if no non-null values are present, or if one or more values are too large to store.

This function applies Bessel's correction to the result to account for bias in the sample. As a special case, if there is exactly one non-null input value then the result will be zero instead of infinity.

The standard deviation function restricts the absolute value of its input to be less than 10¹². Any window containing a value larger than 10¹² will produce a null result. Additionally, this function cannot store values smaller than 10^-15. Such values will be treated as if they were zero instead.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to standard_deviation.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

UNION

Description

Union of values in all the arrays in the window, in ascending order. The result is an array of distinct non-null elements found in all the arrays in the window.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to union.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs. This field must be an array type.

UNIQUE

Description

All distinct values in the window in ascending order. This function was previously known as set.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to unique.
to_field (required): Name of the Output-Record field that will hold the result of the function.
from_field (required): Name of the Input-Record field that the function will read for inputs.

IGNORE

Description

Produces no output. The from_field specified for this field will be ignored by autodraft.

Field Configuration Settings

The following settings are supported:

function (required): This must be set to ignore.
to_field (required): Unused, since this function does not produce any output, but must be unique. Setting the value equal to from_field is recommended.
from_field (required): Name of the field on the input record to ignore.

What's Next

Dynamic Operators

Table of contents

Input Data
Configuration Settings
Output Data
Aggregate Window Functions

timestamp	next_timestamp	timezone	machine	temperature	cycle_end
2019-01-01 11:00:00	2019-01-01 11:00:10	America/Detroit	can	72.0	false
2019-01-01 11:00:10	2019-01-01 11:00:20	America/Detroit	can	73.0	false
2019-01-01 11:00:20	2019-01-01 11:00:30	America/Detroit	can	72.0	false
2019-01-01 11:00:30	2019-01-01 11:00:40	America/Detroit	can	73.0	true
2019-01-01 13:00:00	2019-01-01 13:00:10	America/Detroit	can	73.0	false
2019-01-01 13:00:10	2019-01-01 13:00:20	America/Detroit	can	85.0	false
2019-01-01 13:00:20	2019-01-01 13:00:30	America/Detroit	can	93.0	false
2019-01-01 13:00:30	2019-01-01 13:00:40	America/Detroit	can	86.0	true

Field Name	Field Data Type	Field Annotations
Field Name	Field Data Type	original-name	stream-type	stat-type	measurement-scale	unit-of-measure
timestamp	Instant
next_timestamp	Instant
timezone	ZoneId
machine	Text
temperature	double	temperature	sensor_values	Continuous	thermodynamic_temperature	celsius
cycle_end	boolean

timestamp	min_temperature	max_temperature	start_time	end_time
2019-01-01 11:00:00	72.0	73.0	2019-01-01 11:00:00	2019-01-01 11:00:40
2019-01-01 13:00:00	73.0	93.0	2019-01-01 13:00:00	2019-01-01 13:00:40