Aggregate

Prev Next

Stateful Operator that aggregates multiple records into one record based on an explicit end of window / bin flag, boundary_field. A function is applied to the values of each record field to determine the value in the output record.

This operator is a variation of the Reduce Operator. The difference between the Reduce Operator and the Aggregate Operator is the Reduce Operator groups output by changes of the values of a set of primary keys, primary_key_fields, where as the Aggregate Operator groups based on a boolean boundary flag, boundary_flag, becoming true. Additionally, the Aggregate Operator has functionality to emit output a window with retractions before it closes if the emit_window option is set to each_update.

Input Data

A single field on the input stream must be of data type boolean for use as a boundary flag to indicate the end of a window / bin.

Configuration Settings

The following settings are supported:

  • boundary_field (required): The name of the boolean field indicating the end of a window / bin.

  • emit_window (required): Flag indicating if operator should emit an output each watermark update, or when a window is marked closed on a watermark update. The field must be one of the following values:

    • when_complete: Output window result on boundary field being true and the pipeline state is being synchronized. This option only outputs a record for the window once all the data up to the boundary flag for the window is processed. This is useful if working state is not needed. Complete cycle records vs partial cycle records.

    • each_update: Output window current result on change each pipeline state synchronization. This option outputs partial window states and may issue retractions and updates each state synchronization. This may impact the pipeline performance downstream. This option is useful for things like KPIs where we group the data by date, but want a latest to date update.

  • fields (required): (ONE_OR_MORE) A list of window function transforms and its parameters for the input record to the per window output record. See each of the window function descriptions below for details on their output behavior. Each field configuration consists of at least:

    • function (required): Name of the function to use to calculate the output value in the window, e.g. mean or first. Acceptable functions are listed below.

    • to_field (required): Name of the Output-Record field that will hold the result of the function. Values of to_field must be unique in the output record.

    • *: Each window function can define additional required and optional parameters. See the function descriptions below for details.

Output Data

The format of the output record will primarily be decided by the fields-window-function description. The record annotations of the input fields are copied to the output fields.

A default-aggregation annotation is added to any fields calculated from an aggregation function that has an equivalent function in the data visualization application.

Examples

PER-MACHINE CYCLES

The operator reports the minimum and maximum temperature for each cycle, where cycles are terminated when the cycle_end field is true. The timezone and cycle_end fields are not included in the output. Since cycle_end is not used as the input to a function, it will be added to the configuration when auto-draft is run.

Input

Data

timestamp

next_timestamp

timezone

machine

temperature

cycle_end

2019-01-01 11:00:00

2019-01-01 11:00:10

America/Detroit

can

72.0

false

2019-01-01 11:00:10

2019-01-01 11:00:20

America/Detroit

can

73.0

false

2019-01-01 11:00:20

2019-01-01 11:00:30

America/Detroit

can

72.0

false

2019-01-01 11:00:30

2019-01-01 11:00:40

America/Detroit

can

73.0

true

2019-01-01 13:00:00

2019-01-01 13:00:10

America/Detroit

can

73.0

false

2019-01-01 13:00:10

2019-01-01 13:00:20

America/Detroit

can

85.0

false

2019-01-01 13:00:20

2019-01-01 13:00:30

America/Detroit

can

93.0

false

2019-01-01 13:00:30

2019-01-01 13:00:40

America/Detroit

can

86.0

true

Record Schema

Field Name

Field Data Type

Field Annotations

original-name

stream-type

stat-type

measurement-scale

unit-of-measure

timestamp

Instant

next_timestamp

Instant

timezone

ZoneId

machine

Text

temperature

double

temperature

sensor_values

Continuous

thermodynamic_temperature

celsius

cycle_end

boolean

Configuration
{
  "partition_by": ["machine"],
  "boundary_field": "cycle_end",
  "emit_window": "when_complete",
  "fields": [
    {
      "function": "min",
      "from_field": "temperature",
      "to_field": "min_temperature"
    },
    {
      "function": "max",
      "from_field": "temperature",
      "to_field": "max_temperature"
    },
    {
      "function": "first",
      "from_field": "timestamp",
      "to_field": "start_time"
    },
    {
      "function": "last",
      "from_field": "next_timestamp",
      "to_field": "end_time"
    },
    {
      "function": "ignore",
      "from_field": "timezone",
      "to_field": "timezone"
    }
  ]
}

Output

Data

timestamp

min_temperature

max_temperature

start_time

end_time

2019-01-01 11:00:00

72.0

73.0

2019-01-01 11:00:00

2019-01-01 11:00:40

2019-01-01 13:00:00

73.0

93.0

2019-01-01 13:00:00

2019-01-01 13:00:40

Record Schema

Field Name

Field Data Type

Field Annotations

original-name

stream-type

stat-type

measurement-scale

unit-of-measure

default-aggregation

timestamp

Instant

min_temperature

double

temperature

sensor_values

Continuous

thermodynamic_temperature

celsius

min

max_temperature

double

temperature

sensor_values

Continuous

thermodynamic_temperature

celsius

max

start_time

Instant

first

end_time

Instant

last

Aggregate Window Functions

SUM

Description

Sum of non-null values in the window. Set to 0 if no non-null values present.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to sum.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

FIRST

Description

Returns the first value in the window. Ignores null values by default.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to first.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs.

  • include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the first value in the window. Defaults to false if not provided.

LAST

Description

Returns the last value in the window. Ignores null values by default.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to last.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs.

  • include_nulls (optional): Boolean setting that causes the function to include nulls when choosing the last value in the window. Defaults to false if not provided.

MIN

Description

Minimum value in the window. Nulls sort low.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to min.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs.

MAX

Description

Maximum value in the window. Nulls sort low.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to max.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs.

MEAN

Description

Mean of non-null values in the window. Set to null if no non-null values present.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to mean.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

WEIGHTED MEAN

Description

Weighted mean of the non-null values in the window. To compute this value, the product of each value and its corresponding weight is summed and then divided by the sum of all the weights. Weights must be 0 or greater, or the function raises a warning marker indicating that a negative weight was used, then treats that weight as 0. If any weight is infinite or if all the weights are 0, the result is null. If one or more values are positive infinity and there are no negative infinity values, the result is positive infinity. The same is true for negative infinities. If there are both positive infinity and negative infinity values in the window (with an associated weight that is positive), the result is null.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to weighted_mean.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • value_field (required): Name of the Input-Record field that the function will read for value inputs. The data type of the named field must be double.

  • weight_field (required): Name of the Input-Record field that the function will read for weight inputs. The data type of the named field must be double. Weights should be >= 0.

STANDARD DEVIATION

Description

Estimate of population standard deviation from non-null values in the window. Set to null if no non-null values are present, or if one or more values are too large to store.

This function applies Bessel's correction to the result to account for bias in the sample. As a special case, if there is exactly one non-null input value then the result will be zero instead of infinity.

The standard deviation function restricts the absolute value of its input to be less than 1012. Any window containing a value larger than 1012 will produce a null result. Additionally, this function cannot store values smaller than 10-15. Such values will be treated as if they were zero instead.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to standard_deviation.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs. The data type of the named field must be double.

UNION

Description

Union of values in all the arrays in the window, in ascending order. The result is an array of distinct non-null elements found in all the arrays in the window.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to union.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs. This field must be an array type.

UNIQUE

Description

All distinct values in the window in ascending order. This function was previously known as set.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to unique.

  • to_field (required): Name of the Output-Record field that will hold the result of the function.

  • from_field (required): Name of the Input-Record field that the function will read for inputs.

IGNORE

Description

Produces no output. The from_field specified for this field will be ignored by autodraft.

Field Configuration Settings

The following settings are supported:

  • function (required): This must be set to ignore.

  • to_field (required): Unused, since this function does not produce any output, but must be unique. Setting the value equal to from_field is recommended.

  • from_field (required): Name of the field on the input record to ignore.