Aggregation

Aggregation is used to express data/information in a summary form. An aggregation operation groups the values of multiple rows based on the distinct criteria selected, performs a calculation on the values in those groups, and then returns a summary of the values in each group.

Three different types of aggregations are available:

Dedicated Aggregators

Deephaven provides several predefined aggregation functions:

Function

Description

Table firstBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

Return the first row for each group.

(See example table  t1 below)

Table lastBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

Return the last row for each group.

(See example tables t2 and t3 below)

Table sumBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the sum of the data in each of the other columns.

(See example table t4 below)

Table avgBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the average (mean) of the data in each of the other columns.

(See example table t5 below)

Table stdBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the standard deviation of the data in each of the other columns.

Table varBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the variance of the data in each of the other columns.

Table medianBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the median of the data in each of the other columns.

Table minBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the minimum of the data in each of the other columns.

(See example t6 below)

Table maxBy(String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

For each group, calculate the maximum of the data in each of the other columns.

Table countBy(String countColumn, String... groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

Count the number of rows in each group.

(See example t7below)

Table countBy(String countColumn)

Calculates the number of values in the column indicated in the countColumn argument.

Equivalent to by().count(countColumn).

(See example t8below)

Table headBy (n, groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

Return the first n rows for each group.

(See example table t9 below)

Table tailBy (n, groupByColumns)

Find each distinct value or set of values for the column(s) listed in the groupByColumns argument.

Group the rows containing the same distinct value.

Return the last n rows for each group.

(See example table t10 below)

Note:

  • It is the user's responsibility to ensure the aggregation formula is compatible with the data type of each column being aggregated. For example, if you perform an .avgBy on a column containing strings, the query will throw an error.
  • If multiple aggregations are applied to one column without renaming the output columns, the query will throw an error; e.g, ".update("X = 42").by(AggCombo(AggSum("X"),AggAvg("X")))."

  • If there is no value in the groupByColumns argument, for methods other than countBy, grouping is not performed. countBy with no arguments will fail. 
  • If multiple values are used in the groupByColumns argument, each distinct value set is used for grouping.

Example Set 1

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

C

4

40

AAPL

5

50

AA

 

t1=source.firstBy()

A

B

C

1

10

AAPL

 

t2=source.lastBy()

A

B

C

5

50

AA

 

t3=source.lastBy(C)

C

A

B

AAPL

4

40

IBM

2

20

C

3

30

AA

5

50

Example Set 2

apples

Name

Color

Price

Quantity

Gala

red

1.25

500

Fuji

red

1.35

380

Granny Smith

green

1.85

500

Honey Crisp

red

3.25

80

Golden

green

1.25

370


t4=apples.view("Color","Quantity").sumBy("Color")

Color

Quantity

red

960

green

870


t5=apples.view("Color","Price").avgBy("Color")

Color

Price

red

1.95

green

1.55


t6=apples.minBy("Price")

Price

Name

Color

Quantity

1.25

Gala

green

370

1.35

Fuji

red

380

1.85

GrannySmith

green

500

3.25

HoneyCrisp

red

80


t7=apples.countBy("Count", "Color")

Color

Count

red

3

green

2


t8=apples.countBy("Name")

Name

5

Example Set 3

The table below shows example queries where the table is aggregated by Sym and then simplified to show only two columns. Finally, the first four rows for each Sym is shown under headBy().  The last four rows for each Sym is shown under tailBy().

headBy()

tailBy()

t9 = db.t("LearnDeephaven", "StockTrades")
   .where("Date=`2017-08-25`")
   .view("Sym", "Size")
   .headBy(4, "Sym")

t10 = db.t("LearnDeephaven", "StockTrades")
   .where("Date=`2017-08-25`")
   .view("Sym", "Size")
   .tailBy(4, "Sym")

Combined Aggregations

Deephaven also allows you to perform multiple aggregations at once using the following functions:

Function

Description

ComboBy AggArray

Creates an array of all the values in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggAvg

Computes an average value in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggWAvg

Computes a weighted average in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggCount

Counts the number of rows in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggCountDistinct

Creates a distinct count aggregation. The output column contains the number of distinct values for the input column in that group.

ComboBy AggDistinct

Creates a distinct aggregation. The output column contains a DbArrayBase with the distinct values for the input column within the group. Null values are ignored. Similar to AggArray, but returns a Set.

ComboBy AggFirst

Groups rows containing the same distinct value in the source column(s), returns the first row, and then aggregates the results into their own column(s).

ComboBy AggLast

Groups rows containing the same distinct value in the source column(s), returns the last row, and then aggregates the results into their own column(s).

ComboBy AggMax

Computes a maximum value in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggMed

Computes a median value in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggMin

Computes a minimum value in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggPct

Computes the designated percentile of values in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggStd

Computes standard deviations of values in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggSum

Computes a total sum of values in the source column(s) and aggregates then the results into their own column(s).

ComboBy AggWSum

Computes a weighted sum in the source column(s) and then aggregates the results into their own column(s).

ComboBy AggVar

Computes the variance of the data of values in the source column(s) and then aggregates the results into their own column(s).

Syntax

resultTable = sourceTable.by(AggCombo(ComboBy... Aggregators), String... groupByColumns);

Each aggregation must be defined by a ComboBy method signature. These are created using methods in the ComboAggregateFactory class. There is one method for each type of aggregation.

  • ComboBy AggMin(String...)
  • ComboBy AggMax(String...)
  • ComboBy AggSum(String...)
  • ComboBy AggVar(String...)
  • ComboBy AggAvg(String...)
  • ComboBy AggWAvg(weight, String...)
  • ComboBy AggWSum (weight, String...)
  • ComboBy AggStd(String...)
  • ComboBy AggFirst(String...)
  • ComboBy AggLast(String...)
  • ComboBy AggMed(String...)
  • ComboBy AggPct(percentile, String...)
  • ComboBy AggArray(String...)
  • ComboBy AggCount(String resultColumn)

The arguments for these methods include the name of the source column(s) for the calculations and the name(s) for the columns to be used in the new table.

See also: Class ComboAggregateFactory

Example

For this example, assume there is a table containing data regarding the trade of securities for a single day. The goal is to calculate the earliest timestamp and lowest price in the table. The appropriate ComboBy method follows:

AggMin("Timestamp", "Price")

Because the resulting values are aggregated, it is important to rename the resulting columns with the aggregated title. This is demonstrated in the following:

AggMin("EarliestTimestamp=Timestamp", "EarliestPrice=Price")

The syntax shown above applies to all of the ComboBy methods except AggCount(). The count is based on the number of rows rather than any particular source column, so the only argument to AggCount() is the desired result column name. For example, one could use  AggCount("Count") or AggCount("Total").

Passing multiple ComboBys provides a concise and expressive syntax for aggregation. The example below demonstrates the complete syntax in which the AggMin() and AggCount() calls from above would be used to compute the earliest timestamp, lowest price, and total number of rows by USym and Exchange from the "StockTrades" table:

from deephaven import *  # ComboAggregateFactory module imported as caf

resultTable = db.t("LearnDeephaven", "StockTrades").by(caf.AggCombo(
    caf.AggMax("Timestamp"),
    caf.AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Last"),
    caf.AggCount("Count")
    ), "Exchange", "Sym")
resultTable = db.t("LearnDeephaven", "StockTrades").by(AggCombo(
    AggMax("Timestamp"), 
    AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Last"), 
    AggCount("Count")
), "Exchange", "Sym")

Combined Aggregations Example

Imagine working with the following clip of trades1 table, which contains records of option trades in companies ABC and XYZ:

trades1

Timestamp

USym

Sym

Size

Price

12:01:34

ABC

ABC_2016_02_19_100.00_P

10

0.30

12:01:48

ABC

ABC_2016_02_19_90.00_P

32

0.05

12:01:54

XYZ

XYZ_2016_07_15_45.00_C

12

0.15

12:02:14

ABC

ABC_2016_02_19_110.00_C

15

0.04

12:02:37

XYZ

XYZ_2016_07_15_47.00_C

20

0.01

For each underlying USym, you want to calculate the first and last trade times, the total size traded, and the average trade size. This requires calculating minima, maxima, sums, and averages — four different types of aggregations. You could do all those calculations using minBy(), maxBy(), sumBy(), and avgBy() in four separate steps, creating four separate tables. Then, you'd need to join the results. The following example query shows that lengthy and expensive procedure:

trades1 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`")
firstTrades = trades1.view("USym", "Timestamp").minBy("USym")
lastTrades = trades1.view("USym", "Timestamp").maxBy("USym")
totalSize = trades1.view("USym", "Size").sumBy("USym")
avgSize = trades1.view("USym", "Size").avgBy("USym")
tradeStats = firstTrades.view("USym", "FirstTrade=Timestamp")\
    .naturalJoin(lastTrades, "USym", "LastTrade=Timestamp")\
    .naturalJoin(totalSize, "USym", "Size")\
    .naturalJoin(avgSize, "USym", "AvgSize=Size")
trades1 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`")
firstTrades = trades1.view("USym", "Timestamp").minBy("USym")
lastTrades = trades1.view("USym", "Timestamp").maxBy("USym")
totalSize = trades1.view("USym", "Size").sumBy("USym")
avgSize = trades1.view("USym", "Size").avgBy("USym")
tradeStats = firstTrades.view("USym", "FirstTrade=Timestamp")\
    .naturalJoin(lastTrades, "USym", "LastTrade=Timestamp")\
    .naturalJoin(totalSize, "USym", "Size")\
    .naturalJoin(avgSize, "USym", "AvgSize=Size")

Compare that to the following, which is produced more efficiently, and with simpler, more compact code, using the combined aggregation syntax:

from deephaven import *  # ComboAggregateFactory module imported as caf

tradeStats = trades1.by(caf.AggCombo(
    caf.AggMin("FirstTrade=Timestamp"),
    caf.AggMax("LastTrade=Timestamp"),
    caf.AggSum("Size"),
    caf.AggAvg("AvgSize=Size")),
    "USym")
tradeStats = trades1.by(AggCombo(
    AggMin("FirstTrade=Timestamp"),
    AggMax("LastTrade=Timestamp"),
    AggSum("Size"),
    AggAvg("AvgSize=Size")),
    "USym")

Either way, the resulting tradeStats table contains the same data:

tradeStats

USym

FirstTrade

LastTrade

Size

AvgSize

ABC

12:01:34

12:02:14

57

19

XYZ

12:01:54

12:02:37

32

16

User-defined Aggregators

You can define your own aggregations using the following syntax:

Table applyToAllBy(String formulaColumn,String... groupByColumns)

groupByColumns groups data by columns and applies formulaColumn to each of the columns not altered by the grouping operation.

formulaColumn represents the formula applied to each column, and uses the parameter "each" to refer to each column to which it is being applied. "Each" is an array of the values grouped by the groupByColumns.

Note:  It is user's responsibility to ensure the aggregation formula is compatible with the data type for each column being aggregated.

Example

table

A

B

C

1

10

AAPL

2

20

IBM

3

30

C

4

40

AAPL

5

50

AA

applyToAllBy("`The second element is` + each[1]" ,"Parity=A%2")

Parity

A

B

C

1

The second element is 3

The second element is 30

The second element is C

0

The second element is 4

The second element is 40

The second element is AAPL

Example 2

The example query below opens a table and selects values

In the t2 query, the applyToAllBy method is used to aggregate the Price and Size columns based on Sym and then apply a common formula to manipulate the values in the Last and Size column.

The t3 query shows the same process using the by and updateView methods instead of applyToAllBy.  However, the Last and Size columns are manipulated individually with their own formula.

In this case, the t2 and t3 tables will be identical.

t = db.t("LearnDeephaven", "StockTrades").where("Date=`2017-08-21`")
t2 = t.select("Sym", "Last", "Size").applyToAllBy("each+1", "Sym")
t3 = t.select("Sym", "Last", "Size").by("Sym").updateView("Last=Last+1", "Size=Size+1")


Last Updated: 16 February 2021 18:07 -04:00 UTC    Deephaven v.1.20200928  (See other versions)

Deephaven Documentation     Copyright 2016-2020  Deephaven Data Labs, LLC     All Rights Reserved