Aggregation

Aggregation is used to express data/information in a summary form. An aggregation operation groups the values of multiple rows based on the distinct criteria selected, performs a calculation on the values in those groups, and then returns a summary of the values in each group.

Three different types of aggregations are available:

Dedicated Aggregators
Combined Aggregators
User-defined Aggregators

Dedicated Aggregators

Deephaven provides several predefined aggregation functions:

Function	Description
`Table firstBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. Return the first row for each group. (See example table `t1` below)
`Table lastBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. Return the last row for each group. (See example tables `t2` and `t3` below)
`Table sumBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the sum of the data in each of the other columns. (See example table `t4` below)
`Table avgBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the average (mean) of the data in each of the other columns. (See example table t5 below)
`Table stdBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the standard deviation of the data in each of the other columns.
`Table varBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the variance of the data in each of the other columns.
`Table medianBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the median of the data in each of the other columns.
`Table minBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the minimum of the data in each of the other columns. (See example `t6` below)
`Table maxBy(String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. For each group, calculate the maximum of the data in each of the other columns.
`Table countBy(String countColumn, String... groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. Count the number of rows in each group. (See example `t7`below)
`Table countBy(String countColumn)`	Calculates the number of values in the column indicated in the `countColumn` argument. Equivalent to `by().count(countColumn)`. (See example `t8`below)
`Table headBy (n, groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. Return the first `n` rows for each group. (See example table `t9` below)
`Table tailBy (n, groupByColumns)`	Find each distinct value or set of values for the column(s) listed in the `groupByColumns` argument. Group the rows containing the same distinct value. Return the last `n` rows for each group. (See example table `t10` below)

Note:

It is the user's responsibility to ensure the aggregation formula is compatible with the data type of each column being aggregated. For example, if you perform an .avgBy on a column containing strings, the query will throw an error.
If multiple aggregations are applied to one column without renaming the output columns, the query will throw an error; e.g, ".update("X = 42").by(AggCombo(AggSum("X"),AggAvg("X")))."
If there is no value in the groupByColumns argument, for methods other than countBy, grouping is not performed. countBy with no arguments will fail.
If multiple values are used in the groupByColumns argument, each distinct value set is used for grouping.

Example Set 1

source

A	B	C
1	10	AAPL
2	20	IBM
3	30	C
4	40	AAPL
5	50	AA

t1=source.firstBy()

A	B	C
1	10	AAPL

t2=source.lastBy()

A	B	C
5	50	AA

t3=source.lastBy(C)

C	A	B
AAPL	4	40
IBM	2	20
C	3	30
AA	5	50

Example Set 2

apples

Name	Color	Price	Quantity
Gala	red	1.25	500
Fuji	red	1.35	380
Granny Smith	green	1.85	500
Honey Crisp	red	3.25	80
Golden	green	1.25	370

t4=apples.view("Color","Quantity").sumBy("Color")

Color	Quantity
red	960
green	870

t5=apples.view("Color","Price").avgBy("Color")

Color	Price
red	1.95
green	1.55

t6=apples.minBy("Price")

Price	Name	Color	Quantity
1.25	Gala	green	370
1.35	Fuji	red	380
1.85	GrannySmith	green	500
3.25	HoneyCrisp	red	80

t7=apples.countBy("Count", "Color")

Color	Count
red	3
green	2

t8=apples.countBy("Name")

Name
5

Example Set 3

The table below shows example queries where the table is aggregated by Sym and then simplified to show only two columns. Finally, the first four rows for each Sym is shown under headBy(). The last four rows for each Sym is shown under tailBy().

headBy()	tailBy()
t9 = db.t("LearnDeephaven", "StockTrades") .where("Date=`2017-08-25`") .view("Sym", "Size") .headBy(4, "Sym")	t10 = db.t("LearnDeephaven", "StockTrades") .where("Date=`2017-08-25`") .view("Sym", "Size") .tailBy(4, "Sym")

Combined Aggregations

Deephaven also allows you to perform multiple aggregations at once using the following functions:

Function	Description
ComboBy AggArray	Creates an array of all the values in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggAvg	Computes an average value in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggWAvg	Computes a weighted average in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggCount	Counts the number of rows in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggCountDistinct	Creates a distinct count aggregation. The output column contains the number of distinct values for the input column in that group.
ComboBy AggDistinct	Creates a distinct aggregation. The output column contains a `DbArrayBase` with the distinct values for the input column within the group. Null values are ignored. Similar to AggArray, but returns a Set.
ComboBy AggFirst	Groups rows containing the same distinct value in the source column(s), returns the first row, and then aggregates the results into their own column(s).
ComboBy AggLast	Groups rows containing the same distinct value in the source column(s), returns the last row, and then aggregates the results into their own column(s).
ComboBy AggMax	Computes a maximum value in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggMed	Computes a median value in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggMin	Computes a minimum value in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggPct	Computes the designated percentile of values in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggStd	Computes standard deviations of values in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggSum	Computes a total sum of values in the source column(s) and aggregates then the results into their own column(s).
ComboBy AggWSum	Computes a weighted sum in the source column(s) and then aggregates the results into their own column(s).
ComboBy AggVar	Computes the variance of the data of values in the source column(s) and then aggregates the results into their own column(s).

Syntax

resultTable = sourceTable.by(AggCombo(ComboBy... Aggregators), String... groupByColumns);

Each aggregation must be defined by a ComboBy method signature. These are created using methods in the ComboAggregateFactory class. There is one method for each type of aggregation.

ComboBy AggMin(String...)
ComboBy AggMax(String...)
ComboBy AggSum(String...)
ComboBy AggVar(String...)
ComboBy AggAvg(String...)
ComboBy AggWAvg(weight, String...)
ComboBy AggWSum (weight, String...)
ComboBy AggStd(String...)
ComboBy AggFirst(String...)
ComboBy AggLast(String...)
ComboBy AggMed(String...)
ComboBy AggPct(percentile, String...)
ComboBy AggArray(String...)
ComboBy AggCount(String resultColumn)

The arguments for these methods include the name of the source column(s) for the calculations and the name(s) for the columns to be used in the new table.

See also: Class ComboAggregateFactory

Example

For this example, assume there is a table containing data regarding the trade of securities for a single day. The goal is to calculate the earliest timestamp and lowest price in the table. The appropriate ComboBy method follows:

AggMin("Timestamp", "Price")

Because the resulting values are aggregated, it is important to rename the resulting columns with the aggregated title. This is demonstrated in the following:

AggMin("EarliestTimestamp=Timestamp", "EarliestPrice=Price")

The syntax shown above applies to all of the ComboBy methods except AggCount(). The count is based on the number of rows rather than any particular source column, so the only argument to AggCount() is the desired result column name. For example, one could use AggCount("Count") or AggCount("Total").

Passing multiple ComboBys provides a concise and expressive syntax for aggregation. The example below demonstrates the complete syntax in which the AggMin() and AggCount() calls from above would be used to compute the earliest timestamp, lowest price, and total number of rows by USym and Exchange from the "StockTrades" table:

from deephaven import *  # ComboAggregateFactory module imported as caf

resultTable = db.t("LearnDeephaven", "StockTrades").by(caf.AggCombo(
    caf.AggMax("Timestamp"),
    caf.AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Last"),
    caf.AggCount("Count")
    ), "Exchange", "Sym")

resultTable = db.t("LearnDeephaven", "StockTrades").by(AggCombo(
    AggMax("Timestamp"), 
    AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Last"), 
    AggCount("Count")
), "Exchange", "Sym")

Combined Aggregations Example

Imagine working with the following clip of trades1 table, which contains records of option trades in companies ABC and XYZ:

trades1

Timestamp	USym	Sym	Size	Price
12:01:34	ABC	ABC_2016_02_19_100.00_P	10	0.30
12:01:48	ABC	ABC_2016_02_19_90.00_P	32	0.05
12:01:54	XYZ	XYZ_2016_07_15_45.00_C	12	0.15
12:02:14	ABC	ABC_2016_02_19_110.00_C	15	0.04
12:02:37	XYZ	XYZ_2016_07_15_47.00_C	20	0.01

For each underlying USym, you want to calculate the first and last trade times, the total size traded, and the average trade size. This requires calculating minima, maxima, sums, and averages — four different types of aggregations. You could do all those calculations using minBy(), maxBy(), sumBy(), and avgBy() in four separate steps, creating four separate tables. Then, you'd need to join the results. The following example query shows that lengthy and expensive procedure:

trades1 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`")
firstTrades = trades1.view("USym", "Timestamp").minBy("USym")
lastTrades = trades1.view("USym", "Timestamp").maxBy("USym")
totalSize = trades1.view("USym", "Size").sumBy("USym")
avgSize = trades1.view("USym", "Size").avgBy("USym")
tradeStats = firstTrades.view("USym", "FirstTrade=Timestamp")\
    .naturalJoin(lastTrades, "USym", "LastTrade=Timestamp")\
    .naturalJoin(totalSize, "USym", "Size")\
    .naturalJoin(avgSize, "USym", "AvgSize=Size")

trades1 = db.t("LearnDeephaven", "StockTrades").where("Date = `2017-08-25`")
firstTrades = trades1.view("USym", "Timestamp").minBy("USym")
lastTrades = trades1.view("USym", "Timestamp").maxBy("USym")
totalSize = trades1.view("USym", "Size").sumBy("USym")
avgSize = trades1.view("USym", "Size").avgBy("USym")
tradeStats = firstTrades.view("USym", "FirstTrade=Timestamp")\
    .naturalJoin(lastTrades, "USym", "LastTrade=Timestamp")\
    .naturalJoin(totalSize, "USym", "Size")\
    .naturalJoin(avgSize, "USym", "AvgSize=Size")

Compare that to the following, which is produced more efficiently, and with simpler, more compact code, using the combined aggregation syntax:

from deephaven import *  # ComboAggregateFactory module imported as caf

tradeStats = trades1.by(caf.AggCombo(
    caf.AggMin("FirstTrade=Timestamp"),
    caf.AggMax("LastTrade=Timestamp"),
    caf.AggSum("Size"),
    caf.AggAvg("AvgSize=Size")),
    "USym")

tradeStats = trades1.by(AggCombo(
    AggMin("FirstTrade=Timestamp"),
    AggMax("LastTrade=Timestamp"),
    AggSum("Size"),
    AggAvg("AvgSize=Size")),
    "USym")

Either way, the resulting tradeStats table contains the same data:

tradeStats

USym	FirstTrade	LastTrade	Size	AvgSize
ABC	12:01:34	12:02:14	57	19
XYZ	12:01:54	12:02:37	32	16

User-defined Aggregators

You can define your own aggregations using the following syntax:

Table applyToAllBy(String formulaColumn,String... groupByColumns)

groupByColumns groups data by columns and applies formulaColumn to each of the columns not altered by the grouping operation.

formulaColumn represents the formula applied to each column, and uses the parameter "each" to refer to each column to which it is being applied. "Each" is an array of the values grouped by the groupByColumns.

Note: It is user's responsibility to ensure the aggregation formula is compatible with the data type for each column being aggregated.

Example

table

A	B	C
1	10	AAPL
2	20	IBM
3	30	C
4	40	AAPL
5	50	AA

applyToAllBy("`The second element is` + each[1]" ,"Parity=A%2")

Parity	A	B	C
1	The second element is 3	The second element is 30	The second element is C
0	The second element is 4	The second element is 40	The second element is AAPL

Example 2

The example query below opens a table and selects values

In the t2 query, the applyToAllBy method is used to aggregate the Price and Size columns based on Sym and then apply a common formula to manipulate the values in the Last and Size column.

The t3 query shows the same process using the by and updateView methods instead of applyToAllBy. However, the Last and Size columns are manipulated individually with their own formula.

In this case, the t2 and t3 tables will be identical.

t = db.t("LearnDeephaven", "StockTrades").where("Date=`2017-08-21`")
t2 = t.select("Sym", "Last", "Size").applyToAllBy("each+1", "Sym")
t3 = t.select("Sym", "Last", "Size").by("Sym").updateView("Last=Last+1", "Size=Size+1")

Last Updated: 16 February 2021 18:07 -04:00 UTC Deephaven v.1.20200928 (See other versions)