Aggregation
Aggregation is used to express data/information in a summary form. An aggregation operation groups the values of multiple rows based on the distinct criteria selected, performs a calculation on the values in those groups, and then returns a summary of the values in each group.
Three different types of aggregations are available:
Dedicated Aggregators
Deephaven provides several predefined aggregation functions:
| Function | Description | 
|---|---|
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. Return the first row for each group. (See example table   | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. Return the last row for each group. | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the sum of the data in each of the other columns. (See example table  | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the average (mean) of the data in each of the other columns. (See example table t5 below) | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the standard deviation of the data in each of the other columns. | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the variance of the data in each of the other columns. | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the median of the data in each of the other columns. | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the minimum of the data in each of the other columns. (See example  | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. For each group, calculate the maximum of the data in each of the other columns. | |
| Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. Count the number of rows in each group. (See example  | |
| Calculates the number of values in the column indicated in the  Equivalent to  (See example  | |
| 
 | Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. Return the first  (See example table  | 
| 
 | Find each distinct value or set of values for the column(s) listed in the  Group the rows containing the same distinct value. Return the last  (See example table  | 
Note:
- It is the user's responsibility to ensure the aggregation formula is compatible with the data type of each column being aggregated. For example, if you perform an .avgByon a column containing strings, the query will throw an error.
- 
                                            If multiple aggregations are applied to one column without renaming the output columns, the query will throw an error; e.g, " .update("X = 42").by(AggCombo(AggSum("X"),AggAvg("X")))."
- If there is no value in the groupByColumnsargument, for methods other thancountBy, grouping is not performed.countBywith no arguments will fail.
- If multiple values are used in the groupByColumnsargument, each distinct value set is used for grouping.
| 
 
 | 
 
 | 
 
 | 
 
 | 
Example Set 2
apples
| Name | Color | Price | Quantity | 
|---|---|---|---|
| Gala | red | 1.25 | 500 | 
| Fuji | red | 1.35 | 380 | 
| Granny Smith | green | 1.85 | 500 | 
| Honey Crisp | red | 3.25 | 80 | 
| Golden | green | 1.25 | 370 | 
t4=apples.view("Color","Quantity").sumBy("Color")
| Color | Quantity | 
|---|---|
| red | 960 | 
| green | 870 | 
t5=apples.view("Color","Price").avgBy("Color")
| Color | Price | 
|---|---|
| red | 1.95 | 
| green | 1.55 | 
| Price | Name | Color | Quantity | 
|---|---|---|---|
| 1.25 | Gala | green | 370 | 
| 1.35 | Fuji | red | 380 | 
| 1.85 | GrannySmith | green | 500 | 
| 3.25 | HoneyCrisp | red | 80 | 
t7=apples.countBy("Count", "Color")
| Color | Count | 
|---|---|
| red | 3 | 
| green | 2 | 
| Name | 
|---|
| 5 | 
The table below shows example queries where the table is aggregated by Sym and then simplified to show only two columns. Finally, the first four rows for each Sym is shown under headBy().  The last four rows for each Sym is shown under tailBy().
| headBy() | tailBy() | 
|---|---|
| 
 | 
 | 
|   |   | 
Combined Aggregations
Deephaven also allows you to performing multiple aggregations at once using the following functions:
| Function | Description | 
|---|---|
| ComboBy AggArray | Creates an array of all the values in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggAvg | Computes an average value in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggWAvg | Computes a weighted average in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggCount | Counts the number of rows in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggFirst | Groups rows containing the same distinct value in the source column(s), returns the first row, and then aggregates the results into their own column(s). | 
| ComboBy AggLast | Groups rows containing the same distinct value in the source column(s), returns the last row, and then aggregates the results into their own column(s). | 
| ComboBy AggMax | Computes a maximum value in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggMed | Computes a median value in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggMin | Computes a minimum value in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggPct | Computes the designated percentile of values in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggStd | Computes standard deviations of values in the source column(s) and then aggregates the results into their own column(s). | 
| ComboBy AggSum | Computes a total sum of values in the source column(s) and aggregates then the results into their own column(s). | 
| ComboBy AggVar | Computes the variance of the data of values in the source column(s) and then aggregates the results into their own column(s). | 
Syntax
resultTable = sourceTable.by(AggCombo(ComboBy... Aggregators), String... groupByColumns);
Each aggregation must be defined by a ComboBy method signature. These are created using methods in the ComboAggregateFactory class. There is one method for each type of aggregation.
- ComboBy AggMin(String...)
- ComboBy AggMax(String...)
- ComboBy AggSum(String...)
- ComboBy AggVar(String...)
- ComboBy AggAvg(String...)
- ComboBy AggStd(String...)
- ComboBy AggFirst(String...)
- ComboBy AggLast(String...)
- ComboBy AggMed(String...)
- ComboBy AggPct(percentile, String...)
- ComboBy AggArray(String...)
- ComboBy AggCount(String resultColumn)
The arguments for these methods include the name of the source column(s) for the calculations and the name(s) for the columns to be used in the new table.
Example
For this example, assume there is a table containing data regarding the trade of securities for a single day (March 15, 2016). The goal is to calculate the earliest timestamp and lowest price in the table. The appropriate ComboBy method follows: 
AggMin("Timestamp", "Price")
Because the resulting values are aggregated, it is important to rename the resulting columns with the aggregated title. This is demonstrated in the following:
AggMin("EarliestTimestamp=Timestamp", "EarliestPrice=Price")
The syntax shown above applies to all of the ComboBy methods except AggCount(). The count is based on the number of rows rather than any particular source column, so the only argument to AggCount() is the desired result column name. For example, one could use  AggCount("Count") or AggCount("Total").
Passing multiple ComboBys provides a concise and expressive syntax for aggregation. The example below demonstrates the complete syntax in which the AggMin() and AggCount() calls from above would be used to compute the earliest timestamp, lowest price, and total number of rows by USym and Expiry from the sourceTable
from deephaven import *  # ComboAggregateFactory module imported as caf
resultTable = sourceTable.by(caf.AggCombo(
    caf.AggMax("Timestamp"),
    caf.AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Price"),
    caf.AggCount("Count")
    ), "USym", "Expiry")
resultTable = sourceTable.by(AggCombo(
    AggMax("Timestamp"),
    AggMin("EarliestTimestamp=Timestamp", "LowestPrice=Price"),
    AggCount("Count")
    ), "USym", "Expiry")
Combined Aggregations Example
Imagine working with the following clip of trades1 table, which contains records of option trades in companies ABC and XYZ:
trades1
                                    
| Timestamp | USym | Sym | Size | Price | 
|---|---|---|---|---|
| 12:01:34 | ABC | ABC_2016_02_19_100.00_P | 10 | 0.30 | 
| 12:01:48 | ABC | ABC_2016_02_19_90.00_P | 32 | 0.05 | 
| 12:01:54 | XYZ | XYZ_2016_07_15_45.00_C | 12 | 0.15 | 
| 12:02:14 | ABC | ABC_2016_02_19_110.00_C | 15 | 0.04 | 
| 12:02:37 | XYZ | XYZ_2016_07_15_47.00_C | 20 | 0.01 | 
For each underlying USym, you want to calculate the first and last trade times, the total size traded, and the average trade size. This requires calculating minima, maxima, sums, and averages — four different types of aggregations. You could do all those calculations using minBy(), maxBy(), sumBy(), and avgBy() in four separate steps, creating four separate tables. Then, you'd need to join the results. The following query shows that lengthy and expensive procedure:
firstTrades = trades1.view("USym", "Timestamp").minBy("USym")
lastTrades = trades1.view("USym", "Timestamp").maxBy("USym")
totalSize = trades1.view("USym", "Size").sumBy("USym")
avgSize = trades1.view("USym", "Size").avgBy("USym")
tradeStats = firstTrades.view("USym", "FirstTrade=Timestamp")\
    .naturalJoin(lastTrades, "USym", "LastTrade=Timestamp")\
    .naturalJoin(totalSize, "USym", "Size")\
    .naturalJoin(avgSize, "USym", "AvgSize=Size")
Compare that to the following, which is produced more efficiently, and with simpler, more compact code, using the combined aggregation syntax:
tradeStats = trades1.by(caf.AggCombo(
    caf.AggMin("FirstTrade=Timestamp"),
    caf.AggMax("LastTrade=Timestamp"),
    caf.AggSum("Size"),
    caf.AggAvg("AvgSize=Size")),
    "USym")
Either way, the resulting tradeStats table contains the same data:
tradeStats
                                    
| USym | FirstTrade | LastTrade | Size | AvgSize | 
|---|---|---|---|---|
| ABC | 12:01:34 | 12:02:14 | 57 | 19 | 
| XYZ | 12:01:54 | 12:02:37 | 32 | 16 | 
User-defined Aggregators
You can define your own aggregations using the following syntax:
Table applyToAllBy(String formulaColumn,String... groupByColumns)
groupByColumns groups data by columns and applies formulaColumn to each of the columns not altered by the grouping operation.
formulaColumn represents the formula applied to each column, and uses the parameter "each" to refer to each column to which it is being applied. "Each" is an array of the values grouped by the groupByColumns.
Note: It is user's responsibility to ensure the aggregation formula is compatible with the data type for each column being aggregated.
Example
table
                                    
| A | B | C | 
|---|---|---|
| 1 | 10 | AAPL | 
| 2 | 20 | IBM | 
| 3 | 30 | C | 
| 4 | 40 | AAPL | 
| 5 | 50 | AA | 
applyToAllBy("`The second element is` + each[1]" ,"Parity=A%2")
                                    
| Parity | A | B | C | 
|---|---|---|---|
| 1 | The second element is 3 | The second element is 30 | The second element is C | 
| 0 | The second element is 4 | The second element is 40 | The second element is AAPL | 
Example 2
The example query below opens a table and selects values
In the t2 query, the applyToAllBy method is used to aggregate the Price and Size columns based on Sym and then apply a common formula  to manipulate the values in the Last and Size column. 
The t3 query shows the same process using the by and updateView methods instead of applyToAllBy.  However, the Last and Size columns are manipulated individually with their own  formula.
 In this case, the t2 and t3 tables will be identical.
t = db.t("LearnDeephaven", "StockTrades").where("Date=`2017-08-21`")
t2 = t.select("Sym", "Last", "Size").applyToAllBy("each+1", "Sym")
t3 = t.select("Sym", "Last", "Size").by("Sym").updateView("Last=Last+1", "Size=Size+1")
Last Updated: 28 February 2020 12:20 -05:00 UTC Deephaven v.1.20200121 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved