Grouping

Deephaven offers three methods for grouping (or ungrouping) data in tables.

  • by
  • ungroup
  • byExternal

See also Aggregation.

by

Syntax

by(String… groupColumns)

The by method groups column content into arrays. If no arguments are defined in the method, the content of each column is grouped into its own array.

The method is best explained by operating in three stages:

  1. Evaluate the columns defined by groupColumns against the source table (similar to how a select statement would)
  2. Create a table containing all the distinct rows from the previous step.
  3. For each column from the source table not listed in the by method, add a column with the same name, containing an array with one element for each row that maps to the given distinct grouping row.

Example 1

The goal for this example is to create a new table with the same number of columns as the source table, but instead of having five rows of data for each column, the content in the rows for each column will be grouped into an array using the by method with no arguments. The query follows:

t=source.by()

When this query runs, the t table is generated. There are no arguments to the by method, so the content in the rows for each column is grouped into an array.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

C

4

40

AAPL

5

50

AA

 

t

A

B

C

[1,2,3,4,5]

[10,20,30,40,50]

[AAPL,IBM,C,AAPL,AA]

Example 2

The goal for this example is to group the content based on the distinct values in column C. The query follows:

t2=source.by(C)

When this query runs, the following t2 table is generated. There are four distinct values in column C, so the final table has four rows. In the source table, there are two rows in column C that have the same value (AAPL), so there are two values in each array for that row in the t2 table. Each of the other rows in column C of the source table was unique, so the respective arrays for columns A and B in the t2 table each have only one value.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

C

4

40

AAPL

5

50

AA

 

t2=source.by(C)

C

A

B

AAPL

[1,4]

[10,40]

IBM

[2]

[20]

C

[3]

[30]

AA

[5]

[50]

Example 3

The goal for this example is to group the content based on whether the values in column A are even or odd. The query follows:

t3=source.by("Parity=A%2")

When this query runs, the following t3 table is generated. A new column titled Parity is created in the t3 table to hold the modulus (remainder) values of 1 or 0. The values in column A are then evaluated for those moduli and then grouped into the corresponding arrays for each column. In the source table, there are three rows with a modulus value of 1 in column A, and two rows with a modulus value of 0 in column A. Therefore, the t3 table has only two rows of data - one for odd values and one for even values.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

C

4

40

AAPL

5

50

AA

 

t3=source.by("Parity=A%2")

Parity

A

B

C

1

[1,3,5]

[10,30,50]

[AAPL,C,AA]

0

[2,4]

[20,40]

[IBM,AAPL]

ungroup

Syntax

ungroup()

ungroup is the inverse of the by method. The ungroup method unwraps content from an array and builds a brand new data structure to hold the unwrapped content. There are two forms of ungroup:

  1. No argument:  If no argument is presented, all array columns from the source table will be unwrapped into separate rows. Consequently, all array columns in each row must have the same number of values.
  2. Explicit array column list:  Only columns specified will be unwrapped.

Example 4

The goal for this example is to unwrap all of the arrays for each column. Because we want to unwrap them all, no argument is needed. The query follows:

t4=source.ungroup()

When this query runs, the following t4 table is generated. In the first row of the source table, columns A and B both have three values in their respective arrays. Therefore, Deephaven unwraps those arrays into three rows in the t4 table - one row for each corresponding value in the arrays. In the second row of the source table, columns A and B both have two values in their respective arrays. Therefore, Deephaven unwraps those arrays into two additional rows in the t4 table, resulting in five total rows for the table.

source

A

B

C

1

[10,11,12]

[AAPL,SPY,GS]

2

[20,30]

[IBM,C]

 

t4=source.ungroup()

A

B

C

1

10

AAPL

1

11

SPY

1

12

GS

2

20

IBM

2

30

C

Note:  The same result could have been accomplished by including columns B and C in the ungroup argument, i.e., t4=source.ungroup("B","C")

Example 5

The goal for this example is to  unwrap one array column while leaving the other array column intact. The query follows:

t5=source.ungroup("B")

When this query runs, the following t5 table is generated. We are only unwrapping column B, so we need to add new rows for each value in each array.  Column B has three values in the array for the first row in the source table. So, Deephaven unwraps that array into three separate rows in the t5 table - one row for each corresponding value in the array. However, we did not tell Deephaven to ungroup the arrays in column C, so column C maintains its respective arrays for the corresponding values in column B. In the second row of the source table, column B has two values in its array. So, Deephaven unwraps that array into two additional rows in the t5 table, while maintaining the array in column C for those respective rows.

source

A

B

C

1

[10,11,12]

[AAPL,SPY,GS]

2

[20,30]

[IBM,C]

 

t5=source.ungroup("B")

A

B

C

1

10

[AAPL,SPY,GS]

1

11

[AAPL,SPY,GS]

1

12

[AAPL,SPY,GS]

2

20

[IBM,C]

2

30

[IBM,C]

byExternal

Syntax

byExternal(String… columnNames)

byExternal divides a single table into multiple tables, which are defined by unique key-value pairs in a TableMap. The keys are determined by the columnNames used in the argument.

For example, assume you have a table with five columns and 100 rows. One of the columns is named USym, and it contains five distinct values (AAPL, GOOG, MSFT, BABA and TSLA). For each distinct value in the USym column, there are 20 rows of data.

The byExternal method can be used to create five different tables based on the five unique key-value pairs defined in the TableMap. The query follows:

tableByUSym = table.byExternal("USym")

One table would be created for each of the key-value pairs, (i.e., USym-AAPL, USym-GOOG, USym-MSFT, USym-BABA, and USym-TSLA). Each distinct USym has 20 rows of data, so each table would have 20 rows.

In essence, this would be equivalent to performing individual where filters for each unique value in the USym column, and then saving the results of each to its own variable. However, using the byExternal method is far more efficient and faster because it eliminates the need to repeatedly pass over the data for each of the values of interest.

To retrieve the contents of a given table, you would use the get method as shown in the following:

t1=tableByUsym.get("AAPL")
t2=tableByUsym.get("GOOG")
t3=tableByUsym.get("MSFT")
...

To see all of the keys created in the TableMap, you would use the TableMap getKeySet() method as shown below:

  tKeys=tableByUSym.getKeySet()

Note: this method returns an array of keys.

Using byExternal on multiple columns

You can also create a TableMap using more than one column in the table. In this case, the number of tables created would be based on the number of all possible combinations of key-pairs in the TableMap. For example, if one of the columns in the previous example was titled Date and there were three different dates in that column, there would be three more additional key-pairs. Each key-pair for the Date column would be paired with each key-pair for the USym column, so there would be 15 different tables created in this example.

Using our previous example, the following code would generate tables for each unique combination of values in both the Date and USym columns,  e.g., ["2017-04-09", "TSLA"].

tableByUSym = table.byExternal("Date", "USym")

To open a specific resulting table in the multi-column TableMap, you would then use the SmartKey object to define a specific combination, and then use the .get method as shown below:

import com.fishlib.datastructures.util.SmartKey; 
key1 = new SmartKey("2017-04-08", "GOOG")        
key2 = new SmartKey("2017-04-09", "TSLA")
GOOG08 = curveLogByUsym.get(key1)    
Spy09 = curveLogByUsym.get(key2)


Last Updated: 23 September 2019 12:17 -04:00 UTC    Deephaven v.1.20181212  (See other versions)

Deephaven Documentation      Copyright 2016-2019  Deephaven Data Labs, LLC      All Rights Reserved