Selection
Deephaven offers several methods for selecting (or deselecting) columns of data in a query. These methods also enable users to eliminate or modify existing columns of data, or create new columns of data. The first five methods are select
, view
, update
, updateView
and lazyUpdate
.
Before deciding on the selection method to employ in your query, you should consider two questions:
- Do you need to use only some of the columns or all of the columns from the source table?
- Will the analyses run faster by storing all of the information in memory, or will it be more efficient to dynamically compute cell values as they are needed?
An example for the first question follows:
If your source table has 50 columns of data, but you only need a subset of those columns for your analysis, using the select
or view
methods will allow you to specifically name the columns you want to keep in the resulting table. The select
and view
methods also allow you to modify or add new columns to the resulting table.
Once you have your initial column set pared down, you may want to modify a column or add more columns to the resulting table. This is when the update
, updateView,
and lazyUpdate
methods come into play. When using one of these three selection methods, all of the columns from the source table are automatically included in the resulting table.
select |
view
|
Use when you want to keep only some of the columns from the source table in the resulting table. |
|
update |
updateView
|
lazyUpdate
|
Use when you want to keep all columns from the source table in the resulting table. |
We now have two methods that enable us to keep only some the columns in a table, and three methods that enable us to include all columns of data in a table. Now we need to consider the second question noted above:
Will the analyses run faster by storing all of the information in memory, or will it be more efficient to dynamically compute cell values as they are needed?
When using data from large datasets, the method in which you access and store that data can have a big impact on efficiency and speed.
When select
and update
are used in a query, Deephaven will access and evaluate the data requested and then store it in memory. Storing the data is more efficient when the content is expensive to evaluate and/or when the data is accessed multiple times. However, keep in mind that storing large datasets requires a large amount of memory.
Note: When using select
, the entire dataset requested is stored in memory. When using update
, only the "updated" portion of the dataset is stored in memory.
With view
and updateView
, the data being requested is not stored in memory. Rather, Deephaven stores a formula that recalculates each value as it is needed. These values are calculated as needed and never saved. Therefore, when the formula is fast to compute or if you are only accessing a small portion of the data, storing the results as a formula is often the best option. Memory is not allocated to store new columns, so less memory is needed.
The lazyUpdate
method computes column formulas on demand and will defer computation until it is required. Because it caches the results for the set of input values, the same value will never be computed twice. For small sets of unique values, this uses less memory than an update
, and requires less computation than an updateView
. However, if you have many unique values, an update
will be more efficient because the lazyUpdate
stores the formula inputs and result in a map, whereas the update
stores the values more compactly in an array.
select |
view
|
|
Use when you want to keep only some of the columns from the source table in the resulting table. |
update |
updateView
|
lazyUpdate
|
Use when you want to keep all columns from the source table in the resulting table. |
The content requested is evaluated once and saved into memory. |
The content exists only as formulas that refer to existing data. Values are calculated on the fly and never saved. |
The content is evaluated on demand and when you access a cell, the formula inputs are used to store the values in the cache so the same value is not computed twice. |
|
A large amount of memory is required for storing large datasets. |
Memory is not allocated to store new columns, so less memory is needed. |
Because the results are saved in a cache, this method uses less memory than an |
|
If the content is expensive to evaluate and accessed many times, storing the results in memory is often the best option. |
When the formula is fast to compute or if you are only accessing a small portion of the data, storing the results as a formula is often the best option. |
This method defers computation of formulas until they are required. Values are never removed from the |
|
Choosing one pair of methods over the other is not a trivial decision. Depending on the nature of the analysis and the amount of memory you have at your disposal, one method is usually more efficient than the other. As a general rule of thumb, you should start with view
and updateView,
and then switch to select
, update
, and lazyUpdate
to see if it is more efficient.
The following sections detail how to use each of these methods in a query.
select and view
The select
and view
methods create one table with the same number of rows as the source table, and one column for each argument used in the query. Each argument can be the name of an existing column, or a formula to define an existing column or create a new column. If you do not specify a column from the source table in the argument, it will not be copied to the resulting table.
There is only one difference between the select
and view
methods. Data obtained by using the select
method is saved into memory; data obtained by using the view
method is referenced via a formula and is not saved in memory.
Example
Start with the table shown below named source
; we will write a query to create another table with only certain columns from the original table.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows. Note: The select
method is used in this example, but the syntax and the resulting data is identical to that when using the view
method.
t=source.select ("A", "B=B+2", "E=C+A")
t
is the name of the variable that will store the new table, source
is the table name, select
is the method. Three arguments describe the columns to include in the result table.
The first argument, "A"
indicates that the resulting table's first column should be column A
from the source
table, with no changes. The second argument, "B=B+2"
, indicates that the resulting table's second column should be column B
from the source
table, plus two. The third argument, "E=C+A"
, tells us to create a new virtual column in the resulting table, column E
, by combining columns C
and A
from the original table. Note: Column C
contains strings (as compared to numeric values), so Deephaven will concatenate the two columns and present them in the new E
column.
When the query runs, the new t
table is created.
|
|
As is demonstrated in the resulting table above, the select
and the view
methods enable you to:
- Choose the specific columns you want to show in the new table
- Manipulate the data you are showing in the new table, or
- Omit columns that you do not need for your analysis
- Create new virtual columns in the resulting table
update, updateView, and lazyUpdate
The update
, updateView
, and lazyUpdate
methods create one table with the same number of rows as the source table. All of the columns in the source table are included in the new table. However, the contents of those columns in the resulting table can be manipulated through arguments in the query. Additional columns can be added if desired.
The primary difference among these methods is how data is stored. Data obtained by using the update
method is saved into memory; data obtained by using the updateView
method is referenced via a formula and is not saved in memory; when you access a cell, data computed by using the lazyUpdate
method is stored in a cache.
Note: The update
method is being used in the following example, but the syntax and the resulting data is identical to that when using the updateView
or lazyUpdate
method.
Example
Starting with the table shown below named source
, we will write a query to create another table that will "update" certain data in the source
table for further analysis.
source
A |
B |
C |
---|---|---|
1 |
10 |
AAPL |
2 |
20 |
IBM |
3 |
30 |
C |
The query follows.
t2=source.update("B=B+2","D=C+A")
t2
is the name of the variable that will store the new table, source
is the table name, and update
is the method.
The first argument,"B=B+2"
, indicates column B
in the resulting table should be increased by two. The second argument,"D=C+A"
, indicates a new virtual column, column D
, should be created in the resulting table by combining the values of column C
and column A
from the source table. Note: Because column C
contains strings, Deephaven will concatenate the values in the two columns and present them in the new D
column.
When the query runs, the new t2
table is created.
|
|
As demonstrated in the resulting table above, the update
and updateView
methods enable you to:
- Include all of the columns from the source table in the resulting table without having to include arguments for each.
- Manipulate column values in the resulting table.
- Create new virtual columns in the resulting table.
selectDistinct
The selectDistinct
method creates a new table with only the columns from the source table that are listed in the argument. The number of rows in the new table's column is determined by the number of distinct values contained in these columns.
Example
Starting with the table shown below, we will then write a query to select the distinct values in a column.
source
A |
B |
---|---|
1 |
apple |
1 |
apple |
2 |
orange |
2 |
orange |
3 |
plum |
3 |
grape |
The query follows:
t=source.selectDistinct("B")
t
is the name of the variable that will store the new table, source
is the table name, and selectDistinct
is the method. The argument, "B"
, indicates the method will use the values from column B
of the source
table.
When the query runs, the new t
table is created.
|
|
Column B
in the source table has only four distinct values (apple, orange, plum and grape), so the new t
table has only four rows - one for each distinct value in column B
.
This example showed the distinct values in only one column. However, the selectDistinct
method can also be used to determine distinct value sets across multiple columns.
Using the same table as before, we will adjust the query to include two columns in the argument:
t2=source.selectDistinct("A","B")
When this new query runs, the new t2
table is created.
source
|
|
When the selectDistinct
method is used on multiple columns, it looks for distinct sets of values in the columns selected. In this example, the first two rows have the same set of values, so only one row is created in the new table to hold this distinct value set. The third and fourth rows also have an identical set of values, so only one row is added in the t2
table with this set of values. Each of the last two rows have their own distinct value sets, so two rows are included in the t2
table.
dropColumns
The dropColumns
method creates a table with the same number of rows as the source table, but omits any columns included in the dropColumns
argument. This method is useful when you only want to eliminate a small number of columns from the source table, or, in cases where you need to add a column for some operation, but then no longer need it after the operation is complete.
Example
Starting with the table shown below named source
, we will then write a query to create another table that will drop (omit) one or more columns.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows:
t=source.dropColumns("B","D")
t
is the name of the variable that will store the new table, source
is the table name, and dropColumn
is the method. The argument ("B","D"
) indicates columns B
and D
should be dropped (not included) in the resulting table.
When the query runs, the new t
table is created.
|
|
moveColumns
The moveColumns
method creates a table with the same number of rows and columns as the source table. However, a column (or a set of columns) can be moved to a different location by assigning it to a specific column index value in the moveColumns
argument.
Example
Starting with the table shown below named source
, we will then write a query to create another table with a column in a different order.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows:
t=source.moveColumns(3,"B")
t
is the name of the variable that will store the new table, source
is the table name, and moveColumn
is the method. The argument (3,"B"
) indicates column B
should be moved to column index number three in the resulting table. Deephaven uses a zero-based index model, so column index number 3
would be the fourth column.
When the query runs, the new t
table is created with column B
now in the fourth position (3rd index).
|
|
Column sets can also be moved using the moveColumns
method. For example, the following query moves the set of columns B
and C
to column index number 2.
t2=source.moveColumns(2,"B","C")
When this query runs, the new t2
table is created with the set of columns B
and C
now in the third and fourth position (column index number 2 and 3 ) respectively.
|
|
moveUpColumns
The moveUpColumns
method creates a table with the same number of rows and columns as the source
table. However, a column (or a set of columns) can be moved to the first column index position in the resulting table by including the column name (or set of column names) in the moveUpColumns
argument.
Unlike the moveColumns
method, the argument for the moveUpColumns
method does not require the column index number.
Example
Starting with the table shown below named source
, we will then write a query to create another table with a different column as the left column.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows:
t=source.moveUpColumns("D")
t
is the name of the variable that will store the new table, source
is the table name, and moveUpColumn
is the method. The argument "D"
indicates column D
should be placed first in order (column index 0) in the resulting table. Deephaven uses a zero-based index model, so column index 0 would be the first column.
When the query runs, the new t
table is created with column D
now in the first position (column index number 0).
|
|
A set of columns can also be moved using the moveUpColumns
method. For example, the following query moves the set of columns B
and C
to the first column index position.
t2=source.moveUpColumns("B","C")
When this query runs, the new t2
table is created with columns B
and C
now in the first and second position (column index 0 and 1) respectively.
|
|
moveDownColumns
The moveDownColumns()
method creates a table with the same number of rows and columns as the source table. However, a column (or a set of columns) can be moved to the right side in the resulting table by including the column name (or set of column names) in the moveDownColumns()
argument.
Unlike the moveColumns()
method, the argument for the moveDownColumns()
method does not require the column index number.
Example
Starting with the table shown below named source, we will then write a query to create another table with a different column as the right column.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows:
t=source.moveDownColumns("C")
t
is the name of the variable that will store the new table, source
is the table name, and moveDownColumns
is the method. The argument "C"
indicates column D
should be placed last in order in the resulting table.
When the query runs, the new t
table is created with column C
now in the last position.
|
|
renameColumns
The renameColumns
method creates a table with the same number of rows and columns as the source table. However, column names in the resulting table can be changed by including them in the renameColumns
argument.
Example
Starting with the table shown below named source
, we will then write a query to change a column name in the resulting table.
source
A |
B |
C |
D |
---|---|---|---|
1 |
10 |
AAPL |
Lucy |
2 |
20 |
IBM |
Ricky |
3 |
30 |
C |
Fred |
The query follows:
t=source.renameColumns("Beta=B")
"t" is the name of the variable that will store the new table, source is the table name, and renameColumn
is the method. The argument,"Beta=B"
indicates the name of column B from the source
table should be renamed to column Beta in the resulting table.
When the query runs, the new t
table is created.
|
|
Multiple columns can be renamed using the renameColumn
method. For example, the following query will rename columns A
and C
in the resulting table:
t2=source.renameColumns("Alpha=A","Charlie=C")
When this query runs, the new t2
table is created.
|
|
Last Updated: 16 February 2021 18:07 -04:00 UTC Deephaven v.1.20200928 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved