Filtering

Filtering is a way to exclude data you do not want to see or use in your analysis. Because less data gets loaded, filtering also reduces the processing times for queries so you can work faster.

When writing queries, filtering is accomplished using the where method and other related methods described below.

where()

Syntax

where(One or more logical expression strings)

The where method returns all the rows from the source table for which the logical expressions evaluate to true. Typically these expressions will do things like check for values contained in columns.

Each filter is expressed as a Boolean function in Deephaven's Java extended language. Filters are applied from left to right. Therefore, the order in which they are passed to the function can have a substantial impact on the execution time.

To see how it works, we will look at an example using an imaginary table with data regarding the people on board the Titanic's ill-fated voyage. An excerpt of the table is shown below and includes data such as the passengers' Class (1, 2, 3, or crew), Age (adult or child), Gender (male or female) and Survival (yes or no).

titanic

CLASS

AGE

GENDER

SURVIVAL

1

adult

male

true

1

adult

female

false

1

child

male

true

1

adult

female

true

1

adult

female

true

1

adult

male

false

1

adult

male

false

1

adult

female

true

1

child

female

true

For this example, we will want to filter the table to return rows when the people on board were children. The query follows:

t=titanic.where("AGE=`child`")

In this query, t creates a variable to hold the new table.  titanic is the name of the source table.  where is the method to filter the table.  AGE is the name of the column containing the filter value and child is the filter value. After the query runs, the resulting table will look similar to that shown below. 

t

CLASS

AGE

GENDER

SURVIVAL

1

child

male

true

1

child

female

true

1

child

male

true

1

child

male

true

1

child

female

true

1

child

male

true

2

child

female

true

2

child

female

true

2

child

female

true

The new t table has the same set of columns as the titanic table.  However, the t table is now filtered to contain only those rows that have the value child in the Age column.

Filtering on multiple values

Multiple filters can be used in a query.  For example, we could filter the titanic table to show only the data for (a) children, (b) who were female, and (c) who survived.

These additional filters could be iterated through multiple where clauses in a query as shown below:

t2=titanic.where("Age=`child`")
          .where("Gender=`female`")
          .where("Survival=`true`")

However, the Deephaven Query language allows you to combine multiple where filters into one statement within the argument. The query shown above would then be simplified to the following:

t2=titanic.where("Age=`child`","Gender=`female`","Survival=`true`")

When this query runs, the resulting t2 table would look similar to that shown below. 

t2

CLASS

AGE

GENDER

SURVIVAL

1

child

female

true

2

child

female

true

2

child

female

true

2

child

female

true

2

child

female

true

2

child

female

true

Advanced Filtering Techniques

So far, the examples used for describing the where method have involved a table with only a couple thousand rows of data. Filtering on such a small table is a trivial task for Deephaven. However, when your table size grows to thousands or millions of rows (or more), you will want to ensure you are filtering the data in the most efficient method to reduce compute expense and execution time. 

Filters are applied from left to right. Therefore, the order in which they are passed to the function can have a substantial impact on the execution time.

There are two categories of filters in Deephaven: Match filters and Conditional filters. However, they cannot be used together in the same filter string.

Match Filters

Because match filters benefit from special handling in Deephaven, you should use them whenever possible and before any conditional filters. Match filters often enable Deephaven to perform optimizations that would not be possible with a potentially more expressive conditional filter. For example, grouping information on historical data sources can be used with match filters, but not conditional filters.

There are five kinds of match filters in Deephaven:

  • = (the equal sign)
  • in
  • not in
  • icase in
  • icase not in

Note: match filters must be used individually. They cannot be combined in the same filter string.

=     (the equal sign)

Syntax

.where("columnName=value")

This method returns rows that have a matching value in the specified column.

Example

The following filter returns rows where the value of "child" is included in the column Age.

.where("Age=`child`")

in

Syntax 

.where("columnName in valueList")

This method returns rows that contain a match of one or more values in the specified column.

Example

The following filter returns rows that match the values of either 2 or 3 in the column Class.

.where("Class in `2`,`3`") 

not in

Syntax

.where("columnName not in valueList")

This method returns rows that do not contain a match of one or more values in specified column.

Example

The following filter returns rows that do not match the value of either 2 or 3 in the column Class.

.where("Class not in `2`,`3`")

icase in

Syntax 

.where("columnName icase in valueList")

This method returns rows that contain a match of one or more values in the specified column regardless of the capitalization of the values.

Example

The following filter returns rows that match the values of either A or b in the column Class, regardless of capitalization.

.where("Class icase in `A`,`b`") 

icase not in

Syntax

.where("columnName icase not in valueList")

This method returns rows that do not contain a match of one or more values in specified column regardless of the capitalization of the values.

Example

The following filter returns rows that do not match the value of either A or b in the column Class, regardless of capitalization.

.where("Class icase not in `A`,`b`")

Using Variables within Match Filters

The right side of a match filter using the "in" statement allows for use of variables as follows:

  • list of variables: A in X,Y,Z  - the filter will return true for all the rows where A is equal to X,Y or Z
  • single variable name: A in X:
    1. If X is a Java array or java.util.Collection, the filter will return true for all the rows where A is equal to one element of X
    2. For all other types of X, the filter will return true for all the rows where A is equal to X

Conversely, the right side of a not in statement allows for use of variables as follows:

  • list of variables: A not in X,Y,Z - the filter will return true for all the rows where A is not equal to X,Y or Z
  • single variable name A not in X:
    1. If X is a Java array or Collection, the filter will return true for all the rows where A is not equal to one element of X
    2. For all other types of X, the filter will return true for all the rows where A is not equal to X

Conditional Filters

Conditional filters can be used to filter data based on formulas other than those included in match filters. Conditional filters are not optimized like match filters.  Therefore they should be placed after match filters in a given where clause. Conditional filters can be any arbitrary Java expression evaluator, including:

  • where.("x==y")
  • where("x>y")
  • where("x<y")
  • startsWith()
  • endsWith()

Examples of conditional filters follow:
Example

.where("B>25")

In the t5 table below, the filter returns rows only if the value of the data in column B is greater than 25.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

 

t5=source.where("B>25")

A

B

C

3

30

SPY

4

40

AAPL

5

50

AA

Example

where("A%2==0")     

In the t6 table below, the filter returns rows only if the value in column A is exactly divisible by 2.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

 

t6=source.where("A%2==0")

A

B

C

2

20

IBM

4

40

AAPL

Example

t7=source.where("A%2==1","C.startsWith(`AA`)")

The example shown in the t7 table below shows a filter using formulas on multiple columns. In this case, the filter will return rows only when

  • the value in column A has a remainder of 1 after being divided by 2, and
  • the value in column C starts with the string AA.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

 

t7=source.where("A%2==1","C.startsWith(`AA`)")

A

B

C

1

10

AAPL

5

50

AA

Note:  Because this is Java code, all the normal constraints of Java code must be respected.  For example, in the "C.startsWith(`AA`)" filter used above, the variable C, may not be NULL otherwise a NullPointerException will result.  You must keep your condition filter code consistent with your data.  If you expect NULL values, you must first check for null values (e.g., "C != null && C.startsWith(`AA`)"). 

Conjunctive and Disjunctive Filtering

When filtering on multiple columns in a table, your query can be written so they work on a conjunctive basis or a disjunctive basis.

Conjunctive

In conjunctive filtering, all filters within a where() clause are evaluated.  For example, two filters are working conjunctively in the following: 

.where("B>25","C.startsWith(`AA`)"

In this case, the filter returns rows only when the value in column B is greater than 25, and the value in column C starts with the string AA.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

 

t8=source.where("B>25","C.startsWith(`AA`)"

A

B

C

4

40

AAPL

5

50

AA

The following queries all use equivalent variations of conjunctive filtering

t2=t1.where("A in firstSet", "B in secondSet")

Assuming firstSet and secondSet are collections, this could also be written with a condition filter as:

t3=t1.where("firstSet.contains(A)", "secondSet.contains(B)")

or even:

t4=t1.where("firstSet.contains(A) && secondSet.contains(B)")

t2 will evaluate, using an optimized match filter, "A in firstSet" and then only for values which match the A filter, it will evaluate "B in secondSet".

t3 is logically the same, but instead of using an optimized match filter, it will evaluate the A column for each row and pass it to the condition filter "firstSet.contains(A)".  For rows that pass this filter, it will evaluate the B column, and pass those values to "secondSet.contains(B)"

t4 will produce the same output as t3, but the A and B columns are evaluated and passed into the filter expression ("firstSet.contains(A) && secondSet.contains(B)").

Disjunctive

In disjunctive filtering, each filter clause is evaluated independently, and results are presented if any of the filters return results. 

For example, two filters are working disjunctively in this clause: 

.where("B>25 || C.startsWith(`AA`)"

In this case, the filter will present rows only when the value in column B is greater than 25, or the value in column C starts with the string AA, or both.

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

 

t9=source.where("B>25 || C.startsWith(`AA`)"

A

B

C

1

10

AAPL

3

30

SPY

4

40

AAPL

5

50

AA

The following syntax can also be used to indicate disjunctive filtering:

.whereOneOf("B>25", "C.startsWith(`AA`)"

Partitioning and Grouping Columns

When you partition a hard drive, you are dividing the entire hard drive into multiple, logical, independent volumes. In a high-end data system like Deephaven, data is also divided into multiple, logical and independent partitions, which improves the speed, performance and efficiency of the data analysis.

For example, if a dataset includes information about every cell phone call made in a given city over 10 years, a logical way to partition that data would be by a date range. Instead of having one massive file that contained every record for that time period, 10 smaller partitions could be created to hold the data related to one year of calls. If data was needed to analyze only a specific range of time, the entire data set would not need to be analyzed again.  Rather, the analysis could be performed on one or more of the smaller partitions, which would take considerably less time.

The Deephaven query language is designed to make filtering on partitioning columns highly efficient. Therefore, when possible, one should filter first on partitioning columns. 

Grouping Columns

Grouping columns determine how table data is organized on disk such that each unique value, and its related row data, are located in contiguous blocks of rows. These have persistent indexes that the Deephaven query engine can leverage to vastly improve the efficiency of match filters.

If your analyses require filtering on the data contained in both partitioning columns and grouping columns, the most efficient method would be to first filter on partitioning columns, and then separately filter on grouping columns and/or any remaining columns as needed.

To determine the partitioning and grouping columns in your dataset, you will need to review the metadata for your applicable table(s).  One way this can be accomplished is by using the getMeta method, which will return a table with column details.

tMeta=db.t("<namespace>", "<tablename>").getMeta()

Composing Complex Where Clauses

When building a query that applies multiple filters to the same data, the best approach is to chain multiple filter clauses within the same where method.

When you use any filter on a partitioning column or when you specifically use a match filter on a grouping column, that filter should be an independent clause.

If you have a complex formula, then it is best to combine the clauses if and only if they reference the same column(s).

Multiple clauses can be combined into single where method calls or separated into multiple where method calls. The results and processing effort are the same, but multiple where clauses may improve readability of the query.

Order matters and you should generally put the more selective or more efficient filters first, although the exact performance may vary depending on the filters themselves or the data being filtered. In many cases, doing so allows the query engine to avoid overhead.

The following where clauses illustrate these differences: 

Quality

Query

Discussion

Good

where("Date=`2011-03-21`","USym==`AAPL`", "BidMkt > 100")

The partitioning column (Date) is evaluated first, followed by the USym filter.

Bad

where("USym==`AAPL`", "Date=`2011-03-21`", "BidMkt > 100")

The USym filter is evaluated first, without having first pruned off irrelevant partitions. Every row of every partition's USym column must be examined, which takes longer and uses more computing power.

Bad

where("Date=`2011-03-21`", "BidMkt > 100","USym==`AAPL`")

The partitioning column (Date) is evaluated first, which is good.  However, the USym filter can be executed more efficiently on grouped data, so that filter should be placed ahead of the BidMkt filter

Good

where("Date=`2011-03-21`", "BidMkt > 100 && BidMkt < 200")

The partitioning column (Date) is evaluated first, and there is only a single formula evaluation for the conditions on BidMkt.

Less Good

where("Date=`2011-03-21`", "BidMkt > 100", "BidMkt < 200")

The partitioning column (Date) is evaluated first, but two formulas must be evaluated (one for BidMkt > 100,  and one for BidMkt < 200) rather than one.

Less Good

where("Date=`2011-03-21`").where("BidMkt > 100 && BidMkt < 200")

The partitioning column (Date) is evaluated first, but an intermediate table is created by the secondary where clause.

Good

where("Date=`2011-03-21`", "USym == `AAPL`", "BidMkt > 100 && BidMkt < 200")

The partitioning column (Date) is evaluated first, and the USym filter and the BidMkt filter are independently evaluated in the best order.

Bad

where("Date=`2011-03-21`", "USym == `AAPL` && BidMkt > 100 && BidMkt > 200")

The partitioning column (Date) is evaluated first, but "USym == AAPL" cannot be applied as a match filter, because it is part of a more complex formula.

whereIn and whereNotIn

The whereIn and whereNotIn methods enable you to filter one table based on the contents of another table, which may or may not contain ticking data.

A where clause is evaluated only when a row in the filtered table ticks. whereIn and whereNotIn are  evaluated whenever either table changes. Join expressions, such as join and naturalJoin are also evaluated when either table changes. 

Unlike naturalJoin, whereIn can be used when there are more than one matching value in the right table for values in the left table. This is true of join as well, but whereIn is faster to return matching rows than join.

Also, whereIn only provides filtering, and does not allow adding columns from the right table. In some cases it may be desirable to use whereIn to filter and then join to add columns from the right table. This provides similar performance to naturalJoin while still allowing multiple matches from the right table.

To demonstrate, we will imagine there is a usymsOfInterest table. It contains a USym column. Its contents can vary, depending on which symbols we are currently interested in.

usymsOfInterest

USym

AAPL

IBM

SPY

AAPL

AA

If you have another table, of stock quotes (that might include thousands of USyms), you might want to filter that table to only show the USyms that you care about — the ones in your "usymsOfInterest" table. You can do this with whereIn():

interestingStockQuotes = stockQuotes.whereIn(usymsOfInterest, "USym")

This will filter the stockQuotes table based on USym.

Simlarly, whereNotIn, can be used to process and anti-join and return rows from the left table that have no match in the right table.

onlyValid=tableToFilter.whereIn(validUsyms, "USym")
onlyInvalid=tableToFilter.whereNotIn(validUsyms, "USym")

The onlyValid table contains rows with a USym value that is in the list of validUsyms, the onlyInvalid table contains only rows with a USym value that is not in the list of validUsyms.

You may specify more than one column to filter, and the names need not be the same. For example, if we wanted to match on USym and Expiry, but the tableToFilter table had a column named Maturity instead, we could use the following construct:

onlyValid=tableToFilter.whereIn(validUsyms, "USym", "Maturity=Expiry")

whereIn is not appropriate for all situations. The purpose of whereIn is to enable more efficient filtering for a set that changes infrequently. Any time the right table (in this example validUsyms) ticks, all rows of the left table (in this example tableToFilter) must be re-evaluated. If you have a right table that often ticks, you should use a naturalJoin instead.

Head & Tail Filtering

Head and tail filters are used to return the first or last set of rows of a table by specifying the number of rows desired or by specifying the percent of the table. See also: Dedicated Aggregators.

The following filters are used to return a specific number of rows:

head()

tail() 

The following filters are used to return a specific percentage of rows:

headPct()

tailPct()

For example,

table.head(10)   // returns the first 10 rows of a table

table.tail(20)   // returns the last 20 rows of a table

table.headPct(0.25)   // returns the first 25% of rows contained in a table

table.tailPct(0.5)    // returns the last 50% of rows contained in a table

source

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

4

40

AAPL

5

50

AA

t10=source.head(3)

A

B

C

1

10

AAPL

2

20

IBM

3

30

SPY

t11=source.tailPct(0.4)

A

B

C

4

40

AAPL

5

50

AA

If a filter operation is the cause of a failed query, the exception in the Deephaven interface will prompt you to clear filters with the accompanying Clear Filtering button or access to the right-click menu.

DownsampledWhereFilter

The DownsampledWhereFilter enables users to downsample time series data by calculating the bin intervals for values, and then using upperBin and lastBy to select the last row for each bin. Note: The column containing the data to be binned must be sorted for the method to work. A sample query using this method follows:

from deephaven import *

downsampledX = x.where(DownsampledWhereFilter("Timestamp", 5*dbtu.MINUTE))
import com.illumon.iris.db.v2.select.DownsampledWhereFilter
downsampledX = x.where(new DownsampledWhereFilter("Timestamp", 5 * MINUTE))

The default for this method is to downsample the bins based on upperBin and lastBy. However, you can downsample the bin based on lowerBin and firstBy by adding a third argument to the DownsampledWhereFilter method. An example follows with the third argument highlighted:

from deephaven import *

downsampledX = x.where(DownsampledWhereFilter("Timestamp", 5*dbtu.MINUTE,
                                              DownsampledWhereFilter.SampleOrder.LOWERFIRST))
import com.illumon.iris.db.v2.select.DownsampledWhereFilter
downsampledX = x.where(new DownsampledWhereFilter("Timestamp", 5 * MINUTE,
    DownsampledWhereFilter.SampleOrder.LOWERFIRST))

LOWERFIRST is the constant for lowerBin/firstBy.

UPPERLAST is the constant for upperBin/lastBy.

Either constant works in this query. However, if the third argument is not present, the downsampling will occur on an upperBin/lastBy basis.

Examples

All the following examples will give you the same result of trades after or including 11am:

p=db.i("SystemEquity","TradeData").where("Date=`2012-02-15`", "Timestamp>='2012-02-15T11:00:00 NY'")

p=db.i("SystemEquity","TradeData").where("Date=`2012-02-15`", "Timestamp+'1:00:00'>='2012-02-15T10:00:00 NY'")

p=db.i("SystemEquity","TradeData").where("Date=`2012-02-15`", "Timestamp+'T1h'>='2012-02-15T10:00:00 NY'")

p=db.i("SystemEquity","TradeData").where("Date=`2012-02-15`", "hourOfDay(Timestamp, TZ_NY)>=11")

Note: The ticks around the timestamp are regular ticks, not backticks.


Last Updated: 08 April 2020 10:40 -04:00 UTC    Deephaven v.1.20200121  (See other versions)

Deephaven Documentation     Copyright 2016-2020  Deephaven Data Labs, LLC     All Rights Reserved