Downsampling Temporal Data

Binning Intervals

Downsampling time series data can be done by calculating the bin intervals for values and then using firstBy or lastBy to select the first or last row for each bin.

For example, using a table named x that contains a DBDateTime column called Timestamp, you can get five-minute interval samples from the data using the following:

downsampledx = x.updateView("TimeBin=upperBin(Timestamp, 5 * MINUTE)").lastBy("TimeBin").dropColumns("TimeBin")

This is a little tedious in that you have to add the TimeBin column to get the lastBy, and then you drop it because you don't want it as part of your result set. An alternate way to do this is to use DownsampledWhereFilter, which does the same downsampling, but with a more simple query.

See DownsampledWhereFilter

Downsampler

Sampling time-series data on some fixed-time interval is a common method used to reduce the required storage space or for increasing processing speed. While it is possible to perform this operation using a combination of more basic Deephaven operations, there is an existing utility that covers many common applications.

The Downsampler takes a time-series table as input and produces a downsampled version. Other than the timestamp and key columns, each output column is produced by an aggregate: typically either last, min, max, sum, first, last, std, var, avg, or array. All aggregates will take match pairs to support renaming the source column (i.e. "Volume=LastSize" when a sum aggregate is applied).

Timestamps from a specified column are grouped into "bins" using a method determined by the timeBinMode (defaults to UPPER, which would put all times between, for example, 3:20 and 3:25 into the 3:25 time bin if 5 minute intervals are used). For some applications, it is useful to have every "time bin" represented, even if there is no data. In this case, one can set the allBins option. Sometimes, as in stock quote data, if there is no new data, it is desirable that the value of the prior bin should be used instead of null. To produce this behavior, specify the relevant column(s) in the maintainState option.

The Downsamples can operate in online mode (default is on), in which case the downsampled table is updated whenever the source changes. If this behavior isn't needed, it is possible to turn off online mode, and run large downsampling jobs using multiple threads in parallel. Online mode supports only single-threaded operation.

Under normal conditions, every time a new row is added to the source table, the downsampled table will "tick". This is because the last time bin is always "active" (i.e., a row at 3:23 will affect the 3:25 bin). This causes a large volume of table updates in the downsampled table (at least one for each new row in the source). Using the excludeLastBin option, the last time bin can be excluded from the output until it can no longer change (assuming timestamps increase monotonically), one can eliminate this behavior. This is useful when logging the downsampled output to another table.

Typical examples

Downsample quotes, while filling in "empty" time bins, and maintaining state.

import com.illumon.iris.downsampling.Downsampler
downsampledQuotes = Downsampler.builder(db, quoteTable, "Timestamp", "00:05:00", "Sym")
   .allBins(true)
   .last("Bid","BidSize","Ask","AskSize")
   .maintainState("Bid","BidSize","Ask","AskSize")
   .execute()

Downsample trades, calculating Volume, High and Low per time bin:

import com.illumon.iris.downsampling.Downsampler
downsampledTrades = Downsampler.builder(db, tradeTable, "Timestamp", "00:05:00", "Symbol")
   .last("Price","Size")
   .sum("Volume=Size")
   .min("Low=Price")
   .max("High=Price")
   .execute()

Setter Method

Type

Req?

Default

Description

timeBinMode

String | Downsampler.TimeBinMode

No

UPPER

How to assign source data to time bins (UPPER/LOWER).

allBins

boolean

No

true

Whether to generate all time bins in the source data range, even if no data matches the bin.

maintainStateColumns

String...

No

N/A

Columns for which to maintain state from one time bin to the next if no new data is present for that bin (only applicable if setAllBins is true).

byStrategy

Table.ByStrategy | String

No

DEFAULT

Strategy to use when generating aggregates

aJStrategy

Table.JoinStrategy | String

No

DEFAULT

Strategy to use when using "aj" joins (when allBins=true, for maintain-state columns).

naturalJoinStrategy

Table.JoinStrategy | String

No

DEFAULT

Strategy to use when joining data to time bins (when allBins=true).

setCalendar

String

No

N/A

Calendar to filter time bins

online

boolean

No

true

Whether the output table should update when the source table ticks.

numThreads

int

No

1

Number of threads to use for parallel downsampling. Using a larger number of threads may improve performance for large source tables.

excludeLastBin

boolean

No

false

Whether to exclude the last/active time bin from the output.

logger

Logger

No

Default system logger

Override the logger used for logging progress as the downsampler proceeds

aggregates

LIst<ComboAggregateFactory.ComboBy>

 

No

N/A

Add a set of aggregate columns to produce in the output.

aggregate

AggType, column

No

N/A

Add an aggregate column to produce in the output table

last

String...

No

N/A

Add a set of columns to produce with the "lastBy" operator.

first

String...

No

N/A

Add a set of columns to produce with the "firstBy" operator.

min

String...

No

N/A

Add a set of columns to produce with the "minBy" operator.

max

String...

No

N/A

Add a set of columns to produce with the "maxBy" operator.

sum

String...

No

N/A

Add a set of columns to produce with the "sumBy" operator.

std

String...

No

N/A

Add a set of columns to produce with the "stdBy" operator.

var

String...

No

N/A

Add a set of columns to produce with the "varBy" operator.

array

String...

No

N/A

Add a set of columns to produce with the "arrayBy" operator.


Last Updated: 16 February 2021 18:07 -04:00 UTC    Deephaven v.1.20200928  (See other versions)

Deephaven Documentation     Copyright 2016-2020  Deephaven Data Labs, LLC     All Rights Reserved