Downsampling Temporal Data

Binning Intervals

Downsampling time series data can be done by calculating the bin intervals for values and then using firstBy or lastBy to select the first or last row for each bin.

For example, using a table named x that contains a DBDateTime column called Timestamp, you can get five-minute interval samples from the data using the following:

downsampledx = x.updateView("TimeBin=upperBin(Timestamp, 5 * MINUTE)").lastBy("TimeBin").dropColumns("TimeBin")

This is a little tedious in that you have to add the TimeBin column to get the lastBy, and then you drop it because you don't want it as part of your result set. An alternate way to do this is to use DownsampledWhereFilter, which does the same downsampling, but with a more simple query.

See DownsampledWhereFilter

Downsampler

Sampling time-series data on some fixed-time interval is a common method used to reduce the required storage space or for increasing processing speed. While it is possible to perform this operation using a combination of more basic Deephaven operations, there is an existing utility that covers many common applications.

The Downsampler takes a time-series table as input and produces a downsampled version. Other than the timestamp and key columns, each output column is produced by an aggregate: typically either last, min, max, sum, first, last, std, var, avg, or array. All aggregates will take match pairs to support renaming the source column (i.e. "Volume=LastSize" when a sum aggregate is applied).

Timestamps from a specified column are grouped into "bins" using a method determined by the timeBinMode (defaults to UPPER, which would put all times between, for example, 3:20 and 3:25 into the 3:25 time bin if 5 minute intervals are used). For some applications, it is useful to have every "time bin" represented, even if there is no data. In this case, one can set the allBins option. Sometimes, as in stock quote data, if there is no new data, it is desirable that the value of the prior bin should be used instead of null. To produce this behavior, specify the relevant column(s) in the maintainState option.

The Downsamples can operate in online mode (default is on), in which case the downsampled table is updated whenever the source changes. If this behavior isn't needed, it is possible to turn off online mode, and run large downsampling jobs using multiple threads in parallel. Online mode supports only single-threaded operation.

Under normal conditions, every time a new row is added to the source table, the downsampled table will "tick". This is because the last time bin is always "active" (i.e., a row at 3:23 will affect the 3:25 bin). This causes a large volume of table updates in the downsampled table (at least one for each new row in the source). Using the excludeLastBin option, the last time bin can be excluded from the output until it can no longer change (assuming timestamps increase monotonically), one can eliminate this behavior. This is useful when logging the downsampled output to another table.

Typical examples

Downsample quotes, while filling in "empty" time bins, and maintaining state.

import com.illumon.iris.downsampling.Downsampler
downsampledQuotes = Downsampler.builder(db, quoteTable, "Timestamp", "00:05:00", "Sym")
   .allBins(true)
   .last("Bid","BidSize","Ask","AskSize")
   .maintainState("Bid","BidSize","Ask","AskSize")
   .execute()

Downsample trades, calculating Volume, High and Low per time bin:

import com.illumon.iris.downsampling.Downsampler
downsampledTrades = Downsampler.builder(db, tradeTable, "Timestamp", "00:05:00", "Symbol")
   .last("Price","Size")
   .sum("Volume=Size")
   .min("Low=Price")
   .max("High=Price")
   .execute()

Setter Method	Type	Req?	Default	Description
`timeBinMode`	String \| `Downsampler.TimeBinMode`	No	UPPER	How to assign source data to time bins (UPPER/LOWER).
`allBins`	boolean	No	true	Whether to generate all time bins in the source data range, even if no data matches the bin.
`maintainStateColumns`	String...	No	N/A	Columns for which to maintain state from one time bin to the next if no new data is present for that bin (only applicable if setAllBins is true).
`byStrategy`	`Table.ByStrategy` \| String	No	DEFAULT	Strategy to use when generating aggregates
`aJStrategy`	`Table.JoinStrategy` \| String	No	DEFAULT	Strategy to use when using "aj" joins (when allBins=true, for maintain-state columns).
`naturalJoinStrategy`	`Table.JoinStrategy` \| String	No	DEFAULT	Strategy to use when joining data to time bins (when allBins=true).
`setCalendar`	String	No	N/A	Calendar to filter time bins
`online`	boolean	No	true	Whether the output table should update when the source table ticks.
`numThreads`	int	No	1	Number of threads to use for parallel downsampling. Using a larger number of threads may improve performance for large source tables.
`excludeLastBin`	boolean	No	false	Whether to exclude the last/active time bin from the output.
`logger`	Logger	No	Default system logger	Override the logger used for logging progress as the downsampler proceeds
`aggregates`	LIst<ComboAggregateFactory.ComboBy>	No	N/A	Add a set of aggregate columns to produce in the output.
`aggregate`	AggType, column	No	N/A	Add an aggregate column to produce in the output table
`last`	String...	No	N/A	Add a set of columns to produce with the "lastBy" operator.
`first`	String...	No	N/A	Add a set of columns to produce with the "firstBy" operator.
`min`	String...	No	N/A	Add a set of columns to produce with the "minBy" operator.
`max`	String...	No	N/A	Add a set of columns to produce with the "maxBy" operator.
`sum`	String...	No	N/A	Add a set of columns to produce with the "sumBy" operator.
`std`	String...	No	N/A	Add a set of columns to produce with the "stdBy" operator.
`var`	String...	No	N/A	Add a set of columns to produce with the "varBy" operator.
`array`	String...	No	N/A	Add a set of columns to produce with the "arrayBy" operator.

Last Updated: 16 February 2021 18:07 -04:00 UTC Deephaven v.1.20200928 (See other versions)