Downsampling Temporal Data
Binning Intervals
Downsampling time series data can be done by calculating the bin intervals for values and then using firstBy
or lastBy
to select the first or last row for each bin.
For example, using a table named x that contains a DBDateTime
column called Timestamp
, you can get five-minute interval samples from the data using the following:
downsampledx = x.updateView("TimeBin=upperBin(Timestamp, 5 * MINUTE)").lastBy("TimeBin").dropColumns("TimeBin")
This is a little tedious in that you have to add the TimeBin
column to get the lastBy
, and then you drop it because you don't want it as part of your result set. An alternate way to do this is to use DownsampledWhereFilter
, which does the same downsampling, but with a more simple query.
Downsampler
Sampling time-series data on some fixed-time interval is a common method used to reduce the required storage space or for increasing processing speed. While it is possible to perform this operation using a combination of more basic Deephaven operations, there is an existing utility that covers many common applications.
The Downsampler takes a time-series table as input and produces a downsampled version. Other than the timestamp and key columns, each output column is produced by an aggregate: typically either last, min, max, sum, first, last, std, var, avg,
or array
. All aggregates will take match pairs to support renaming the source column (i.e. "Volume=LastSize
" when a sum aggregate is applied).
Timestamps from a specified column are grouped into "bins" using a method determined by the timeBinMode
(defaults to UPPER
, which would put all times between, for example, 3:20 and 3:25 into the 3:25 time bin if 5 minute intervals are used). For some applications, it is useful to have every "time bin" represented, even if there is no data. In this case, one can set the allBins
option. Sometimes, as in stock quote data, if there is no new data, it is desirable that the value of the prior bin should be used instead of null. To produce this behavior, specify the relevant column(s) in the maintainState
option.
The Downsamples
can operate in online mode (default is on), in which case the downsampled table is updated whenever the source changes. If this behavior isn't needed, it is possible to turn off online mode, and run large downsampling jobs using multiple threads in parallel. Online mode supports only single-threaded operation.
Under normal conditions, every time a new row is added to the source table, the downsampled table will "tick". This is because the last time bin is always "active" (i.e., a row at 3:23 will affect the 3:25 bin). This causes a large volume of table updates in the downsampled table (at least one for each new row in the source). Using the excludeLastBin
option, the last time bin can be excluded from the output until it can no longer change (assuming timestamps increase monotonically), one can eliminate this behavior. This is useful when logging the downsampled output to another table.
Typical examples
Downsample quotes, while filling in "empty" time bins, and maintaining state.
import com.illumon.iris.downsampling.Downsampler
downsampledQuotes = Downsampler.builder(db, quoteTable, "Timestamp", "00:05:00", "Sym")
.allBins(true)
.last("Bid","BidSize","Ask","AskSize")
.maintainState("Bid","BidSize","Ask","AskSize")
.execute()
Downsample trades, calculating Volume, High and Low per time bin:
import com.illumon.iris.downsampling.Downsampler
downsampledTrades = Downsampler.builder(db, tradeTable, "Timestamp", "00:05:00", "Symbol")
.last("Price","Size")
.sum("Volume=Size")
.min("Low=Price")
.max("High=Price")
.execute()
Last Updated: 16 February 2021 18:07 -04:00 UTC Deephaven v.1.20200928 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved