Downsample Imports

Downsample imports are used to import a downsampled version of data from an existing time-series table into an intraday partition. The target table must already exist. Unlike other styles of import, Deephaven provides no "schema creator" tool, as the target table typically takes a very similar or identical form to the source table.

Quickstart

The following is an example of how to run a downsample import from the command line (This document also describes how to import via a script or a Deephaven UI Import Job). This example assumes a typical Deephaven installation and that the target table already exists.

sudo -u dbmerge iris_exec downsample_import -- \
-ns TargetNamespace -tn TargetQuoteTable \
-sn SourceNamespace -st SourceQuoteTable -sp 2018-09-26 \
-tc Timestamp -p 00:00:10 -kc Sym -last Bid,Ask,BidSize,AskSize \
-dp localhost/2018-09-26

This example downsamples the 2018-09-26 partition of the table SourceNamespace.SourceQuoteTable and stores the result in the target table (TargetNamespace.TargetQuoteTable). The Timestamp column is used to downsample the source table every 10 seconds for each key. "Sym" is identified as the key column, and the last values from each time interval from the Bid, Ask, BidSize and AskSize columns are selected.

Import Query

When Import - Downsample is selected, the Persistent Query Configuration Editor window shows the following options:

To proceed with creating a query to import downsampled data, you will need to select a merge server and enter the desired value for Memory (Heap) Usage (GB).

Options available in the Show Advanced Options section of the panel are typically not used when importing or merging data. To learn more about this section, please refer to the Persistent Query Configuration Viewer/Editor.

The Access Control tab presents a panel with the same options as all other configuration types, and gives the query owner the ability to authorize Admin and Viewer Groups for this query. For more information, please refer to Access Control.

Clicking the Scheduling tab presents a panel with the same scheduling options as all other configuration types. For more information, please refer to Scheduling.

Clicking the DownsampleImport Settings tab presents a panel with the relevant options:

Downsample Import Settings

Target Namespace: This is the namespace into which you want to import the file.
Target Table: This is the table into which you want to import the data. The schema of this table must be consistent with the output of the downsampling process (as specified by the other options on this panel).
Output Mode: This determines what happens if data is found in the fully-specified partition for the data. The fully-specified partition includes both the internal partition (unique for the import job) and the column partition (usually the date).
- Safe - if existing data is found in the fully-specified partition, the import job will fail.
- Append - if existing data is found in the fully-specified partition, data will be appended to it.
- Replace - if existing data is found in the fully-specified partition, it will be replaced. This does not replace all data for a column partition value, just the data in the fully-specified partition.
- Strict Import: has no meaning (at present) for this type of import.
Downsample Parameters
- By Strategy - The Deephaven "byStrategy" to use when performing the downsample aggregate operations.
- AJ Strategy - The Deephaven join strategy to use when joining source data to time bins (applicable only when setAllBins(true)).
- Natural Join Strategy - The Deephaven join strategy to use when filtering time bins.
- Number Threads - the number of threads to use when downsampling. For large downsampling jobs where "generate all time bins" is true, this can improve performance by parallelizing the downsample time-join on a per-key basis.
- Partition Formula: Specifies which source column partition to import. This style of import supports only single-partition imports. A typical value for this might be currentDateNy().
- Source Namespace: Namespace of the table that will be downsampled.
- Source Table: Name of the table to downsample. This table must contain the timestamp and key columns specified in this panel. For performance reasons, it is recommended that large source tables be grouped by the key column(s).
- Timestamp Column: Name of the column in the source table to use for generating time bins. Must be of type DBDateTime.
- Period: Interval to use for generating time bins in HH:MM:SS format. Examples: 00:05:00 for 5 minute bins, 00:01:00 for 1 minute bins.
- Calendar: The name of a calendar to use for filtering time stamps.
- Time Bin Mode: How to assign timestamps in the source data to time bins. Given the time of 01:32:34 and a period of 5 minutes, UPPER would assign this time to the 01:35:00 bin, while LOWER would assign to the 01:30:00 bin. UPPER is a more typical choice.
- Generate All Time Bins: Whether or not to generate time bins for the entire time range spanned by the underlying data, even if no data falls into the bin. Setting this to true requires one or more joins and hence will cause the downsampling process to be significantly slower.
- Key Columns: Optional columns to use for grouping the source data. The downsampling will produce a row for each (time bin, key) combination. For financial series, the instrument identifier is a typical choice for this.
- Aggregate Columns: Comma separated list of aggregate/data columns names desired in the downsampler output. These may match columns in the source table or rename then using the OutputColumn=OriginalColumn format. Typically the latter would be used for additional aggregates of source columns (for example Volume=Size used in combination with a sum aggregate on trade data).
- Aggregates: Comma separated list of aggregates to use for each column specified in the Aggregate Columns field. There number of aggregates must exactly match the number of aggregate columns. Legal values for aggregates are: Last, First, Sum, Min, Max, Avg, Std, Var, Array (Count is not available).
- Maintain State Columns: Comma separated list of columns for which to maintain state from one time bin to the next if no data when generating all time bins. Typically would be used for Last aggregate columns where the value is "good until changed" (such as Bid and Ask in quote data).

Importing Using Builder

Downsample imports may be performed from inside a Deephaven Groovy or Python script. This permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec "run_local_script" tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

The DownsampleImport class provides a static builder method, which produced an object used to set parameters for the import. The builder returns an DownsampleImport object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The pattern when executing a Downsample Import follows:

nRows = DownsampleImport.builder(<db>, <namespace>, <table>, <timestamp column>, <period>, <key column>...)
.set<option>(<option value>)
...
.build()
.run()

Required Builder Parameters

Parameter	Type	Req?	Default	Description
db	Database	Yes	N/A	The Deephaven database (typically "`db`" in a Python/Groovy script).
namespace	String	Yes	N/A	The target table namespace.
table	String	Yes	N/A	The target table name.
Timestamp column	String	Yes	N/A	Name of the time column in the source table used for time sampling. The output time bin column will be the same as this unless it is renamed by specifying this as a pair (i.e. `"TimeBin=Timestamp"`, where `Timestamp` is the input column and `TimeBin` is the output column).
period	String	Yes	N/A	Time interval for sampling in HH:MM:SS format.
key column...	String(s)	Yes	N/A	Zero or more "key columns", used to group the data. There will be one row in the output for each distinct key value per time interval.

Import Options

Option Setter	Type	Req?	Default	Description
setDestinationPartitions	String	No*	N/A	The destination partition(s). e.g., "`localhost/2018-05-01`" for a single partition import or "`localhost`" for multi-partition import.
setDestinationDirectory	File \| String	No*	N/A	The destination directory. e.g., `/tmp/mylocation`
setOutputMode	ImportOutputMode \| String	No	SAFE	Enumeration with the following options: `ImportOutputMode.SAFE \| ImportOutputMode.REPLACE \| ImportOutputMode.APPEND` May also be specified as String (e.g.,"`SAFE"/"REPLACE"/"APPEND`").
setStrict	boolean	No	true	If true, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to false only when developing the import process for a new data source.
setPartitionColumn	String	No	N/A	Column name to use to choose which partition to import each source row.

setSourceTable	Table	Yes	N/A	Source table to downsample
setTimeBinMode	String \| Downsampler.TimeBinMode	No	UPPER	Must be `LOWER` or `UPPER`. This indicates how to generate the time bins from source timestamps. `LOWER` indicates to map a timestamp to the first previous time that falls on an even time interval, `UPPER` the next.
setAllBins	boolean	No	false	Indicates whether or not to generate all time bins over the interval covered by the source table. If false, only intervals for which source data exists will be generated.
setMaintainStateColumns	String...	No	N/A	Indicates a subset of one or more output columns that should "maintain state", or maintain the same value from one time interval to the next even if no new data is observed. Typically used with `setAllBins(true)`.
setByStrategy	String \| Table.ByStrategy	No	DEFAULT	The Deephaven "`byStrategy`" to use when performing the downsample aggregate operations.
setAjStrategy	String \| Table.JoinStrategy	No	DEFAULT	The Deephaven join strategy to use when joining source data to time bins (applicable only when `setAllBins(true)`).
setNaturalJoinStrategy	String \| Table.JoinStrategy	No	DEFAULT	The Deephaven join strategy to use when filtering time bins.
setCalendar	String	No	N/A	A Deephaven calendar to use for filtering output time bins.
setNumThreads	int	No	1	Maximum number of threads to use when downsampling. Downsampling can be parallelized to some extent if key columns are present.
setLogger	Logger	No	Default logger	Deephaven logger object for logging progress.
addAggregates	ComboAggregateFactory.ComboBy...	No	N/A	Add the given aggregates to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations ("`addLastColumns`", etc).
addAggregate	AggType, String	No	N/A	Add the given aggregate to the downsampling process for generating output data columns. Typically it is easier to apply the methods for specific operations ("`addLastColumns`", etc).
addLastColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `lastBy` operation from the source data.
addFirstColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `firstBy` operation from the source data.
addMinColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `minBy` operation from the source data.
addMaxColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `maxBy` operation from the source data.
addSumColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `sumBy` operation from the source data.
addStdColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `stdBy` operation from the source data.
addVarColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `varBy` operation from the source data.
addAvgColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `avgBy` operation from the source data.
addArrayColumns	String...	No	N/A	Add the given columns to the downsample output, to be generated by a `arrayBy` operation from the source data.

Import from Command Line

Downsample imports from the command line work in a very similar way to the scripted imports. The primary difference is that since the source table is specified by name (instead of a Table object), you must also supply a source partition (i.e., Date) from which to import.

The syntax for running downsample imports from the command line is described below. You may also issue the import command without arguments to see them described in the console.

Command Reference

iris_exec downsample_import <launch args> -- <downsample import args>

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
 -sn or --sourceNamespace <source namespace>
-st or --sourceTable <source table>
-tc or --timestampColumn <timestamp column>
-p or --period <period>
-kc or --keyColumns <key columns>
-sp or --sourcePartition <source partition>
-timeBinMode <LOWER or UPPER>
-allBins <true or false>
-maintainStateColumns <maintain state columns>
-byStrategy <by strategy>
-ajStrategy <aj strategy>
-naturalJoinStrategy <natural join strategy>
-last <last columns>
-first <first columns>
-sum <sum columns>
-min <min columns>
-max <max columns>
-std <standard deviation columns>
-var <variance columns>
-avg <average columns>
-array <array columns>

Downsample Import Arguments

Parameter Value	Description
-dd -dp -pc	Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s): Destination directory (e.g., `-dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/`) Internal partition and destination partition (e.g., `-dp localhost/2018-01-01`) Internal partition and partition column - for multi-partition import (e.g., `-dp localhost -pc Date`)
-ns	`namespace` (required): Namespace in which to find the target table.
-tn	`tableName` (required): Name of the target table.
-om	`outputMode` (optional): `SAFE` (default), `REPLACE`, or `APPEND`. `SAFE` checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, `REPLACE` should be used, because failed import attempts will often write some data to the table, causing the next attempt with `SAFE` to abort. `APPEND` should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.
-rc	`relaxedChecking` (optional): Defaults to `FALSE`. Presently ignored by Downsampler.
-sn	`sourceNamespace` (required): Namespace in which to find the source table.
-st	`sourceTable` (required): Name of the source table.
-tc	`timestampColumn` (required): Name of the column in the source table to use for time sampling. The output time bin column will be the same as this unless it is renamed by specifying this as a pair (i.e. "`TimeBin=Timestamp`", where `Timestamp` is the input column and `TimeBin` is the output column).
-p	`period` (required): Time interval for sampling, specified in HH:MM:SS format.
-kc	`keyColumns` (optional): One or more column separated "`key columns`", used to group the data. There will be one row in the output for each distinct key value per time interval.
-sp	`sourcePartition` (optional): Source column partition to downsample, typically a date. This is optional but highly recommended for partitioned tables.
-timeBinMode	`timeBinMode` (optional): Must be `LOWER` or `UPPER`, default is `UPPER`. This indicates how to generate the time bins from source timestamps. `LOWER` indicates to map a timestamp to the first previous time that falls on an even time interval, `UPPER` the next.
-allBins	`allBins` (optional): Must be `true` or `false`, default is `false`. Indicates whether or not to generate all time bins over the interval covered by the source table. If false, only intervals for which source data exists will be generated.
-maintainStateColumns	`maintainStateColumns` (optional): A comma delimited subset of one or more output columns that should "maintain state", or maintain the same value from one time interval to the next even if no new data is observed. Typically used with `setAllBins(true)`.
-byStrategy	`byStrategy` (optional): The Deephaven "`byStrategy`" to use when performing the downsample aggregate operations.
-ajStrategy	`ajStrategy` (optional): The Deephaven join strategy to use when joining source data to time bins (applicable only when `setAllBins(true)`).
-naturalJoinStategy	`naturalJoinStrategy` (optional): The Deephaven join strategy to use when filtering time bins.
-calendar	`calendar` (optional): A Deephaven calendar name to use for filtering output time bins.
-last	`last` (optional): One or more comma delimited column names, to be aggregated by a `lastBy` operation from the source data.
-first	`first` (optional): One or more comma delimited column names, to be aggregated by a `firstBy` operation from the source data.
-min	`min` (optional): One or more comma delimited column names, to be aggregated by a `minBy` operation from the source data.
-max	`max` (optional): One or more comma delimited column names, to be aggregated by a `maxBy` operation from the source data.
-sum	`sum` (optional): One or more comma delimited column names, to be aggregated by a `sumBy` operation from the source data.
-std	`std` (optional): One or more comma delimited column names, to be aggregated by a `stdBy` operation from the source data.
-var	`var` (optional): One or more comma delimited column names, to be aggregated by a `varBy` operation from the source data.
-avg	`avg` (optional): One or more comma delimited column names, to be aggregated by a `avgBy` operation from the source data.
-array	`array` (optional): One or more comma delimited column names, to be aggregated by a `arrayBy` operation from the source data.

Last Updated: 25 February 2020 08:25 -05:00 UTC Deephaven v.1.20200121 (See other versions)