Importing Data Introduction

Introduction

Deephaven has two main ways to ingest data: batch or streaming. Batch, or static, data is imported in relatively large chunks, usually on some schedule. Streaming, or "ticking", data arrives continuously as events occur.

Importing batch data into Deephaven is fairly easy, as Deephaven provides built-in tools that simplify most of the required tasks. However, setting up a new streaming data source is more complicated. While there are tools in Deephaven to help with this, there are more tasks involved, as well as the need for data-source specific executables that feed new events to Deephaven processes. (See Streaming Data below for further information.)

There are multiple options for configuring Deephaven to import batch data from a new source. All of the details of these options are explained in this documentation. However, all batch imports follow these steps:

Create a schema for the new table–either via the Schema Editor UI or by directly creating XML files–and then deploy the schema.
Create and schedule tasks to load data into the new table.

The following are the typical steps in adding a streaming data source to Deephaven (assuming you use the standard binary log format and location):

Create and deploy a schema for the table with a section defining the Logger and the Listener.
Generate the logger and listener classes.
Create a logger application that will use the generated logger to stream data to Deephaven.
Restart the necessary processes to pick up the new schema and configuration information.

Generally speaking, importing either batch or streaming data will also require tasks to merge and validate newly loaded (intraday) data.

Deephaven Tables & Schemas

Every table in Deephaven must have a unique name within a namespace. Each table is either a System table or User table. System tables are defined by XML schema files, which containing detailed information about each column, import source(s), and storage strategy. User tables are created directly from raw data and have no externally defined schema.

This documentation is primarily concerned with System tables, which are the mechanism by which data may be reliably imported, validated and shared system-wide. Depending on the type and layout of the source data, a suitable schema may only need the correct column names and data types, or it may also need additional import instructions that govern how data will be transformed and otherwise handled during the import process.

Tables that will receive streaming data, from a Logger, have similar requirements to those that will be used for batch imports. Whether a table receives data from batch imports, streaming sources, or both, the basic requirements of the schema are the same. At a minimum a schema must contain the column names and datatypes, and some basic information about how the table's data will be stored. These requirements are detailed in the Schema documentation.

Intraday and Historical Data

There are two main categories of data in Deephaven: intraday and historical. Intraday data is stored in the order in which it was received as it is appended to a table. Historical data is partitioned, usually by date, into sets that will no longer have any new data appended, and can be organized based on criteria other than arrival time, for more efficient access. The merge process is used to read the intraday data and reorganize it as it writes it out as historical data. All new data coming into Deephaven is first written as intraday data. The typical flow is to import or stream new data into intraday, and then (usually nightly) merge it to historical. Once historical data has been validated, the intraday versions of the data can be removed.

Deephaven is designed so each enterprise can customize their own installation, including the data being imported and the authorization to access the data. The following diagram is a very generalized view of the data import process:

Although merging data is recommended, it is not required. Merged data can be more efficient to store, and is faster to retrieve for queries. However, it is possible to use old data that has been left unmerged in intraday.

Note: Imported data is generally available immediately when using intraday queries (db.i).

Importing Batch Data

Data Types and Sources

The following data types and sources are currently supported by Deephaven:

CSV - Comma-separated value
JDBC - Java Database Connectivity
XML - eXtensible Markup Language
JSON - JavaScript Object Notation
Downsample - lower frequency version of existing Deephaven table
Binary Import - Deephaven Binary Log Files
Other - Virtually any data type can be imported into Deephaven with the proper importer class

Additionally, data can be imported from commercial sources such as the following:

Quandl
MayStreet

Import Mapping

In the simplest cases, there will be a one-to-one match of columns from a data import source to the Deephaven table into which data is being imported, and the data values will all "fit" directly into the table columns with no conversion or translation needed. In other cases, things may not be so simple.

When source and destination column names don't match, or data does need to be manipulated during import, the ImportSource section of the schema provides instructions for the importer to use in driving the import process.

Schema files created with the Schema Editor based on sample data will have an ImportSource block defined and may have needed transformation rules automatically created. Details of syntax and available mapping and transformation capabilities of the import process can be found in the Schema documentation.

Batch Import Processes

Deephaven provides batch data import tools that can read data from CSV, JDBC, XML, JSON and binary log file sources and write it to intraday tables. These tools can be used from the command line, or through scripts or import queries in the Deephaven UI. These methods are designed for any size of data and for standardized importing of data shared by multiple users. For temporary import of a single (typically small) dataset see Importing Data without Schemas.

The preferred method of importing batch data is to use persistent queries through the Deephaven console. In addition to providing a graphical user interface for creating and editing import tasks, the query-based import methods include task scheduling, dependency chaining, and a single location to view history and status for imports that have been run. A persistent query import can be created clicking the New button shown on the right side of the Deephaven interface. This will open the Persistent Query Configuration Editor. After selecting one of the Import configuration types such as Import - CSV or Import - XML, the Settings tab will present options appropriate to the chosen type.

If you wish to control the scheduling of imports or customize them beyond what is possible in the persistent import queries, you may either run them via the command line or via a "Batch Query (RunAndDone)" script. The latter may be executed via a persistent query, or via a command line "run_local_script" command.

For more details on importing via any of these methods, see the chapter addressing the specific type of data you wish to import.

Streaming Data

Streaming data is appended to an intraday table on a row-by-row basis as it arrives. Streaming data is immediately delivered to users as new rows arrive (via db.i). There is an expectation that the quality of this newest data may be lower than that of historical data, but having access to information with low latency is generally more important for ticking data sources.

One important difference between batch and streaming data is that there are no general formats for streaming data. CSV, JDBC, JSON and others are well known standards for describing and packaging sets of batch data, but since ticking data formats do not have such accepted standards, there is almost always some custom work involved in adding a ticking data source to Deephaven.

Adding a new streaming source to Deephaven requires creating a "Binary Logger". The logger is a program responsible for writing row-oriented binary logs in a format that Deephaven can read. These binary logs serve as an intermediate format that can be replayed into the database in a recovery scenario. Loggers can be written in any language, but must always write in the same format, consistent with the table schema. These binary logs are read by a Deephaven "tailer" process and sent to a Data Import Server, which writes them to intraday tables using a "Listener" generated from the table schema. Binary logger classes are typically generated directly from the target table schema (in Java or C#). Logging with C++ classes is also possible. These logger classes are then used in customer code.

See Streaming Data for details on this process.

Merging and Validating Data

All new data added to an Deephaven installation is first stored in the Intraday database. Typically this data is periodically reorganized and merged into the Historical database where it is stored for long-term use. The easiest way to accomplish a merge is via the "Data Merge" persistent query type. These merge queries take advantage of the same scheduling, execution history, and dependency chaining as other persistent queries.

The data merge step by itself only handles reorganizing the data and copying it from intraday to historical. It does not validate or remove the intraday version of the partition. Validation of merged data, which may optionally include removal of the source intraday data, can be accomplished via a "Data Validation" persistent query, which is dependent on the success of a merge query. As with other import-related tasks, validation may also be done via the command line or manual scripting.

A typical data "lifecycle" consists of some form of data ingestion to intraday, followed by merge, and validation. After successful data validation, the intraday data is deleted, including the directory that contained the intraday partition.

See Merging for further information.

Last Updated: 06 July 2020 14:54 -04:00 UTC Deephaven v.1.20200928 (See other versions)