Batch Data Introduction

Deephaven offers specialized utilities for importing a number of common data formats, including CSV, XML and JSON. Traditional relational data sources can also be directly imported via the JDBC importer. Other data sources must be either transformed into a known format, or a Custom Importer must be implemented.

These importers have many similar aspects so we'll discuss them collectively here, and then list specific individual requirements for each as needed. Although readCsv is available for simple imports of CSV data into memory tables in an Deephaven console, these importer classes should be used when you need something more fine-grained, or are using a non-CSV source or need/want:

control of the output data type of a column (e.g., you know something is text but might look like a number when the above code reads it, such as various types of order ids)
auto-mapping or custom mapping of source column name to target column name
the ability to handle invalid rows (such as extraneous header rows)
something that automatically uses a directory name and a file prefix, and grabs all the files in order and processes them for you
the ability to import data from different sources and formats into a single table
the ability to use custom Java code to transform source data during import

Schema Requirements and Options

To import data into a system table using any import source, you must first create a table in Deephaven with a schema definition that either matches the source structure, or includes additional metadata to describe how to map columns from the source to the target table. Schemas can be created by hand or through a "schema inference" tool provided by Deephaven.

A simple schema file generated from a CSV data sample might look like the following:

Note: columnType="Normal" is the default setting and can be omitted.

The storageType listed in the schema for CSV, JDBC, JSON and XML table imports must be NestedPartitionedOnDisk. In addition, the schema must include exactly one string column defined as the partitioning column. This column will be used to partition separate import events of data to the table. If an import that would replace an existing partition is started (i.e., there is already data in the table with this partitioning value), the importer will, by default, abort the import. However, optional import instructions can direct the importer to append to or replace existing data rather than aborting.

Once defined, the schema(s) must be deployed before they can be used. See Deploying Schemas to Deephaven.

Because the schema example noted above has no additional metadata to direct import activities, it is expecting a source with FirstName, LastName, and Gender columns. Matching of column names is case-sensitive and whitespace sensitive. The values for the Date column (the partitioning column in this case) will be provided as one of the arguments to the importer if single-partition import is run, and not from data from the source. For multi-partition imports, data from the source can be used to indicate how blocks of data should be partitioned during import.

The following is an example of a schema file that was created using the schema creator to inspect an XML file. There are two main sections to this schema: the ImportSource block near the top, and the Column details towards the bottom. The Column details define the properties of each column in the table, while the ImportSource sections provides instructions related to importing data from a particular format of source - in this case, an XML file.

The schema generators "legalize" column names when reading details of a data set. If a source column name is valid for a column name in Deephaven, and is not in conflict with the partitioning column name, it will be used directly as the column name in the schema. If there is some problem or conflict, the schema generator will modify the name to make it valid, and will also add mapping instructions for the importer to use later. This can be seen in several columns of the preceding sample schema. For example, the column name Max-value in the source file is not valid for a column name (hyphens are not allowed). Therefore, the schema generator renamed Max-value to Max_value, and added an ImportColumn entry with sourceName="Max-value" to map Max-value from the source file to Max_value in the Deephaven table.

Besides mapping different source and target names, ImportColumn entries are used by the schema creators to add default values for sparse primitive columns (e.g., numeric columns that are sometimes empty) and to add formulas needed to parse or convert source data to Deephaven types.

The BackOrder column is an example where the source data was sometimes empty, so the schema creator added a default of "null". Without the default, import would fail when the importer attempted to parse an empty value into a long.

The source_Date column provides an example of two types of ImportColumn operations: it is being renamed, from Date in the source file to source_Date, and it has a formula to reformat its string representation of date/time data into a form that can be passed to convertDateTime to convert it to Deephaven's DBDateTime type. Note that the type of the source_Date column is DBDateTime, to match the output of convertDateTime.

Caching

The batch data importers provide some additional caching to improve efficiency when importing complex data. For columns that have a formula, or a transform, or that use a complex type other than String, the importer will automatically and by default set up a cache to save recalculating or reconverting frequently repeated source values. The caching maintains a map of source values to calculated values, and will use a cached result when a previously seen source value is seen again for a later row. This behavior can be controlled through optional importer instructions in the schema.

Last Updated: 16 February 2021 18:06 -04:00 UTC Deephaven v.1.20200928 (See other versions)