Batch Data Introduction
Deephaven offers specialized utilities for importing a number of common data formats, including CSV, XML and JSON. Traditional relational data sources can also be directly imported via the JDBC importer. Other data sources must be either transformed into a known format, or a Custom Importer must be implemented.
These importers have many similar aspects so we'll discuss them collectively here, and then list specific individual requirements for each as needed. Although readCsv
is available for simple imports of CSV data into memory tables in an Deephaven console, these importer classes should be used when you need something more fine-grained, or are using a non-CSV source or need/want:
- control of the output data type of a column (e.g., you know something is text but might look like a number when the above code reads it, such as various types of order ids)
- auto-mapping or custom mapping of source column name to target column name
- the ability to handle invalid rows (such as extraneous header rows)
- something that automatically uses a directory name and a file prefix, and grabs all the files in order and processes them for you
- the ability to import data from different sources and formats into a single table
- the ability to use custom Java code to transform source data during import
Schema Requirements and Options
To import data into a system table using any import source, you must first create a table in Deephaven with a schema definition that either matches the source structure, or includes additional metadata to describe how to map columns from the source to the target table. Schemas can be created by hand or through a "schema inference" tool provided by Deephaven.
A simple schema file generated from a CSV data sample might look like the following:
<Table name="CsvNames" namespace="Import" storageType="NestedPartitionedOnDisk">
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
<Column name="Date" intradayType="none" dataType="String" columnType="Partitioning"/>
<Column name="FirstName" dataType="String" columnType="Normal" />
<Column name="LastName" dataType="String" columnType="Grouping" />
<Column name="Gender" dataType="String" columnType="Normal" />
</Table>
Note: columnType="Normal"
is the default setting and can be omitted.
The storageType
listed in the schema for CSV, JDBC, JSON and XML table imports must be NestedPartitionedOnDisk
. In addition, the schema must include exactly one string column defined as the partitioning column. This column will be used to partition separate import events of data to the table. If an import that would replace an existing partition is started (i.e., there is already data in the table with this partitioning value), the importer will, by default, abort the import. However, optional import instructions can direct the importer to append to or replace existing data rather than aborting.
Once defined, the schema(s) must be deployed before they can be used. See Deploying Schemas to Deephaven.
Because the schema example noted above has no additional metadata to direct import activities, it is expecting a source with FirstName
, LastName
, and Gender
columns. Matching of column names is case-sensitive and whitespace sensitive. The values for the Date
column (the partitioning column in this case) will be provided as one of the arguments to the importer if single-partition import is run, and not from data from the source. For multi-partition imports, data from the source can be used to indicate how blocks of data should be partitioned during import.
The following is an example of a schema file that was created using the schema creator to inspect an XML file. There are two main sections to this schema: the ImportSource
block near the top, and the Column
details towards the bottom. The Column
details define the properties of each column in the table, while the ImportSource
sections provides instructions related to importing data from a particular format of source - in this case, an XML file.
<Table name="OrderInfo" namespace="Sample" storageType="NestedPartitionedOnDisk">
<ImportSource name="IrisXML" type="XML">
<ImportColumn name="source_Date" sourceName="Date" sourceType="String" formula="source_Date.isEmpty() ? null :DBTimeUtils.convertDateTime(source_Date.replace(" ","T").replace("Z"," UTC"))" />
<ImportColumn name="Max_value" sourceName="Max-value" />
<ImportColumn name="Product_code" sourceName="Product-code" />
<ImportColumn name="BackOrder" default="null" />
<ImportColumn name="Macro_value" sourceName="Macro-value" />
</ImportSource>
<!-- Directives for overnight merge operations. Everything in one storage partition daily, chosen round-robin. -->
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__" multiday="0,1,2,3,4,5" />
<Column name="Date" dataType="java.lang.String" columnType="Partitioning" />
<Column name="source_Date" dataType="com.illumon.iris.db.tables.utils.DBDateTime" columnType="Normal" />
<Column name="Max_value" dataType="double" columnType="Normal" />
<Column name="Product_code" dataType="java.lang.String" columnType="Grouping" />
<Column name="Quantity" dataType="long" columnType="Normal" />
<Column name="BackOrder" dataType="long" columnType="Normal" />
<Column name="Warranty" dataType="char" columnType="Normal" />
<Column name="Comments" dataType="java.lang.String" columnType="Normal" />
<Column name="Macro_value" dataType="java.lang.String" columnType="Normal" />
</Table>
The schema generators "legalize" column names when reading details of a data set. If a source column name is valid for a column name in Deephaven, and is not in conflict with the partitioning column name, it will be used directly as the column name in the schema. If there is some problem or conflict, the schema generator will modify the name to make it valid, and will also add mapping instructions for the importer to use later. This can be seen in several columns of the preceding sample schema. For example, the column name Max-value in the source file is not valid for a column name (hyphens are not allowed). Therefore, the schema generator renamed Max-value
to Max_value
, and added an ImportColumn
entry with sourceName="Max-value"
to map Max-value
from the source file to Max_value
in the Deephaven table.
Besides mapping different source and target names, ImportColumn
entries are used by the schema creators to add default values for sparse primitive columns (e.g., numeric columns that are sometimes empty) and to add formulas needed to parse or convert source data to Deephaven types.
The BackOrder
column is an example where the source data was sometimes empty, so the schema creator added a default of "null
". Without the default, import would fail when the importer attempted to parse an empty value into a long.
The source_Date
column provides an example of two types of ImportColumn
operations: it is being renamed, from Date
in the source file to source_Date
, and it has a formula to reformat its string representation of date/time data into a form that can be passed to convertDateTime
to convert it to Deephaven's DBDateTime type. Note that the type of the source_Date
column is DBDateTime, to match the output of convertDateTime
.
Caching
The batch data importers provide some additional caching to improve efficiency when importing complex data. For columns that have a formula, or a transform, or that use a complex type other than String, the importer will automatically and by default set up a cache to save recalculating or reconverting frequently repeated source values. The caching maintains a map of source values to calculated values, and will use a cached result when a previously seen source value is seen again for a later row. This behavior can be controlled through optional importer instructions in the schema.
Last Updated: 16 February 2021 18:06 -04:00 UTC Deephaven v.1.20200928 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved