Appendix B: CSV, JDBC, XML and JSON Data Import Classes
The CSV File Importer Class, the JDBC Importer Class, the XML Importer Class and JSON Importer Class are used to import data from CSV files, JDBC data sources, XML files and JSON files respectively.
These classes have many similar aspects so we'll discuss them collectively, and then list specific individual requirements for each as needed. Although readCsv is available for simple imports of CSV data into memory tables in an Deephaven console, these importer classes should be used when you need something more fine-grained, or are using a non-CSV source or need/want:
- control of the output data type of a column (e.g., you know something is text but might look like a number when the above code reads it, such as various types of order ids)
- auto-mapping or custom mapping of source column name to target column name
- the ability to handle invalid rows (such as extraneous header rows)
- something that automatically uses a directory name and a file prefix, and grabs all the files in order and processes them for you
- the ability to import data from different sources and formats into a single table
- the ability to use custom Java code to transform source data during import
Quickstart
The importers may easily be invoked via the command line using iris_exec
, as follows. Run the importer without arguments for a help page describing the arguments (e.g., iris_exec csv_import
). The arguments are also described in detail in this document.
iris_exec csv_import|jdbc_import|xml_import|json_import [<launch args>] -- <import args>
Schema Requirements and Options
To import data using the CsvFileImporter
, JdbcImporter
, and XmlFileImporter
classes, you must first create a table in Deephaven with a schema definition that either matches the source structure, or includes additional metadata to describe how to map columns from the source to the target table. When importing CSV files, the first row of each file must contain the column names (i.e., the table's header). Multiple CSV files can have different column names, and/or different orders to the columns, as long as appropriate columns exist to populate required columns in the target table. XML files may have named values or positional values. If positional values are being used, they must match to the order of columns in the table schema. Named values work similarly to the header-provided column names of a CSV file, and the importer will match values by name, even if the position of the value in the file varies.
See also: Schema Creation Tools
A simple schema file might look like this:
<Table name="CsvNames5" namespace="Import" storageType="NestedPartitionedOnDisk">
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
<Column name="Date" intradayType="none" dataType="String" columnType="Partitioning"/>
<Column name="FirstName" dataType="String" columnType="Normal" />
<Column name="LastName" dataType="String" columnType="Grouping" />
<Column name="Gender" dataType="String" columnType="Normal" />
</Table>
The storageType
listed in the schema for CSV, JDBC and XML table imports must be NestedPartitionedOnDisk
. In addition, the schema must include one (and only one) string column defined as the partitioning column. This column will be used to partition separate import events of data to the table. If an import that would replace an existing partition is started, (i.e., there is already data in the table with this partitioning value), the importer will, by default, abort the import. However, optional import instructions can direct the importer to append to or replace existing data rather than aborting.
Once defined, the schema can be used to create the table by running the DatabaseTools
class or the deploy_schema
script to read and process the schema file(s).
Because the schema example noted above has no additional metadata to direct import activities, it is expecting a source with FirstName, LastName, and Gender columns. Matching of column names is case-sensitive and whitespace sensitive. The values for the Date column (the partitioning column in this case) will be provided as one of the arguments to the importer if single-partition import is run, and not from data from the source. For multi-partition imports, data from the source can be used to indicate how blocks of data should be partitioned during import.
Caching
The CSV and XML importers provide some additional caching to improve efficiency when importing complex data. For columns that have a formula, or a transform, or that use a complex type other than String, the importer will automatically and by default set up a cache to save recalculating or reconverting frequently repeated source values. The caching maintains a map of source values to calculated values, and will use a cached result when a previously seen source value is seen again for a later row. This behavior can be controlled through optional importer instructions in the schema. See details in the Entry Type/Attribute table in ImportSource XML.
Specific Requirements for CsvFileImporter
The CsvFileImporter
class (com.illumon.iris.importers.CsvFileImporter
) main takes the following arguments:
-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-fd or --delimiter <delimiter character>
-ff or --fileFormat <format name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
-cv or --constantColumnValue <constant column value>
Parameter Value |
Description |
---|---|
-pc |
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with In summary, there are three ways to specify destination table partition(s):
|
|
|
- |
|
|
Field delimiter (optional): Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, |
|
|
|
|
|
|
|
|
|
|
|
Similar to the |
-cv
|
A literal value to use for the import column with sourceType="CONSTANT" , if the destination schema requires it.
|
For the example schema used earlier, one might execute CsvFileImporter
with the following program arguments:
-dp Server1/2016-12-07 -tn CsvNames5 -ns Import -sf ./NewData.csv
One special case when importing CSV data is columns in the target table with a type of Boolean. By default, the CSV importer will attempt to interpret string data from the source table with 0
, F
, f
, or any case of false
, being written as a Boolean false
. Similarly, 1
, T
, t
, or any case of true
would be a Boolean true
. Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column. To convert from other representations of true
/false
, for example from foreign language strings, a formula or transform would be needed, along with a sourceType="String"
to ensure the reading of the data from the CSV field is handled correctly.
Specific Requirements for JdbcImporter
The JdbcImporter (com.illumon.iris.importers.JdbcImporter)
is similar to the CsvFileImporter
, and supports the same syntax and features for automatic column mapping and custom import specification through an ImportSource block. The main difference is that while the CSV importer takes a CSV file as its data source, the JDBC importer pulls data from a SQL query.
The JdbcImporter
class main takes the following arguments:
-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-jd or --jdbcDriver <class name>
-cu or --connectionUrl <JDBC URL>
-jq or --jdbcQuery <query string>
-rc or --relaxedChecking <TRUE or FALSE>
-sn or --sourceName <ImportSource name>
Parameter Value |
Description |
---|---|
-pc |
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with In summary, there are three ways to specify destination table partition(s):
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The schema of the target table will be used to select default setters for transformation of incoming data if there are no ImportColumns
to provide other instructions. For example, if the source column is a string, but the Deephaven table column is designated as integer, the importer will attempt to cast the data as integer.
The driver will need to be available from the classpath of the importer. Examples of driver classes are:
com.microsoft.sqlserver.jdbc.SQLServerDriver for Microsoft SQL Server
com.mysql.jdbc.Driver for MySQL
The connectionURL
will include things like server name, port, authentication, etc, that will vary depending on the type of data source being used. An example of a SQL Server URL for a local, default instance, using integrated Windows security, might be something like the following:
jdbc:sqlserver://localhost:1433;databaseName=Sample;integratedSecurity=true
Other sources may be a bit different. For example, a sample MySQL connectionURL follows:
jdbc:mysql://localhost:3306/Sample?user=root&password=secret
The SQL query format will also vary depending on specific language capabilities and expectations of the data source. However, the resultant columns should match the column names in the target table. Most SQL dialects will allow columns to be calculated or labeled as part of the query, and the importer will match columns based on the resultant labels. Rather than just directly matching result set column names to target table column names, another option is to add transformation metadata to the Deephaven schema. This allows the user to specify options like a different source column name, a different source column data type, whether a column can be null, and even custom transform functions to use during the import. See the general importer topic for a sample of schema for an import target table.
In addition to column translation features of the CSV importer, the JDBC importer will perform special handling when the source column is a Date type (JDBC DateTime, Date, or Timestamp), and the destination is a DBDateTime
. This handling will attempt to do direct conversion between date values. (Note that DBDateTime
is restricted to dates in the range of 09/25/1677 to 04/11/2262.)
Specific Requirements for XmlFileImporter
The XmlFileImporter class (com.illumon.iris.importers.XmlFileImporter
) main takes the following arguments:
-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-fd or --fieldDelimiter
-xi or --startIndex
-xd or --startDepth
-xm or --maxDepth
-xt or --elementType
-ev or --useElementValues
-av or --useAttributeValues
-pv or --namedValues
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
Parameter Value |
Description |
---|---|
-pc |
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
|
|
|
|
|
-fd | Field delimiter (optional): Allows specification of a character to be used when parsing string representations of long or double arrays. |
-xi | Start Index: Starting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found. |
-xd | Start Depth: Start Depth - Under the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import. |
-xm | Max. Depth: Starting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name. |
-xt | Element Type: The name or path of the element that will contain data elements. |
-ev |
Element Values: Indicates that field values will be taken from element values. E.g., <Price>10.25</> |
-av |
Attribute Values: Indicates that field values will be taken from attribute valued. E.g., <Record ID="XYZ" Price="10.25" />
|
-pv |
Positional Values: When omitted, field values within the document will be named. E.g., a value called Price might be contained in an element named Price , or an attribute named Price . When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.
|
|
|
|
|
|
sourceDirectory , sourceFile , and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import (see below). Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob . If sourceDiretory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used. |
|
|
An example execution of the XmlFileImporter
might include program arguments like this:
-dp Server1/2016-12-07 -tn XmlNames5 -ns Import -sf ./NewData.csv -xi 1 -xd 1 -xm 2 -xt DataRecord -ev
One special case when importing XML data is columns in the target table with a type of Boolean. By default, the CSV importer will attempt to interpret string data from the source table with 0, F, f,
or any case of false
, being written as a Boolean false
. Similarly, 1, T, t,
or any case of true
would be a Boolean true
. Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column. To convert from other representations of true/false
, for example from foreign language strings, a formula or transform would be needed, along with a sourceType="String"
to ensure the reading of the data from the XML value is handled correctly.
Specific Requirements for JsonFileImporter
The input file should contain plain JSON objects, not comma separated or an array.For example, the following is a legal file with three objects, which will result in three rows: {a:1} {a:2} {a:3}
You may have multiple JSON objects on a single line, and/or the JSON objects may span multiple lines.
Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the -fps
argument. Array elements are numbered with a zero-based index.
For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2"
and a row with values 1, 2, 3, 4, 5, 6
; (the path separator used here is "_")
{ "a":1, "b":{ "c":2, "d":3 }, "e":[4,5,6] }
The JsonFileImporter class (com.illumon.iris.importers.JsonFileImporter
) main takes the following arguments:
-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
-fps or --filePathSeparator <file path separator>
-mi or --maxInferItems <max infer items>
-cv or --constantColumnValue <constant column value>
Parameter Value |
Description |
---|---|
|
Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):
|
-ns
|
Namespace (required): Namespace in which to find the target table.
|
|
|
|
|
|
|
|
|
|
|
|
Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns" which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default, the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values. |
-mi
|
Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns" which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values. |
-cv
|
A literal value to use for the import column with sourceType="CONSTANT" , if the destination schema requires it.
|
Last Updated: 20 August 2019 09:54 -06:00 UTC Deephaven v.1.20180917
Deephaven Documentation Copyright 2016-2018 Deephaven Data Labs, LLC All Rights Reserved