Appendix B: CSV, JDBC, XML and JSON Data Import Classes 

The CSV File Importer Class, the JDBC Importer Class, the XML Importer Class and JSON Importer Class are used to import data from CSV files, JDBC data sources, XML files and JSON files respectively.

These classes have many similar aspects so we'll discuss them collectively, and then list specific individual requirements for each as needed. Although readCsv is available for simple imports of CSV data into memory tables in an Deephaven console, these importer classes should be used when you need something more fine-grained, or are using a non-CSV source or need/want:

  • control of the output data type of a column (e.g.,  you know something is text but might look like a number when the above code reads it, such as various types of order ids)
  • auto-mapping or custom mapping of source column name to target column name
  • the ability to handle invalid rows (such as extraneous header rows)
  • something that automatically uses a directory name and a file prefix, and grabs all the files in order and processes them for you
  • the ability to import data from different sources and formats into a single table
  • the ability to use custom Java code to transform source data during import

Quickstart

The importers may easily be invoked via the command line using iris_exec, as follows. Run the importer without arguments for a help page describing the arguments (e.g., iris_exec csv_import). The arguments are also described in detail in this document.

iris_exec csv_import|jdbc_import|xml_import|json_import [<launch args>] -- <import args>

Schema Requirements and Options

To import data using the CsvFileImporter, JdbcImporter, and XmlFileImporter classes, you must first create a table in Deephaven with a schema definition that either matches the source structure, or includes additional metadata to describe how to map columns from the source to the target table. When importing CSV files, the first row of each file must contain the column names (i.e., the table's header). Multiple CSV files can have different column names, and/or different orders to the columns, as long as appropriate columns exist to populate required columns in the target table. XML files may have named values or positional values. If positional values are being used, they must match to the order of columns in the table schema. Named values work similarly to the header-provided column names of a CSV file, and the importer will match values by name, even if the position of the value in the file varies.

See also: Schema Creation Tools

A simple schema file might look like this:

<Table name="CsvNames5" namespace="Import" storageType="NestedPartitionedOnDisk">
   <Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
   <Column name="Date" intradayType="none" dataType="String" columnType="Partitioning"/>
   <Column name="FirstName" dataType="String" columnType="Normal" />
   <Column name="LastName" dataType="String" columnType="Grouping" />
   <Column name="Gender" dataType="String" columnType="Normal" />
</Table>

The storageType listed in the schema for CSV, JDBC and XML table imports must be NestedPartitionedOnDisk.  In addition, the schema must include one (and only one) string column defined as the partitioning column. This column will be used to partition separate import events of data to the table. If an import that would replace an existing partition is started, (i.e., there is already data in the table with this partitioning value), the importer will, by default, abort the import. However, optional import instructions can direct the importer to append to or replace existing data rather than aborting.

Once defined, the schema can be used to create the table by running the DatabaseTools class or the deploy_schema script to read and process the schema file(s).

Because the schema example noted above has no additional metadata to direct import activities, it is expecting a source with FirstName, LastName, and Gender columns. Matching of column names is case-sensitive and whitespace sensitive. The values for the Date column (the partitioning column in this case) will be provided as one of the arguments to the importer if single-partition import is run, and not from data from the source. For multi-partition imports, data from the source can be used to indicate how blocks of data should be partitioned during import.

Caching

The CSV and XML importers provide some additional caching to improve efficiency when importing complex data. For columns that have a formula, or a transform, or that use a complex type other than String, the importer will automatically and by default set up a cache to save recalculating or reconverting frequently repeated source values. The caching maintains a map of source values to calculated values, and will use a cached result when a previously seen source value is seen again for a later row. This behavior can be controlled through optional importer instructions in the schema. See details in the Entry Type/Attribute table in ImportSource XML.

Specific Requirements for CsvFileImporter

The CsvFileImporter class (com.illumon.iris.importers.CsvFileImporter) main takes the following arguments:

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-fd or --delimiter <delimiter character>
-ff or --fileFormat <format name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
-cv or --constantColumnValue <constant column value>

Parameter Value

Description

-dd

-dp

-pc

Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table.

In summary, there are three ways to specify destination table partition(s):

  • Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  • Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  • Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)

-ns

Namespace (required): Namespace in which to find the target table.

-tn

tableName (required): Name of the target table.

-fd

Field delimiter (optional): Allows specification of a character other than the file format default as the field delimiter. If delimiter is specified, fileFormat is ignored. This must be a single character.

-ff

fileFormat (optional): The Apache commons CSV parser is used to parse the file itself. Five common formats are supported:

  • DEFAULT – default format if none is specified; comma-separated field values, newline row terminators, double-quotes around field values that contain embedded commas, newline characters, or double-quotes.
  • TRIM - Similar to DEFAULT, but will trim all white space around values.
  • EXCEL – The Microsoft Excel CSV format
  • MYSQL – The MySQL CSV format
  • RFC4180 – IETF RFC 4180 MIME text/csv format
  • TDF – Tab-delimited format

-om

outputMode (optional): SAFE (default), REPLACE, or APPEND. SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort. APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.

-rc

relaxedChecking (optional): Defaults to FALSE. If TRUE, will allow target columns that are missing from the source CSV to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.

-sd

-sf

-sg

sourceDirectory, sourceFile, and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import (see below). Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob. If sourceDiretory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.

-sn

sourceName: Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV or JDBC).

-tr

Similar to the TRIM file format, but adds leading/trailing whitespace trimming to any format. So, for a comma-delimited file with extra whitespace, -ff TRIM would be sufficient, but for a file using something other than a comma as its delimiter, the -tr option would be used in addition to -ff or -fd.

-cv A literal value to use for the import column with sourceType="CONSTANT", if the destination schema requires it.

For the example schema used earlier, one might execute CsvFileImporter with the following program arguments:

-dp Server1/2016-12-07 -tn CsvNames5 -ns Import -sf ./NewData.csv

One special case when importing CSV data is columns in the target table with a type of Boolean. By default, the CSV importer will attempt to interpret string data from the source table with 0, F, f, or any case of false, being written as a Boolean false. Similarly, 1, T, t, or any case of true would be a Boolean true. Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column. To convert from other representations of true/false, for example from foreign language strings, a formula or transform would be needed, along with a sourceType="String" to ensure the reading of the data from the CSV field is handled correctly.

Specific Requirements for JdbcImporter

The JdbcImporter (com.illumon.iris.importers.JdbcImporter) is similar to the CsvFileImporter, and supports the same syntax and features for automatic column mapping and custom import specification through an ImportSource block. The main difference is that while the CSV importer takes a CSV file as its data source, the JDBC importer pulls data from a SQL query.

The JdbcImporter class main takes the following arguments:

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-jd or --jdbcDriver <class name>
-cu or --connectionUrl <JDBC URL>
-jq or --jdbcQuery <query string>
-rc or --relaxedChecking <TRUE or FALSE>
-sn or --sourceName <ImportSource name>

Parameter Value

Description

-dd

-dp

-pc

Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table.

In summary, there are three ways to specify destination table partition(s):

  • Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  • Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  • Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)

-ns

Namespace (required): Namespace in which to find the target table.

-tn

tableName (required): Name of the target table.

-om

outputMode (optional): SAFE (default), REPLACE, or APPEND. SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort. APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.

-jd

jdbcDriver - class name of the driver to load to connect to the data source. See below for more details.

-cu

connectionUrl - information needed to find and connect to the data source. See below for more details.

-jq

jdbcQuery - the full query string to execute against the data source. In most cases, this will need to be delimited with double-quotes.

-rc

relaxedChecking (optional): Defaults to FALSE. If TRUE, this will allow target columns that are missing from the source CSV to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.

-sn

sourceName: Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV or JDBC).

The schema of the target table will be used to select default setters for transformation of incoming data if there are no ImportColumns to provide other instructions. For example, if the source column is a string, but the Deephaven table column is designated as integer, the importer will attempt to cast the data as integer.

The driver will need to be available from the classpath of the importer. Examples of driver classes are:

  • com.microsoft.sqlserver.jdbc.SQLServerDriver for Microsoft SQL Server
  • com.mysql.jdbc.Driver for MySQL

The connectionURL will include things like server name, port, authentication, etc, that will vary depending on the type of data source being used. An example of a SQL Server URL for a local, default instance, using integrated Windows security, might be something like the following:

jdbc:sqlserver://localhost:1433;databaseName=Sample;integratedSecurity=true

Other sources may be a bit different. For example, a sample MySQL connectionURL follows:

jdbc:mysql://localhost:3306/Sample?user=root&password=secret

The SQL query format will also vary depending on specific language capabilities and expectations of the data source. However, the resultant columns should match the column names in the target table. Most SQL dialects will allow columns to be calculated or labeled as part of the query, and the importer will match columns based on the resultant labels. Rather than just directly matching result set column names to target table column names, another option is to add transformation metadata to the Deephaven schema. This allows the user to specify options like a different source column name, a different source column data type, whether a column can be null, and even custom transform functions to use during the import. See the general importer topic for a sample of schema for an import target table.

In addition to column translation features of the CSV importer, the JDBC importer will perform special handling when the source column is a Date type (JDBC DateTime, Date, or Timestamp), and the destination is a DBDateTime. This handling will attempt to do direct conversion between date values. (Note that DBDateTime is restricted to dates in the range of 09/25/1677 to 04/11/2262.)

Specific Requirements for XmlFileImporter

The XmlFileImporter class (com.illumon.iris.importers.XmlFileImporter) main takes the following arguments:

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-fd or --fieldDelimiter
-xi or --startIndex
-xd or --startDepth
-xm or --maxDepth
-xt or --elementType
-ev or --useElementValues
-av or --useAttributeValues
-pv or --namedValues
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>

Parameter Value

Description

-dd

-dp

-pc

Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table.

In summary, there are three ways to specify destination table partition(s):

  • Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  • Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  • Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)

-ns

Namespace (required): Namespace in which to find the target table.

-tn

tableName (required): Name of the target table.

-fd Field delimiter (optional): Allows specification of a character to be used when parsing string representations of long or double arrays.
-xi Start Index: Starting from the root of the document, the index (1 being the first top-level element in the document after the root) of the element under which data can be found.
-xd Start Depth: Start Depth - Under the element indicated by Start Index, how many levels of first children to traverse to find an element that contains data to import.
-xm Max. Depth: Starting from Start Depth, how many levels of element paths to traverse and concatenate to provide a list that can be selected under Element Name.
-xt Element Type: The name or path of the element that will contain data elements.
-ev Element Values: Indicates that field values will be taken from element values. E.g., <Price>10.25</>
-av Attribute Values: Indicates that field values will be taken from attribute valued. E.g., <Record ID="XYZ" Price="10.25" />
-pv Positional Values: When omitted, field values within the document will be named. E.g., a value called Price might be contained in an element named Price, or an attribute named Price. When this option is included, field names (column names) will be taken from the table schema, and the data values will be parsed into them by matching the position of the value with the position of column in the schema.

-om

outputMode (optional): SAFE (default), REPLACE, or APPEND. SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort. APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.

-rc

relaxedChecking (optional): Defaults to FALSE. If TRUE, this will allow target columns that are missing from the source CSV to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.

-sd
-sf
-sg

sourceDirectory, sourceFile, and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import (see below). Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob. If sourceDiretory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.

-sn

sourceName: Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV or JDBC).

An example execution of the XmlFileImporter might include program arguments like this:

-dp Server1/2016-12-07 -tn XmlNames5 -ns Import -sf ./NewData.csv -xi 1 -xd 1 -xm 2 -xt DataRecord -ev

One special case when importing XML data is columns in the target table with a type of Boolean. By default, the CSV importer will attempt to interpret string data from the source table with 0, F, f, or any case of false, being written as a Boolean false. Similarly, 1, T, t, or any case of true would be a Boolean true. Any other value would result in a conversion failure that may be continuable, if, for instance, a default is set for the column. To convert from other representations of true/false, for example from foreign language strings, a formula or transform would be needed, along with a sourceType="String" to ensure the reading of the data from the XML value is handled correctly.

Specific Requirements for JsonFileImporter

The input file should contain plain JSON objects, not comma separated or an array.For example, the following is a legal file with three objects, which will result in three rows: {a:1} {a:2} {a:3}

You may have multiple JSON objects on a single line, and/or the JSON objects may span multiple lines.

Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the -fps argument. Array elements are numbered with a zero-based index.

For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2" and a row with values 1, 2, 3, 4, 5, 6; (the path separator used here is "_")

{ "a":1, "b":{ "c":2, "d":3 }, "e":[4,5,6] }

The JsonFileImporter class (com.illumon.iris.importers.JsonFileImporter) main takes the following arguments:

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
-fps or --filePathSeparator <file path separator>
-mi or --maxInferItems <max infer items>
-cv or --constantColumnValue <constant column value>

Parameter Value

Description

-dd

-dp

-pc

Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table.

In summary, there are three ways to specify destination table partition(s):

  • Destination directory (e.g., -dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/)
  • Internal partition and destination partition (e.g., -dp localhost/2018-01-01)
  • Internal partition and partition column - for multi-partition import (e.g., -dp localhost -pc Date)
-ns Namespace (required): Namespace in which to find the target table.

-tn

tableName (required): Name of the target table.

-om

outputMode (optional): SAFE (default), REPLACE, or APPEND. SAFE checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, REPLACE should be used, because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort. APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.

-rc

relaxedChecking (optional): Defaults to FALSE. If TRUE, will allow target columns that are missing from the source JSON to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to TRUE only when developing the import process for a new data source.

-sd

-sf

-sg

sourceDirectory, sourceFile, and sourceGlob are all optional. If none of these are provided, the system will attempt to do a multi-file import (see below). Otherwise, sourceDirectory will be used in conjunction with sourceFile or sourceGlob. If sourceDiretory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.

-sn

sourceName: Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/JDBC/XML/JSON).

-fps

Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns" which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default, the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values.

-mi Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns" which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values.
-cv A literal value to use for the import column with sourceType="CONSTANT", if the destination schema requires it.

 


Last Updated: 20 August 2019 09:54 -06:00 UTC    Deephaven v.1.20180917

Deephaven Documentation      Copyright 2016-2018  Deephaven Data Labs, LLC      All Rights Reserved