Importing JSON Files

JSON can contain elaborate structure. The JSON import tools provided by Deephaven can interpret most JSON files, but map the data to a single table in a specific way. A more elaborate mapping may require writing a custom import job or pre-processing the source data.

The input file should contain plain JSON objects, not comma separated or an array. For example, the following is a legal file with three objects, which will result in three rows: {a:1} {a:2} {a:3}

You may have multiple JSON objects on a single line, and/or the JSON objects may span multiple lines.

Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the -fps argument. Array elements are numbered with a zero-based index.

For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2" and a row with values 1, 2, 3, 4, 5, 6; (the path separator used here is "_")

{ "a":1, "b":{ "c":2, "d":3 }, "e":[4,5,6] }

Quickstart

Here is an example of how to generate a schema from a JSON data file, deploy it, and import a file from the command line. These commands assume a typical Deephaven installation and a sample file located at /data/sample.json.

Generate a schema from a sample JSON data file:

iris_exec json_schema_creator -- -ns JSONExampleNamespace -tn JSONExampleTableName -sf /data/sample.json

The schema file will be generated in the JSONExampleNamespace directory.

ls -l JSONExampleNamespace/*.schema

Deploy the schema:

sudo -u irisadmin service iris deploy_schema -f <fully qualified schema file name>

Import a single data file into the specified Intraday partition:

sudo -u dbmerge /usr/illumon/latest/bin/iris_exec json_import -- -ns JSONExampleNamespace -tn JSONExampleTableName -sf /data/sample.json -dp localhost/2018-09-26

This example will generate a table schema from sample.json with the namespace "JSONExampleNamespace" and table name "JSONExampleTableName". The default behavior creates a directory for the namespace in the current directory and places the output schema file inside this directory. Once deployed, JSON files matching the structure of sample.json can be imported.

Schema Inference

The JSON schema creator infers the structure of a single table by recursively "flattening" the JSON objects in the specified file. Type inference is then run on this using the same logic as the CSV schema creator.

The input file should contain plain JSON objects, not comma separated or an array. For example, the following is a legal file with three objects, which will result in three rows: {a:1} {a:2} {a:3}

You may have multiple JSON objects on a single line, and/or the JSON objects may span multiple lines.

Nested objects are recursively expanded, with the column name for each field derived from the full "path" to that field. The path separator can be specified with the -fps argument. Array elements are numbered with a zero-based index.

For example, the following maps to a schema with the columns "a", "b_c", "b_d", "e0", "e1", "e2"and a row with values 1, 2, 3, 4, 5, 6; (the path separator used here is "_")

{ "a":1, "b":{ "c":2, "d":3 }, "e":[4,5,6] }

Command Reference

iris_exec json_schema_creator <launch args> -- <schema creator args>

The following arguments are available when running the JSON schema creator:

-ns or --namespace <namespace>
-tn or --tableName <name>
-sp or --schemaPath <path>
-sf or --sourceFile <file name or file path and name>
-pc or --partitionColumn
-gc or --groupingColumn
-spc or --sourcePartitionColumn
-sn or --sourceName <name for ImportSource block>
-sl or --skipLines <integer value>
-lp or --logProgress
-bf or --bestFit
-om or --outputMode
-fps or --fieldPathSeparator

Parameter Value	Description
-ns	`Namespace` (required): Namespace in which to find the target table.
-tn	`tableName` (required): Name of the target table.
-sp	An optional path to which the schema file will be written. If not specified, this will default to the current working directory and will create or use a subdirectory that matches the namespace.
-sf	The name of the JSON file to read (required).
-pc	Optional name for the partitioning column if schema is being generated. If this is not provided, the importer will default to `Date` for the partitioning column name. Any existing column from the source that matches the name of the partitioning column will be renamed to `source_[original column name]`.
-spc	Optional column to use for multi-partition imports. For example if the partitioning column is "Date" and you want to enable multi-partition imports based on the column `source_date`, specify "`source_date`" with this option (Note: This is the column name in the data source, not the Deephaven column name).
-gc	Optional column name that should be marked as `columnType="Grouping"` in the schema. If multiple grouping columns are needed, the generated schema should be manually edited to add the Grouping designation to the additional columns.
-sn	Optional name to use for the generated `ImportSource` block in the schema. If not provided, the default of `IrisJSON` will be used.
-lp	If present, additional informational logging will be provided with progress updates during the parsing process.
-bf	If present, the class will attempt to use the smallest numeric types to fit the data. For example, for integer values, short will be used if all values are in the range of -32768 to 32767. If larger values are seen, the class will then move to int and eventually long. The default behavior, without `-bf`, is to use long for all integer columns and double for all floating point columns.
-om	Either `SAFE` (default) or `REPLACE`. When `SAFE`, the schema creator will exit with an error if an existing file would be overwritten by the schema being generated. When set to `REPLACE`, a pre-existing file will be overwritten.
-fps	If present, specifies a character or string to use when generating column names from nested JSON data. Defaults to underscore ("`_`").

Import Query

There is presently no support for JSON imports via a persistent query.

Importing Using Builder

JSON imports may be performed from inside a Deephaven Groovy or Python script. This permits more elaborate logic with respect to existing data. These scripts may be executed as a persistent query or from the command line using the iris_exec run_local_script tool. All imports should be performed as the dbmerge user (or from a persistent query, running on a merge server).

Example

The following script imports a single JSON file to a specified partition:

from deephaven import *

rows = JsonImport.Builder("Test", "Sample")\
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.json")\
    .setDestinationPartitions("localhost/2018-04-01")\
    .setOutputMode("REPLACE")\
    .build()\
    .run()

print("Imported {} rows.".format(rows))

import com.illumon.iris.importers.util.JsonImport
import com.illumon.iris.importers.ImportOutputMode

rows = new JsonImport.Builder("Test","Sample")
    .setSourceFile("/db/TempFiles/dbquery/staging/data1.json")
    .setDestinationPartitions("localhost/2018-04-01")
    .setOutputMode(ImportOutputMode.REPLACE)
    .build()
    .run()

println "Imported " + rows + " rows."

Import API Reference

The JSON import class provides a static builder method, which produced an object used to set parameters for the import. The builder returns an import object from the build() method. Imports are executed via the run() method and if successful, return the number of rows imported. All other parameters and options for the import are configured via the setter methods described below. The general pattern when scripting an import is:

nRows = JsonImport.builder(<namespace>,<table>)
    .set<option>(<option value>)
    …
    .build()
    .run()

JSON Import Options

Option Setter	Type	Req?	Default	Description
setDestinationPartitions	String	No*	N/A	The destination partition(s). e.g., "`localhost/2018-05-01`" for a single partition import or "`localhost`" for multi-partition import.
setDestinationDirectory	File \| String	No*	N/A	The destination directory. e.g., `/tmp/mylocation`
setOutputMode	ImportOutputMode \| String	No	SAFE	Enumeration with the following options: `ImportOutputMode.SAFE \| ImportOutputMode.REPLACE \| ImportOutputMode.APPEND` May also be specified as String (e.g.,`"SAFE"/"REPLACE"/"APPEND"`).
setStrict	boolean	No	true	If `true`, will allow target columns that are missing from the source to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to `false` only when developing the import process for a new data source.
setPartitionColumn	String	No	N/A	Column name to use to choose which partition to import each source row.
setSourceName	String	No	N/A	Specific `ImportSource` to use (from the table schema). If not specified, the importer will use the first `ImportSource` block that it finds that matches the type of the import (CSV/XML/JDBC).
setSourceDirectory	String	No**	N/A	Directory from which to read source file(s)
setSourceFile	String	No**	N/A	Source file name (either full path on server filesystem or relative to specified source directory)
setSourceGlob	String	No**	N/A	Source file(s) wildcard expression
setMaxInferItems	Long	No	No Limit	Maximum number of JSON objects to examine when inferring which columns are present in the file. Inference is necessary because JSON has no internal "schema".
setColumnNames	String \| List<String>	No	N/A	Column names to extract from source data.
setFieldPathSeparator	String	No	N/A	String to use when generating column names from nested JSON data. Defaults to the underscore character ("`_`").
setConstantColumnValue	String	No	N/A	A String to materialize as the source column when an `ImportColumn` is defined with a `sourceType` of `CONSTANT`

* Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s):

Destination directory (e.g., .setDestinationDirectory( /db/Intraday/<namespace>/<table>/localhost/<date>/<table/))
Internal partition and destination partition (e.g., .setDestinationPartitions("localhost/2018-04-01"))
Internal partition and partition column - for multi-partition import (e.g., .setDestinationPartitions("localhost").setPartitionColumn("Date"))

** The sourceDirectory parameter will be used in conjunction with sourceFile or sourceGlob. If sourceDirectory is not provided, but sourceFile is, then sourceFile will be used as a fully qualified file name. If sourceDirectory is not provided, but sourceGlob is, then sourceDirectory will default to the configured log file directory from the prop file being used.

Import from Command Line

JSON imports can also be performed directly from the command line, using the iris_exec tool.

Command Reference

iris_exec json_import <launch args> -- <json import args>

-dd or --destinationDirectory <path>
-dp or --destinationPartition <internal partition name / partitioning value> | <internal partition name>
-pc or --intradayPartition <partition column name>
-ns or --namespace <namespace>
-tn or --tableName <name>
-om or --outputMode <import behavior>
-rc or --relaxedChecking <TRUE or FALSE>
-sd or --sourceDirectory <path>
-sf or --sourceFile <exact file name>
-sg or --sourceGlob <file name pattern>
-sn or --sourceName <ImportSource name>
-fps or --filePathSeparator <file path separator>
-mi or --maxInferItems <max infer items>
-cv or --constantColumnValue <constant column value>

JSON Import Arguments

Parameter Value	Description
-dd -dp -pc	Either a destination directory, specific partition, or internal partition plus a partition column must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table. In summary, there are three ways to specify destination table partition(s): Destination directory (e.g., `-dd /db/Intraday/<namespace>/<table>/localhost/<date>/<table/`) Internal partition and destination partition (e.g., `-dp localhost/2018-01-01`) Internal partition and partition column - for multi-partition import (e.g., `-dp localhost -pc Date`)
-ns	`Namespace` (required): Namespace in which to find the target table.
-tn	`tableName` (required): Name of the target table.
-om	`outputMode` (optional): `SAFE` (default), `REPLACE`, or `APPEND. SAFE` checks whether the target table already contains data matching the provided partitioning value; if it does, the import is aborted. When developing an import process, `REPLACE` should be used, because failed import attempts will often write some data to the table, causing the next attempt with `SAFE` to abort. `APPEND` should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.
-rc	`relaxedChecking` (optional): Defaults to `FALSE`. If `TRUE`, will allow target columns that are missing from the source JSON to remain null, and will allow import to continue when data conversions fail. In most cases, this should be set to `TRUE` only when developing the import process for a new data source.
-sd -sf -sg	`sourceDirectory, sourceFile`, and `sourceGlob` are all optional. If none of these are provided, the system will attempt to do a multi-file import (see below). Otherwise, `sourceDirectory` will be used in conjunction with `sourceFile` or `sourceGlob`. If `sourceDiretory` is not provided, but `sourceFile` is, then `sourceFile` will be used as a fully qualified file name. If `sourceDirectory` is not provided, but `sourceGlob` is, then `sourceDirectory` will default to the configured log file directory from the prop file being used.
-sn	`sourceName`: Specific ImportSource to use. If not specified, the importer will use the first ImportSource block that it finds that matches the type of the import (CSV/JDBC/XML/JSON).
-fps	Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns" which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default, the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values.
-mi	Specifies how many JSON items to examine in the source file(s) prior to import, in order to infer the "columns", which exist in the source data, and validate against the destination table. This inference step is necessary because JSON permits missing values. By default the importer will read all items. For large files it may be improve performance to use a smaller value, if you know your source data does not contain missing values.
-cv	A literal value to use for the import column with `sourceType="CONSTANT"`, if the destination schema requires it.

Last Updated: 16 February 2021 18:06 -04:00 UTC Deephaven v.1.20200928 (See other versions)