Tables & Schemas

Overview

A Deephaven table schema is an XML document that describes a table. A schema must include a table's namespace, name, column names and data types. In addition to specifying the structure of the data, a schema can include directives controlling how data is imported and stored.

Schema files can be created by directly constructing XML files (e.g., using a text editor), or can be generated using built-in tools. In most cases, the easiest way to create or edit a schema file is by using the Deephaven Schema Editor. The Schema Editor itself can be used to create a new schema from scratch, edit an existing schema, or derive a new schema from example data. This tool is covered in detail in the Schema Editor section.

If a schema is created manually, this is typically done by picking an existing schema file to use as a starting point, and then copying, pasting, and modifying it to create a suitable layout for the new table. This does not require any special tools, but it can be tedious for complex tables, or when many tables are needed. To simplify this process, Deephaven includes tools that can analyze data sets and infer column details and import instructions, and that can also generate and validate sections and entries within schema files. The schema creation tools make schema creation faster and easier, and help avoid issues related to formatting and syntax. These tools are described briefly in the section on Schema Inference, and in detail in the chapter covering each specific data type (CSV, XML, etc.).

Once a schema is defined, it must be deployed to take effect. If the table is intended for use by streaming data, an additional step is required to generate loggers and listeners. This is discussed further in Deploying Schemas to Deephaven.

This part of the documentation covers general schema requirements and features. Additional optional sections of schema are detailed in Importing Batch Data and Streaming Data.

Basic Table Attributes

Metadata is imported to the database using XML-based schema files that contain definitions like the following examples:

Each schema file contains definitions for exactly one table, and must follow the naming convention <namespace>.<table>.schema. For example, the above examples would be named Ise.Quotes.schema and Korea.Securities.schema.

The following three attributes are necessary for all Table elements:

Attribute	Description
namespace	The system `namespace` (e.g., Ise or Korea) to which this table belongs. A `namespace` is analogous to a folder on the filesystem — they are used to organize the data, and one namespace typically contains many tables (just as a folder typically contains many files).
name	The name of this table (e.g., Quotes or Securities).
storageType	Specifies the way data for this table will be organized on disk. Options: `NestedPartitionedOnDisk` - a hierarchically-partitioned format, suitable for nearly all use cases (the remainder of this document applies primarily to this format). `SplayedOnDisk` - a table that exists in exactly one location, without any of the features enabled by partitioning (near real-time import, storage load balancing, etc).
defaultMergeFormat	Specifies the format in which merged (historical) data will be written. DeephavenV1 - the data will be written in the proprietary Deephaven format. Parquet - the data will be written in Parquet format.

Merge Attributes

When merging data to historical partitions, Deephaven supports writing to traditional Deephaven files as well as writing to Parquet files. When writing to Parquet, a default compression codec can be chosen by adding a MergeAttributes element with an appropriate Parquet-supported codec. This codec can be changed by specifying a different one in the merge operation. If no codec is chosen, then the SNAPPY codec is used. This element does not apply to merges in Deephaven format. See Parquet Codecs for further details.

For example, to use the uncompressed codec:

Columns

The <Column> elements contain the detailed information to define each table's columns, including the column's name, data type, and role as a partitioning or grouping column. At minimum these elements must contain the following attributes.

Attribute	Description
name	Required – The name of the column as it will be stored in the database
dataType	Required – The type of the data stored in a column. This can be one of the following: A Java class name, including the package. Of particular note are BigDecimal, BigInteger and byte[], since these type are supported with special codecs (see below). One of the following type names: DateTime — A date and time, stored as a DBDateTime object String — A Java String StringSet — An efficient Deephaven collection for storing multiple Strings Boolean — A Java Boolean object. In order to support null values, boolean values are stored with Java's boxed boolean type (java.lang.Boolean), not the primitive type (boolean). byte — A Java primitive byte char — A Java primitive char double — A Java primitive double float — A Java primitive float integer — A Java primitive int long — A Java primitive long short — A Java primitive short An array of any of the above data types, specified with square brackets, for example short[] Note: Deephaven also supports other friendly names for some types, for instance, Integer will be automatically treated as int. This treatment does not apply, however to array types. An array of 32-bit values must be declared as int[].
columnType	Optional – The role of the column in the table, one of the following: Normal - the default, a standard column with no special attributes Partitioning - exactly one column must be specified with this columnType attribute value, which indicates that it is the partitioning column. Partition columns must have `dataType="String"`. Grouping - this is a grouping column See Partition Key Formulas for details on the function of Partitioning and Grouping columns
objectCodec	Optional - A java class which controls how the column is stored on disk. See below for details on the provided codecs: `com.illumon.util.codec.BigDecimalCodec` `com.illumon.util.codec.BigIntegerCodec` `com.illumon.util.codec.ByteArrayCodec`
objectCodecArguments	Optional - Argument(s) provided to the `objectCodec`. How this value is interpreted depends on which codec is specified.
symbolTable	Optional - For String columns only, this specifies whether or not a symbol table will be used to store the data. Supported values are: `ColumnLocation` - for tables with a limited number of possible unique values, this stores the unique values one time and stores only a reference to the values. This can save space in the table. `None` - for columns with a large number of unique values, this stores each value as its own entry in the column. If it is not provided, the value of `ColumnLocation` is used.
encoding	Optional - For String columns only, this determines the encoding used to store the value. Supported values are: `US_ASCII` `ISO_8859_1` `UTF_8` `UTF_16BE` `UTF_16LE` `UTF_16` If it is not provided, a default encoding of `ISO_8859_1` is used.

Column Codecs

Each codec is mapped to a single column data type and may be used only with that type. The codec does not affect how the data appears in the Deephaven console. Custom codecs may be provided on the Deephaven classpath by implementing the com.illumon.util.codec.ObjectCodec<T> interface. The following are provided with Deephaven. For columns of these types, it is much more space efficient to use one of these codecs than to rely on the default java serialization.

Class Name	Column Type	Arguments
com.illumon.util.codec.BigDecimalCodec	java.lang.BigDecimal	`[<precision> [, <scale>]]` The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after decimal point. If precision is specified but scale is not, a scale of zero is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of 1000. Null values are permitted.
com.illumon.util.codec.BigIntegerCodec	java.lang.BigInteger	`[<precision>]` If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted.
com.illumon.util.codec.ByteArrayCodec	byte[]	`[<size> [, nullable \| notnull]]` If size is specified, values are encoded as fixed-size values. If not, a variable size encoding is used. A flag indicating nullability may also be specified (defaults to nullable). Specifying `notnull` saves one byte per value.

Class Name

Column Type

Arguments

com.illumon.util.codec.BigDecimalCodec

java.lang.BigDecimal

[<precision> [, <scale>]]

The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after decimal point. If precision is specified but scale is not, a scale of zero is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of 1000. Null values are permitted.

com.illumon.util.codec.BigIntegerCodec

java.lang.BigInteger

[<precision>]

If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted.

com.illumon.util.codec.ByteArrayCodec

byte[]

[<size> [, nullable | notnull]]

If size is specified, values are encoded as fixed-size values. If not, a variable size encoding is used. A flag indicating nullability may also be specified (defaults to nullable). Specifying notnull saves one byte per value.

Following is an example for a simple schema; additional elements will be added in later sections.

Partitions

Deephaven has two distinct uses of the term "partition" as it applies to schemas. Both are required for partitioned tables (i.e., those using the NestedPartitionedOnDisk storage type).

A partitioning column is a String column that must contain an identifiable value for each set of data appended to the table. This is required for all partitioned tables.
The historical storage partitions, along with the partitioning key formula, are used to share table data across multiple storage locations as the data is merged into the historical database. This is required for all partitioned tables.

Typically, some form of date is used for the partitioning column value, but any string that meaningfully partitions the data can be used. Table data is associated together on disk by the partitioning column value, so queries will be more efficient if the partitioning value is a criterion commonly used for including or excluding data in a query.

Partitioning columns are specified by using the columnType="Partitioning" attribute in a column definition. Currently, partitioned tables must have exactly one such column. Further information on column definitions is provided below.

When writing merged historical table data to disk, Deephaven will place partitions of the input data in a directory identified by the namespace, storage partition, column partition, and table name.

Partition Key Formulas

Historical storage partitions correspond to directories available for historical storage on the disk; for scaling and performance benefits, these are normally mapped to different physical storage devices, whether local or network-connected.

Historical storage partitions are specified using the Partitions element, which uses the keyFormula attribute to map table rows to the available storage partitions. Specifically, this formula is used by merge operations in an updateView operation, the result of which is used to assign rows to a storage partition .

The following QueryScope parameters are available as input to the formula, in addition to the columns of the table:

Parameter	Description
__WRITABLE_PARTITIONS__	An array of partition name strings corresponding to directories found at `/db/Systems/Namespace/WritablePartitions`
__NUM_WRITABLE_PARTITIONS__	The number of available partitions (i.e., `__WRITABLE_PARTITIONS__.length`)
__NAMESPACE__	A string representing the namespace
__TABLE_NAME__	A string representing the table name
__NS_TN__	A string representing the namespace and table name, joined with slash as a separator (e.g., `Ise/Quotes` or `Korea/Securities`)
__PARTITIONING_COLUMN_VALUE__	The partitioning column value for this merge operation
__DAY_OF_YEAR__	The day of the year, only available when `__PARTITIONING_COLUMN_VALUE__` appears to be a date in YYYY-MM-DD form
__PARTITION_AUTOBALANCE_SINGLE__	A complete formula that automatically chooses a single partition for all data based on `__NS_TN__` and `__PARTITIONING_COLUMN_VALUE__`, attempting to achieve balanced partition usage over long time horizons is deprecated in favor of ${autobalance_single}

In addition, the following substitutions will be applied, in order to make writing appropriate partitioning formulas easier:

Token	Substitution Description
${autobalance_single}	A complete formula that chooses a single partition for all input data, attempting to achieve balanced partition usage over long time horizons. The preferred way to write a formula that only refers to the `__PARTITION_AUTOBALANCE_SINGLE__` query scope parameter.
${autobalance_by_first_grouping_column}	A complete formula that uses the first grouping column (if there is one) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Assuming the first grouping column is the most significant, keeps all grouped data together. Falls back to the same behavior as `${autobalance_single}` if there are no grouping columns.
${autobalance_by_all_grouping_columns}	A complete formula that uses all grouping columns (if there are any) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Effectively spreads grouped data across multiple partitions, which may be useful for queries. Falls back to the same behavior as `${autobalance_single}` if there are no grouping columns.
${first_grouping_column}	A string representing the name of the first grouping column (i.e., the first column definition with the attribute `columnType="Grouping"`). Only available if there are grouping columns.
${all_grouping_columns}	An array of strings representing the names of the grouping columns (i.e., column definitions with the attribute `columnType="Grouping"`). Only available if there are grouping columns.

Conventions and Recommendations

Some common conventions follow:

Large Tables: Tables with large amounts of data per column partition typically try to balance across all available storage partitions, while keeping related data together, often choosing to use a grouping column as part of the formula to achieve this. The easiest way to achieve this is to use the following formula, relying on token substitution: ${autobalance_by_first_grouping_column}
For an expanded example, the Ise/Quotes table above might use the following formula:
__WRITABLE_PARTITIONS__[abs(Objects.hash(__NS_TN__, Ticker) + __DAY_OF_YEAR__) % __NUM_WRITABLE_PARTITIONS__]
This serves to distribute data with the same value for the Ticker column together, while spreading data across all available partitions, taking into account day of year, namespace, and table name in order to approximate long term load balancing.
Small tables: Tables with small amounts of data per column partition often simply use ${autobalance_single}, keeping data in a relatively compact format while utilizing all historical partitions over longer time horizons in order to approximate load balancing.

Schema Inference

Schema inference from source data can be performed from the command line, or through the Schema Editor interface. These are the schema generator tools available, and the command lines used to invoke them. You may use iris_exec to invoke any of the schema creators without arguments, to display the available options. (Note: To learn more about iris_exec commands, please refer to Running Local Scripts.) The arguments are also described in detail in the section covering the specific data type (CSV/JDBC/XML/JSON).

Command Line Interface

iris_exec csv_schema_creator <launch args> -- <schema creator args>

iris_exec jdbc_schema_creator <launch args> -- <schema creator args>

iris_exec xml_schema_creator <launch args> -- <schema creator args>

iris_exec json_schema_creator <launch args> -- <schema creator args>

The table schema files created from these processes also support merge operations, and allow specification of a grouping column and/or the partitioning column name. After a schema file is generated, it is highly recommended the user review the schema and the log file, and update details as needed. You must deploy your schema and generate loggers/listeners. See below for more details on the contents of a generated schema file and the types of changes that might be needed.

CSV

CSV refers to a variety of text file data formats. Although they are called "comma-separated", the delimiter could in fact be some other character (such as a tab or semicolon). Other details of formatting are similarly flexible. There is no single official standard for CSV formatting, so Deephaven provides flexibility when defining how to interpret and transform text data from CSV files into structured storage in Deephaven tables.

Because CSV is a text file format that provides no metadata, the schema generation tool must analyze the data that it finds in a sample data file. Typically column names are derived from column headers, but there are options for CSV files with and without headers. The schema generation process can be simplified and accelerated somewhat by using a sample of data, as long as there are enough rows to clearly demonstrate the formats and ranges of data values that will be included in normal imports.

For non-String columns, the schema generator will scan all the values in the supplied CSV file to ensure the inferred data type remains valid for all values in the column. As an example, the first value for a column might be 4. This would cause the schema generator to tentatively categorize this column as an integer type (short or long, depending on settings). If a 5.5 was encountered in a subsequent row, the generator would "promote" the column to a floating point datatype. If another value in the column included non-numeric characters, such as XYZ, the generator would then mark the column as a String column. Once a column has been designated as a String column, no other checking is done, as Strings are always valid to accept any data from a CSV.

If a column is empty for all rows, it is marked as a String and a warning message is written to the log. In this case, the administrator should examine the generated schema and validate the generated XML before adding it to Deephaven. Date/time values are also a special case, where the schema generator will attempt to match the format of the date/time string to a known format for which it can provide a conversion formula. Like the earlier example with numbers and Strings, if a date/time column has multiple different formats in it, it will end up being marked as a String column, since no single formula could be found that matched all of the column's data.

See Importing from CSV for details.

JDBC

Java Database Connectivity is a standard that describes how data store drivers provide common Java interfaces to a wide variety of databases and other data storage mechanisms. Unlike CSV, JDBC provides data type information for columns, and the abilities to filter and manipulate data using a SQL query.

In most cases it is easier to generate schema files from JDBC than from CSV because of the result set metadata that JDBC data sources can provide. The default behavior of the JDBC schema generator is to request data types of the result set columns from the data source. These data types are then mapped to Deephaven data types to provide the schema details.

For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.

See Importing from JDBC for details.

XML

XML (eXtensible Markup Language) is a container format that can contain sequential and/or hierarchical data. Like CSV, XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accomodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).

Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that will be needed for the table.

See Importing XML Files for details.

JSON

JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same was as it does on CSV data.

Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.

See Importing JSON Files for details.

Converters and Adding Custom Converters

The schema creators have a number of built-in converters to recognize various date/time formats. Unrecognized formats will result in a date/time field being treated as String. If ambiguous data is included, such that multiple converters match all the data values, the column will also be treated as String, and a warning message will be written to the log.

Built-in Converter	Matching Data
ISO format with date only	2017-08-30
ISO UTC date/time with optional sub-second precision (no T)	2017-08-30 11:59:59.000Z
ISO UTC date/time with optional sub-second precision (with T)	2017-08-30T11:59:59.000Z
ISO UTC two digit offset date/time with millisecond precision	2017-08-30T11:59:59.000+04
ISO UTC four digit offset date/time with millisecond precision	2017-08-30T11:59:59.000+0400
ISO UTC two digit offset date/time with microsecond precision	2017-08-30T11:59:59.000123+04
ISO UTC four digit offset date/time with microsecond precision	2017-08-30T11:59:59.000123+0400
ISO time zone date/time with optional sub-second precision	2017-08-30T11:59:59.000 NY
Exchange timestamp (will be converted to NY time)	2017-08-30T11:59:59.000
Timestamp UTC with optional sub-second precision	2017-08-30 11:59:59 or 2017-08-30 11:59:59.123456
SQL Server default format	AUG 30 2017 11:59AM

Converters for other formats can be included by adding their details to a property file used by the schema creator process. Here is an example converter that could be added in a property file:

SchemaCreator.converter.name.1=USDateTimeSeconds1

SchemaCreator.converter.pattern.1=^((0?\\d)|(1[0-2]))\\/(([0-2]?\\d)|(3[0-1]))\\/\\d{4} (([0-1]?\\d)|(2[0-3])):[0-5]\\d:[0-5]\\d$

SchemaCreator.converter.formula.1=new DBDateTime(new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse($TARGETNAME$).getTime()*1000000)

SchemaCreator.converter.sourceType.1=String

SchemaCreator.converter.targetType.1=DateTime

There are five components that comprise the definition for the custom converter. All five are required. Note the format of the property names: SchemaCreator.<atrribute name>.<unique index of the converter>. The index value must match across the five properties. The values do not have to start at 1, or be sequential, but it is important to ensure index numbers across multiple property files do not collide.

These five pieces can appear in any order and do not have to be contiguous, but all five pieces must be present if any one of them is present. If an incomplete converter entry is detected when a schema creator process starts, the process will fail and will log details about which components of the definition are missing.

name - The name attribute is mainly for convenience in identifying a particular converter.
pattern - The pattern is a regular expression string used to match data the converter can process. It must be very specific so this converter does not overlap any other converters, and so this formula is not given any data that it cannot process. The example regex limits the month portion of strings it will match to numbers less than 13, which helps avoid matches to non-US date strings where the day would be written before the month. When trying to find a converter to use for complex data, the schema generator will try to find one and only one converter whose pattern matches all the non-empty values from the sample data. Note that all backslashes in any property values must be escaped by doubling them. In this example, the regex will match date/time strings like this one: 01/11/1975 11:10:23
formula - The formula contains a Java expression that will be used to convert the source data to the target data type. It must include the $TARGETNAME$token, which will be replaced with the column name by the schema generator. In this case, a Java SimpleDateFormat is being used to parse the date/time string into a Java Date. Then the .getTime() method is used to get an epoch milliseconds value for the Date. Deephaven uses epoch nanoseconds, so this value is then multiplied by one million and passed to the DBDateTime creator method to return a DBDateTime.
sourceType - The sourceType controls how the importer will interpret a value before passing it to the formula. It will usually be String, but if the source value was already an epoch value, for instance, then the source type might be long instead of String.
targetType - The targetType must match the type produced by the formula. This is used by the schema generator to set the appropriate column type for columns that match the regex pattern.

ImportSource XML in a Schema File

Additional metadata to control and modify the import process is added in an ImportSource XML block of the schema. This is demonstrated below:

Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.

To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/> where trade_date is the CSV/JDBC column name).

An ImportSource block can contain multiple types of entries, including ImportColumn, Imports, maxError, arrayDelimiter and ImportTransform. These options are described below:

Entry Type/Attribute	Description
ImportColumn	Each entry describes how to map values for one column from the source data to the target table's schema
name	The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An `ImportColumn` element with no attributes beside name provides no special handling for that column and is effectively ignored.
sourceName	The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports.
sourceType	The data type of the source data. This is used when data types must be translated during the import process. If `sourceType` is not specified, then the source data is expected to have the type specified in the Column element that corresponds to the `ImportColumn`. As an example, `sourceType="long"` may be required when importing a timestamp stored as a long value to a DBDateTime column. Such a conversion cannot be done implicitly and requires a `sourceType` to tell the importer to read the source value as a long. Note: `sourceType` is case-sensitive: long is a primitive long value, while Long is interpreted as `java.lang.Long`, an object type.
nullable	Indicates whether null values will be accepted for this column. If the `nullable` attribute is present, it must be "`true`" or "`false`". False is the default. A column marked as nullable can be skipped during the import (i.e., if it doesn't exist in the source ), and be set to null in all rows. A nullable column will also receive a null value when a null value is read from the source or no value is read from the source. Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from `QueryConstants` will be used. If a default is defined, the default will be used instead of a null value.
formula	This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the `FirstNameLength ImportColumn` uses a formula to calculate its value. Note that the value being read during the import will be passed to the formula using the ImportColumn's name. Formulas can use built-in Java methods, like `replace()` for strings, functions that are built into the importer classes, or functions imported from other packages (see: Imports). When compiling the field writer, the importer will enclose the formula in a try/catch block, in case methods being called within the formula require exception handling. A caught exception will be rethrown as a RuntimeException with a message indicating that the exception was caught in a formula, and with the original message appended.
transform	Similar to formula, transform creates a new function during the import using the details of the ImportColumn element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is `dbDateTimeFromLong`. This function will create a Lambda function that converts a long to a DBDateTime, and will use another attribute (transformPrecision) to select whether the resultant function expects milliseconds, microseconds, or nanoseconds from Unix epoch.
default	The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from `com.illumon.iris.db.tables.utils.QueryConstants`) for primitive types, or an actual null value for object types. A specified default takes precedence over null handling when a column is defined as nullable. In the special case where the literal null, lowercase, is desired as the default value for a String, this could be accomplished with a default that sets some known value, or null, and a formula to replace the value with the literal String null.
class	The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the `CsvFieldWriter` or `JdbcFieldWriter` class, as appropriate to the import type. If class is specified, `formula, transform, sourceName,` and `sourceType` are ignored and not used by the importer itself, though they may be used within the custom field writer. See: details below.
cacheable	(CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated.
cacheSize	(CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries.
________	Other attributes as needed to control transform functions. (e.g., `dbDateTimeFromTimestamp` uses `transformPrecision` to control its behavior. See details below.
Imports	The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import `qualified.package.name.class;` line. Beside the internal imports used by the field writers, these imports are automatically added in the field writer context: `java.util.Properties` `com.illumon.iris.db.tables.utils.DBDateTime` `com.illumon.iris.db.tables.utils.DBTimeUtils` `com.fishlib.util.DateUtil` `java.util.regex.Pattern` `java.time.format.DateTimeFormatter` `java.time.*`
maxError	The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.)
arrayDelimiter	A string to be used as a delimiter when parsing array data. The default delimiter is a comma.
ImportTransform	See Custom Transforms below
name	The name of a custom transform function, and will be the name sought when an `ImportColumn` specifies a custom transform function.
body	The Java code that will be compiled as part of the setter for an `ImportColumn` that uses the custom transform.

Custom Transforms

Custom transforms can be added to the ImportSource block as ImportTransform elements. These elements must:

Have a Body element that contains the text of the transform function
Have a name attribute that matches the name used in the transform attribute of ImportColumns that use the transform
Implement a getFunction function that returns a function matching the specification of the interface

At runtime, getFunction is passed to the ImportColumnDefinition of the column that is being set, and a boolean value indicating whether strict (true) or permissive (false) data checking has been selected. The ImportColumnDefinition allows getFunction to access details like the target column name, the target column type, and any ImportColumn details by using getImportColumn(), which provides access to functions such as getAttributeValue().

The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong transform):

<ImportTransform name="Sample">
<Body>
   getFunction(ImporterColumnDefinition col, boolean strict)
     {
     String precision = col.getImportColumn().getAttributeValue("transformPrecision");
     switch (precision) {
     case "millis":
return (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
     case "micros":
return (long value) -> new DBDateTime(DBTimeUtils.microsToNanos(value));
     case "nanos": return DBDateTime::new;
     default: return null;
     }
</Body>
</ImportTransform>

When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction. In the case of the example above, the target column type is DBDateTime, and the sourceType="long" attribute is added to the ImportColumn. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>.

A much simpler example follows:

<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>

If this is used for a column whose type is long, this will be mapped to a LongUnaryOperator, which takes a long and returns a Long. If the sourceType is not specified, the importer will attempt to use the target column type for matching a function interface.

If the function being returned from getFunction has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction, so the system can perform implicit type conversion when compiling the transform.

Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.

Primitive Types and Boxed Types in Transforms

Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.

Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.

However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short> Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.

Note: This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter. Since the FieldWriter classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.

Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants values, or use a boxed type for the sourceType property of the column so these values will be preserved as null.

Array Types

The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). Formatting of the string is as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces { }. The delimiter can be specified with arrayDelimiter, but the default delimiter is a comma. Empty arrays [] are also supported.

An example array of double strings follows: [1.5, 6.4, 8.0, 2.2322, 88, 16.02]

Custom FieldWriter Classes

Custom FieldWriter classes can be used where a field writer needs to use values from multiple source columns, or where it is more desirable to package transform logic into a FieldWriter class rather than in an in-line transform or imported code. Custom field writers must extend the source-specific FieldWriter class (i.e., CsvFieldWriter for CSV imports or JdbcFieldWriter for JDBC imports.) They must implement a specific constructor, and must also implement the processField method.

A custom field writer class project will need the following Deephaven jars added as dependencies:

IO-<version_number>.jar
Base-<version_number>.jar
Ingesters-<version_number>.jar
DataObjects-<version_number>.jar
BinaryStore-<version_number>.jar
IrisDataObjects-<version_number>.jar

An example of a custom field writer for JDBC follows:

package io.deephaven.fieldwriters;

import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.TimeZone;
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import com.illumon.iris.importers.ImporterColumnDefinition;
import com.illumon.iris.importers.JdbcFieldWriter

/**
* Sample custom field writer class that takes FirstName and LastName and writes out FirstName + ' ' + LastName
*/
public class FullNameWriter extends JdbcFieldWriter {

    private final RowSetter setter;
    private final ResultSet rs;
    private final ImporterColumnDefinition column;

  // This is the constructor format that is required for custom JdbcFieldWriters
    public FullNameWriter(Logger log, boolean strict, ImporterColumnDefinition column,
       final TimeZone sourceTimeZone,RowSetter setter, ResultSet rs, String delimiter) {
       super(log, column.getName(), delimiter, sourceTimeZone);
       this.setter = setter;
       this.column = column;
       this.rs = rs;
   }

    @Override
    public void processField() throws IOException {
        try {
             setter.set(rs.getString("FirstName") + " " + rs.getString("LastName"));
        } catch (SQLException e) {
             throw new RuntimeException("Failed to concatenate FirstName and LastName for target field: " + column.getName());
        }
    }
}

A custom field writer for CSV would be similar, but would not include ResultSet or TimeZone in its constructor arguments, and would require CsvRecord as an argument for processField.

import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;

public class CsvMultiColumnExample extends CsvFieldWriter {
    private final RowSetter setter;
    private final ImporterColumnDefinition column;

   //This is the constructor format that is required for custom CsvFieldWriters
    public CsvMultiColumnExample(Logger log, boolean strict, ImporterColumnDefinition column, RowSetter setter, String delimiter) {
        super(log, column.getName(), delimiter);
        this.setter = setter;
        this.column = column;
    }
    @Override
    public void processField(CSVRecord record) throws IOException {
        setter.set(record.get("FirstName") + " " + record.get("LastName"));
    }
}

Internally, XML uses CSV import processing, so the CSV custom field writer structure would also be usable for XML imports.

To use the custom field writer in the table, the class attribute of the ImportColumn is set, and an import entry will be needed to make the custom field writer class available to the import process:

<Imports>com.io.deephaven.fieldwriters.FullNameWriter</Imports>
...
<ImportColumn name="FullName" class="io.deephaven.fieldwriters.FullNameWriter" sourceName="FirstName" />

Note: Even though it is not used, sourceName is required due to a known issue in ImportColumn processing. Currently, this attribute must exist, and must indicate some other column that does exist in the data source as its value.

This can then be used to provide data for the FullName column in the table:

Typically, the custom field writer class will be compiled and packaged in a jar that should be placed under /etc/sysconfig/illumon.d/java_lib on servers where data imports that need it will be run.

JdbcFieldWriters must extend JdbcFieldWriter and implement a constructor that accepts:

Logger.class - logger instance passed down from main
ImporterColumnDefinition.class - column definition of column to which this field writer will be attached (including ImportColumn, and target information)
RowSetter.class - rowsetter for the target column
ResultSet.class - JDBC result set with the current row. The field writer should only read values from result set metadata and current row data, not change the position of the result set or close it.
String.class - delimiter value used for array parsing functions
Boolean.class - strict value to indicate whether data checking is strict (true) or relaxed (false)

CsvFieldWriters must extend CsvFieldWriter and implement a constructor that accepts:

Logger.class - logger instance passed down from main
ImporterColumnDefinition.class - column definition of column to which this field writer will be attached (including ImportColumn, and target information)
RowSetter.class - row setter for the target column
String.class - delimiter value used for array parsing functions
Boolean.class - strict value to indicate whether data checking is strict (true) or relaxed (false)

Built-in Functions and Transforms

Some functions and transforms are built into the importer framework and can be used when designing ImportSource entries. Transforms often require other attributes to control their behavior. Functions can be used in formulae; transforms can be used in transforms.

dbDateTimeFromLong (transform) - takes a long input (offset from Unix epoch) and returns a DBDateTime. The attribute transformPrecision should be set to "millis", "micros", or "nanos" to indicate the units of the long offset value. For example, transformPrecision="millis" means that the long value from the source is the number of milliseconds from Unix epoch. Requires sourceType=long to match the expected input of the transform.
new DBDateTime (function) - this is the constructor for DBDateTime objects. By default it takes a long offset of nanoseconds from Unix epoch. Additionally, DBDateTimeUtils.millsToNanos() and DBDateTimeUtils.microsToNanos() are available for converting other precisions of offsets to nanoseconds. Standard Java date and time parsing and conversion functions can be used to create a long Unix epoch offset from other formats; the result of which can then be passed to this function to store as DBDateTimevalue. java.time.* and java.time.format.DateTimeFormatter are included in the standard imports. ZonedDateTime.parse can be used with various formats to parse strings into ZonedDateTimes; and the ZonedDateTime.toInstant().toEpochMilli() call can then be used, for example, to convert to long milliseconds to pass to this creator. '
dbDateTimeFromTimestamp (function) - takes a Java Timestamp and returns a DBDateTime used internally, automatically, when importing JDBC datetime types to DBDateTimedestination columns.
DBTimeUtils.convertDateTime (function) - takes a String and returns a DBDateTime. Expects a string of the form: YYYY-MM-DDThh:mm:ss.nnnnnnnnn TZ. One use of this function would be to convert dates that had originally been exported to CSV from a Deephaven system.
enumFormatter (transform) - takes an int input and returns a String, based on lookup from an enum defined in the system columns file. The attribute transformColumnSet is the String column set name to match in the columns file. The attribute transformEnum is the String name of the enum under the column set to match in the columns file. Requires sourceType=into match the expected input of the transform.
accountingInt (function) - takes a String and converts to an int. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function.
accountingLong (function) - takes a String and converts it to a long. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function.
accountingDouble (function) - takes a String and converts it to a double. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). Requires sourceType=String to match the expected input of the function.
stripFirstCharacter (function) - takes a String and returns a String. Strips the first character, assuming the String has at least one character. No overrides of sourceType needed because input and output types match.
parseBoolean (function) - takes a String and returns a Boolean. Single character T/F 1/0 t/f converted to Boolean true or false. Strings (case-insensitive) true or false converted to Boolean true or false. Used internally, automatically, when reading a CSV to a destination column that is Boolean. Requires a sourceType=String for explicit use in custom formulae.
getLongArray (function) - takes a String and returns an array of long values. Two arguments - first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed - normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a long[]column. Requires sourceType=String for explicit use in custom formulae.
getDoubleArray (function) - takes a String and returns an array of double values. Two arguments: first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a double[]column. Requires sourceType=String for explicit use in custom formulae.

CopyTable Schemas

One table layout may be used for multiple tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. Instead, a schema definition can use the CopyTable element instead of the Table element, referring to another source table for its definition. The CopyTable element then uses the source table's schema for all definitions, including columns, imports, loggers and listeners.

The schema file naming convention of <namespace>.<table>.schema also applies to CopyTable schema files. The namespace and table in the CopyTable schema filename should be those of the duplicate table, not the source table.

A CopyTable definition contains the following attributes:

namespace - the duplicate table's namespace
sourceNamespace - the source table's namespace
name - the duplicate table's name
sourceName - the source table's name

Note: the previous CopyTable syntax of namespaceSource and nameSource has been deprecated.

For example, the following would use the definition of the table ExampleTable from the namespace ExampleNamespace for a table called ExampleTableCopy in the namespace CopyTableNamespace:

A duplicate table functions in the same manner as any other Deephaven table. Data can be imported through the importer classes or real-time ingestion, partitions will be generated in the same ways, intraday data can be merged to the historical database, and queries will work as expected.

Loggers and listeners are not generated for CopyTable definitions, as the loggers and listeners from the source table will be used for the duplicate table.

Note: Chained CopyTable definitions are not supported at this time; a CopyTable must refer to a fully defined Table schema.

Managing Schema Files

Once schemas have been defined or changed, they must be made visible to Deephaven (deployed), and if any custom loggers and listeners are defined, these must be generated.

The logger/listener generation can be done from deployed schemas, or from another location prior to schema deployment so the generated loggers can be integrated with the customer application before being deployed to Deephaven.

Two scripts are provided for managing schema files, and are described below:

dhconfig both deploys schemas to a Deephaven database, making their described tables available for use, and exports schemas from etcd to a specified directory. See: Configuration Tools.
generate_loggers_listeners generates the logger and listener classes from schemas. For more information on loggers and listeners see Streaming Data.

These scripts can be used during the initial setup of a new Deephaven system, as well as to update configuration when schemas are added or modified. Rerunning the scripts is safe, even if nothing has changed since the last time they were run.

Schema Service Types

Deephaven installations use etcd to store and access schema files; this allows schema files to be easily distributed to the processes that need them.

Some standalone processes such as standalone logging programs not running on a Deephaven server will not use the etcd installation to access schema files. These programs will have access to the schema files locally, and are not discussed here.

Schema Locations and Properties

On a Deephaven server, editing a schema file directly has no effect on the system. Edited schema files must be deployed before the system will use them. See: Deploying Schemas to Deephaven.

The following property controls aspects of schema behavior:

SchemaConfig.defaultListenerPackagePrefix - defines the default listener package prefix used if one is not defined in a schema. Any generated listeners will use this value followed by the namespace converted to lowercase if they don't have their own package defined in their schema file. The default value is: SchemaConfig.defaultListenerPackagePrefix=com.illumon.iris.intraday.gen

Deploying Schemas to Deephaven

Once schemas are defined or after they are changed, they must be made available to the Deephaven database. This is done using the dhconfig script. The script uses the parameters defined at the command line to find the schemas. The Schema Editor also has an option to deploy schemas.

Until this step is done, schema changes are not visible to Deephaven.

Use the following to call the schema deployment script (proper authentication is required):

dhconfig schemas import <options>

As it runs, the script shows details about how Java was started and the property files read. Following this, output similar to the following should appear. This indicates all of the tables that were found and deployed.

For example, on most Deephaven installations, the following command will redeploy the LearnDeephaven schemas:

/usr/illumon/latest/bin/dhconfig schemas import --directory /etc/sysconfig/illumon.d/schema/LearnDeephaven <auth options>

This will result in the following output:

Updating System.LearnDeephaven.EODTrades
Updating System.LearnDeephaven.StockQuotes
Updating System.LearnDeephaven.StockTrades

The following options are available for use with dhconfig schemas import:

`-d, --directory <arg>`	Specify the directory from which schema files will be read. This option can be specified with `--file`, in which case both the specified directory and any specified files will be processed.
`-e,--skip-errors`	Ignore any errors and continue to attempt to deploy schema files. If this option is not specified, then any errors cause the tool to stop as soon as any errors occur.
`-f,--file <arg>`	Specify a schema or table definition file to import. This option can be specified multiple times to process multiple files at once. This must be the full path, not a relative one.
`-force, --force`	Overwrite existing schemas when importing, overwrite files when exporting.
`-h, --help`	Pritns help for a schemas command.
`-k,--key <arg>`	Specify a private key file to use for authentication.
`-n,--namespace <arg>`	Specify a namespace to process. If this is specified, then only files for the specified namespace will be imported; to process multiple namespaces at one time, this argument can be specified more than once. If this is not specified, all namespaces will be processed.
`-p,--password <arg>`	Specify a password for the given user.
`-s,--nsset <arg>`	Specify the namespace set to import. This allows the processing of either System or User namespace files. If it is not specified, both System and User namespace set files will be processed.
`-u,--update`	Attempt to update all schemas. If this is not specified, then existing schemas will be skipped.
`-v,--verbose`	Print full text of exceptions to stderr.

Migration Mode

When running the legacy deploy_schema tool in migration mode, if directories or files are not specified, the tool will instead use several properties to locate and process schema files. This is intended to be run one time, to import schema files from their legacy locations into etcd. Once this migration is complete, these properties will no longer be used by Deephaven to find schema files.

These legacy schema location properties include a list of locations (directories or JAR files) in which schemas can be found, delimited by semicolons or spaces. When looking for schema files, Deephaven (including the deployment and logger/listener generation scripts) will search all subdirectories of the specified locations for files with the .schema extension, as well as looking within JAR files in those directories and subdirectories for .schema files.

The schema location properties allow the following substitutions:

<devroot> - this refers to the installation directory of the Deephaven software, usually /usr/illumon/latest.
<workspace> - this refers to the workspace of the process.

The following properties specify the locations of schema files during migration mode:

SchemaConfig.allowedNamespaces - If this property is set when deploying schemas, then schema files will be imported only for the namespaces specified, and all specified namespaces will be created even if no schema files are found. The value is a comma delimited list of namespaces.
The SchemaConfig.resourcePath is a multi-valued property prefix which tells the system where to find schema files. You may add any suffix to the property name to add new paths without overwriting any other configured SchemaConfig.resourcePath values. You may add more than one directory by using a semicolon separator; you may also use wildcard matching by using an asterisk * (note that when using wildcards, your paths must end with a trailing slash). For example:
- SchemaConfig.resourcePath.irisInternal - The configured schema directory for internal Deephaven schemas.
- SchemaConfig.resourcePath.customer - By default, this path is the customer's schema directory: /etc/sysconfig/illumon.d/schema. If schemas will be placed in additional locations, you may freely add to this property, or create some other suffix with the additional locations.
- SchemaConfig.resourcePath.plugins - By default, this is a list of schemas for Deephaven plugins. If new plugins are added which have schema files, you may freely add to this property, or create some other suffix for the new plugin schema locations. If this property does not exist (on older installations), you should add it with a value of /etc/sysconfig/illumon.d/plugins/*/schema/ (note the trailing slash).

Deploying Schemas During Intraday Data Ingestion

Schema changes made while intraday data is being written to a table (for instance, through the tailer and data import server) will require special attention as Deephaven does not automatically handle schema changes while data is being ingested. This can be accomplished using the following procedure.

First, ensure the tailer is not sending data for the changed table by examining its log file (usually /var/log/deephaven/tailer/LogtailerMain1.log.current), and stop the tailer.
Move or rename any existing binary log files for that table for the day so they won't be found by the tailer. For example if the UpdatePerformanceLog schema was being changed:

sudo mkdir /db/TempFiles/save
sudo mv /var/log/deephaven/binlogs/perflogs/DbInternal.UpdatePerformanceLog*.bin.* /db/TempFiles/save

Assuming the table's internal partition is determined by the tailer configuration, update the tailer's XML configuration file and change (or add) the internalPartitionSuffix attribute for the changed table so that new intraday data will be written to a different location. If the internal partition is instead determined by the internalPartitionRegexattribute, update the logger to ensure it's logging with a different filename. For example: internalPartitionSuffix="TableVersion2"
If the new schema includes listener changes, the Data Import Server will need to be restarted so it can use the new listener classes.
Finally, once the new schema is deployed, the tailer can be restarted.

Note that queries may not find the previously written intraday data for the day after this procedure is complete, but merges will find and write it correctly.

Exporting Schemas From Deephaven

The dhconfig tool exports schemas from etcd to a specified directory as ".schema" files, named <namespace>.<table name>.schema. These files can then be examined and edited, and re-imported with the dhconfig script. It needs to be run from an account with access to etcd, such as dbquery, or an exception will occur that indicates it can not access an etcd directory. It also has an option to simply list schemas rather than export them.

The following options are available:

`-d, --directory <arg>`	Specify the directory from which schema files will be read. Unless the -`-list` option is specified, this argument is required, and the directory must exist and have write permissions for the user running the tool.
`-l,--list`	Instead of exporting the schemas, list them.
`-n,--namespace <arg>`	Specify the namespace to export or list. This argument can be specified multiple times to export or list multiple namespaces at once. If it is not specified, all namespaces will be exported or listed.
`-o, --overwrite`	If specified, existing schema files will be overwritten, otherwise a schema will not be exported if its schema file already exists in the specified directory. This argument is ignored if the `--list` option is specified.
`-s,--nsset <arg>`	Specify the namespace set to export (System or User). If it is not specified then both System and User schemas are exported.

For example, if the user is logged in as dbquery on most Deephaven installations, the following command will export the LearnDeephaven schema files to the directory /tmp/export, overwriting any files that are already there.

/usr/illumon/latest/bin/dhconfig schema export --directory /tmp/export --namespace LearnDeephaven --overwrite

Last Updated: 16 February 2021 18:06 -04:00 UTC Deephaven v.1.20200928 (See other versions)