Tables & Schemas
Overview
A Deephaven table schema is an XML document that describes a table. A schema must include a table's namespace, name, column names and data types. In addition to specifying the structure of the data, a schema can include directives controlling how data is imported and stored.
Schema files can be created by directly constructing XML files (e.g., using a text editor), or can be generated using built-in tools. In most cases, the easiest way to create or edit a schema file is by using the Deephaven Schema Editor. The Schema Editor itself can be used to create a new schema from scratch, edit an existing schema, or derive a new schema from example data. This tool is covered in detail in the Schema Editor section.
If a schema is created manually, this is typically done by picking an existing schema file to use as a starting point, and then copying, pasting, and modifying it to create a suitable layout for the new table. This does not require any special tools, but it can be tedious for complex tables, or when many tables are needed. To simplify this process, Deephaven includes tools that can analyze data sets and infer column details and import instructions, and that can also generate and validate sections and entries within schema files. The schema creation tools make schema creation faster and easier, and help avoid issues related to formatting and syntax. These tools are described briefly in the section on Schema Inference, and in detail in the chapter covering each specific data type (CSV, XML, etc.).
Once a schema is defined, it must be deployed to take effect. If the table is intended for use by streaming data, an additional step is required to generate loggers and listeners. This is discussed further in Deploying Schemas to Deephaven.
This part of the documentation covers general schema requirements and features. Additional optional sections of schema are detailed in Importing Batch Data and Streaming Data.
Basic Table Attributes
Metadata is imported to the database using XML-based schema files that contain definitions like the following examples:
<Table namespace="Ise" name="Quotes" storageType="NestedPartitionedOnDisk" defaultMergeFormat="DeephavenV1">
<Partitions keyFormula="${autobalance_by_first_grouping_column}"/>
<Column name="Date" dataType="String" columnType="Partitioning" />
<Column name="Ticker" dataType="String" columnType="Grouping" />
<Column name="Size" dataType="int" columnType="Normal" />
<Column name="Price" dataType="double" columnType="Normal" />
</Table>
<Table namespace="Korea" name="Securities" storageType="SplayedOnDisk">
<Column name="UnderlyingId" dataType="int" columnType="Grouping" />
<Column name="Id" dataType="int" columnType="Normal" />
<Column name="Ticker" dataType="String" columnType="Normal" />
<Column name="SecurityType" dataType="String" columnType="Normal" />
<Column name="Expiration" dataType="DateTime" columnType="Normal" />
</Table>
Each schema file contains definitions for exactly one table, and must follow the naming convention <namespace>.<table>.schema
. For example, the above examples would be named Ise.Quotes.schema
and Korea.Securities.schema
.
The following three attributes are necessary for all Table elements:
Attribute |
Description |
---|---|
namespace |
The system |
name |
The name of this table (e.g., Quotes or Securities). |
storageType |
Specifies the way data for this table will be organized on disk. Options:
|
defaultMergeFormat |
Specifies the format in which merged (historical) data will be written.
|
Merge Attributes
When merging data to historical partitions, Deephaven supports writing to traditional Deephaven files as well as writing to Parquet files. When writing to Parquet, a default compression codec can be chosen by adding a MergeAttributes element with an appropriate Parquet-supported codec. This codec can be changed by specifying a different one in the merge operation. If no codec is chosen, then the SNAPPY codec is used. This element does not apply to merges in Deephaven format. See Parquet Codecs for further details.
For example, to use the uncompressed codec:
<MergeAttributes format="Parquet" codec="UNCOMPRESSED" />
Columns
The <Column>
elements contain the detailed information to define each table's columns, including the column's name, data type, and role as a partitioning or grouping column. At minimum these elements must contain the following attributes.
Attribute |
Description |
---|---|
name |
Required – The name of the column as it will be stored in the database |
dataType |
Required – The type of the data stored in a column. This can be one of the following:
Note: Deephaven also supports other friendly names for some types, for instance, Integer will be automatically treated as int. This treatment does not apply, however to array types. An array of 32-bit values must be declared as int[]. |
columnType |
Optional – The role of the column in the table, one of the following:
See Partition Key Formulas for details on the function of Partitioning and Grouping columns |
objectCodec |
Optional - A java class which controls how the column is stored on disk. See below for details on the provided codecs:
|
objectCodecArguments |
Optional - Argument(s) provided to the |
symbolTable |
Optional - For String columns only, this specifies whether or not a symbol table will be used to store the data. Supported values are:
If it is not provided, the value of |
encoding |
Optional - For String columns only, this determines the encoding used to store the value. Supported values are:
If it is not provided, a default encoding of |
Column Codecs
Each codec is mapped to a single column data type and may be used only with that type. The codec does not affect how the data appears in the Deephaven console. Custom codecs may be provided on the Deephaven classpath by implementing the com.illumon.util.codec.ObjectCodec<T> interface. The following are provided with Deephaven. For columns of these types, it is much more space efficient to use one of these codecs than to rely on the default java serialization.
Class Name |
Column Type |
Arguments |
---|---|---|
com.illumon.util.codec.BigDecimalCodec |
java.lang.BigDecimal |
The precision argument specifies the maximum number of decimal digits that may be stored. The scale argument specifies the maximum number of decimal digits after decimal point. If precision is specified but scale is not, a scale of zero is inferred. If precision is specified, a space-efficient fixed size encoding is used. Otherwise, a variable size encoding is used, with a maximum precision of 1000. Null values are permitted. |
com.illumon.util.codec.BigIntegerCodec |
java.lang.BigInteger |
If specified, precision is used to specify the maximum number of decimal digits storable by this column. This codec uses the same encoding as BigDecimalCodec with a scale of zero. Null values are permitted. |
com.illumon.util.codec.ByteArrayCodec |
byte[] |
If size is specified, values are encoded as fixed-size values. If not, a variable size encoding is used. A flag indicating nullability may also be specified (defaults to nullable). Specifying |
Following is an example for a simple schema; additional elements will be added in later sections.
<Table name="TestTable" namespace="TestNamespace" storageType="NestedPartitionedOnDisk">
<Partitions keyFormula="${autobalance_single}"/>
<Column name="Partition" dataType="String" columnType="Partitioning" />
<Column name="Alpha" dataType="String" columnType="Grouping"/>
<Column name="Bravo" dataType="int" />
<Column name="Charlie" dataType="double" />
<Column name="Delta" dataType="double" />
<Column name="Echo" dataType="StringSet" />
<Column name="Foxtrot" dataType="java.math.BigDecimal" objectCodec="com.illumon.util.codec.BigDecimalCodec" objectCodecArguments="13,7"/>
<Column name="Golf" dataType="java.math.BigInteger" objectCodec="com.illumon.util.codec.BigIntegerCodec" objectCodecArguments="13"/>
<Column name="Hotel" dataType="byte[]" objectCode="com.illumon.util.codec.ByteArrayCodec" objectCodecArguments="32,notnull"/>
</Table>
Partitions
Deephaven has two distinct uses of the term "partition" as it applies to schemas. Both are required for partitioned tables (i.e., those using the NestedPartitionedOnDisk
storage type).
- A partitioning column is a String column that must contain an identifiable value for each set of data appended to the table. This is required for all partitioned tables.
- The historical storage partitions, along with the partitioning key formula, are used to share table data across multiple storage locations as the data is merged into the historical database. This is required for all partitioned tables.
Typically, some form of date is used for the partitioning column value, but any string that meaningfully partitions the data can be used. Table data is associated together on disk by the partitioning column value, so queries will be more efficient if the partitioning value is a criterion commonly used for including or excluding data in a query.
Partitioning columns are specified by using the columnType="Partitioning"
attribute in a column definition. Currently, partitioned tables must have exactly one such column. Further information on column definitions is provided below.
When writing merged historical table data to disk, Deephaven will place partitions of the input data in a directory identified by the namespace, storage partition, column partition, and table name.
Partition Key Formulas
Historical storage partitions correspond to directories available for historical storage on the disk; for scaling and performance benefits, these are normally mapped to different physical storage devices, whether local or network-connected.
Historical storage partitions are specified using the Partitions element, which uses the keyFormula
attribute to map table rows to the available storage partitions. Specifically, this formula is used by merge operations in an updateView
operation, the result of which is used to assign rows to a storage partition .
The following QueryScope
parameters are available as input to the formula, in addition to the columns of the table:
Parameter |
Description |
---|---|
__WRITABLE_PARTITIONS__ |
An array of partition name strings corresponding to directories found at |
__NUM_WRITABLE_PARTITIONS__ |
The number of available partitions (i.e., |
__NAMESPACE__ |
A string representing the namespace |
__TABLE_NAME__ |
A string representing the table name |
__NS_TN__ |
A string representing the namespace and table name, joined with slash as a separator (e.g., |
__PARTITIONING_COLUMN_VALUE__ |
The partitioning column value for this merge operation |
__DAY_OF_YEAR__ |
The day of the year, only available when |
__PARTITION_AUTOBALANCE_SINGLE__ |
A complete formula that automatically chooses a single partition for all data based on |
In addition, the following substitutions will be applied, in order to make writing appropriate partitioning formulas easier:
Token |
Substitution Description |
---|---|
${autobalance_single} |
A complete formula that chooses a single partition for all input data, attempting to achieve balanced partition usage over long time horizons. The preferred way to write a formula that only refers to the |
${autobalance_by_first_grouping_column} |
A complete formula that uses the first grouping column (if there is one) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Assuming the first grouping column is the most significant, keeps all grouped data together. Falls back to the same behavior as |
${autobalance_by_all_grouping_columns} |
A complete formula that uses all grouping columns (if there are any) in order to spread data across all available partitions, attempting to achieve balanced partition usage over long time horizons. Effectively spreads grouped data across multiple partitions, which may be useful for queries. Falls back to the same behavior as |
${first_grouping_column} |
A string representing the name of the first grouping column (i.e., the first column definition with the attribute |
${all_grouping_columns} |
An array of strings representing the names of the grouping columns (i.e., column definitions with the attribute |
Conventions and Recommendations
Some common conventions follow:
- Large Tables: Tables with large amounts of data per column partition typically try to balance across all available storage partitions, while keeping related data together, often choosing to use a grouping column as part of the formula to achieve this. The easiest way to achieve this is to use the following formula, relying on token substitution:
${autobalance_by_first_grouping_column}
- For an expanded example, the Ise/Quotes table above might use the following formula:
__WRITABLE_PARTITIONS__[abs(Objects.hash(__NS_TN__, Ticker) + __DAY_OF_YEAR__) % __NUM_WRITABLE_PARTITIONS__]
This serves to distribute data with the same value for the Ticker column together, while spreading data across all available partitions, taking into account day of year, namespace, and table name in order to approximate long term load balancing. - Small tables: Tables with small amounts of data per column partition often simply use
${autobalance_single}
, keeping data in a relatively compact format while utilizing all historical partitions over longer time horizons in order to approximate load balancing.
Schema Inference
Schema inference from source data can be performed from the command line, or through the Schema Editor interface. These are the schema generator tools available, and the command lines used to invoke them. You may use iris_exec
to invoke any of the schema creators without arguments, to display the available options. (Note: To learn more about iris_exec
commands, please refer to Running Local Scripts.) The arguments are also described in detail in the section covering the specific data type (CSV/JDBC/XML/JSON).
Command Line Interface
iris_exec csv_schema_creator <launch args> -- <schema creator args>
iris_exec jdbc_schema_creator <launch args> -- <schema creator args>
iris_exec xml_schema_creator <launch args> -- <schema creator args>
iris_exec json_schema_creator <launch args> -- <schema creator args>
The table schema files created from these processes also support merge operations, and allow specification of a grouping column and/or the partitioning column name. After a schema file is generated, it is highly recommended the user review the schema and the log file, and update details as needed. You must deploy your schema and generate loggers/listeners. See below for more details on the contents of a generated schema file and the types of changes that might be needed.
CSV
CSV refers to a variety of text file data formats. Although they are called "comma-separated", the delimiter could in fact be some other character (such as a tab or semicolon). Other details of formatting are similarly flexible. There is no single official standard for CSV formatting, so Deephaven provides flexibility when defining how to interpret and transform text data from CSV files into structured storage in Deephaven tables.
Because CSV is a text file format that provides no metadata, the schema generation tool must analyze the data that it finds in a sample data file. Typically column names are derived from column headers, but there are options for CSV files with and without headers. The schema generation process can be simplified and accelerated somewhat by using a sample of data, as long as there are enough rows to clearly demonstrate the formats and ranges of data values that will be included in normal imports.
For non-String columns, the schema generator will scan all the values in the supplied CSV file to ensure the inferred data type remains valid for all values in the column. As an example, the first value for a column might be 4. This would cause the schema generator to tentatively categorize this column as an integer type (short or long, depending on settings). If a 5.5 was encountered in a subsequent row, the generator would "promote" the column to a floating point datatype. If another value in the column included non-numeric characters, such as XYZ, the generator would then mark the column as a String column. Once a column has been designated as a String column, no other checking is done, as Strings are always valid to accept any data from a CSV.
If a column is empty for all rows, it is marked as a String and a warning message is written to the log. In this case, the administrator should examine the generated schema and validate the generated XML before adding it to Deephaven. Date/time values are also a special case, where the schema generator will attempt to match the format of the date/time string to a known format for which it can provide a conversion formula. Like the earlier example with numbers and Strings, if a date/time column has multiple different formats in it, it will end up being marked as a String column, since no single formula could be found that matched all of the column's data.
See Importing from CSV for details.
JDBC
Java Database Connectivity is a standard that describes how data store drivers provide common Java interfaces to a wide variety of databases and other data storage mechanisms. Unlike CSV, JDBC provides data type information for columns, and the abilities to filter and manipulate data using a SQL query.
In most cases it is easier to generate schema files from JDBC than from CSV because of the result set metadata that JDBC data sources can provide. The default behavior of the JDBC schema generator is to request data types of the result set columns from the data source. These data types are then mapped to Deephaven data types to provide the schema details.
For cases where the data source cannot provide column data type information, or where the source database is not ideally configured to store the data (e.g., numbers or date/times in String fields), the JDBC schema generator provides options to scan data as is done in the CSV schema generator, or to use a combination of data scanning and metadata.
See Importing from JDBC for details.
XML
XML (eXtensible Markup Language) is a container format that can contain sequential and/or hierarchical data. Like CSV, XML by itself does not provide or guarantee a particular layout to the data. Simpler data sets that use XML typically have one element per record, are not hierarchical, and use either attributes or element values to store record values. More complex XML feeds may include metadata, common data updates, and hierarchically related records. Deephaven provides "generic" XML data import capabilities that can accomodate most of the simpler forms. The more complex XML data feeds, including those that include data that must be imported to multiple tables in a single operation, require custom importers that are specific to the data format encapsulated within the XML (e.g., FIXML).
Although XML documents can include metadata, there is no standard as to how such column data types and other such information should be formatted. Therefore, for generic XML imports, Deephaven treats XML files similarly to how it handles CSV files. To find column data types, the schema generation tool analyzes all data that it finds in a sample data file. In addition, if named (rather than positional) values are used, the schema generator will also scan all importable elements to find column names that will be needed for the table.
See Importing XML Files for details.
JSON
JSON is a hierarchical format representing data objects and arrays. In order to convert this format to a Deephaven table, JSON data is recursively converted into a stream of lines equivalent to a CSV file (with one "line" per top-level JSON object). The type inference then operates in the same was as it does on CSV data.
Column names for a table inferred from JSON are derived from the field names contained in the JSON objects. Since there is no "header" in JSON, all or a large portion of the data is explored to enumerate all columns (since any one JSON object may be missing a given field). JSON array elements are converted into column names by combining the field name with a numeric suffix reflecting the position in the array.
See Importing JSON Files for details.
Converters and Adding Custom Converters
The schema creators have a number of built-in converters to recognize various date/time formats. Unrecognized formats will result in a date/time field being treated as String. If ambiguous data is included, such that multiple converters match all the data values, the column will also be treated as String, and a warning message will be written to the log.
Built-in Converter |
Matching Data |
---|---|
ISO format with date only |
2017-08-30 |
ISO UTC date/time with optional sub-second precision (no T) |
2017-08-30 11:59:59.000Z |
ISO UTC date/time with optional sub-second precision (with T) |
2017-08-30T11:59:59.000Z |
ISO UTC two digit offset date/time with millisecond precision |
2017-08-30T11:59:59.000+04 |
ISO UTC four digit offset date/time with millisecond precision |
2017-08-30T11:59:59.000+0400 |
ISO UTC two digit offset date/time with microsecond precision |
2017-08-30T11:59:59.000123+04 |
ISO UTC four digit offset date/time with microsecond precision |
2017-08-30T11:59:59.000123+0400 |
ISO time zone date/time with optional sub-second precision |
2017-08-30T11:59:59.000 NY |
Exchange timestamp (will be converted to NY time) |
2017-08-30T11:59:59.000 |
Timestamp UTC with optional sub-second precision |
2017-08-30 11:59:59 or 2017-08-30 11:59:59.123456 |
SQL Server default format |
AUG 30 2017 11:59AM |
Converters for other formats can be included by adding their details to a property file used by the schema creator process. Here is an example converter that could be added in a property file:
SchemaCreator.converter.name.1=USDateTimeSeconds1
SchemaCreator.converter.pattern.1=^((0?\\d)|(1[0-2]))\\/(([0-2]?\\d)|(3[0-1]))\\/\\d{4} (([0-1]?\\d)|(2[0-3])):[0-5]\\d:[0-5]\\d$
SchemaCreator.converter.formula.1=new DBDateTime(new java.text.SimpleDateFormat("MM/dd/yyyy HH:mm:ss").parse($TARGETNAME$).getTime()*1000000)
SchemaCreator.converter.sourceType.1=String
SchemaCreator.converter.targetType.1=DateTime
There are five components that comprise the definition for the custom converter. All five are required. Note the format of the property names: SchemaCreator.<atrribute name>.<unique index of the converter>
. The index value must match across the five properties. The values do not have to start at 1, or be sequential, but it is important to ensure index numbers across multiple property files do not collide.
These five pieces can appear in any order and do not have to be contiguous, but all five pieces must be present if any one of them is present. If an incomplete converter entry is detected when a schema creator process starts, the process will fail and will log details about which components of the definition are missing.
name
- Thename
attribute is mainly for convenience in identifying a particular converter.pattern
- The pattern is a regular expression string used to match data the converter can process. It must be very specific so this converter does not overlap any other converters, and so this formula is not given any data that it cannot process. The example regex limits the month portion of strings it will match to numbers less than 13, which helps avoid matches to non-US date strings where the day would be written before the month. When trying to find a converter to use for complex data, the schema generator will try to find one and only one converter whose pattern matches all the non-empty values from the sample data. Note that all backslashes in any property values must be escaped by doubling them. In this example, the regex will match date/time strings like this one: 01/11/1975 11:10:23formula
- Theformula
contains a Java expression that will be used to convert the source data to the target data type. It must include the$TARGETNAME$token
, which will be replaced with the column name by the schema generator. In this case, a Java SimpleDateFormat is being used to parse the date/time string into a Java Date. Then the.getTime()
method is used to get an epoch milliseconds value for the Date. Deephaven uses epoch nanoseconds, so this value is then multiplied by one million and passed to the DBDateTime creator method to return a DBDateTime.sourceType
- ThesourceType
controls how the importer will interpret a value before passing it to the formula. It will usually be String, but if the source value was already an epoch value, for instance, then the source type might be long instead of String.targetType
- ThetargetType
must match the type produced by the formula. This is used by the schema generator to set the appropriate column type for columns that match the regex pattern.
ImportSource XML in a Schema File
Additional metadata to control and modify the import process is added in an ImportSource XML block of the schema. This is demonstrated below:
<Table name="CsvNames5" namespace="Import" storageType="NestedPartitionedOnDisk">
<ImportSource name="JDBC1" type="JDBC">
<ImportColumn name="FirstName" sourceName="first_name" />
<ImportColumn name="LastName" sourceName="last_name" />
<ImportColumn name="Gender" sourceName="gender" />
<ImportColumn name="Race" sourceName="race" />
<ImportColumn name="FirstNameLength" sourceName="first_name" formula="FirstNameLength.length()" sourceType="String" />
</ImportSource>
<Partitions keyFormula="__PARTITION_AUTOBALANCE_SINGLE__"/>
<Column name="Date" intradayType="none" dataType="String" columnType="Partitioning" />
<Column name="FirstName" dataType="String" columnType="Normal" />
<Column name="LastName" dataType="String"columnType="Grouping" />
<Column name="Gender" dataType="String" columnType="Normal" />
<Column name="Race" dataType="String" columnType="Normal" />
<Column name="FirstNameLength" dataType="int" columnType="Normal" />
</Table>
Multiple ImportSource blocks can be added to allow various import instructions for different sources or types of import. The type attribute is required; currently supported values are "CSV", "JDBC", "JSON", and "XML". The name attribute is optional. When the import runs, if no ImportSource name is specified, the importer will use the first ImportSource block that matches its type. If no matching block is found (for type, and, if specified, for name), then the import will attempt to auto-match columns as it would if no ImportSource blocks existed.
To support multi-partition imports, you must have an ImportColumn for the partition column indicating how to partition each row (e.g., <ImportColumn name="Date" sourceName="trade_date"/>
where trade_date
is the CSV/JDBC column name).
An ImportSource block can contain multiple types of entries, including ImportColumn, Imports, maxError, arrayDelimiter and ImportTransform. These options are described below:
Entry Type/Attribute |
Description |
---|---|
ImportColumn |
Each entry describes how to map values for one column from the source data to the target table's schema |
|
The name of the target column in the Deephaven table, which must exactly match the name attribute of one of this schema's Column elements. This attribute is required; others are optional. Note: An |
|
The name of the column in the source data whose values should be mapped into the target column specified by name. This is used for mapping from a different column name in the source table, such as when a source name is not a valid Deephaven column name. Also use this attribute on the partition column when configuring multi-partition imports. |
|
The data type of the source data. This is used when data types must be translated during the import process. If As an example, |
|
Indicates whether null values will be accepted for this column. If the Note: for primitive types, which do not natively support null values, the database uses reserved values to represent null. Null default values from |
|
This allows specification of Java code to do simple manipulations of data based on the source value itself. In the example noted earlier, the |
|
Similar to formula, transform creates a new function during the import using the details of the ImportColumn element. Transforms create Java Lambda functions that are then applied to the input value. An example of a built-in transform function is |
|
The value to use when import or parsing of the source value fails or the source is null. The default value must be type compatible with the target column type; for boolean columns, this would be lowercase "true" or "false." The literal "null" may be used for a default, and will result in an appropriately typed null value (from |
|
The full name of a custom class to use as the field writer for the column. Custom field writer classes must derive from the |
|
(CSV only) If this attribute is set to "false", caching will be disabled for this column. If this attribute is not present, or is set to "true", caching will be enabled, assuming the column meets one of the other criteria for caching (complex non-String column type, has a formula, or has a transform.) Caching is useful for values that require conversion or calculation work during the import, but may not be helpful if values in the column are rarely duplicated. |
|
(CSV only) Allows overriding the cache size for columns that will be cached. The default size is 1024 entries. |
|
Other attributes as needed to control transform functions. (e.g., |
The packages to import. A default set of packages is imported to support built-in functions and import tasks, but additional packages can be specified to provide other functions or support custom transforms. Imports can be just the qualified package/class name, or the full import Beside the internal imports used by the field writers, these imports are automatically added in the field writer context:
|
|
maxError |
The maximum number of allowed parsing errors before failing the import. The default is 0. Allowed errors that will increment maxError are things like numeric conversions, overflows, and mismatched values (e.g., a Boolean value being set from a String column that doesn't match any known version of True or False.) |
arrayDelimiter |
A string to be used as a delimiter when parsing array data. The default delimiter is a comma. |
ImportTransform |
See Custom Transforms below |
|
The name of a custom transform function, and will be the name sought when an |
|
The Java code that will be compiled as part of the setter for an |
Custom Transforms
Custom transforms can be added to the ImportSource
block as ImportTransform
elements. These elements must:
- Have a
Body
element that contains the text of the transform function - Have a
name
attribute that matches the name used in the transform attribute ofImportColumns
that use the transform - Implement a
getFunction
function that returns a function matching the specification of the interface
At runtime, getFunction
is passed to the ImportColumnDefinition
of the column that is being set, and a boolean value indicating whether strict (true
) or permissive (false
) data checking has been selected. The ImportColumnDefinition
allows getFunction
to access details like the target column name, the target column type, and any ImportColumn details by using getImportColumn()
, which provides access to functions such as getAttributeValue()
.
The following shows an example of a custom transform element that could be specified in the schema (Note: this is actually the built-in dbDateTimeFromLong
transform):
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition col, boolean strict)
{
String precision = col.getImportColumn().getAttributeValue("transformPrecision");
switch (precision) {
case "millis":
return (long value) -> new DBDateTime(DBTimeUtils.millisToNanos(value));
case "micros":
return (long value) -> new DBDateTime(DBTimeUtils.microsToNanos(value));
case "nanos": return DBDateTime::new;
default: return null;
}
</Body>
</ImportTransform>
When compiling the transform function, the target, and, if defined, source, data types are used to find a suitable Java functional interface with which to declare the getFunction
. In the case of the example above, the target column type is DBDateTime
, and the sourceType="long"
attribute is added to the ImportColumn
. This combination of source and target types is then used to create and execute the resultant function as a LongFunction<DBDateTime>
.
A much simpler example follows:
<ImportTransform name="Sample">
<Body>
getFunction(ImporterColumnDefinition column, Boolean strict) {
return (long value) -> (2 * value);
}
</Body>
</ImportTransform>
If this is used for a column whose type is long, this will be mapped to a LongUnaryOperator
, which takes a long and returns a Long. If the sourceType
is not specified, the importer will attempt to use the target column type for matching a function interface.
If the function being returned from getFunction
has a declared return type, this return type must exactly match the target column type for which the transform will be used. For example, a transform that will provide values for a Long column and that has a declared return type, must be declared Long, not long. However, the simpler, and recommended, implementation is to include only the transform body in the return from getFunction
, so the system can perform implicit type conversion when compiling the transform.
Custom transforms can be added from the Schema Editor File menu using the View/Edit Schema XML option.
Primitive Types and Boxed Types in Transforms
Not every combination of primitive types is supported by the Lambda interfaces used to pass transform functions into field writers. The supported combinations for unboxed primitive types are any combination of source and target of int, double, and long.
Therefore, a transform that reads a long value from the source and transforms it to a double value in the target table would use the primitive types of long and double.
However, a transform that reads a byte value and writes a short value would not be able to use primitive types. The importer will map this transform as a Function<Byte, Short>
Note the capitalization. Byte and Short are the boxed (class-based) versions of the primitive types byte and short. This means there is extra overhead in the transform process to use the larger class-based versions of the data types.
Note: This concern with boxing of primitive types is only applicable to transforms. If the data can be implicitly converted (e.g., map a short to an int), or converted with a formula, then no boxing occurs. If the translation is too complex for a formula, another option for such primitive types is to write a custom FieldWriter
. Since the FieldWriter
classes are not interfaced through Lamdba functions, they have no limitations on which primitive data types they can use for source or target.
Another thing to be aware of is that primitive types that have a "null" default will be set to values from com.illumon.iris.db.tables.utils.QueryConstants
when their sources are null. While these may be boxed when they are passed into a transform, they will not be automatically converted to true null values. If special handling is needed for null values in the transform, either check for equality to the QueryConstants
values, or use a boxed type for the sourceType
property of the column so these values will be preserved as null.
Array Types
The importers support long[] and double[] array types. From the source, these are expected to be strings (either a string field from JDBC or a column value in CSV). Formatting of the string is as a bracketed list of values. Brackets can be square brackets [ ], parentheses ( ), or curly braces { }. The delimiter can be specified with arrayDelimiter
, but the default delimiter is a comma. Empty arrays [] are also supported.
An example array of double strings follows: [1.5, 6.4, 8.0, 2.2322, 88, 16.02]
Custom FieldWriter Classes
Custom FieldWriter
classes can be used where a field writer needs to use values from multiple source columns, or where it is more desirable to package transform logic into a FieldWriter
class rather than in an in-line transform or imported code. Custom field writers must extend the source-specific FieldWriter
class (i.e., CsvFieldWriter
for CSV imports or JdbcFieldWriter
for JDBC imports.) They must implement a specific constructor, and must also implement the processField method.
A custom field writer class project will need the following Deephaven jars added as dependencies:
IO-<version_number>.jar
Base-<version_number>.jar
Ingesters-<version_number>.jar
DataObjects-<version_number>.jar
BinaryStore-<version_number>.jar
IrisDataObjects-<version_number>.jar
An example of a custom field writer for JDBC follows:
package io.deephaven.fieldwriters;
import java.io.IOException;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.TimeZone;
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import com.illumon.iris.importers.ImporterColumnDefinition;
import com.illumon.iris.importers.JdbcFieldWriter
/**
* Sample custom field writer class that takes FirstName and LastName and writes out FirstName + ' ' + LastName
*/
public class FullNameWriter extends JdbcFieldWriter {
private final RowSetter setter;
private final ResultSet rs;
private final ImporterColumnDefinition column;
// This is the constructor format that is required for custom JdbcFieldWriters
public FullNameWriter(Logger log, boolean strict, ImporterColumnDefinition column,
final TimeZone sourceTimeZone,RowSetter setter, ResultSet rs, String delimiter) {
super(log, column.getName(), delimiter, sourceTimeZone);
this.setter = setter;
this.column = column;
this.rs = rs;
}
@Override
public void processField() throws IOException {
try {
setter.set(rs.getString("FirstName") + " " + rs.getString("LastName"));
} catch (SQLException e) {
throw new RuntimeException("Failed to concatenate FirstName and LastName for target field: " + column.getName());
}
}
}
A custom field writer for CSV would be similar, but would not include ResultSet
or TimeZone
in its constructor arguments, and would require CsvRecord
as an argument for processField
.
import com.fishlib.io.logger.Logger;
import com.illumon.iris.binarystore.RowSetter;
import org.apache.commons.csv.CSVRecord;
import java.io.IOException;
public class CsvMultiColumnExample extends CsvFieldWriter {
private final RowSetter setter;
private final ImporterColumnDefinition column;
//This is the constructor format that is required for custom CsvFieldWriters
public CsvMultiColumnExample(Logger log, boolean strict, ImporterColumnDefinition column, RowSetter setter, String delimiter) {
super(log, column.getName(), delimiter);
this.setter = setter;
this.column = column;
}
@Override
public void processField(CSVRecord record) throws IOException {
setter.set(record.get("FirstName") + " " + record.get("LastName"));
}
}
Internally, XML uses CSV import processing, so the CSV custom field writer structure would also be usable for XML imports.
To use the custom field writer in the table, the class attribute of the ImportColumn
is set, and an import entry will be needed to make the custom field writer class available to the import process:
<Imports>com.io.deephaven.fieldwriters.FullNameWriter</Imports>
...
<ImportColumn name="FullName" class="io.deephaven.fieldwriters.FullNameWriter" sourceName="FirstName" />
Note: Even though it is not used, sourceName
is required due to a known issue in ImportColumn processing. Currently, this attribute must exist, and must indicate some other column that does exist in the data source as its value.
This can then be used to provide data for the FullName
column in the table:
<Column name="FullName" dataType="String" columnType="Normal" />
Typically, the custom field writer class will be compiled and packaged in a jar that should be placed under /etc/sysconfig/illumon.d/java_lib
on servers where data imports that need it will be run.
JdbcFieldWriters
must extend JdbcFieldWriter
and implement a constructor that accepts:
Logger.class
- logger instance passed down from mainImporterColumnDefinition.class
- column definition of column to which this field writer will be attached (includingImportColumn
, and target information)RowSetter.class
- rowsetter for the target columnResultSet.class
- JDBC result set with the current row. The field writer should only read values from result set metadata and current row data, not change the position of the result set or close it.String.class
- delimiter value used for array parsing functionsBoolean.class
- strict value to indicate whether data checking is strict (true) or relaxed (false)
CsvFieldWriters
must extend CsvFieldWriter
and implement a constructor that accepts:
Logger.class
- logger instance passed down from mainImporterColumnDefinition.class
- column definition of column to which this field writer will be attached (including ImportColumn, and target information)RowSetter.class
- row setter for the target columnString.class
- delimiter value used for array parsing functionsBoolean.class
- strict value to indicate whether data checking is strict (true
) or relaxed (false
)
Built-in Functions and Transforms
Some functions and transforms are built into the importer framework and can be used when designing ImportSource
entries. Transforms often require other attributes to control their behavior. Functions can be used in formulae; transforms can be used in transforms.
dbDateTimeFromLong (transform)
- takes a long input (offset from Unix epoch) and returns a DBDateTime. The attributetransformPrecision
should be set to "millis", "micros", or "nanos" to indicate the units of the long offset value. For example,transformPrecision="millis"
means that the long value from the source is the number of milliseconds from Unix epoch. RequiressourceType=long
to match the expected input of the transform.new DBDateTime (function)
- this is the constructor for DBDateTime objects. By default it takes a long offset of nanoseconds from Unix epoch. Additionally,DBDateTimeUtils.millsToNanos()
andDBDateTimeUtils.microsToNanos()
are available for converting other precisions of offsets to nanoseconds. Standard Java date and time parsing and conversion functions can be used to create a long Unix epoch offset from other formats; the result of which can then be passed to this function to store asDBDateTimevalue. java.time.*
andjava.time.format.DateTimeFormatter
are included in the standard imports.ZonedDateTime.parse
can be used with various formats to parse strings into ZonedDateTimes; and theZonedDateTime.toInstant().toEpochMilli()
call can then be used, for example, to convert to long milliseconds to pass to this creator. 'dbDateTimeFromTimestamp (function)
- takes a Java Timestamp and returns a DBDateTime used internally, automatically, when importing JDBC datetime types toDBDateTimedestination
columns.DBTimeUtils.convertDateTime (function)
- takes a String and returns a DBDateTime. Expects a string of the form: YYYY-MM-DDThh:mm:ss.nnnnnnnnn TZ. One use of this function would be to convert dates that had originally been exported to CSV from a Deephaven system.enumFormatter (transform)
- takes an int input and returns a String, based on lookup from an enum defined in the system columns file. The attributetransformColumnSet
is the String column set name to match in the columns file. The attributetransformEnum
is the String name of the enum under the column set to match in the columns file. RequiressourceType=into
match the expected input of the transform.accountingInt (function)
- takes a String and converts to an int. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). RequiressourceType=String
to match the expected input of the function.accountingLong (function)
- takes a String and converts it to a long. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). RequiressourceType=String
to match the expected input of the function.accountingDouble (function)
- takes a String and converts it to a double. Converts enclosing parentheses to a negative value. Removes thousands separators. Expects US number formatting (comma for thousands separator). RequiressourceType=String
to match the expected input of the function.stripFirstCharacter (function)
- takes a String and returns a String. Strips the first character, assuming the String has at least one character. No overrides ofsourceType
needed because input and output types match.parseBoolean (function)
- takes a String and returns a Boolean. Single character T/F 1/0 t/f converted to Boolean true or false. Strings (case-insensitive) true or false converted to Boolean true or false. Used internally, automatically, when reading a CSV to a destination column that is Boolean. Requires asourceType=String
for explicit use in custom formulae.getLongArray (function)
- takes a String and returns an array of long values. Two arguments - first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed - normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a long[]column. RequiressourceType=String
for explicit use in custom formulae.getDoubleArray (function)
- takes a String and returns an array of double values. Two arguments: first is the String of array values, second is a String delimiter to parse the values. Input string is expected to be enclosed normally by square brackets. Only single-dimensional lists (arrays) are supported; no matrices. Used internally, automatically, when importing to a double[]column. RequiressourceType=String
for explicit use in custom formulae.
CopyTable Schemas
One table layout may be used for multiple tables. When this is required, it is not necessary to replicate the entire source schema definition for each new table. Instead, a schema definition can use the CopyTable
element instead of the Table
element, referring to another source table for its definition. The CopyTable
element then uses the source table's schema for all definitions, including columns, imports, loggers and listeners.
The schema file naming convention of <namespace>.<table>.schema
also applies to CopyTable
schema files. The namespace and table in the CopyTable
schema filename should be those of the duplicate table, not the source table.
A CopyTable
definition contains the following attributes:
namespace
- the duplicate table's namespacesourceNamespace
- the source table's namespacename
- the duplicate table's namesourceName
- the source table's name
Note: the previous CopyTable syntax of namespaceSource
and nameSource
has been deprecated.
For example, the following would use the definition of the table ExampleTable
from the namespace ExampleNamespace
for a table called ExampleTableCopy
in the namespace CopyTableNamespace
:
<CopyTable namespace="CopyTableNamespace" sourceNamespace="ExampleNamespace" name="ExampleTableCopy" sourceName="ExampleTable" />
A duplicate table functions in the same manner as any other Deephaven table. Data can be imported through the importer classes or real-time ingestion, partitions will be generated in the same ways, intraday data can be merged to the historical database, and queries will work as expected.
Loggers and listeners are not generated for CopyTable
definitions, as the loggers and listeners from the source table will be used for the duplicate table.
Note: Chained CopyTable
definitions are not supported at this time; a CopyTable
must refer to a fully defined Table
schema.
Managing Schema Files
Once schemas have been defined or changed, they must be made visible to Deephaven (deployed), and if any custom loggers and listeners are defined, these must be generated.
The logger/listener generation can be done from deployed schemas, or from another location prior to schema deployment so the generated loggers can be integrated with the customer application before being deployed to Deephaven.
Two scripts are provided for managing schema files, and are described below:
dhconfig
bothdeploys schemas to a Deephaven database, making their described tables available for use, and exports schemas from etcd to a specified directory. See: Configuration Tools.
generate_loggers_listeners
generates the logger and listener classes from schemas. For more information on loggers and listeners see Streaming Data.
These scripts can be used during the initial setup of a new Deephaven system, as well as to update configuration when schemas are added or modified. Rerunning the scripts is safe, even if nothing has changed since the last time they were run.
Schema Service Types
Deephaven installations use etcd to store and access schema files; this allows schema files to be easily distributed to the processes that need them.
Some standalone processes such as standalone logging programs not running on a Deephaven server will not use the etcd installation to access schema files. These programs will have access to the schema files locally, and are not discussed here.
Schema Locations and Properties
On a Deephaven server, editing a schema file directly has no effect on the system. Edited schema files must be deployed before the system will use them. See: Deploying Schemas to Deephaven.
The following property controls aspects of schema behavior:
SchemaConfig.defaultListenerPackagePrefix
- defines the default listener package prefix used if one is not defined in a schema. Any generated listeners will use this value followed by the namespace converted to lowercase if they don't have their own package defined in their schema file. The default value is:SchemaConfig.defaultListenerPackagePrefix=com.illumon.iris.intraday.gen
Deploying Schemas to Deephaven
Once schemas are defined or after they are changed, they must be made available to the Deephaven database. This is done using the dhconfig
script. The script uses the parameters defined at the command line to find the schemas. The Schema Editor also has an option to deploy schemas.
Until this step is done, schema changes are not visible to Deephaven.
Use the following to call the schema deployment script (proper authentication is required):
dhconfig schemas import <options>
As it runs, the script shows details about how Java was started and the property files read. Following this, output similar to the following should appear. This indicates all of the tables that were found and deployed.
For example, on most Deephaven installations, the following command will redeploy the LearnDeephaven schemas:
/usr/illumon/latest/bin/dhconfig schemas import --directory
/etc/sysconfig/illumon.d/schema/LearnDeephaven <auth options>
This will result in the following output:
Updating System.LearnDeephaven.EODTrades
Updating System.LearnDeephaven.StockQuotes
Updating System.LearnDeephaven.StockTrades
The following options are available for use with dhconfig schemas import
:
|
Specify the directory from which schema files will be read. This option can be specified with |
|
Ignore any errors and continue to attempt to deploy schema files. If this option is not specified, then any errors cause the tool to stop as soon as any errors occur. |
|
Specify a schema or table definition file to import. This option can be specified multiple times to process multiple files at once. This must be the full path, not a relative one. |
|
Overwrite existing schemas when importing, overwrite files when exporting. |
|
Pritns help for a schemas command. |
|
Specify a private key file to use for authentication. |
|
Specify a namespace to process. If this is specified, then only files for the specified namespace will be imported; to process multiple namespaces at one time, this argument can be specified more than once. If this is not specified, all namespaces will be processed. |
|
Specify a password for the given user. |
|
Specify the namespace set to import. This allows the processing of either System or User namespace files. If it is not specified, both System and User namespace set files will be processed. |
|
Attempt to update all schemas. If this is not specified, then existing schemas will be skipped. |
|
Print full text of exceptions to stderr. |
Migration Mode
When running the legacy deploy_schema
tool in migration mode, if directories or files are not specified, the tool will instead use several properties to locate and process schema files. This is intended to be run one time, to import schema files from their legacy locations into etcd. Once this migration is complete, these properties will no longer be used by Deephaven to find schema files.
These legacy schema location properties include a list of locations (directories or JAR files) in which schemas can be found, delimited by semicolons or spaces. When looking for schema files, Deephaven (including the deployment and logger/listener generation scripts) will search all subdirectories of the specified locations for files with the .schema
extension, as well as looking within JAR files in those directories and subdirectories for .schema
files.
The schema location properties allow the following substitutions:
<devroot>
- this refers to the installation directory of the Deephaven software, usually/usr/illumon/latest
.<workspace>
- this refers to the workspace of the process.
The following properties specify the locations of schema files during migration mode:
SchemaConfig.allowedNamespaces
- If this property is set when deploying schemas, then schema files will be imported only for the namespaces specified, and all specified namespaces will be created even if no schema files are found. The value is a comma delimited list of namespaces.- The
SchemaConfig.resourcePath
is a multi-valued property prefix which tells the system where to find schema files. You may add any suffix to the property name to add new paths without overwriting any other configuredSchemaConfig.resourcePath
values. You may add more than one directory by using a semicolon separator; you may also use wildcard matching by using an asterisk*
(note that when using wildcards, your paths must end with a trailing slash). For example:SchemaConfig.resourcePath.irisInternal
- The configured schema directory for internal Deephaven schemas.SchemaConfig.resourcePath.customer
- By default, this path is the customer's schema directory:/etc/sysconfig/illumon.d/schema
. If schemas will be placed in additional locations, you may freely add to this property, or create some other suffix with the additional locations.SchemaConfig.resourcePath.plugins
- By default, this is a list of schemas for Deephaven plugins. If new plugins are added which have schema files, you may freely add to this property, or create some other suffix for the new plugin schema locations. If this property does not exist (on older installations), you should add it with a value of/etc/sysconfig/illumon.d/plugins/*/schema/
(note the trailing slash).
Deploying Schemas During Intraday Data Ingestion
Schema changes made while intraday data is being written to a table (for instance, through the tailer and data import server) will require special attention as Deephaven does not automatically handle schema changes while data is being ingested. This can be accomplished using the following procedure.
- First, ensure the tailer is not sending data for the changed table by examining its log file (usually
/var/log/deephaven/tailer/LogtailerMain1.log.current
), and stop the tailer. - Move or rename any existing binary log files for that table for the day so they won't be found by the tailer. For example if the
UpdatePerformanceLog
schema was being changed:
sudo mkdir /db/TempFiles/save
sudo mv /var/log/deephaven/binlogs/perflogs/DbInternal.UpdatePerformanceLog*.bin.* /db/TempFiles/save
- Assuming the table's internal partition is determined by the tailer configuration, update the tailer's XML configuration file and change (or add) the
internalPartitionSuffix
attribute for the changed table so that new intraday data will be written to a different location. If the internal partition is instead determined by the internalPartitionRegexattribute, update the logger to ensure it's logging with a different filename. For example:internalPartitionSuffix="TableVersion2"
- If the new schema includes listener changes, the Data Import Server will need to be restarted so it can use the new listener classes.
- Finally, once the new schema is deployed, the tailer can be restarted.
Note that queries may not find the previously written intraday data for the day after this procedure is complete, but merges will find and write it correctly.
Exporting Schemas From Deephaven
The dhconfig
tool exports schemas from etcd to a specified directory as ".schema" files, named <namespace>.<table name>.schema
. These files can then be examined and edited, and re-imported with the dhconfig
script. It needs to be run from an account with access to etcd, such as dbquery
, or an exception will occur that indicates it can not access an etcd directory. It also has an option to simply list schemas rather than export them.
The following options are available:
|
Specify the directory from which schema files will be read. Unless the - |
|
Instead of exporting the schemas, list them. |
|
Specify the namespace to export or list. This argument can be specified multiple times to export or list multiple namespaces at once. If it is not specified, all namespaces will be exported or listed. |
|
If specified, existing schema files will be overwritten, otherwise a schema will not be exported if its schema file already exists in the specified directory. This argument is ignored if the |
|
Specify the namespace set to export (System or User). If it is not specified then both System and User schemas are exported. |
For example, if the user is logged in as dbquery
on most Deephaven installations, the following command will export the LearnDeephaven schema files to the directory /tmp/export
, overwriting any files that are already there.
/usr/illumon/latest/bin/dhconfig schema export --directory /tmp/export --namespace LearnDeephaven --overwrite
Last Updated: 16 February 2021 18:06 -04:00 UTC Deephaven v.1.20200928 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved