Data Routing Service Configuration via YAML

Data is central to Deephaven. There are many ways to configure the storage, ingestion, and retrieval of the data. The Data Routing Service is a central API for managing the configuration of a Deephaven system. The YAML configuration file format centralizes the information governing the locations, servers, and services that determine how data is handled. Because the information is stored in one place, the entire configuration can be viewed at a glance. This also makes it easier to make changes and understand the implications.

To enable the YAML based data routing you will need to set these properties:

DataRoutingService.default=YAML
DataRoutingService.configFile=routing_service.yml

These values should be set in IRIS-CONFIG.prop file located in the /etc/sysconfig/illumon.d/resources/ directory.

YAML File Format

Related Link
The full YAML specification can be found at:
http://yaml.org/spec/1.2/spec.html.

The YAML data format is designed to be human readable. However, there are some aspects that are not obvious to the unfamiliar reader. The following pointers will make the rest of this document easier to understand.

YAML leans heavily on maps and lists.
A line beginning with a string followed by a colon ( e.g., filters:) indicates the name of a name-value pair (the value can be a complex type). This is used in "map" sections.
A line beginning with a dash (e.g., - name) indicates a list or array. In this structure, the item is often a map, where multiple instances of the following data are expected. The name: element avoids an unnecessary map level.
Anchors are defined by &identifier, and they are global to the file. Anchors might refer to complex data.
Aliases refer to existing anchors (e.g., *identifier).
Aliased maps can be spliced in (e.g., <<: *LAS-default). In the context below, all the items defined by the map with anchor LAS-default are duplicated in the map containing the <<: directive.

The Data Routing Service needs certain data to create a system configuration. It looks for that data under defined sections in a single YAML document (the YAML file format allows for multiple "documents" embedded in one file, but this is not supported in the Deephaven Data Routing Configuration file).

Data Types

Related Link
More information about data types can be found at:
http://yaml.org/spec/1.2/spec.html#id2759963.

Only a few of the possible data types are used by Deephaven and mentioned in this document:

List (or sequence) - consecutive items beginning with "- "
Map - set of "name: value" pairs
Scalars - single values mainly integer, floating point, boolean, and string

The YAML parser will guess at the data type for a scalar, and it cannot be always correct. The main opportunity for confusion is with strings that can be interpreted as other data types. Any value can be clarified as a string by enclosing it in quotation marks (e.g., "8" puts the number in string format).

Sections

The Data Routing Configuration file contains a single YAML document with defined sections.

The document must contain a map key called "config", which must have the following sections:

storage
dataImportServers
logAggregatorServers
tableDataServices

Optional sections may be included in the YAML file that define anchors and default values (e.g., "anchors"), and those that combine default data into one location (e.g., "default"). Each section is discussed below.

Anchors

Anchors can be defined and then referenced throughout the data routing configuration file. These can represent strings, numbers, lists, maps, and more. In this case, we use them to define names for machine roles and for port values. In a later section, they define the default set of properties in a DIS.

The syntax &identifer defines an anchor, which allows "identifier" to be referenced elsewhere in the document. This is often useful to concentrate default data in one place to avoid duplication. Note: because anchors are global, the names must be unique.

The example "anchors" section below effectively creates a map of the cluster, defines default values for ports that were previously defined in properties files, and defines default values for DIS configurations that will be referenced later in the document (see Data Import Servers below for a detailed explanation of these values):

anchors:
 # This is effectively a map of the cluster and defines anchors for hosts
 - &localhost   "localhost"
 - &ddl_infra   "192.168.0.140"
 - &ddl_query   "192.168.0.140"
 - &ddl_query1  "192.168.0.141"
 - &ddl_query2  "192.168.0.142"
 - &ddl_dis     "192.168.0.140"
 - &ddl_rta     "192.168.0.140"
 - &ddl_merge   "192.168.0.140"
 # Define aliases for the default port values
 - &default_tailerPort              22011
 - &default_lasPort                 22020
 - &default_tableDataPort           22015
 - &default_tableDataCacheProxyPort 22016
 - &default_localTableDataPort      22014
 # Define a DIS configuration instance that contains all the default values.
 # Instances below can include
 - &DIS-default 
     tailerPort: *default-tailerPort
     throttleKbps: -1 
     storage: default 
     definitionsStorage: default 
     tableDataPort: *default-tableDataPort

Note: Anchors can be defined anywhere in the YAML file. However, the anchors must be defined earlier in the file than where they are used. For example, the hosts and default ports could also be defined in their own sections (e.g., "hosts" and "defaultPorts") rather than consolidated into one section.

Storage

This required section defines locations where Deephaven data will be stored. This will include the default database root and any alternate locations used by additional data import servers.

Note: Many components still get the root location from properties (e.g., OnDiskDatabase.rootDirectory).

For example:

storage:
  - name: default
     dbRoot: /db

The value of this section must be a list:

name: [String] - Other parts of the configuration will refer to a storage instance by this name.
dbRoot: [String] - This refers to an existing directory.

Data Import Servers

This section supports Data Import Server (DIS) and Tailer processes.

The DIS process will start with a process name (e.g., db_dis) matching a data import server configuration. The configuration names can be set to match the process names in use.

The Tailer process uses this configuration section to determine where to send the data it tails. Note that a given table might have multiple DIS destinations.

The two consumers have slightly different needs, such as the host name, which is not needed by the DIS process.

The following example defines two data import servers. Note that the defaults defined in the previous section are imported into each map entry:

 dataImportServers:
 db_dis:
   <<: *DIS_default # import all defaults from the DIS_default section
   host: *ddl_dis   # reference the address defined above for "ddl_dis"
   userIntradayDirectoryName: "IntradayUser"
   filters: {namespaceSet: System}
   webServerParameters:
    enabled: true     
    port: 8084
    authenticationRequired: false
    sslRequired: false
 db_rta:
   <<: *DIS_default
   host: *ddl_dis
   filters: {namespaceSet: User}

The value of this section must be a map. The keys of the map will be used as Data Import Server names. The value for each key is also a map:

host: [String] - The tailer will connect to this data import server on the given host and port
tailerPort: [int] - The Data Import Server will receive Tailer connections on this port.
throttleKbps: [int] - (Optional) If omitted or set to -1, then there is no throttling.
storage: [String] - This must be the name of a storage instance defined in the storage section. This data import server will write data in the location specified by this storage instance.
definitionsStorage: [String] - (Optional) If a Data Import Server is configured with storage other than the default, table definitions generally must still be read from the default storage. This must be the name of a storage instance defined in the storage section.
userIntradayDirectoryName: [String]- (Optional) Intraday user data will be stored in this folder under the defined storage. If not specified, the default is "Users".
filters: [filter definition]- (Optional) This filter determines the tables for which tailers will send data to this Data Import Server.
tableDataPort: [int] - (Optional) - This port will be used to publish table data. If not set, or set to -1, this import server will not publish table data.
webServerParameters: [map]- (Optional) This defines an optional web server for internal status.
enabled: [boolean]- (Optional) If set, and true, then a webserver will be created.
port: [(if enabled) int]
authenticationRequired: [boolean] - (Optional) Defaults to true.
sslRequired: [boolean] - (Optional) Defaults to true. If authenticationRequired is true, then sslRequired must also be true.

Log Aggregator Servers

This section defines Log Aggregator Servers. Unlike the the way a Tailer uses DIS entries, only one LAS will be selected for a given table location. This allows a section with a specific filter to override a following section with more general filters.

The following example shows one service for user data (RTA) and a local service for everything else:

  logAggregatorServers: !!omap  
 - rta:
    port: *default_lasPort
    host: *ddl_rta
    filters:
      - namespaceSet: User

 - log_aggregator_service: # default service
     port: *default_lasPort
     host: *localhost

The value for this section must be a list, and should be an ordered list. The "!!omap" directive ensures that the datatype is an ordered list. Consumers of the Log Aggregator Service will send data to the first server in the list for which the filter matches.

Each item in the list is a map, where the key is taken to be the name of a Log Aggregator Server.

The value of each item is a map:

host: [String] - Clients of this Log Aggregator Server will connect to the given host and port. This is often localhost.
port: [int] - The Log Aggregator Service of this name will listen on this port.
filters: [filter definition] - (Optional) This filter defines whether data for a given table should be sent to this server.

Table Data Services

This is the most important section and requires careful set-up.

This section supports both providers and consumers of the TableDataService protocol. Providers include Data Import Servers, the Table Data Cache Proxy and the Local Table Data Service. All Data Import Servers are implicitly Table Data Service (TDS) providers.

Any TDS that will be used by a consumer process (query server, merge process, etc.) must have filters configured so that any table location will be provided by exactly one source, either because of filter exclusions or because of local availability of data.

A table location is namespace, tableName, internalPartition, and columnPartition. It is common for one TDS to provide certain column partition values (e.g. currentDate, currentDate+future) and another to provide locations for the other column partition values.

For example:

 tableDataServices:
 # db_dis - data import server implies table data source
 # db_rta - data import server implies table data source

 # local, with a storage named above
 local:
   storage: default 
			
 # Configuration for the LocalTableDataServer named db_ltds, and define
 # the data published by that service
 db_ltds:
    host: *ddl_dis
    port: *default_localTableDataPort
    storage: default
			
 # Proxies combine other services and create new services
 # Configuration for the TableDataCacheProxy named db_tdcp, and define
 # the data published by that service.
 # There is typically one such service on each worker machine.
 db_tdcp:
    host: localhost
    port: *default_tableDataCacheProxyPort
    sources:
    # SYSTEM_INTRADAY tables for "current date" (and future), minus Order tables handled by Simple_LastBy
    - name: [db_dis, db_dis_backup] # the array defines a failover group of equivalent sources
     filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition >= currentDateNy()"}
   # LTDS for SYSTEM_INTRADAY past dates, minus Order tables handled by Simple_LastBy
   - name: db_ltds
     filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition < currentDateNy()"}
   # all user data
   - name: db_rta
     filters: {namespaceSet: User}
    # only Orders data
  	- name: Simple_LastBy
    	filters: {whereTableKey: "NamespaceSet = `System` && Namespace == `Order`"}


# TDS failover groups. These are treated as equivalent sources:
#e.g., in the case of data recovery. 

            system_dis_tds:
   sources:
     # any source that is an array defines a rollover group;
     #all entries in the group must have host and port, and should have identical filters.
     - name: [db_dis, db_dis2]

The value for this section must be a map. The key of each entry will be used as the name of the table data service.

The value of each item is also a map:

host: [String] - (Optional) Host and port define the address of a remote table data service. Both must be set, or neither.
port: [int] - (Optional) Host and port define the address of a remote table data service. Both must be set, or neither.
storage: [String] - (Optional) Either storage or sources must be specified. If present, this must be the name of a defined storage instance. This is valid for a local table data service, or a local table data proxy.
sources: [list] - (Optional) Either storage or sources must be specified. Sources defines a list of other table data services that will be published as a new table data service:
name: [String or List] - A string value refers to a configured table data service. A list indicates multiple configured table data services that are deemed to be equivalent. This can be used for redundancy or failover.
filters: [filter definition] - (Optional) In a composed table data service, it is essential that data is segmented to be non-overlapping via filters and data layout. A given table location should not be served by multiple table data services in a group.

Note: If sources are present, the host and port are required.

Data Filters

Filters specify which services apply to a given data location, or which locations apply to a given service. In the "filters" section, which may be used within many of the primary sections discussed above, a single filter may defined, or an array of filters. A location will be accepted if any defined filter accepts it.

There are two ways to specify these filters: using Query Language or Attribute Values. The filter attributes for the two modes are mutually exclusive.

Query Language Filters

Filter attributes whereTableKey and whereTableLocationKey contain boolean clauses in the Deephaven query language. These phrases operate on a table with columns representing a table location:

NamespaceSet(String)- NamespaceSet is User or System, and divides tables and namespaces between System and User.
Namespace(String)
TableName(String)
Online(Boolean) - Online tables are those that are expected to change, or tick. This includes system intraday tables and all user tables.
Offline(Boolean) - Offline tables are those that are expected to be historical and unchanging. This is system data for past dates, and all user tables. Online and Offline categories both include user data, so Offline is not the same as !Online.
InternalPartition(String)
ColumnPartition(String) - whereTableLocationKey queries apply to the Location Key (InternalPartition and ColumnPartition) associated with a given Table Key.

Examples

The following example filters to System namespaces except Order, for the current date and the future:

filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition >= currentDateNy()"}

The next example filters to the same tables, but for all dates before the current date:

filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition <= currentDateNy()"}

Unlike the first two filters, the following example includes all locations for these tables:

filters: {whereTableKey: "NamespaceSet = `System` && Namespace == `Order`"}

Attribute Values Filter

This type of filter allows you to stipulate specific values for the named attributes. Since multiple filters can be specified disjunctively, you can build an inclusive filter by specifying the parts you want included in separate filters.

The attributes for a filter are:

namespaceSet - "User" or "System".
namespace - The table namespace must match this value.
tableName - The table name must match this value.
online - true or false. online==false means historical system data, or any user data. online==true means intraday system data, or any user data.
class - This specifies a fully qualified class name which can be used to evaluate locations. This class must implement DataRoutingService.LocationFilter or DataRoutingService.TableFilter, or both (which is defined as DataRoutingService.Filter).

A filter may define zero or more of these fields. A value of "*" is the same as not specifying the attribute. A filter will accept a location if all specified fields match.

Examples

There are several ways to specify filters in the YAML format, as illustrated in the examples below.

Example 1: Inline Map Format

sources:
 - name: example1a
   filters: {namespaceSet: System, online: false } # (system and offline) or user

Example 2: Map Elements on Separate Lines

 - name: example1b
   filters:
     namespaceSet: System
     namespace: "*"
     tableName: "*"
     online: false

Example 3: An Array of Filters. Each filter can be inline or on multiple lines.

 - name: example2
   # any of these filters
   - filters: {namespaceSet: System, online: true, class: com.illumon.iris.db.v2.locations.FilteredTableDataService$SystemIntradayActiveLocationFilterClass }
   - filters: {namespaceSet: User }
   - filters: {namespace: Example2Namespace, tableName: "*" }

Example 4: An Empty Filter

 -name: everything2
   - filters: {}

Example 5: No Filter (same as an empty filter)

 -name: everything1

YAML File Validation Tool

Deephaven includes a tool to validate data routing service configuration files, which can be used before putting them on a system. This will test for various common errors in the YAML file, and can is performed by invoking a Java class, VerifyDataRoutingConfiguration. This can be accomplished in one of two ways: via validate_routing_yml, or directly via java.

Running each respective command should return 0 for a successful parse, and non-zero otherwise. The parsing or IOException will be printed on failure.

validate_routing_yml script

A script named validate_routing_yml is provided in /usr/illumon/latest/bin. This script takes a YAML filename to be validated as a parameter.

Example Command:

/usr/illumon/latest/bin/validate_routing_yml /etc/sysconfig/illumon.d/resources/routing_service.yml

Example Output:

Data Routing Configuration file "/etc/sysconfig/illumon.d/resources/routing_service.yml" parsed successfully

Java

Note that the location specified for the workspace must be a directory, or creatable as a directory, in which the user has write permission.

Command:

java -Dworkspace=/tmp/foo -Ddevroot=/tmp/foo
-DConfiguration.rootFile=iris-defaults.prop -cp -Dconfiguration.quiet=true -cp
"/usr/illumon/latest/java_lib/*"
com.illumon.iris.db.v2.configuration.VerifyDataRoutingConfiguration /etc/sysconfig/illumon.d/resources/routing_service.yml

Example Output:

Loading iris-defaults.prop
Configuration: workspace is /tmp/foo/
Configuration: devroot is /tmp/foo/
Configuration: Configuration.rootFile is iris-defaults.prop
Data Routing Configuration file "/etc/sysconfig/illumon.d/resources/routing_service.yml" parsed successfully

Example Deephaven Data Routing Configuration File

The following example shows a sample YAML configuration file for a three-node cluster, with a lastBy DIS and shared System (historical) data.

Click to expand

 config:      
 hosts:
 # Define anchors for the nodes in the cluster.
 # This is effectively a map of the cluster
 - &localhost    "localhost"
 - &ddl_infra   "192.168.0.140"
 - &ddl_query   "192.168.0.140"
 - &ddl_query1  "192.168.0.141"
 - &ddl_query2  "192.168.0.142"
 - &ddl_dis     "192.168.0.140"
 - &ddl_rta     "192.168.0.140"
 - &ddl_merge   "192.168.0.140"       

 defaultPorts:
 # Define aliases for the default port values
 - &default_tailerPort              22011 #data.import.server.port
 - &default_lasPort                 22022
 - &default_tableDataPort           22015
 - &default_tableDataCacheProxyPort 22016
 - &default_localTableDataPort      22014

 storage:
 # The standard db root location
 - name: default
   dbRoot: /db
 # This defines a storage location for use in an in-worker import server
 - name: Simple_LastBy
   dbRoot: /other/DIS_Simple_LastBy

 DIS_default:
 # This anchor creates default values for an import server instance.
 # Instances below can override only what they need to change
 - &DIS_default
   tailerPort: *default_tailerPort
   throttleKbps: -1 # -1 means "no throttle"
   storage: default
   definitionsStorage: default						
   tableDataPort: *default_tableDataPort

 dataImportServers:
   # main import server
   db_dis:
     <<: *DIS_default
     host: *ddl_dis
     userIntradayDirectoryName: "IntradayUser"
     webServerParameters:
       enabled: true
       port: 8084
       authenticationRequired: false
       sslRequired: false
     # do not process Orders
     filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`"}

   # import server for user data
   # This actually represents the same server as db_dis, but in a separate role
   db_rta:
     <<: *DIS_default
     host: *ddl_dis
     userIntradayDirectoryName: "IntradayUser"
     tailerPort: *default_tailerPort
     # only handle user data
     filters: {namespaceSet: User}
     tableDataPort: *default_tableDataPort

   # configuration for an in-worker lastBy import server, handling the Order namespace
   Simple_LastBy:
     host: *ddl_query1
     tailerPort: 22222
     # handle Orders tables only
     filters: {whereTableKey: "NamespaceSet = `System` && Namespace == `Order`"}
     webServerParameters:
       enabled: false
     storage: Simple_LastBy
     definitionsStorage: default						
     tableDataPort: 22223

 logAggregatorServers: !!omap  
 # this instance will match first, handling all user data
 - rta:
     host: *ddl_rta
     port: *default_lasPort
     filters:
     - namespaceSet: User

 # This instance matches everything that does not match rta.
 # Each node should have an instance running on localhost.
 - log_aggregator_service: # default
     host: *localhost
     port: *default_lasPort
 
 tableDataServices:
   # Note: dataImportServices entries are implicitly available for use in
   # tableDataService composition.

   # define the default local (from disk) LocalTableDataService
   local:
     storage: default

   # Configuration for the LocalTableDataServer named db_ltds, and define
   # the data published by that service
   db_ltds:
     host: *ddl_dis
     port: *default_localTableDataPort
     storage: default

   # Configuration for the TableDataCacheProxy named db_tdcp, and define
   # the data published by that service.
   # There is typically one such service on each worker machine.
   db_tdcp:
     host: localhost
     port: *default_tableDataCacheProxyPort
     sources:
       # SYSTEM_INTRADAY tables for "current date", minus Order tables handled by Simple_LastBy
       - name: db_dis
         filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition == currentDateNy()"}
       # local for SYSTEM_INTRADAY non-current date, minus Order tables handled by Simple_LastBy. This assumes that /db/Systems is mounted locally for the tdcp process.
       - name: local
         filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition != currentDateNy()"}
       # LTDS for SYSTEM_INTRADAY non-current date, minus Order tables handled by Simple_LastBy
       # If historical system data is not mounted locally, get it from the LTDS service where it is available
       # - name: db_ltds
       # filters: {whereTableKey: "NamespaceSet = `System` && Namespace != `Order`", whereLocationKey: "ColumnPartition != currentDateNy()"}

# all user data
       - name: db_rta
         filters: {namespaceSet: User}
       # only Orders data
       - name: Simple_LastBy
         filters: {whereTableKey: "NamespaceSet = `System` && Namespace == `Order`"}

   # define the TableDataService for query servers.
   query:
     sources:
     # Read offline system data from local storage.
     # Note that it is not absolutely necessary to exclude Namespace Order, because that data is not visible to local.
     - name: local
       filters: {whereTableKey: "NamespaceSet == `System` && Offline"}
     # Read everything else from the defined table data cache proxy.
     - name: db_tdcp
       filters: {whereTableKey: "Online"}

Last Updated: 23 September 2019 12:17 -04:00 UTC Deephaven v.1.20181212 (See other versions)