Appendix A: Import/Merge Scripts

The import and merge scripts are designed to automate the day-to-day data ingestion operations of a production environment. The assumption is that there are a number of batch data sources that need to be repeatedly imported, validated and merged. Without these scripts, a user would need to manually specify many required parameters and environment variables. The scripts rely on XML files. See XML Files.

Environment Variables (Host Configuration File)

Several basic environment variables must be defined for the merge/import scripts to work correctly; this will normally be done in the host configuration file with a merge_import entry within the $proc case statement. These environment variables should be defined including the export statement since they will be used in subshells. These are as follows, including typical values.

JAVA_HOME defines where the Java installation can be found; java should be found under $JAVA_HOME/bin. For example:

export JAVA_HOME=/usr/java/latest-x64

ILLUMON_DEVROOT defines the development root directory. Scripts to be run should be found here, and jars will be added to the classpath.

: ${ILLUMON_DEVROOT:="/db/Users/dbmerge/dev"}
export ILLUMON_DEVROOT

ILLUMON_DBROOT specifies the root directory for database files (not the installation, but the data).

export ILLUMON_DBROOT=/db

ILLUMON_TEMP_PATH defines the directory under which log files will be created, under the user defined by LOGNAME.

export ILLUMON_TEMP_PATH=$ILLUMON_DBROOT/TempFiles

TEMP_PATH_USER is used as a directory for temporary files.

export TEMP_PATH_USER=$ILLUMON_TEMP_PATH/$LOGNAME

DB_RPM_DEPLOYED is used to help determine library paths; in most installations, it should be true. If not true, assumptions about the existence of specific directories within the installation are not made, and installation-specific shared libraries will not be available.

export DB_RPM_DEPLOYED=True

ILLUMON_DB_INST_ROOT refers to the directory in which the Deephaven software was installed.

export ILLUMON_DB_INST_ROOT=/usr/illumon/latest

ILLUMON_CONFIG_ROOT refers to the directory in which the customer configuration was installed.

ILLUMON_CONFIG_ROOT=/etc/sysconfig/illumon.d

STATUS_ROOT will be used to write status files.

export STATUS_ROOT=/db/Users/dbmerge/dailystatus

ILLUMON_PROP_FILES refers to the jar or directory containing the property files to be used for this import or merge; it is added to the classpath used by the import and merge programs to search for resources. If this is not defined, then the resource search will use the standard classpath as generated by the scripts (override directories, database jar files, database etc directory and the development root's etc directory).

export ILLUMON_PROP_FILES=\
$ILLUMON_CONFIG_ROOT/java_lib/illumon-config-demo.jar

ILLUMON_CONFIG_FILES_DIR refers to the directory containing the required configuration files.

export ILLUMON_CONFIG_FILES_DIR=$ILLUMON_CONFIG_ROOT/java_lib

If defined, ILLUMON_OVERRIDE_DIRS contains a colon-delimited list of directories that can contain overrides of the XML configuration files.

export ILLUMON_OVERRIDE_DIRS=\
$ILLUMON_CONFIG_ROOT/override:/db/Users/dbmerge/override

PYTHON_LOGGING defines the level of information that is written to the scripts' log files. Valid values are ERROR, WARNING, INFO, and DEBUG.

export PYTHON_LOGGING=INFO

CONFIGFILE defines the specific property file to be used for merges and imports.

export CONFIGFILE=db-nodename.prop

dbEXTRA_ARGS defines additional arguments to be passed into the import and merge classes.

export EXTRA_ARGS="-DWAuthenticationClientManager.defaultPrivateKeyFile=/etc/sysconfig/illumon.d/auth/priv-merge.base64.txt"

db_merge_import_base.sh

This is the base script that will usually be called to perform a merge/import. It sources the host configuration file and calls db_ingest_daily, passing the parameters straight through. It provides the advantage of calling the host configuration file, which should be configured with some basic parameters.

db_merge_import_base.sh <NTHREADS> <MODE> <EMAIL> <Optional DATE>

  • NTHREADS specifies the maximum number of threads allowed for the operations to be performed. Note: not all import processes support multi-threading.
  • MODE specifies the mode to be run, from the db_dailymerge.xml.
  • EMAIL specifies the email to which details will be sent.

An optional DATE in the format YYYY-MM-DD overrides the default date, taken from the XML files. The date can also be specified as a start:end dates in the format YYYY-MM-DD:YYYY-MM-DD, or with multiple dates separated by commas.

db_ingest_daily.py

db_ingest_daily.py is the primary script that is used for nightly merges, and will generally be the one that's used in cron to perform nightly merges. It will usually be called from db_merge_import_base.sh to ensure it has everything it needs from the host configuration, but if the environment is defined correctly, it can be used interactively. The parameters are the same as for db_merge_import_base.sh.

db_ingest_daily <NTHREADS> <MODE> <EMAIL> <Optional DATE>

For example:

$DEVROOT/bin/db_ingest_daily.py 10 UsInternal [email protected]

db_ingester_v2.py

db_ingester_v2.py analyzes the db_merge*.xml files and performs the actual merges; a method within this file is called by db_ingest_daily.py, but it can also be called interactively. When it is called interactively, the parameters that could be specified in the db_dailymerge.xml file must be manually specified.

db_ingester_v2.py <NTHREADS> <DATE> [runners=<runners>] [namespaces=<namespaces>] [feeds=<feeds>] [ifeeds=<intradayFeeds>] [logDirSuffix=<suffix>] [initialPort=<port>] [explicitTests=<testNames>] [dateOrderNormal=<boolean>] [emailWarning=<boolean>]

$DEVROOT/bin/db_ingester_v2.py 1 2016-09-26 namespaces=DbInternal feeds=PersistentQueryStateLog ifeeds=PersistentQueryStateLog logDirSuffix=_2016-09-26_DbInternal runners=v2merge,verify_simple,verify_full,cleanup

XML Files

There are several levels of definitions and XML files used to define repeatable merge and import operations.

  1. The db_merge* XML files define parameters at the table and feed levels.
    1. The lowest level is the table, which is defined as the actual table name and its associated parameters.
    2. Tables are grouped into feeds. In many cases, there will be a one-to-one correspondence between a table and a feed, but one feed can contain zero to many tables. Zero tables is allowed because the feed may simply run a script. Feeds contain additional parameters.
    3. Feeds are grouped into namespaces.
  2. The db_dailymerge.xml file groups the feeds and namespaces into groups for merge and import operations.
    1. Namespaces are grouped into namespace_lists.
    2. Feeds are grouped into feed_lists.
    3. Modes are used to specify namespace lists and feed lists, which are processed based on the supplied details. These details include which operations to run (e.g., merge, import, validate).

db_merge* XML Files

One or more XML files with the name db_merge*.xml contain the details of the merge/import namespaces, tables, feeds, and various required options. Each of these files should be a fully parsable, self-contained XML file, and requires the following three sections and parameters. The root element must exist, but the name is not important.

  1. <import_merge> is the root element.
  2. <defaults> specifies all the default values for the feed and table entries which will follow. The syntax for the table-level and feed-level defaults is the same. Every optional parameter within the table and feed sections requires an entry within the <defaults> section. See the Default Values section for available defaults. For example, to specify the default for the runOnce value:
    <runOnce value="False" />
  3. <holidays> associates holiday files with names that can be specified in each namespace or feed; a namespace will not be run on a holiday. It can be specified whether or not weekends are considered holidays (the default is "False" if this is not specified).
    <holidayFile name="USHolidays" file="USNYSE.txt" weekends="True" />
  4. One or more namespace entries is used to specify namespaces. Each namespace must include a name and may optionally specify a holiday file on which its feeds should not be run.
    <namespace name="DbInternal" holidays="USHolidays" >
    1. Zero or more additional namespaces can be specified. These are additional namespaces for which the defined operations will be performed when the primary namespace is specified in a namespace list (see the db_dailymerge.xml file details).
      <additionalNamespace name="DbInternalV2" />
    2. One or more feeds should be specified in each namespace. Each feed entry must contain the following elements:
      1. name specifies the name of the feed. This will be used when defining feed lists.
      2. program specifies the fully-qualified class name, which will be loaded and run for this feed. The specified class must include a main method.
      3. args specifies the arguments that will be passed to the specified program. This is optional; if it is blank, then no arguments will be passed to the program.
    3. In addition, each feed allows for the following optional properties. Most of these properties will take their values from the default values if not defined in the <feed> section.
      1. customPropFile indicates a property file to be used. This does not require a default.
      2. holidays can be used to override the namespace-level holiday, if any. This does not require a default.
      3. runOnce, isIntraday, nThreadForFeed, logImportGc, skipSecMaster, importRamGBDefault, verifyRamGBDefault, extraJvmArgs - all these arguments can be optionally specified. If they are not specified, the values from the <defaults> section will be used.
      4. An example is: <importRamGB>1</importRamGB>
    4. Within each feed, zero or more table elements can be specified. Each table must include the following elements:
      1. name - the name of the table.
      2. validator - a validator class for the table.
    5. Each table element may contain the following optional parameters. Unless otherwise specified, each of these elements must include an entry in the defaults section.
      1. sortColumn - specifies the sortColumn to be passed into the Java program. This is optional and does not have a default.
      2. mergeHeapGB, queryHeapMB, queryTimeoutMS, ignoreFailuresForCleanup, lowHeapUsageMode, isParMerge, nParThread, directMemory, futureLookingDays, logMergeGC, numaNode, safeEmptyMerge
      3. tableDef - specifies the tableDef parameter to be passed into the Java program

Default Values

Two types of default values are provided. Table-level default values can be changed for every table, while feed-level default values can be changed on a per-feed basis.

Table-Level Default Values

  • mergeHeapGB - specifies the GB of RAM to be used for a merge operation. It should be a positive number.
  • queryHeapMB - specifies the queryHeapMB parameter to be passed into the Java program, which is used when performing operations remotely. It should be a positive number.
  • queryTimeoutMS - specifies the queryTimeoutMS parameter to be passed into the Java program, which is used when performing operations remotely. It should be a positive integer.
  • ignoreFailuresForCleanup - if this is True, then logs are not created for this table. Valid values are "True" and "False".
  • lowHeapUsageMode - specifies the lowHeapUsageMode parameter to be passed into the Java program. A true value allows merges using less physical memory at the cost of speed. Valid values are "True" and "False".
  • isParMerge - if True, specifies that a default number of threads (six) will be used for the operation. Valid values are "True" and "False".
  • nParThread - allows specifications of the exact number of threads for a merge process, overriding the isParMerge value. It should be an integer; values less than 1 are ignored.
  • directMemory - if greater than 0, this sets the data buffer pool to the specified size in gigabytes.
  • futureLookingDays - if greater than 0, adds the validation.futureLookingDays parameter to the validator (for verify_simple and verify_full operations), which may be used in various data validation algorithms. It should be an integer.
  • logMergeGC - if True, generates a detailed log file for the merge operation. Valid values are "True" and "False".
  • numaNode - If this value is greater than zero, it adds the following parameters to the JVM call, binding the operation to a specific CPU. This should be an integer.
  • numactl --cpunodebind=<numaNode> --membind=<numaNode>
  • safeEmptyMerge - Unused but retained for backwards compatibility.

Feed-level default values

  • runOnce - if multiple dates are specified for the merge or import, the operation normally runs once for each date. runOnce causes the operation to only run one time, not for each day. Valid values are "True" and "False".
  • isIntraday - used in the generation of import logs and threads; an intraday feed is not imported (i.e., import operations are skipped). Valid values are "True" and "False".
  • nThreadForFeed - the maximum number of threads that can be created for an operation, used within the importer or merge class. This should be a positive integer.
  • logImportGc - specifies to create a detailed import log file. Valid values are "True" and "False".
  • importRamGB - specifies the GB of RAM used for import operations. It should be a positive number and is likely to need tuning per feed.
  • verifyRamGB  - specifies the GB of RAM used for verification operations. It should be a positive number and is likely to need tuning per feed.
  • extraJvmArgs - specified extra JVM arguments to be passed to the Java virtual machine.

Argument Substitution

In the db_merge XML files, each feed contains an <args> specification, used to pass arguments to the Java class running the operation (for example, CsvFileImporter). These arguments can contain tokens that will be substituted with the appropriate values as follows:

_NTHREADS_

The number of threads available for this operation

_YYYY_

The four-digit year

_MM_

The two-digit month (01-12)

_DD_

The two-digit day (01-31)

_NAMESPACE_

The namespace

_FEEDNAME_

The feed name

_HOSTNAME_

The name of the server on which the import is being run

Holiday Files

Holiday schedules are defined through the use of files. At the current time, holiday files are maintained by Deephaven.

db_dailymerge*.xml Files

One or more files with the name db_dailymerge*.xml contain the list of namespaces, feeds and modes. Each of these files should be a fully parsable, self-contained XML file, and must contain the following XML elements. The root element must exist, but the name is not important.

  1. namespaces - this section contains a named list of namespace_list elements, each of which contains one or more namespace_name entries. Each namespace_name entry should correspond with a namespace from a db_merge*.xml file. These named lists will be used within modes, which will run for all the namespaces defined in the list.
    <namespace_list name="NAMESPACES_INTERNAL" >
       <namespace_name>DbInternal</namespace_name>
    </namespace_list>
  2. feeds - this section contains a named list of feed_list elements, each of which contains one or more feed names. Each feed_list should correspond with a feed from a db_merge*.xml file. These named feed lists will be used within modes.
    <feed_list name="FEEDS_USINTERNAL" >
        <feed_name>PersistentQueryStateLog</feed_name>
        <feed_name>PersistentQueryConfigurationLogV2</feed_name>
    </feed_list>
  3. runners_default - a list of the runners that will be run if no runners are specified in the modes below. See the Runner Operations section for details on allowed runners. For example:
    <runners_default>import,v2merge,verify_simple,cleanup</runners_default>
  4. modes - this section contains a list of modes, which define the namespaces, feeds, and other parameters as follows.
    1. Each mode element must include name and numEntries tags.
      <mode name="UsInternal" numEntries="1" >
    2. Based on the numEntries value, there must be additional sub-elements, named modeEntry<number>.  Each of these should define values for the following tags. When the mode is run, each of these entries will be run - the specified feeds within the specified namespaces will be run using the specified runners.
      1. initial_port - the port to be used for the operation. This is required.
      2. namespaces - a namespace_list name from the namespaces list defined above. This is required.
      3. feeds - a feed_list name from the feeds list defined above. This is required.
      4. date - the date on which to run. Currently allowed values can be Today, Yesterday, or LastFriday. This is required.
      5. ifeeds - an optional intraday feed, containing a feed_list name from the feeds list.
      6. runners - an optional list of the runners. If it is not supplied, the runners_default list will be used. See the Runner Operations section for details on allowed runners.
      7. script - an optional script to be run.

<mode name="UsInternal" numEntries="1" >
<modeEntry1 initial_port="38000" namespaces="NAMESPACES_INTERNAL" feeds="FEEDS_USINTERNAL" date="Yesterday" />
</mode>

Runner Operations

The following are the supported runner values, for the runners_default and the runners tags.

  1. import - perform an import operation using the <program> specified in the feed. If the logImportGc parameter is true, then an additional detailed log will be created for the import.
  2. v2merge - perform a merge using the standard merge class (com.illumon.iris.importers.MergeIntradayData), to move the specified data from the intraday database to the historical database. A standard merge will fail if the destination partition already exists.
  3. v2merge_forced - perform a merge using the standard merge class. A forced merge will overwrite the destination partition if it already exists.
  4. verify_simple - run a data verification using the standard validation class (com.illumon.iris.validation.RunDataQualityTests), using the LOCAL_SIMPLE mode.
  5. verify_full - run a data verification using the standard validation class, using the REMOTE_FULL mode.
  6. cleanup - cleans up intraday data. This removes the intraday data and will usually be run after a successful merge. A standard cleanup operation looks for a successful merge log and does not clean up if it doesn't find one.
  7. cleanup_forced - cleans up intraday data, deleting the specified partition's files regardless of whether a successful merge log is found.


Last Updated: 24 February 2020 17:06 -05:00 UTC    Deephaven v.1.20200121  (See other versions)

Deephaven Documentation     Copyright 2016-2020  Deephaven Data Labs, LLC     All Rights Reserved