Appendix A: Import/Merge Scripts
The import and merge scripts are designed to automate the day-to-day data ingestion operations of a production environment. The assumption is that there are a number of batch data sources that need to be repeatedly imported, validated and merged. Without these scripts, a user would need to manually specify many required parameters and environment variables. The scripts rely on XML files. See XML Files.
Environment Variables (Host Configuration File)
Several basic environment variables must be defined for the merge/import scripts to work correctly; this will normally be done in the host configuration file with a merge_import
entry within the $proc
case statement. These environment variables should be defined including the export statement since they will be used in subshells. These are as follows, including typical values.
JAVA_HOME
defines where the Java installation can be found; java should be found under $JAVA_HOME/bin
. For example:
export JAVA_HOME=/usr/java/latest-x64
ILLUMON_DEVROOT
defines the development root directory. Scripts to be run should be found here, and jars will be added to the classpath.
: ${ILLUMON_DEVROOT:="/db/Users/dbmerge/dev"}
export ILLUMON_DEVROOT
ILLUMON_DBROOT
specifies the root directory for database files (not the installation, but the data).
export ILLUMON_DBROOT=/db
ILLUMON_TEMP_PATH
defines the directory under which log files will be created, under the user defined by LOGNAME
.
export ILLUMON_TEMP_PATH=$ILLUMON_DBROOT/TempFiles
TEMP_PATH_USER
is used as a directory for temporary files.
export TEMP_PATH_USER=$ILLUMON_TEMP_PATH/$LOGNAME
DB_RPM_DEPLOYED
is used to help determine library paths; in most installations, it should be true. If not true, assumptions about the existence of specific directories within the installation are not made, and installation-specific shared libraries will not be available.
export DB_RPM_DEPLOYED=True
ILLUMON_DB_INST_ROOT
refers to the directory in which the Deephaven software was installed.
export ILLUMON_DB_INST_ROOT=/usr/illumon/latest
ILLUMON_CONFIG_ROOT
refers to the directory in which the customer configuration was installed.
ILLUMON_CONFIG_ROOT=/etc/sysconfig/illumon.d
STATUS_ROOT
will be used to write status files.
export STATUS_ROOT=/db/Users/dbmerge/dailystatus
ILLUMON_PROP_FILES
refers to the jar or directory containing the property files to be used for this import or merge; it is added to the classpath used by the import and merge programs to search for resources. If this is not defined, then the resource search will use the standard classpath as generated by the scripts (override directories, database jar files, database etc directory and the development root's etc directory).
export ILLUMON_PROP_FILES=\
$ILLUMON_CONFIG_ROOT/java_lib/illumon-config-demo.jar
ILLUMON_CONFIG_FILES_DIR
refers to the directory containing the required configuration files.
export ILLUMON_CONFIG_FILES_DIR=$ILLUMON_CONFIG_ROOT/java_lib
If defined, ILLUMON_OVERRIDE_DIRS
contains a colon-delimited list of directories that can contain overrides of the XML configuration files.
export ILLUMON_OVERRIDE_DIRS=\
$ILLUMON_CONFIG_ROOT/override:/db/Users/dbmerge/override
PYTHON_LOGGING
defines the level of information that is written to the scripts' log files. Valid values are ERROR
, WARNING
, INFO
, and DEBUG
.
export PYTHON_LOGGING=INFO
CONFIGFILE
defines the specific property file to be used for merges and imports.
export CONFIGFILE=db-nodename.prop
dbEXTRA_ARGS
defines additional arguments to be passed into the import and merge classes.
export EXTRA_ARGS="-DWAuthenticationClientManager.defaultPrivateKeyFile=/etc/sysconfig/illumon.d/auth/priv-merge.base64.txt"
db_merge_import_base.sh
This is the base script that will usually be called to perform a merge/import. It sources the host configuration file and calls db_ingest_daily
, passing the parameters straight through. It provides the advantage of calling the host configuration file, which should be configured with some basic parameters.
db_merge_import_base.sh <NTHREADS> <MODE> <EMAIL> <Optional DATE>
NTHREADS
specifies the maximum number of threads allowed for the operations to be performed. Note: not all import processes support multi-threading.MODE
specifies the mode to be run, from the db_dailymerge.xml.EMAIL
specifies the email to which details will be sent.
An optional DATE
in the format YYYY-MM-DD overrides the default date, taken from the XML files. The date can also be specified as a start:end dates in the format YYYY-MM-DD:YYYY-MM-DD, or with multiple dates separated by commas.
db_ingest_daily.py
db_ingest_daily.py
is the primary script that is used for nightly merges, and will generally be the one that's used in cron to perform nightly merges. It will usually be called from db_merge_import_base.sh to ensure it has everything it needs from the host configuration, but if the environment is defined correctly, it can be used interactively. The parameters are the same as for db_merge_import_base.sh
.
db_ingest_daily <NTHREADS> <MODE> <EMAIL> <Optional DATE>
For example:
$DEVROOT/bin/db_ingest_daily.py 10 UsInternal [email protected]
db_ingester_v2.py
db_ingester_v2.py
analyzes the db_merge*.xml
files and performs the actual merges; a method within this file is called by db_ingest_daily.py
, but it can also be called interactively. When it is called interactively, the parameters that could be specified in the db_dailymerge.xml
file must be manually specified.
db_ingester_v2.py <NTHREADS> <DATE> [runners=<runners>] [namespaces=<namespaces>] [feeds=<feeds>] [ifeeds=<intradayFeeds>] [logDirSuffix=<suffix>] [initialPort=<port>] [explicitTests=<testNames>] [dateOrderNormal=<boolean>] [emailWarning=<boolean>]
$DEVROOT/bin/db_ingester_v2.py 1 2016-09-26 namespaces=DbInternal feeds=PersistentQueryStateLog ifeeds=PersistentQueryStateLog logDirSuffix=_2016-09-26_DbInternal runners=v2merge,verify_simple,verify_full,cleanup
XML Files
There are several levels of definitions and XML files used to define repeatable merge and import operations.
- The
db_merge*
XML files define parameters at the table and feed levels.- The lowest level is the table, which is defined as the actual table name and its associated parameters.
- Tables are grouped into feeds. In many cases, there will be a one-to-one correspondence between a table and a feed, but one feed can contain zero to many tables. Zero tables is allowed because the feed may simply run a script. Feeds contain additional parameters.
- Feeds are grouped into namespaces.
- The
db_dailymerge.xml
file groups the feeds and namespaces into groups for merge and import operations.- Namespaces are grouped into namespace_lists.
- Feeds are grouped into feed_lists.
- Modes are used to specify namespace lists and feed lists, which are processed based on the supplied details. These details include which operations to run (e.g., merge, import, validate).
db_merge* XML Files
One or more XML files with the name db_merge*.xml
contain the details of the merge/import namespaces, tables, feeds, and various required options. Each of these files should be a fully parsable, self-contained XML file, and requires the following three sections and parameters. The root element must exist, but the name is not important.
<import_merge>
is the root element.<defaults>
specifies all the default values for the feed and table entries which will follow. The syntax for the table-level and feed-level defaults is the same. Every optional parameter within the table and feed sections requires an entry within the <defaults> section. See the Default Values section for available defaults. For example, to specify the default for the runOnce value:<runOnce value="False" />
<holidays>
associates holiday files with names that can be specified in each namespace or feed; a namespace will not be run on a holiday. It can be specified whether or not weekends are considered holidays (the default is "False
" if this is not specified).<holidayFile name="USHolidays" file="USNYSE.txt" weekends="True" />
- One or more namespace entries is used to specify namespaces. Each namespace must include a name and may optionally specify a holiday file on which its feeds should not be run.
<namespace name="DbInternal" holidays="USHolidays" >
- Zero or more additional namespaces can be specified. These are additional namespaces for which the defined operations will be performed when the primary namespace is specified in a namespace list (see the
db_dailymerge.xml
file details).<additionalNamespace name="DbInternalV2" />
- One or more feeds should be specified in each namespace. Each feed entry must contain the following elements:
name
specifies the name of the feed. This will be used when defining feed lists.program
specifies the fully-qualified class name, which will be loaded and run for this feed. The specified class must include a main method.args
specifies the arguments that will be passed to the specified program. This is optional; if it is blank, then no arguments will be passed to the program.
- In addition, each feed allows for the following optional properties. Most of these properties will take their values from the default values if not defined in the
<feed>
section.customPropFile
indicates a property file to be used. This does not require a default.- holidays can be used to override the namespace-level holiday, if any. This does not require a default.
runOnce, isIntraday, nThreadForFeed, logImportGc, skipSecMaster, importRamGBDefault, verifyRamGBDefault, extraJvmArgs
- all these arguments can be optionally specified. If they are not specified, the values from the<defaults>
section will be used.- An example is:
<importRamGB>1</importRamGB>
- Within each feed, zero or more table elements can be specified. Each table must include the following elements:
name
- the name of the table.validator
- a validator class for the table.
- Each table element may contain the following optional parameters. Unless otherwise specified, each of these elements must include an entry in the defaults section.
sortColumn
- specifies the sortColumn to be passed into the Java program. This is optional and does not have a default.mergeHeapGB, queryHeapMB, queryTimeoutMS, ignoreFailuresForCleanup, lowHeapUsageMode, isParMerge, nParThread, directMemory, futureLookingDays, logMergeGC, numaNode, safeEmptyMerge
tableDef
- specifies thetableDef
parameter to be passed into the Java program
- Zero or more additional namespaces can be specified. These are additional namespaces for which the defined operations will be performed when the primary namespace is specified in a namespace list (see the
Default Values
Two types of default values are provided. Table-level default values can be changed for every table, while feed-level default values can be changed on a per-feed basis.
Table-Level Default Values
mergeHeapGB
- specifies the GB of RAM to be used for a merge operation. It should be a positive number.queryHeapMB
- specifies thequeryHeapMB
parameter to be passed into the Java program, which is used when performing operations remotely. It should be a positive number.queryTimeoutMS
- specifies thequeryTimeoutMS
parameter to be passed into the Java program, which is used when performing operations remotely. It should be a positive integer.ignoreFailuresForCleanup
- if this isTrue
, then logs are not created for this table. Valid values are "True
" and "False
".lowHeapUsageMode
- specifies thelowHeapUsageMode
parameter to be passed into the Java program. Atrue
value allows merges using less physical memory at the cost of speed. Valid values are "True
" and "False
".isParMerge
- if True, specifies that a default number of threads (six) will be used for the operation. Valid values are "True
" and "False
".nParThread
- allows specifications of the exact number of threads for a merge process, overriding theisParMerge
value. It should be an integer; values less than 1 are ignored.directMemory
- if greater than 0, this sets the data buffer pool to the specified size in gigabytes.futureLookingDays
- if greater than 0, adds thevalidation.futureLookingDays
parameter to the validator (forverify_simple
andverify_full operations
), which may be used in various data validation algorithms. It should be an integer.logMergeGC
- if True, generates a detailed log file for the merge operation. Valid values are "True
" and "False
".numaNode
- If this value is greater than zero, it adds the following parameters to the JVM call, binding the operation to a specific CPU. This should be an integer.numactl --cpunodebind=<numaNode> --membind=<numaNode>
safeEmptyMerge
- Unused but retained for backwards compatibility.
Feed-level default values
runOnce
- if multiple dates are specified for the merge or import, the operation normally runs once for each date. runOnce causes the operation to only run one time, not for each day. Valid values are "True
" and "False
".isIntraday
- used in the generation of import logs and threads; an intraday feed is not imported (i.e., import operations are skipped). Valid values are "True
" and "False
".nThreadForFeed
- the maximum number of threads that can be created for an operation, used within the importer or merge class. This should be a positive integer.logImportGc
- specifies to create a detailed import log file. Valid values are "True
" and "False
".importRamGB
- specifies the GB of RAM used for import operations. It should be a positive number and is likely to need tuning per feed.verifyRamGB
- specifies the GB of RAM used for verification operations. It should be a positive number and is likely to need tuning per feed.extraJvmArgs
- specified extra JVM arguments to be passed to the Java virtual machine.
Argument Substitution
In the db_merge
XML files, each feed contains an <args>
specification, used to pass arguments to the Java class running the operation (for example, CsvFileImporter
). These arguments can contain tokens that will be substituted with the appropriate values as follows:
_NTHREADS_ |
The number of threads available for this operation |
---|---|
_YYYY_ |
The four-digit year |
_MM_ |
The two-digit month (01-12) |
_DD_ |
The two-digit day (01-31) |
_NAMESPACE_ |
The namespace |
_FEEDNAME_ |
The feed name |
_HOSTNAME_ |
The name of the server on which the import is being run |
Holiday Files
Holiday schedules are defined through the use of files. At the current time, holiday files are maintained by Deephaven.
db_dailymerge*.xml Files
One or more files with the name db_dailymerge*.xml
contain the list of namespaces, feeds and modes. Each of these files should be a fully parsable, self-contained XML file, and must contain the following XML elements. The root element must exist, but the name is not important.
namespaces
- this section contains a named list ofnamespace_list
elements, each of which contains one or morenamespace_name
entries. Eachnamespace_name
entry should correspond with a namespace from adb_merge*.xml
file. These named lists will be used within modes, which will run for all the namespaces defined in the list.<namespace_list name="NAMESPACES_INTERNAL" >
<namespace_name>DbInternal</namespace_name>
</namespace_list>- feeds - this section contains a named list of feed_list elements, each of which contains one or more feed names. Each feed_list should correspond with a feed from a db_merge*.xml file. These named feed lists will be used within modes.
<feed_list name="FEEDS_USINTERNAL" >
<feed_name>PersistentQueryStateLog</feed_name>
<feed_name>PersistentQueryConfigurationLogV2</feed_name>
</feed_list> runners_default
- a list of the runners that will be run if no runners are specified in the modes below. See the Runner Operations section for details on allowed runners. For example:<runners_default>import,v2merge,verify_simple,cleanup</runners_default>
- modes - this section contains a list of modes, which define the namespaces, feeds, and other parameters as follows.
- Each mode element must include
name
andnumEntries
tags.<mode name="UsInternal" numEntries="1" >
- Based on the
numEntries
value, there must be additional sub-elements, namedmodeEntry<number>
. Each of these should define values for the following tags. When the mode is run, each of these entries will be run - the specified feeds within the specified namespaces will be run using the specified runners.initial_port
- the port to be used for the operation. This is required.namespaces
- a namespace_list name from the namespaces list defined above. This is required.- feeds - a
feed_list
name from the feeds list defined above. This is required. date
- the date on which to run. Currently allowed values can beToday
,Yesterday
, orLastFriday
. This is required.ifeeds
- an optional intraday feed, containing afeed_list
name from the feeds list.runners
- an optional list of the runners. If it is not supplied, the runners_default list will be used. See the Runner Operations section for details on allowed runners.script
- an optional script to be run.
- Each mode element must include
<mode name="UsInternal" numEntries="1" >
<modeEntry1 initial_port="38000" namespaces="NAMESPACES_INTERNAL" feeds="FEEDS_USINTERNAL" date="Yesterday" />
</mode>
Runner Operations
The following are the supported runner values, for the runners_default and the runners tags.
import
- perform an import operation using the <program> specified in the feed. If thelogImportGc
parameter istrue
, then an additional detailed log will be created for the import.v2merge
- perform a merge using the standard merge class (com.illumon.iris.importers.MergeIntradayData), to move the specified data from the intraday database to the historical database. A standard merge will fail if the destination partition already exists.v2merge_forced
- perform a merge using the standard merge class. A forced merge will overwrite the destination partition if it already exists.verify_simple
- run a data verification using the standard validation class (com.illumon.iris.validation.RunDataQualityTests
), using theLOCAL_SIMPLE
mode.verify_full
- run a data verification using the standard validation class, using theREMOTE_FULL
mode.cleanup
- cleans up intraday data. This removes the intraday data and will usually be run after a successful merge. A standard cleanup operation looks for a successful merge log and does not clean up if it doesn't find one.cleanup_forced
- cleans up intraday data, deleting the specified partition's files regardless of whether a successful merge log is found.
Last Updated: 24 February 2020 17:06 -05:00 UTC Deephaven v.1.20200121 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved