Deephaven Operations Guide

Routine Procedures and Operations for Deephaven System Administrators

This document provides procedures for system administrators to meet the SLA requirements for reliable and high-performing Deephaven production deployments.

Deephaven System Overview

A production-deployed Deephaven system comprises various Network, Storage and Application components.

  • The Network services include IP subnets and associated services (DNS, NTP, etc.).
  • The Storage components consist of Intraday data on high speed local SSD volumes and Historical data on NFS disk mounts exported from a highly available storage system.
  • The Application components are deployed across multiple nodes, where each node is a x86_64 (64-bit) Linux physical or virtual server.

For more details on deployment architecture, please refer to Scaling to Multiple Servers.

List of Deephaven Services

A Deephaven system comprises the following processes.

Process Name

Run User

Log Prefix

authentication_server

irisadmin

AuthenticationServer

db_acl_write_server

irisadmin

DbAclWriteServer

db_tdcp

dbquery

TableDataCacheProxy

configuration_server

irisadmin

ConfigurationServer

iris_controller

irisadmin

PersistentQueryController

log_aggregator_service

irisadmin

LogAggregatorService

tailer1

irisadmin

LogtailerMain

db_dis

dbmerge

DataImportServer

db_merge_server

dbmerge

RemoteQueryDispatcher

db_ltds

dbquery

LocalTableDataServer

db_query_server

dbquery

RemoteQueryDispatcher

web_api_service

dbquery

WebServer

 

Other Processes

Run User

Executable

client_update_service

lighttpd

/sbin/lighttpd

MariaDB

mysql

/usr/libexec/mysqld

Troubleshooting Procedures

Deephaven is a highly robust and fault-tolerant system. In the event of temporary errors, the system is designed to retry or restart components to recover from errors without much intervention from administrators and operators.

However, even in the most controlled IT environments, failures and incidents do occur. In the event of an outage or non-recoverable error of any component making up the Deephaven system, administrators can perform the basic system checks described in this guide. These include network connectivity checks, firewall rules, configuration file settings, etc.

Administrators should be familiar with the components of the Deephaven system and how to perform basic component checks, including how to stop and start services, view log files and other component diagnostics.

Typical incidents are often caused by file permissions or other system settings that are easily resolved. If you are unable to resolve any incident with basic checks or restarts, the Deephaven Support Team is available via live chat or telephone Monday through Friday, 8 a.m. to 5:30 p.m. Eastern Time.

To improve incident resolution times, please be sure your team knows how to contact Deephaven Support. Getting Help provides detailed instructions on how to contact the Deephaven Support team via email, live chat, telephone or links within Deephaven.

Prerequisites For Performing Admin Tasks

Deephaven Operating System Users

When Deephaven is installed, three Linux users are created with the minimal operating system permissions needed to run their respective process.

irisadmin

OS user for running Deephaven Admin processes

dbquery

OS user for database query (read) processes

dbmerge

OS user for database import and merge (read-write) processes

SSH Terminal Access

Certain administrative tasks described in this guide require SSH terminal access to the Deephaven compute nodes. As an administrator, please ensure you have SSH access to each server node in the Deephaven cluster using your own operating system user and not any of the users created during the installation.

Sudo Access

Many administrative tasks and Deephaven commands will also require sudo for the admin user, e.g., to mount or unmount NFS file shares, edit configuration files, or set folder permissions.

Please update the /etc/sudoers file to ensure the administration user can run all commands on the system.

Deephaven Console Access

Some administrative tasks, such as adding end-users, viewing query logs or performance metrics stored in the database are done using the Deephaven Console Client GUI.

As an administrator, please ensure you have the Deephaven Console installed on your workstation. For more details on installing the Deephaven Console, please refer to Installing the Launcher.

Network Services

This section describes some basic network checks administrators should verify for a healthy cluster.

Basic Network Service Checks

System administrators need to be familiar with common Linux utilities to configure and verify that the Deephaven cluster network and associated services are functioning properly. Please refer to the Linux man pages for details on any Linux commands and services.

  • Check that all servers in the Deephaven cluster have their clocks synchronized using NTP service.
  • A Deephaven cluster requires high speed (Gigabit) IP network connectivity from clients and between cluster nodes. Tools such as ping, traceroute and iperf can be used to check bandwidth limits and network latency between clients and server nodes.
  • Deephaven processes require connectivity to various ports on cluster nodes. The Deephaven configuration files specify the host names or IP addresses and ports used by the various Deephaven components. Administrators need to check that DNS is working correctly if hostnames or fully qualified domain names are used in config files instead of IP addresses. DNS tools such as ping, dig, nslookup, etc. can be used to verify nodes in the Deephaven cluster are able to communicate with each other.
  • The Deephaven Console Client GUI installed on end-users' workstations will also require IP network connectivity to a range of ports to the server nodes. The Administrators need to check network routes and firewall settings to ensure the specified ports are open between server processes and clients. Diagnostic tools such as netstat, netcat and nmap can be used to verify port accessibility between nodes and from clients.

Table of Deephaven Process Ports

Deephaven processes listen on various TCP ports and port ranges. All ports are configurable.

TCP ports can be configured in the property files:

/etc/sysconfig/illumon.d/resources/*.prop

The default Deephaven ports and port ranges follow:

TCP Ports

Component

Process Name

Property

22013
22012
8084

Remote Query Dispatcher (RQD)

db_query_server

RemoteQueryDispatcherParameters.queryPort=22013

RemoteQueryDispatcher.workerPort=22012

RemoteQueryDispatcher.webPort=8084

30002
30003
8085

Remote Merge Dispatcher (RMD)

db_merge_server

RemoteQueryDispatcherParameters.queryPort=30002

RemoteQueryDispatcher.workerPort=30003

RemoteQueryDispatcher.webPort=8085

23000-24999

RQD Workers

RemoteQueryDispatcher_worker_<number>

RemoteQueryDispatcher.workerServerPorts=23000-23999

RemoteQueryDispatcher.workerServerWebsocketPorts=24000-24999

32000-32999

25000-25999

RMD Workers

RemoteQueryDispatcher_worker_<number>

RemoteQueryDispatcher.workerServerPorts=32000-32999

RemoteQueryDispatcher.workerServerWebsocketPorts=25000-25999

22021
22015

 

Data Import Server

db_dis

routing_service.yml:

dataImportServers:
  db_dis:
    tailerPort: 22021
    tableDataPort: 22015

22020

Log Aggregator Service

log_aggregator_service

routing_service.yml:

logAggregatorServers:
  - log_aggregator_service:
      port: 22020

22014

Local Data Table

db_ltds

routing_service.yml:

tableDataServices:
  db_ltds:
      port: 22014

22016

Table Data Cache Proxy

db_tdcp

routing_service.yml:

tableDataServices:  

   db_tdcp:      

      port: 22016

22023

Configuration Server (Centralized Schema Service, Data Routing Service and Configuration Service)

configuration_server

configuration.server.port=22023

9030,
9031

User Authentication Server

authentication_server

authentication.server.port.plaintext=9030

authentication.server.port.ssl=9031

9040
9041

User Access Control Server

db_acl_write_server

dbaclwriter.port=9040

dbaclwriter.ssl.port=9041

20126

 

Persistent Query Controller

iris_controller

PersistentQueryController.port=20126

22021 (outbound)

Log Tailer

tailer1..tailerN

This is determined by the db_dis section (see above).

80/443

Client Update Server

client_update_service

client-update-service.conf:

  server.port = 80

8123

Web API Service

web_api_service

Webapi.server.port=8123

3306

MySQL Database

mariadb_server

/etc/my.cnf:

  port=3306 # default

2812

M/Monit Daemon

monit

/etc/monitrc:

  set httpd port 2812

Startup/Shutdown of Deephaven Processes

All Deephaven processes are started and stopped with Monit. Monit is a utility for managing and monitoring processes, programs, files, directories and filesystems on a Unix system. (Refer to https://mmonit.com/monit for more information.) 

Monit Service

The Monit service itself can be checked with the following command:

sudo service status monit

If Monit is not running, it can be started with the following command:

sudo systemctl start monit

To ensure Monit starts up whenever the system restarts, use the following command:

sudo systemctl enable monit

All Monit configuration files for the Deephaven processes are located in:

/etc/sysconfig/illumon.d/monit

Deephaven Services

If any of the Deephaven processes terminates unexpectedly, Monit will restart the process automatically.

You can check which processes are running with the following Monit command:

sudo monit summary

You can check the status of all processes with the following Monit command:

sudo monit status

You can check the status of individual processes with the following Monit command:

sudo monit status <process name>

For example:

sudo monit status iris_controller

Starting and Stopping Deephaven Services

If a configuration file has been updated, the associated Deephaven processes will typically need to be restarted for the changes to take effect.  One exception to this is the Deephaven Controller Process, which allows various properties to be edited without needing a restart. 

Starting/Stopping Deephaven Processes

When a configuration file has been updated that requires a restart of the associated Deephaven processes, use the following commands:

To stop all the configured Deephaven processes, use the following command:

sudo monit stop all

Alternatively, individual Deephaven processes can be stopped and started with the following:

sudo monit stop <process name>
sudo monit start <process name>

For example:

sudo monit stop authentication_server
sudo monit start authentication_server

See also: List of Deephaven Services

Configuration Files

Configuration-as-Code

Configuration-as-Code is an industry best practice that promotes storing of application configuration as code within the source code repository. Configuration-as-Code is an important prerequisite to have reproducible and auditable configurations. It also makes configurations testable and easier to reuse.

This is especially important with distributed systems. Maintaining multiple services across a network of distributed systems, and keeping all their settings and configurations in sync is very hard to achieve when you have to manually configure everything by clicking through UI configuration screens.

It is therefore recommended to add the Deephaven configuration files to a version control system (VCS) such as Git. This allows updates to configuration files to be tested in a separate environment before deploying any configuration changes to your production Deephaven hosts.

Deephaven Configuration Files

The Deephaven configuration files are installed on the Deephaven server in several subdirectories under:

/etc/sysconfig/deephaven/

SymLinks

/etc/sysconfig/illumon -> /etc/sysconfig/illumon.confs/illumon.iris.hostconfig

/etc/sysconfig/illumon.confs -> /etc/sysconfig/deephaven/illumon.confs.latest

/etc/sysconfig/illumon.d -> /etc/sysconfig/deephaven/illumon.d.latest

The following section provides a brief summary of the configuration files.

M/Monit Configurations

/etc/sysconfig/illumon.d/monit/*.conf

Cron Configuration

/etc/cron.d/illumonRestart.cron

Deephaven Process Configurations

/etc/sysconfig/illumon.d/resources/*.prop

Authentication Configurations

/etc/sysconfig/illumon.d/auth/*

/etc/sysconfig/illumon.d/trust/*

Schema Configurations

/etc/sysconfig/illumon.d/schema/*/*.schema

Editing Configuration Files

Deephaven can be customized by editing the configuration files. Property files (.prop) are plain text and can be edited with any local text editor such as vi or emacs. When a configuration file is changed, the affected Deephaven processes need to be restarted.

Editing Property Files

Property files are stored in etcd.

  1. Export the property file from etcd
  2. Edit the text file locally
  3. Import the edited file into etcd

Edit iris-endpoints.prop

$ sudo -u irisadmin /usr/illumon/latest/bin/dhconfig --export iris-endpoints.prop

$ edit iris-endpoints.prop

$ sudo -u irisadmin /usr/illumon/latest/bin/dhconfig --import iris-endpoints.prop

Edit iris-environment.prop

$ sudo -u irisadmin /usr/illumon/latest/bin/dhconfig --export iris-environment.prop

$ edit iris-environment.prop

$ sudo -u irisadmin /usr/illumon/latest/bin/dhconfig --import iris-environment.prop

Truststore Configuration File Permissions

The truststore files are located in: /etc/sysconfig/illumon.d/trust/

File owner and group should match the following table:

Auth File Name

Owner

Group

Permissions

truststore-iris.p12

irisadmin

irisadmin

-r--r--r--

truststore-iris.pem

irisadmin

irisadmin

-r--r--r--

Keystore and Authentication Configuration File Permissions

The authentication configuration files require more restrictive permissions.

These files are located in: /etc/sysconfig/illumon.d/auth/

File owner and group should match the following table:

Auth File Name

Owner

Group

Permissions

keystore.authserver.p12

irisadmin

irisadmin

-r--------

keystore.configuration_server.p12

irisadmin

irisadmin

-r--------

webServices-keystore.p12

irisadmin

irisadmin

-r--------

priv-iris.base64.txt

irisadmin

irisadmin

-r--------

priv-merge.base64.txt

dbmerge

dbmerge

-r--------

priv-superuser.base64.txt

irisadmin

irisadmin

-r--------

Log Files

Some Deephaven processes record events, errors and other application messages in their process log files. These are plain text files that can be viewed or searched with grep, tail and other text processing utilities.

Log Search Term Definitions
General
FATAL Any trace logged at log level "FATAL" should indicate that the process is beginning to shutdown due to an error.
ERROR Generally, the rate of "ERROR" level trace may be interesting, but many traces logged at this level are not critical.

Lifecycle Events

<process name> INITIALIZING The process has begun its initialization.
<process name> RUNNING Initialization has completed, and the process is available.
<process name> ALIVE The process continues to be available, logged on a configurable cycle.
<process name> SHUTTING_DOWN The process is begging to shut down.
Process Specific
For db_dis:
  • Rejecting and closing channel
This should occur seldom, if at all. This message generally points to a configuration issue to which DBAs should be alerted.
  • DataImportStreamProcessor-.+WARN(regex)
If a high rate of warning trace messages are seen from this component (other than during tailer restarts) it may point to a system issue.

Startup Script Log Files

These capture the stdout and stderr streams from Deephaven production processes.

This file appears in the process’s log directory, with name format process_name.log.yyyy-MM-dd, where the date is the process startup date.

This file is not rotated. Be careful with automated log cleanup for long-running processes in order to ensure that this file is not deleted while still in-use.

Application Log Files

By default, application log files for Deephaven processes are written to the following locations:

/var/log/deephaven/<Application Log Directory>/<Log Prefix>.log.current

where the Application Log Directory and Log Prefix are different for each application.

For example, to view the current log file for the Authentication Server:

cat /var/log/deephaven/authentication_server/AuthenticationServer.log.current

These log locations are defined by properties in the Deephaven property files and can be changed. The root logging directory is defined by the logroot property, which has the following default value:

logroot=/var/log/deephaven

Each process can define its own log directory, using the form logDir.<process name>=<log directory>, and allowing the <logroot> substitution to use the logroot definition as a starting directory. For example, the following property defines the authentication server's log directory:

logDir.authentication_server=<logroot>/authentication_server

The following property defines a default log directory for process names which don't have their own definitions:

defaultLogDir=<logroot>/misc

The location can be changed by passing the logDir parameter into the JVM when a process is started. For example, passing the following parameter to a process changes the application log file location to the defined value:

-DlogDir=/db/TempFiles/logs

A new log file is created every 15 minutes and the .current file is a hard link to the current log file. The initial log file will have a timestamp matching the time the process started and subsequent log files will have the 15-minute timestamp interval.

The actual log file name has a timestamp suffix. For example:

AuthenticationServer.log.2018-01-30-170000.000+0000

For example, the following files were created over a one-hour period:

AuthenticationServer.log.2018-01-30-170000.000+0000
AuthenticationServer.log.2018-01-30-171500.000+0000
AuthenticationServer.log.2018-01-30-173000.000+0000
AuthenticationServer.log.2018-01-30-174500.000+0000

To follow (tail) the current log file in a terminal, use tail command with -F, so it will retry when the new file is created at the 15-minute interval. For example:

cd /var/log/deephaven/authentication_server
tail -F AuthenticationServer.log.current

Query Worker Log Samples

Deephaven workers do not produce the files described in this section. Worker logs are sent to the Process Event Log table directly or via the RemoteQueryDispatcher, depending on the type of log event.

This has implications for monitoring system processes running as workers. Such monitoring must either be done by querying the process event log table, or by consuming binary logs that are written by the LogAggregatorService as an intermediate step.

By default, Query Workers are configured to write logs via the LogAggregationService and DataImportServer to the DbInternal/ProcessEventLog table in Deephaven.

To enable Query Worker logs to write their logs to plain text files, the following properties can be set in: /etc/sysconfig/illumon.d/resources/iris-common.prop:

RemoteQueryDispatcher.writeDatabaseProcessLogs=false
RemoteQueryProcessor.sendLogsToSystemOut=true

The following sample Query Worker log entries provide a few samples for what might be monitored here. Note that in general, monitoring worker logs is not advisable, because a lot of the content may result from user code in the query itself. However, monitoring for the expected output from well-defined queries does make sense.

[2018-02-20T23:04:19.476898-0500] - WORKER_1 - ConnectionMonitor: registering monitored connection job:1744982467/AuthServer_Client/10.128.0.38:42466->10.128.0.38:9030/CommandConnection
[2018-02-20T23:04:19.484982-0500] - WORKER_1 - Successful authentication worker8.internal/10.128.0.38 using delegate token {-1691233174684810865, service
: DelegatedAuthentication, origin: worker8.internal/10.128.0.38, tokenContext: {iris}}
[2018-02-20T23:04:19.486329-0500] - WORKER_1 - Serializing result of type java.lang.Boolean
[2018-02-20T23:04:19.486838-0500] - WORKER_1 - Sending serialized result, length=47 bytes

Java Garbage Collector Logs

Deephaven Java processes are configured to record JVM Garbage Collection (GC) statistics.  For query workers, the GC statistics are recorded to the Process Event Logs. Other processes will record GC statistics to the logs directory.  For example, the Data Import Server's GC log can be found at:

/var/log/deephaven/dis/db_dis.log.gc_detailed.<datetime>

where <datetime> is: <YYYY-MM-DD_HH-MM-SS>

For example:

/var/log/deephaven/dis/db_dis.log.gc_detailed.2018-02-01_13-00-42
/var/log/deephaven/dis/db_dis.log.gc_detailed.2018-02-01_20-10-06
/var/log/deephaven/dis/db_dis.log.gc_detailed.2018-02-05_10-23-15
/var/log/deephaven/dis/db_dis.log.gc_detailed.2018-02-06_18-38-27

The GC logs can be very useful to understand JVM memory usage or diagnose any memory problems.

GC Log Patterns

With default Deephaven JVM start-up parameters, there are four GC log events that should be monitored:

  1. Parallel young generation collections that occur very frequently (more often than once per 10 seconds), or with significant real time duration (longer than 0.2 seconds):

    [GC (Allocation Failure) 2018-02-21T04:28:31.559+0000: 14.240: [ParNew: 27679K->16000K(144384K), 0.0226034 secs]2018-02-21T04:28:31.582+0000: 14.263: [CMS: 143620K->141747K(320896K), 0.4811685 secs] 168566K->141747K(465280K), [Metaspace: 36101K->36101K(1083392K)], 0.5040269 secs] [Times: user=0.55 sys=0.01, real=0.50 secs]

  2. Full GCs, which usually result from failures of the concurrent garbage collector or other near out-of-memory conditions:

    2018-02-21T04:31:33.822+0000: 196.503: [Full GC (System.gc()) 2018-02-21T04:31:33.822+0000: 196.503: [CMS: 532358K->532883K(707840K), 0.6859044 secs] 566077K->532883K(852352K), [Metaspace: 36261K->36261K(1083392K)], 0.6864226 secs] [Times: user=0.70 sys=0.02, real=0.68 secs]

    Note that in this case, "System.gc()" above indicates that this was invoked by a user deliberately, although this is not a guarantee given that Java built-in libraries use the same mechanism sometimes.
  3. Concurrent mode failures in the old generation collector:

    (concurrent mode failure): 658018K->180450K(672472K), 0.5520093 secs] 786112K->180450K(816856K), [Metaspace: 36011K->36011K(1083392K)], 0.5647268 secs] [Times: user=0.69 sys=0.03, real=0.57 secs]

  4. It may also be worth monitoring for frequency of old generation collections (CMS) to detect cases when a process is using more CPU than needed because it needs a larger heap:

    2018-02-21T04:36:27.709+0000: 76.357: [GC (CMS Initial Mark) [1 CMS-initial-mark: 219607K(601812K)] 258670K(872660K), 0.0010968 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]

    You should be notified if any non-worker process is starting CMS more often than once per minute.

Stats Driver Logs

If the Stats Driver property is enabled for a given Deephaven process, various Deephaven process statistics are recorded in the logs directory. 

For example, the Data Import Server's stats log can be found at:

/var/log/deephaven/dis/db_dis.log.stats.current

The <file>.stats.current file is a hard link to the current stats file. A new stats file is created every hour.

These stats files are for debugging purposes along with the following utility script:

/usr/illumon/latest/bin/stats

  • Allows users to select individual columns of the stat logs and pipe them to stdout.
  • Allows users to order and filter the stat logs columns output to meet their needs.
  • Creates the output in CSV format to be used in monitoring applications.

The Stats Driver property is enabled by default for Deephaven processes. To turn off statistics, disable the property in the process configuration file as follows:

statsdriver.enabled=false

The application statistics log contains entries of the performance metrics kept for each process. These are recorded on disk in CSV format for easy import into a multitude of external performance monitoring tools. As noted under the Deephaven Performance section, the statistics are also available in the DbInternal database.

The entries are in CSV format and have the following headers: Stat,IntervalName,NowSec,NowString,AppNowSec,AppNowString,TypeTag,Name,N,Sum,Last,Min,Max,Avg,Sum2,Stdev

This data should be imported into a query-able monitoring tool to write alarms against.

Binary Log Files

Several Deephaven processes write binary log files, which contain data that will be written to tables in the DbInternal namespace. The locations of these log files are driven by properties similar to the application log files, including the logroot property.

When writing a binary log file, the application will determine the location using a three-tiered search, using the namespace and table name; the most specific result found is used as the location.

binaryLogDir.<namespace>.<table name>
binaryLogDir.<namespace>
binaryLogDir

If no location is defined for the table or namespace, the location defined by the binaryLogDir property will be used.

The default location for binary logs uses the same log root as the application logs (/var/log/deephaven), but defines a different subdirectory:

binaryLogDir=<logroot>/binlogs

Process event logs (see below) have their own subdirectory:

binaryLogDir.DbInternal.ProcessEventLog=<logroot>/binlogs/pel

Performance logs have a different subdirectory.

binaryLogDir.DbInternal.QueryOperationPerformanceLog=<logroot>/binlogs/perflogs
binaryLogDir.DbInternal.QueryPerformanceLog=<logroot>/binlogs/perflogs
binaryLogDir.DbInternal.UpdatePerformanceLog=<logroot>/binlogs/perflogs

Log File Maintenance

Customers should create a job to archive these logs based on their own internal retention policies.  Deephaven recommends at least a few weeks of logs be retained in case issue troubleshooting is needed.

Deephaven Database Event Logs

Some Deephaven components can be configured to log their events to the Deephaven database tables in place of text files. This dramatically improves the ability to search through logs and find related events, especially for query workers where a user would otherwise need access to the server running the worker to view its log. The full power of the Deephaven query language is available to view these table-stored event logs, instead of only the basic text processing utilities like  grep, awk, sed, etc.

For example, if you want to see the logs for worker_9, you can use the following query to view them. (Note: You will need appropriate permissions to do so.)

t=db.i("DbInternal","ProcessEventLog").where("Date=currentDateNy()","Process=`worker_9`")

Process Event Logs

Table Namespace: DbInternal

Table Name:

ProcessEventLog

Description:

Records log records for configured Deephaven Processes. Rows include the classic Log entry, Log Level, Date, Time, etc. Non-privileged users can only see records for which the EffectiveUser matches their username.

Sample Queries

// Show all events for today
today = new Date().format("yyyy-MM-dd")
t=db.i("DbInternal", "ProcessEventLog").where("Date=`" + today + "`")

// Show all ERROR Level events for today
t=db.i("DbInternal", "ProcessEventLog").where("Date=`" + currentDateNy() + "`").where("Level=`ERROR`")

For more details on the ProcessEventLog table, please refer to DbInternal tables.

Audit Event Logs

Table Namespace: DbInternal

Table Name:

AuditEventLog

Description:

Captures all Audit events such as user login attempts, process startup events, etc. Non-privileged users can only see records for which the EffectiveUser matches their user name.

Sample Queries

// Show all Audit events for today
today = new Date().format("yyyy-MM-dd")
t=db.i("DbInternal",  "AuditEventLog").where("Date=`" + today + "`")

For more details on the AuditEventLog table, please refer to DbInternal Tables.

Viewing Binary Log Files with iriscat

Because Deephaven log files on disk are in binary format, an alternative way of viewing log records is to use the iriscat utility. This can come in handy when there are system errors preventing you from querying the database using the Deephaven Console GUI client.

To view binary files using iriscat first make sure that JAVA_HOME is set. 

export JAVA_HOME=/usr/java/latest

Once JAVA_HOME is set you can run iriscat using this command:

/usr/illumon/latest/bin/iriscat <path to binary file>

For example:

/usr/illumon/latest/bin/iriscat
/var/log/deephaven/binlogs/test.file.bin.2018-02-02.141436.113-0800

This will display the contents of the file in CSV format.  To view them in the same file as formatted text use the -l flag.  For example:

/usr/illumon/latest/bin/iriscat -l <path to binary file>

Here is the full list of options for the iriscat utility:

iriscat Option

Description

-C <columns>

Names of columns to display, comma-separated

-d

Show debugging information (row flags, column metadata)

-e <endRow>

Index of the last row displayed

-F <format>

Format date time columns, comma-separated (e.g., "Column=HH:mm:ss.SSS", or "Column" for ISO8601 microsecond)

-l

Format output as if it were a log file (useful for ProcessEventLog)

-s <startRow>

Index of the first row displayed

-T <timeZone> 

Time zone for time columns (e.g. DST or CST)

Contributing Applications, Daemons and Services

Deephaven includes a few third-party components. For example, Monit has already been discussed in the Starting/Stopping Deephaven Services section.

These third-party components are installed on the system as prerequisites prior to performing the Deephaven installation. Examples of these components include lighttpd, python, and mariadb/MySQL. Please refer to the Deephaven Server Installation Guide for a complete list of third-party Deephaven package dependencies.

All of the license agreements, maintenance, troubleshooting and run-books for these external contributing applications, daemons and services are described in their respective and official online documentation.

Security and Access Control

Each Deephaven end-user requires a unique private key, or username and password. Users are created in the Deephaven Console.  Only users with Deephaven admin rights can create new users.

To access the ACL Editor from the Deephaven Console, click the Advanced button and select ACL Editor. (Note: A user must be a member of the acl-editors group to view or open the ACL Editor.)

Enter a new panel title if desired, or click OK to accept the default.

The ACL DB Editor can then be used to add or update Users, Groups and Table permissions.

Please refer to Permissions for additional information about the ACL Editor.

MySQL vs. LDAP

By default, Deephaven user permissions are stored in a local MariaDB (MySQL) database. Alternatively, the Authentication Server allows the use of Lightweight Directory Access Protocol (LDAP) to validate usernames and passwords. This is convenient for customers who already have users in a directory server such as Microsoft Active Directory.

Please refer to the LDAP Configuration Guide for instructions on how to configure Deephaven to use LDAP instead of MariaDB (MySQL).

Metrics and Monitoring

Understanding the state of your Deephaven deployment is essential for ensuring the reliability and stability of the system. Metrics provide the raw data measurements of system usage for an external monitoring system that can gather and visualize metrics, and send out alerts.

Monitoring

Monitoring is the process of collecting, aggregating and analyzing metrics. Raw metric data from various parts of the system should be collected into a monitoring system such as Grafana, DataDog, New Relic, etc.

The monitoring system is responsible for aggregation, visualization, and initiating alerts when the values meet certain thresholds.

Deephaven Process Monitoring

Basic monitoring for all Deephaven processes can be implemented by examining the servers process list for each of the Deephaven processes listed above or by using Monit output to see the process health.  In general, Monit will attempt to restart any failed processes until successful.

Alarms should be set up for any processes not running when they should be and combined with examination of the application logs to get the root cause.

Deephaven Performance and Event Information

The Deephaven DbInternal namespace provides several tables containing performance and event data that can be used for monitoring purposes. Queries can be written to analyze Deephaven's internal state and performance by using these tables. The performance records are recorded on disk in the binary logs listed below and are then processed into the database. Although some of the Deephaven processes do not run under the irisadmin account (for example the RemoteQueryDispatcher and worker processes), their binary logs are still written to the irisadmin directory as they are sent through the Log Aggregator Service.

Binary Logs

Table Name

Processes

Log Names

AuditEventLog AuthenticationServer
DbAclWriteServer
PersistentQueryController
RemoteQueryDispatcher
Workers (RemoteQueryProcessor)
/var/log/deephaven/binlogs/DbInternal.AuditEventLog.System.<internal partition>.<date>.bin.<date.timestamp>
PersistentQueryConfigurationLogV2 PersistentQueryController /var/log/deephaven/binlogs/DbInternal.PersistentQueryConfigurationLogV2.System.<internal partition>.<date>.bin.<date.timestamp>
PersistentQueryStateLog PersistentQueryController /var/log/deephaven/binlogs/DbInternal.PersistentQueryStateLog.System.<internal partition>.<date>.bin.<date.timestamp>
ProcessEventLog RemoteQueryDispatcher Workers (RemoteQueryProcessor) /var/log/deephaven/binlogs/pel/DbInternal.ProcessEventLog.System.<internal partition>.<date>.bin.<date.timestamp>
QueryOperationPerformanceLog Workers (RemoteQueryProcessor) /var/log/deephaven/binlogs/perflogs/DbInternal.QueryOperationPerformanceLog.System.<internal partition>.<date>.bin.<date.timestamp>
QueryPerformanceLog Workers (RemoteQueryProcessor) /var/log/deephaven/binlogs/perflogs/DbInternal.QueryPerformanceLog.System.<internal partition>.<date>.bin.<date.timestamp>
UpdatePerformanceLog Workers (RemoteQueryProcessor) /var/log/deephaven/binlogs/perflogs/DbInternal.UpdatePerformanceLog.System.<internal partition>.<date>.bin.<date.timestamp>
WorkspaceData WorkspaceHelper worker (RemoteQueryProcessor) /var/log/deephaven/binlogs/DbInternal.WorkspaceData.System.<internal partition>.<date>.bin.<date.timestamp>

Log-Related Properties

Logging of the main processes is handled through a standard set of properties. Properties are defined by <property prefix>.<property name suffix>. The property prefixes enables default values to be supplied for all processes, with overrides for individual processes. Each property can be defined multiple times, and when a process starts the appropriate values will be found and used. Prefixes are defined and searched for in the following order:

  1. Process name - the name of the process, as defined by the process.name property passed in to the process. For example, the query server is usually defined as db_query_server. This is the most specific level at which a property can be defined.
  2. Main class name - the main class that is used for the process, defined by the call to the JVM. For example, the query server runs the RemoteQueryDispatcher class.
  3. IrisLogDefaults - if a property is not found with the process name or main class name prefix, the prefix IrisLogDefaults will be used.

An example of searching for a property is given after the list of properties presented below.

The following is a list of all the properties that can be defined for the processes. The default value is in iris-defaults.prop and can be changed in any configuration by updating the default configuration file.

Property Name Suffix

Explanation

Default Value

aliveMessageSeconds

If defined to a value greater than zero, an alive message will be written to the event logs periodically, with the interval between messages defined (in seconds) by this value.

0 (no keepalive message)

binaryLogTimeZone

Defines a time zone ID to be used in the creation of binary log filenames.

<none - uses the system's default time zone>

captureLog4j

Defines whether to capture Log4j output and send it to the text logs, including the ProcessEventLog if it is configured for the process. Some third-party libraries write logs to Log4j, so it is normally useful to capture this output.

true

captureSyserr

Defines whether or not to capture system error writes from the process and write it to the event logs, including the ProcessEventLog if it is configured for the process.

false

captureSysout

Defines whether or not to capture system output writes from the process and write it to the event logs, including the ProcessEventLog if it is configured for the process.

false

logLevel

Defines the level of messages that are written to the event logs. This applies to both the text logs and the ProcessEventLog for each process. See below for a list of log levels.

INFO

useLogAggregatorService

Whether to write audit event logs and process event logs through the log aggregator service. This must be set to true if multiple instances of the same class will run on a server.

false, except for the query workers and remote query dispatcher

useMainClassNameForLogs

If this is true, the class name is used when writing log files; otherwise the process name is used. While most times the class name is sufficient, defining this to false can be useful in cases where the same class runs for multiple processes, such as the RemoteQueryDispatcher, which is used for both the db_query_server and db_merge_server processes.

true

writeDatabaseAuditLogs

Defines whether or not to write audit events to Deephaven. A standard set of audit events is provided; see the Standard Log Messages section below.

false, except for the authentication server, Deephaven controller, ACL write server, query workers and remote query dispatcher

writeDatabaseProcessLogs

Defines whether or not to write process events to Deephaven. A standard set of audit events is provided; see the Standard Log Messages section below.

false, except for the query workers and remote query dispatcher

For example, the query server will look at the following properties to determine whether to write database audit logs. Once a matching property is found, that value is used. This allows the definition of the property at the individual process level (e.g,. db_query_server vs db_merge_server), the class level (in this case the RemoteQueryDisptacher class runs both the query and merge servers), and at a default level.

  • db_query_server.writeDatabaseAuditLogs
  • RemoteQueryDispatcher.writeDatabaseAuditLogs
  • IrisLogDefaults.writeDatabaseAuditLogs

By default, all binary logs are written to the binlogs subdirectory under the root directory, usually resolving to /var/log/deephaven/binlogs. See Log Files for details on changing this location.

Log Levels

The following log levels can be defined in the logLevel property:

  • FATAL - Any message logged at this log level should indicate that the process is beginning to shutdown due to an error.
  • ERROR - A message of this level indicates that an error has occurred, but these errors may not be critical. Examples of error-level messages include failed connections and authentication issues.
  • WARN - These are warning messages which may indicate a possible issue with the application, but not generally a problem that directly affects operation.
  • INFO - These indicate informational-level messages, typically for operations such as successful connections to remote servers.
  • DEBUG - This log level is for debugging-level messages. Messages of this level will not usually be visible in production logs.

Standard Log Messages

All processes will have a standard set of messages that are logged. They are defined as follows.

Text Messages

The following text messages will be written. If the process is configured to write to the Process Event Log, then they will also be written there.

  • <process name> INITIALIZING - The process has begun its initialization.
  • <process name> RUNNING - Initialization has completed, and the process has entered normal running state.
  • <process name> ALIVE - If the aliveMessageSeconds property indicates to write alive messages, then this message will be written to the event logs at the specified interval. It means that the process continues to be available, logged on a configurable cycle
  • <process name> SHUTTING_DOWN - This indicates the process is terminating.

Audit Event Log Messages

If a process is configured to write to the Audit Event Log, then a standard set of audit events will be written. Some processes provide additional audit events. Following is a list of the standard values in the Event field; the values have the same meaning as the text entries with the same names explained above. Alive messages are not written to the Audit Event Log.

  • INITIALIZING
  • RUNNING
  • SHUTTING_DOWN

Data Import Server Log Message Definitions

  • Rejecting and closing channel - This should occur seldom, if at all. This message generally points to a configuration issue to which DBAs should be alerted.
  • DataImportStreamProcessor-.+WARN - (regex) If a high rate of warning trace messages are seen from this component (other than during tailer restarts) it may point to a system issue.

Monitoring Process Startup Logs

The application startup logs contain the initial start of the Deephaven java services for each process. Due to these commands producing output to standard out, these logs contain only the start-up commands and any failures before the process is actually running.

Alarms should be raised if the logs contain any message containing the word 'FATAL' or 'ERROR'.

Any change or modification would indicate an application restart is needed. Several alarms over time would indicate repeated process start failure.

Common Startup and Shutdown Logging

Most or all Deephaven processes use an internal library for logging their lifecycle, which can be monitored externally. Monitoring tasks can look for the following strings:

  • EventLogProcessName INITIALIZING - This is logged very early in startup, as the logging framework is initialized.
  • ProcessMainClassName starting up - This is also logged very early, as part of the environment is initialized.
  • EventLogProcessName RUNNING - This is logged when process-specific initialization is complete and the process is ready.
  • ProcessMainClassName shutting down - This is logged when the environment shutdown is triggered.
  • EventLogProcessName SHUTTING_DOWN - This is logged as one of the initial operations during shutdown processing.

Shutdown processing has a little more structure to it, if that is necessary to monitor. Each process has three ordered groups of shutdown tasks, denoted first, middle, and last. The logging around this process looks like:

  • Initiating shutdown processing
  • Starting to invoke FIRST shutdown tasks
  • Done invoking FIRST shutdown tasks
  • Starting to invoke MIDDLE shutdown tasks
  • Done invoking MIDDLE shutdown tasks
  • Starting to invoke LAST shutdown tasks
  • Done invoking LAST shutdown tasks
  • Finished shutdown processing

During shutdown processing, Deephaven processes attempt to log as much usable information as possible in order to report on exceptional conditions or errors during cleanup, although this becomes error prone as the logger framework itself must ultimately be shutdown during the last phase, truncating some of the final messages.

Monitoring Process Run Logs

The application run log contains entries of the current activity of the given process.  The entries in the run logs are categorized into: DEBUG, INFO, WARNING, ERROR, and FATAL.

Alarms should be raised if the logs contain:

  1. Any message containing the word 'FATAL'
  2. Any message containing the word 'shutdown'

Statistics that may be useful:

  1. Log entries per time interval
  2. ERROR log entries per time interval
    1. Ratio with total log entries
  3. Startup/Shutdown messages

Monitoring the Data Buffer Pool

The DataImportServer, LocalTableDataServer, TableDataCacheProxy, and query worker processes (including those used for merge) all operate around an internal pool of 64KB binary buffers used to read, write, and cache binary data. While buffer size is technically configurable, it must be globally consistent for the entire data pipeline, and Deephaven has found that 64KB strikes a good balance between read/write throughput and read/cache amplification.

This pool is highly configurable.

See the JavaDocs (com.illumon.iris.db.util.caching.redesign.DataBufferConfiguration) for documentation of the options available.

Most processes only adjust the total size of the pool, which is controlled by one of the following properties (or by the Data Memory Ratio field for a Persistent Query:

  • DataBufferPool.sizeInBytes (deprecated in Deephaven v1.20200331 and after)
  • DataBufferConfiguration.poolSize - allows more fluent specification, e.g. 24g, rather than simple a raw number of bytes

Pool sizing is very important for cache performance in workers, especially for the merge operations that re-write intraday data in a historical format.

Pool sizing is even more important at DataImportServer instances. A DIS must be able to allocate one buffer per column file per for each open partition (i.e., partitions with a connected tailer). If a DIS cannot make progress because it has completely exhausted the buffer pool, and it cannot free up any buffer space with a synchronous cleanup, the process will terminate.

Key log lines to look for:

  • AutoReclaimingObjectPool-DataBufferPool: Failed to take() an item for thread=<thread name>, initiating synchronous cleanup
    This means a thread was unable to acquire a buffer in a lock free manner, and will wait and attempt to reclaim space on the current thread. It generally means that the pool is under pressure, and may indicate that the configuration should be changed to increase the pool size or the frequency and aggressiveness of concurrent cleanup.
  • AutoReclaimingObjectPool-DataBufferPool: Unable to take() an item for thread=<thread name>, yielding (<#>/<#>)
    This means that a thread was unable to acquire a buffer even after performing synchronous cleanup. This means that all buffers are actively in use and none could be freed. The thread will yield the CPU in order to allow other threads to make progress and hopefully release one or more buffers.
  • AutoReclaimingObjectPool-DataBufferPool: Unable to take() an item for thread=<thread name> after <#> yields
    This means that a thread was unable to acquire a buffer after multiple attempts to synchronously clean up and yield the CPU. This is a fatal error, and will be followed by an ObjectPoolExhaustedError with message AutoReclaimingObjectPool-DataBufferPool: Yield limit encountered - no pooled objects available!
    Subsequent threads that might encounter this scenario during shutdown will instead throw an ObjectPoolExhaustedException with message AutoReclaimingObjectPool-DataBufferPool: Yield limit previously encountered - no pooled objects available!

Monitoring TableDataException Logs

Many errors in reading data will result in a log message including a TableDataException. While these will often occur in workers, it may be worthwhile to monitor for them in all system processes that are part of read flow in the data pipeline, i.e., DataImportServer, LocalTableDataServer, TableDataCacheProxy.

Generally these will include the name of the service that could not provide data, an explanation of the exception, and details that may aid in investigations. They almost always point to a problem in the system, either due to data buffer pool exhaustion, invalid caches due to data deletion outside of Deephaven-internal mechanisms, software issues, or underlying issues with hardware or file systems.

The key to investigating these issues and reaching a speedy resolution is checking the logs of the process that originated the chain of errors. Often for intraday data reads, this is the DataImportServer.

Resolving cache invalidity issues, which often show up as failures to read an offset past the end of a file, usually requires a restart of the impacted processes. This typically means the DIS or LTDS serving the data, any intervening TDCPs, and the query worker(s) accessing the data.

An example of such an error follows:

Caused by: com.illumon.iris.db.v2.locations.TableDataException: RemoteTableDataService-Remote_db_ltds: Received rejection for request (requestKey=12279): FileDataBufferStore[OffsetFileAccessor[LocalFileAccessor[/db/Intraday/Namespace/TableName/IntradayPartition/2019-04-08/TableName/Symbol.dat],131072]]: Read failure: startOffset=0, minimumLength=6872, localSize=6872, size=6872, bufferIndex=0, minimumLimit=6872, sizeImpliedLimit=6872, requiredLimit=6872

Monitoring Schemas

At startup, the DIS reads schemas from the configuration service, and loads listener classes needed to parse binary log data received from tailers. If the schema or a listener for a tailed data stream changes while the DIS is running, that stream will stop receiving updates. The details of the problem will be listed in the DIS and tailer logs, and will usually be in the form of IllegalStateExceptions, RuntimeExceptions, or IllegalArgumentExceptions stating that a column did not match a name, did not match a type, or was not found, or that the version number of a binary log file did not match the version numbers of available listener classes.

Correcting such a situation may require:

  • Deploying updated schema
  • Regenerating loggers and listeners
  • Updating remote loggers to use latest schemata and logger classes
  • Restarting loggers
  • Restarting the DIS process(es)

The most common scenario is that schemas must be deployed, the DIS needs to be restarted, or the schema change needs to be properly managed to not happen intra-partition.

Log messages related to this will generally include text along the lines of:

  • Channel handling error while processing stream: java.lang.IllegalArgumentException: Unknown column
    This means that a column in the data received from the tailer is not found in the schema that the DIS has available to it. Likely needs to be resolved with a schema deployment.
  • Channel handling error while processing stream: java.lang.RuntimeException: Can not find schema Listener element
    This means the DIS cannot find the listener code; either it was not deployed, or the DIS must be restarted to find it.
  • Channel handling error while processing stream: java.lang.IllegalStateException:
    Often this is going to be an error wherein some binary log files for a partition were written with a newer schema version, and cannot be imported to the same partition without manual intervention.

That said, not all channel handling errors are of severe nature; the impact depends on the nature of the error and the importance of the data.

Metrics

Host Metrics

The Operating System records metrics about the usage of the host's underlying physical resources, such as disk space, CPU load, swap usage, etc.

Host metrics are important to evaluate the health or performance of an individual machine, regardless of the Deephaven components and services. The host metrics that should be collected are mainly comprised of usage or performance of the operating system or hardware, such as:

  • CPU
  • Memory
  • Disk space
  • Processes

Collecting these metrics will determine a node's health and ability to perform meaningful work in your Deephaven cluster.

General Host Metrics
  1. General System Health
    1. CPU errors
    2. Memory errors
    3. SAR data collection or similar for historical performance trending
    4. Status of all disk drives and/or built-in RAID including battery cache
  2. CPU Utilization
    1. Usage break down: %user, %system, %idle, %iowait
    2. Any blocked processes or any in IO Wait
    3. System load
  3. Memory/Swap Utilization
    1. Any swap utilization should raise an alarm
  4. Disk Utilization
    1. Root disk (/)
    2. General Deephaven database disk (/db)
    3. Deephaven Intraday disk utilization (/db/Intraday)
    4. Deephaven Historical disk utilization (/db/Systems)
    5. Data Import Server (DIS) or any servers performing batch data imports.
    6. All hosts in the Deephaven cluster or storage server itself
  5. Network Utilization
    1. Bandwidth utilization of network interfaces

Deephaven Metrics

Deephaven also records many metrics and events about the internal workings of the system. These metrics cover performance, audit events, system state, etc.

Deephaven metrics are stored in the Deephaven database and can be queried just like any other Deephaven table.

Performance Metrics

Deephaven Query Performance Metrics are recorded in the following tables:

  • QueryPerformanceLog
  • QueryOperationPerformanceLog
  • UpdatePerformanceLog
Table Namespace DbInternal

Table Names

QueryPerformanceLog, QueryOperationPerformanceLog, UpdatePerformanceLog

Description

Records a large variety of performance metrics about Deephaven queries, including timing measures, memory usage, duration, as well as information about the client.

Sample Queries

// Show QueryPerformanceLog records for today for specified user
today = new Date().format("yyyy-MM-dd")
user = "charlie"
t=db.i("DbInternal",  "QueryPerformanceLog").where("Date=`" + today + "`").where("PrimaryEffectiveUser=`" + user + "`")


// Show QueryOperationPerformanceLog records for today for specified user
today = new Date().format("yyyy-MM-dd")
user = "charlie"
t=db.i("DbInternal",  "QueryOperationPerformanceLog").where("Date=`" + today + "`").where("PrimaryEffectiveUser=`" + user + "`")


// Show UpdatePerformanceLog records for today for specified user
today = new Date().format("yyyy-MM-dd")
user = "charlie"
t=db.i("DbInternal",  "UpdatePerformanceLog").where("Date=`" + today + "`").where("PrimaryEffectiveUser=`" + user + "`")

Status Metrics

Deephaven Status and Configuration logs are recorded in the following tables:

  • PersistentQueryConfigurationLogV2
  • PersistentQueryStateLog
Table Namespace DbInternal

Table Name

PersistentQueryConfigurationLog, PersistentQueryConfigurationLogV2, PersistentQueryStateLog

Description

Records configuration and state of Persistent Queries along with any exceptions or errors the query may have encountered.

Sample Queries

// Show PersistentQueryStateLog records for today for specified query
today = new Date().format("yyyy-MM-dd")
queryName = "RevertHelper"
t=db.i("DbInternal",  "PersistentQueryStateLog").where("Date=`" + today + "`").where("Name=`" + queryName + "`")

Alerts

Alerting is part of the monitoring system that performs actions based on changes in metric values. The main purpose of alerting is to bring human attention to investigate the current status of your systems. Monitoring systems integrate with Slack, Email, SMS and other communication mechanisms.

Critical Email Alerts

Deephaven does some internal monitoring and will send email alerts for certain critical errors, such as when Garbage Collection is taking too long, or a reporting job is hung, etc.

Administrators should configure the appropriate email address as to where these critical errors are sent. This could be to an internal group email address or the email can be submitted directly into your monitoring system.

Edit the following properties in the iris-common.prop file:

Prop File Path

/etc/sysconfig/illumon.d/resources/iris-common.prop

Property

smtp.mx.domain=

critEmail=

E.g., mycompany.com

E.g., [email protected]

Administrating and Monitoring Web Server

The Data Import Server (DIS) is used to ingest real-time data into Deephaven. At any given time, it can be handling many different real-time data feeds, and it can be useful to be able to view these feeds and where they originated from. The Remote Query Dispatcher (RQD) launches all query workers, and handles connections from various clients such as Deephaven controllers and consoles. Visibility into these operations can help an administrator understand current load.

Both the DIS and RQD can provide their current status through built-in Web servers. These options are turned off by default and must be enabled by changing the appropriate properties (see the Properties section below for details on how to do this).

To connect to a Web service, connect to the appropriate port on the server. If the Web service is configured to require SSL, use "https"; otherwise use "http". Depending on the configuration, a password may be required.

For example, if the Data Import Server is running on dataimportserver.host with SSL on port 8086 (its default), its Web server can be accessed from a browser with:

https://dataimportserver.host:8086/

If the Remote Query Dispatcher is running on remotequerydispatcher.host on port 8084 (its default), its Web server can be accessed from a browser with:

https://remotequerydispatcher.host:8084/

Authentication

The Web servers can be optionally configured to require authentication; if this is not turned on, then anybody with access to the server and port can see the details.

If authentication is turned on, then the Web service will use the Deephaven Authentication Server to determine if the user is an authorized Deephaven user. Once this is confirmed, the user's ACLs will be checked.

Configuring ACLs

The ACLs used to validate whether an authenticated user is allowed to access a Web server can be configured with the Deephaven ACL Editor. If a user is a member of the appropriate group or is a superuser, then the user can access the Web server. A user can be allowed to view all Web servers or only a specific one by making them a member of one or more of the following groups.

  • iris-Webviewers - a member of this group can access any of the available Web servers.
  • <Service name>.iris-Webviewers - a member of this group can access that service's Web server.
    • DataImportServer.iris-Webviewers
    • RemoteQueryDispatcher.iris-Webviewers

Web Servers

Each Web server provides different status pages. It will first show a home page, and the user can click on the available options to see further details.

Data Import Server (DIS)

The Data Import Server provides two options: Server Configuration and Tailer Connections.

Server Configuration

The example screenshot below shows the configuration with which the DIS was started.

The following columns are displayed.

  • Field Name - the name of the server configuration field
  • Value - the value for the field
  • Description - a brief description of the field, if available
Tailer Connections

The example screenshot below shows the current tailer connections (i.e., every tailer connection for which this DIS is processing data). Every connection is specific to a namespace/table/column partition value/internal partition value.

The following columns are displayed for each tailer connection.

  • Namespace - the connection's namespace
  • Table Name - the connection's table name
  • Column Partition -  the connection's column partition value, frequently the date
  • Internal Partition - the connection's internal partition value
  • Tailer Address - the IP and port of the remote tailer
  • State - the current state of the connection:
    •         NEW - a new connection has started but is not running
    •         REGISTERING - a new connection is registering
    •         INITIALIZING - a new connection is initializing
    •         PROCESSING - the connection is processing
  • Time Started - the time this connection was started

Remote Query Dispatcher

The Remote Query Dispatcher provides three options: Server Configuration, Usage and Job Queue.

Server Configuration

This shows the configuration with which the DIS was started.

The following columns are displayed.

  • Field Name - the name of the server configuration field
  • Value - the value for the field
  • Description - a brief description of the field, if available
Usage

This shows the active client connections to the dispatcher. These will usually indicate Deephaven controller or console processes.

The following columns are displayed.

  • Client Host - the host name of the client
  • Active Connections - the number of active connections for this host
  • Total Connections - the total number of connections received for this host
  • Scheduled Jobs - the number of jobs (workers) scheduled (running) for this host
  • Waiting Jobs - the number of jobs (workers) waiting to be started for this host
  • Last New Job Scheduled At - the last time a new job was scheduled
Job Queue

This displays the active jobs (i.e., worker processes).

The following columns are displayed.

  • Job ID - the internal dispatcher job ID
  • Description - a brief description of the job
    • A description that starts with "Console-" indicates it's a console
    • A description that starts with "PersistentQuery-" indicates that it is a persistent query and is followed by the persistent query name
  • Client Host - the host of the client that started the job
  • User Name - the user that is running the query
  • Heap (GB) - the job's heap usage in GB
  • Duration(s) - the maximum duration of the query in seconds
  • Priority - the job's priority (a lower number indicates a higher priority)
  • Worker ID - the worker ID of this job
  • Worker PID - the PID of this worker, if available
  • Status - the job's current status
    • Assigned - the job is assigned to a worker but is not yet running
    • Cancelled - the job has been cancelled but has not yet stopped
    • Finished - the job has completed
    • Scheduled - the job is actively running
    • Waiting - waiting to be started

Web Services Properties

For the Remote Query dispatcher, the following properties define how the web server runs:

Property Name

Meaning

Default Value

RemoteQueryDispatcher.webserver.enabled

If this is defined as false, no Web server will be created for the Remote Query Dispatcher.

false

RemoteQueryDispatcher.webserver.port

The port to be used for the Jetty server for the Remote Query Dispatcher.

8084

RemoteQueryDispatcher.webserver.sslRequired

If this is defined as true, an SSL certificate will be required and used for the Remote Query Dispatcher's Web services, and the connection will be over https.

true

RemoteQueryDispatcher.webserver.authenticationRequired

If this is true, users will be required to authenticate to access the Remote Query Dispatcher's Web services. If authentication is required, then sslRequired must be true (i.e., credentials are not allowed to be sent over an insecure connection).

true

The following parameters apply to the Remote Query Dispatcher web servers.

Property Name

Meaning

Default Value

RemoteQueryDispatcher.tls.keystore

If SSL is being used, the keystore to be searched to find the certificate for the encrypted https connection.

<none>

RemoteQueryDispatcher.tls.passphrase

If SSL is being used, the passphrase to be used to verify the integrity of the keystore. This string is base64 encoded to prevent accidental exposure; or, alternately, and recommended, this passphrase can instead be stored in an external file. See RemoteQueryDispatcher.tls.passphrase.file.

<none>

RemoteQueryDispatcher.tls.passphrase.file

If SSL is being used, the passphrase to be used for the encrypted https connection. This string in the file is base64 encoded to prevent accidental exposure. The file name and path are not base64 encoded. This property can be and is recommended to be used instead of RemoteQueryDispatcher.tls.passphrase.

<none>

The Data Import Server web server is configured through the data routing service YML configuration file. The following parameters apply to the Data Import Server web servers. Note that “db_dis” prefix is a default. This prefix must match the DIS name in the data routing service YML configuration file. See Data Routing Service Configuration via YAML for further details.

Property Name

Meaning

Default Value

db_dis.tls.keystore

If SSL is being used, the keystore to be searched to find the certificate for the encrypted https connection.

<none>

db_dis.tls.passphrase

If SSL is being used, the passphrase to be used to verify the integrity of the keystore. This string is base64 encoded to prevent accidental exposure; or, alternately, and recommended, this passphrase can instead be stored in an external file. See db_dis.tls.passphrase.file.

<none>

db_dis.tls.passphrase.file

If SSL is being used, the passphrase to be used for the encrypted https connection. This string in the file is base64 encoded to prevent accidental exposure. The file name and path are not base64 encoded. This property can be and is recommended to be used instead of db_dis.tls.passphrase.

<none>

Performing a Thread Dump

There are times during incident management when the Deephaven Support team may ask you to perform a Thread Dump of a particular Deephaven process. This does not impact the system in any way, but records valuable diagnostic information into the logs to be used in troubleshooting.

Performing a Thread Dump with jstack

To perform a Thread Dump, simply run:

sudo jstack -F <pid> > <file-path>

pid: is the Process Id of the Deephaven process, and file-path: is the file path where the thread dump will be written.

For example, to perform a Thread Dump of Worker 1, use ps, grep and awk to capture the process id.

sudo jstack -F $(ps -ef | grep worker_1 | awk '{print $2}' | head -n 1) > /tmp/threadDumpWorker1.txt

Performing a Thread Dump with kill -3 (SIGQUIT)

In cases where you want the thread dump to go to the process logs, 'kill -3' can be used. When 'kill -3' is used, the thread dump is sent to the standard output stream of the process.

To perform a Thread Dump with kill -3, simply run:

sudo kill -3 <pid>

Where pid: is the Process Id of the Deephaven process

For example, to perform a Thread Dump of Worker 1, use ps, grep and awk to capture the process id.

sudo kill -3 $(ps -ef | grep worker_1 | awk '{print $2}' | head -n 1)

By default, Worker logs are sent to the Deephaven database along with stderr and stdout, where stderr is redirected to the stdout for the process.

To view stdout logs for Worker 1, the following query can be used:

t=db.i("DbInternal",  "ProcessEventLog").where("Date=`" + today + "`").where("Process=`worker_1`").where("Level=`STDOUT`")

Finding Errors

Persistent query processes record their activity and state in the ProcessEventLog and the PersistentQueryStateLog tables.  When errors or exceptions occur, these tables can be queried to find error messages or queries that have failed.

Querying the PersistentQueryStateLog Table

The PersistentQueryStateLog is updated every time a persistent query changes state, and can be used to see both the current state of any persistent query as well as all the historical states. The status of failed queries could be:

  • Failed: The persistent query failed before or during initialization.
  • Error: An error occurred after the query was initialized (e.g., processing incremental updates)
  • Disconnected: The worker process disconnected from the dispatcher.  Often this results from a problem that terminates the JVM without giving the worker a chance to notify the dispatcher (e.g., OutOfMemoryError or hotspot errors in native code).

The following example shows all times a persistent query entered a failure state on the current date:

pqsAllFailedStates = db.i("DbInternal","PersistentQueryStateLog").where("Date=currentDateNy()", "Status in `Failed`, `Error`, `Disconnected`")

Adding a lastBy() call to the query shows the most recent state of any persistent query that attempted to start on the current date:

pqslAllCurrentStates = db.i("DbInternal","PersistentQueryStateLog").where("Date=currentDateNy()").lastBy("SerialNumber")

Combining the above queries indicates which persistent queries have entered a failure state on the current date:

pqslCurrentFailedStates = db.i("DbInternal","PersistentQueryStateLog").where("Date=currentDateNy()").lastBy("SerialNumber").where("Status in `Failed`, `Error`, `Disconnected`")

Querying the ProcessEventLog Table

To find all FATAL Level messages in the ProcessEventLog, the following query can be used:

t = db.i("DbInternal","ProcessEventLog").where("Date=currentDateNy()").where("Level==`FATAL`")

To find specific errors or error messages for a specific query worker, first query the PersistentQueryStateLog table to determine the worker name and worker host and use these values to query the ProcessEventLog.  For example:

workerHost="hostname"
workerId="worker_123"
errorText="Problem maintaining subscription"
t = db.i("DbInternal","ProcessEventLog")
    .where("Date=currentDateNy()")
    .where("Host==workerHost")
    .where("Process==workerId")
    .where("LogEntry.contains(errorText)")

Hold your mouse over the Log Entry to view the full exception message and stack trace from the results of the query.

Viewing Persistent Query Exceptions in the Console

When a persistent query fails with a Java exception, you can view the exception message and stack trace in the Deephaven console.

To view the exception message, click the Query Config button to open the Query Config panel. Then right-click on the failed query in the "Exception Details" column and select Show Exceptions in the drop-down menu:

Deephaven Upgrades

Deephaven is designed to facilitate upgrades and rollbacks.  It is important to ensure you complete a backup of the Deephaven database and all configuration files before performing an upgrade.

A first-time install of Deephaven requires two RPM packages:

  1. Database RPM
  2. Config RPM

Your Deephaven account representative will provide you with the latest versions of these two packages. Please refer to the Deephaven Server Installation Guide for additional details on a first-time installation.

Subsequent installation or upgrades require installing only the Database RPM. The release notes will instruct Administrators if any changes or edits to configuration files are required.

To upgrade Deephaven to the latest software release, you will first need to copy the new packages onto your Deephaven Linux hosts and run the following commands to update the Deephaven Database.

sudo yum localinstall illumon-db-<VERSION>.rpm -y

The installer will not delete or overwrite any existing files. It simply lays down the new files and updates a symbolic link called latest in the /usr/illumon folder.

If new schemas or internal database changes are part of a new release, administrators will also be required to run a dhconfig command (usually of the form dhconfig schemas import <options>)to create or update any table definitions. Please check the respective release notes for any additional information or instructions specific to the new release.

If there are any issues or problems with the new release, updating the latest symlink will revert the system binaries back to the previous release.

If any changes were made to the database or config files, these will need to be restored if a rollback is required.

Deephaven Hotfixes

A hotfix update is a single minor release to address a problem in a Deephaven release. Typically, hotfixes are made to address a specific customer situation.

If a hotfix is applicable to your current Deephaven deployment, your Deephaven account representative will provide you with the hotfix installation package.

Copy the new packages onto your Deephaven Linux hosts and run the following commands to apply the hotfix:

sudo yum localinstall illumon-db-<VERSION>.rpm -y

The installation procedure for a hotfix is similar to regular Deephaven upgrades. The installer will not delete or overwrite any existing files. It simply lays down the new files and updates a symbolic link called latest in the /usr/illumon folder.

The directory in the Release Notes section of the Deephaven Documentation includes the full version log for active Deephaven versions.

Custom Libraries, Overrides and Hotfixes

Any custom or third party libraries and resources, such as JARs, class files, property files, etc. can be added to designated directories to override and/or extend your Deephaven installation.

The following directories are placed in the front of the CLASSPATH of Deephaven processes. Add any custom or third party resources to the following directories:

  • /etc/sysconfig/illumon.d/java_lib - For any user and/or third-party Java libraries (JARs). Any JARs placed in this directory will be found on the classpath of Deephaven processes.
  • /etc/sysconfig/illumon.d/override - For class files, properties or other Java resources that should override those contained in the java_lib JARs.
  • /etc/sysconfig/illumon.d/hotfixes - This path contains any JAR files that should be used instead of those contained in the java_lib directories.

Restart Affected Processes

After a Deephaven hotfix package has been applied or any custom libraries, overrides or custom hotfixes have been placed in designated directories, the affected Deephaven processes need to be restarted.

In most cases, a restart of all Deephaven processes on the server will be required. This is accomplished using the following command:

sudo monit restart all

In cases where the hotfix or custom extensions are applicable to a specific Deephaven process, only that process needs to be restarted. Please refer to the Startup/Shutdown of Deephaven Processes section for instructions on restarting individual Deephaven processes.

Troubleshooting Custom Hotfixes and Extensions

In some cases, third party libraries, custom extensions or hotfixes can cause conflicts or errors with existing Deephaven code libraries. Refer to the Log files for any errors or warnings once processes have been restarted with custom extensions.

Debugging Process Configuration

Use the sudo jinfo <pid> command to print Java configuration information for a specified Java process like the CLASSPATH and other useful info about a running Java process.

Debugging SSL/TLS Connections

Understanding SSL/TLS connection problems can sometimes be difficult, especially when it is not clear what messages are actually being sent and received.

You can get excellent certificate debugging by adding -Djavax.net.debug=all to your Java launch parameters to help troubleshoot or diagnose SSL/TLS Connections.


Last Updated: 16 February 2021 18:07 -04:00 UTC    Deephaven v.1.20200928  (See other versions)

Deephaven Documentation     Copyright 2016-2020  Deephaven Data Labs, LLC     All Rights Reserved