Bulk Data Ingestion
Batch import types that are executed via Persistent Queries can be "duplicated" with a range of inputs (for example, to perform a one-time load or back fill of historical data where the job normally imports one day at a time). This feature is known as Bulk Data Ingestion.
Bulk data ingestion enables an administrator to easily import, merge, and validate data for multiple partition values. Instead of creating individual queries for each dataset, the bulk ingestion process enables users to add Bulk Copy options to an existing Import, Data Merge and/or Data Validation query configuration.
Users start with an existing persistent query - typically an Import query used for day-to-day data ingestion. Bulk Copy options/parameters can then be added to the query configuration to specify the range of partitions that need to be imported - typically a range of dates. Once these options are entered, Deephaven automatically creates individual, transient (short-lived) queries - one for each data partition - that will run when resources permit to import each individual dataset. Once each dataset has been imported, the transient query used to import that dataset will be erased.
Additional options are also available for simultaneously creating the respective Data Merge and Validation queries, which will also run in order (as resources permit) following the completion of the data import process. The bulk merge and bulk validate queries are dependent on the earlier queries in the chain. For example, a bulk merge query for a given partition value will only run if the bulk import query successfully completed, and a bulk validate query will run only if the bulk merge query successfully completed.
For example, to load 100 day's worth of CSV data in bulk (e.g., 100 files), the existing day-to-day CSV import query for that data could be chosen as the base query. The range of partitions, 100 different business dates in this case, is then added to the Bulk Copy options for the query. Once the new parameters are entered, Deephaven generates 100 individual transient queries to handle the import of those 100 datasets. These queries run whenever resources are available, thereby automating the entire process of importing a year's worth of data. Furthermore, merge and data validation queries for each dataset can also be generated in the same process.
Configuring the Bulk Copy Configuration
Open the Query Config panel in Deephaven. Right-clicking on any Import, Data Merge, or Validate query will generate the menu shown below. Select Bulk Copy Configuration.
This opens the Bulk Copy Dialog window, which is shown below.
To begin, first choose either Date Partitioning or Other Partitioning.
If Date Partitioning is chosen, the following options are presented:
- Start Date: this is the starting date for the bulk operation, in yyyy-MM-dd format. When selected, Start Date will display a date chooser.
- End Date: this is the ending date for the bulk operation, in yyyy-MM-dd format. When selected, End Date will display a date chooser.
- Business Days Only: if selected, then only business days will be used for the bulk operation. Queries will not be created for weekends and holidays.
- Business Calendar: if the Business Days Only checkbox is selected, then a Business Calendar must also be chosen.
Depending on the type of query configuration originally chosen (Import, Data Merge, or Validate), the following additional buttons may appear in the Bulk Copy Dialog window:
- Edit Bulk Query: Selecting this option will allow you to edit the parameters for the bulk version of the underlying Import, Data Merge or Validate query. Edits are typically made to change the scheduling or other parameters. However, edits will only apply to the new bulk queries being processed The underlying query that was copied earlier will not be impacted.
- Create Bulk Merge Query: If the original configuration is an Import query, selecting Create Bulk Merge Query allows the user to create a set of Data Merge queries that will run for the same partition values used for the bulk import process. This set of Data Merge queries will run only when its respective set of Import queries successfully completes. When selected, the Persistent Query Configuration Editor will open and present options to configure the bulk merge query. If a Data Merge query already exists that uses this namespace/table, the settings for that Data Merge query will be copied into the base settings for the new bulk merge query. Otherwise, the setting used for the new bulk merge query will be based on the original Import query. In either case, the settings can be edited by selecting Edit Configuration in the Bulk Copy Dialog window.
- Delete Bulk Merge Query: This option appears only if a bulk merge query has been created. Selecting Delete Bulk Merge Query will delete the bulk merge query associated with this Bulk Import process. Note: If there is a corresponding bulk validate query for the bulk merge query, both will be deleted when Delete Bulk Merge Query is selected.
- Create Bulk Validate Query: If the original configuration is a Data Merge query, selecting Create Bulk Validate Query allows the user to create a set of Validate queries that will run for the same partition values used for the bulk merge process. This set of Validate queries will run only when its respective set of Data Merge queries successfully completes. When selected, the Persistent Query Configuration Editor will open and present options to configure the bulk validate query. If a Validate query already exists that uses this namespace/table, the settings for that Validate query will be copied into the base settings for the new bulk validate query. Otherwise, the setting used for the new bulk validate query will be based on the original Data Merge query. In either case, the settings can be edited by selecting Edit Configuration in the Bulk Copy Dialog window.
- Delete Bulk Validate Query: This option appears only if a bulk validate Query has been created. Selecting Delete Bulk Validate Query will delete the bulk validate query associated with this bulk merge process.
- Delete Merge & Validate Queries: This button is only shown when bulk merge and bulk validate queries have been created. Selecting it removes both. A Bulk Validation query cannot run without a corresponding bulk merge query.
If Other Partitioning is chosen, the following options are presented:
When Other Partitioning is selected, the user must input a list of chosen partition values in the empty text field shown. These can be added manually (typed directly into the window) with each partition value listed on a new line. Or, Select File can be used to choose and then open a file. The file's contents will be read into the window and used to determine the partitioning values. Each line in the file will be a separate column partition value.
Depending on the type of query configuration originally chosen (Import, Data Merge, or Validate), the following additional buttons may appear in the Bulk Copy Dialog window:
- Edit Bulk Query: Selecting this option will allow you to edit the parameters for the bulk version of the underlying Import, Data Merge or Validate query. Edits are typically made to change the scheduling or other parameters. However, edits will only apply to the new bulk queries being processed The underlying query that was copied earlier will not be impacted.
- Create Bulk Merge Query: If the original configuration is an Import query, selecting Create Bulk Merge Query allows the user to create a set of Data Merge queries that will run for the same partition values used for the bulk import process. This set of Data Merge queries will run only when its respective set of Import queries successfully completes. When selected, the Persistent Query Configuration Editor will open and present options to configure the bulk merge query. If a Data Merge query already exists that uses this namespace/table, the settings for that Data Merge query will be copied into the base settings for the new Bulk Merge query. Otherwise, the setting used for the new Bulk Merge query will be based on the original Import query. In either case, the settings can be edited by selecting Edit Configuration in the Bulk Copy Dialog window.
- Delete Bulk Merge Query: This option appears only if a bulk merge query has been created. Selecting Delete Bulk Merge Query will delete the Bulk Merge query associated with this Bulk Import process. Note: If there is a corresponding bulk validate query for the Bulk Merge query, both will be deleted when Delete Bulk Merge Query is selected.
- Create Bulk Validate Query: If the original configuration is a Merge query, selecting Create Bulk Validate Query allows the user to create a set of Validate queries that will run for the same partition values used for the bulk merge process. This set of Validate queries will run only when its respective set of Data Merge queries successfully completes. When selected, the Persistent Query Configuration Editor will open and present options to configure the bulk validate query. If a Validate query already exists that uses this namespace/table, the settings for that Validate query will be copied into the base settings for the new bulk validate query. Otherwise, the setting used for the new bulk validate query will be based on the original Data Merge query. In either case, the settings can be edited by selecting Edit Configuration in the Bulk Copy Dialog window.
- Delete Bulk Validate Query: This option appears only if a bulk validate query has been created. Selecting Delete Bulk Validate Query will delete the bulk validate query associated with this bulk merge process.
- Delete Merge & Validate Queries: This button is only shown when bulk merge and bulk validate queries have been created. Selecting it removes both. A Bulk Validation query cannot run without a corresponding Bulk Merge query.
When a Bulk Import, Data Merge or Validate query is saved, each query is given the name of its base query, with the partitioning value appended with a "-
". For example, a Bulk Import query might be named MyImport-2017-01-01
.
Last Updated: 06 July 2020 14:54 -04:00 UTC Deephaven v.1.20200928 (See other versions)
Deephaven Documentation Copyright 2016-2020 Deephaven Data Labs, LLC All Rights Reserved