Importing Data from Quandl

About Quandl

Quandl is an exchange for various types of data, mostly financial. Quandl provides an API that allows access to data, metadata, and some catalog information for these data sets, and allows retrieval of data in CSV, JSON, and XML formats. Some of the data sets are free, while others are paid subscription items. Data are organized in two formats: time-series metrics or datatables. At this time, all of Quandl's data is snapshot updated. Much of the data is updated daily. However, depending on the dataset, update frequency ranges from hourly to annually.

Time-Series

Time-series metrics account for the majority of data available through Quandl. Time-series data sets have a date or date/time value for each row, and only numeric values for the other fields. For the most part, they are vertically partitioned, with a single metric per data set. Horizontally partitioned sets such as per-symbol instrument data also exist. The API catalog provides information about time-series data sets, the publishers that exist, and data sets available from each publisher. Metadata for time-series is a bit limited, often providing only the names of columns and not the data types contained therein.

Datatables

Datatables are standard-looking tables. Like time-series data, these have a date or date/time value for each row, but these also contain text values in their fields, such as ticker symbols. The datatables API supports different filtering and downsampling options that are defined per table. The API also provides fairly rich metadata that includes the data type of each column. However, datatables are not listed in the Quandl catalog API.

Connecting to Quandl

Quandl's API is web-hosted. To interact with it, you can use a regular Web browser. You submit requests to it through HTTP Gets that include the specific data set and options being requested, and an API key parameter for authorization. An API key can be obtained by registering at www.quandl.com, and is a prerequisite to using Quandl data import features in Deephaven. The API key parameter must always be included in API requests. If it is invalid or empty, the Quandl API will reject the request with an HTTP status code of 400. Even sample data from premium content channels requires a valid API key, which is then used to associate subscriptions to paid content.

Working with Quandl Data in Deephaven

Using Quandl data in Deephaven is similar to using CSV data. Deephaven has tools that can quickly create in-memory tables for Quandl data, and/or import Quandl data to persistent namespace tables.

In-memory Tables

Deephaven' Quandl functions are contained in the QuandlHelpers class. This class is not imported into the Deephaven Groovy session by default, so use of these methods requires either explicit imports, or full qualification of method names.

Explicit import:

import static com.illumon.iris.quandl.QuandlHelpers.*

quandlAPIKey("4qxnjGszorJKZ565sJP7")

Note:  "4qxnjGszorJKZ565sJP7" is not a valid Quandl API Key.  Replace with your own.

Fully-qualified:

com.illumon.iris.quandl.QuandlHelpers.quandlAPIKey("4qxnjGszorJKZ565sJP7")

Note:  "4qxnjGszorJKZ565sJP7" is not a valid Quandl API Key.  Replace with your own.

There are three main commands for creating in-memory tables from Quandl data:

  • getQuandlTimeSeries – connects to the time-series API, and creates a table from the resultant data retrieved
  • getQuandlTable – connects to the datatables API, and creates a table from the resultant data retrieved
  • getQuandl – will try both APIs – datatables first, then time-series if datatables doesn't have the requested data set

Each of the methods throws an IOException, which is normally caused by a problem parsing CSV data downloaded from Quandl. If the method is used through the Deephaven console, the console itself will set up a handler for the IOException. If the method is used in Java code, the method will need to be called from within a try/catch block to ensure the IOException is caught if thrown. RuntimeExceptions normally result from other problems that can occur during the download and the parsing of Quandl data (e.g., a timeout, or an incorrect property value).

Each of the methods noted above has several forms in which it can be used. The simplest form takes two arguments: a Quandl database name, and a Quandl dataset or datatable name. (Note: These names are shown in Quandl's online data catalog at www.quandl.com). For instance, using the database "WIKI" and the datatable "PRICES" provides free end-of-day stock data for 3,000 U.S. stocks.

A slightly more complex form for calling these functions takes three or more arguments. Like the simplest form, the first two arguments are the Quandl database name and the Quandl dataset name. However, the last argument is a variable argument (vararg), and it can be used to include one or more Quandl API arguments (separated by commas) to limit the scope of the data requested from Quandl. See example below:

myTable=getQuandlTable("WIKI", "PRICES", "date.gte=20100101", "qopts.export=true")

In this case, "date.gte=20100101" tells the Quandl API to filter the WIKI/PRICES datatable so only those dates that are greater than or equal to January 1, 2010 are included. The next argument, "qopts.export=true", tells the Quandl API to package the results in a zip file for download. (Note: Deephaven automatically handles checking for the zipped results to be ready for download, downloading them, and unpacking them.) The option to zip results is only available for datatables, not for time series. Also, for large datatables, if the results are not zipped, only the first 10,000 rows will be returned.

Another argument that can be passed to these methods is a StatusCallback object. This progress parameter allows the getQuandl methods to update process status as data is downloaded and parsed. This can be used to poll status, or to register a callback that can be used to update a status message and/or a progress bar. If the version of a method that includes a StatusCallback parameter is called, with the parameter value explicitly set to null, this will result in progress update info messages being logged to the console log. If a version of the method that does not include a StatusCallback parameter is used, then no progress messages will be logged. A custom StatusCallback object can also be created by the calling application and passed in, to allow logging and/or other progress tracking.

When using getQuandl commands, the Quandl API key can be provided using the following methods:

  • Set it in a configuration (.prop) file: The setting Quandl.APIKey can be included in a configuration file.
  • Set it before calling getQuandl: The quandlAPIKey property of the QuandlHelpers class can be used to set or retrieve the current API key being used by the class instance.
  • Specify it as part of the optional arguments to a getQuandl call: Including api_key=[your_api_key] as one of the arguments will result in this API key being used for the call.

The order or priorities for an API key are:

  1. (Highest) API key set by optional argument
  2. API key set through quandlAPIKey property
  3. API key set in configuration file
  4. If none of these is set, an empty API key will be used, and the request will most-likely fail with an HTTP 400 status code

For time-series data, the dataset (second) argument to the commands also accepts a comma-separated list of dataset names, as well as  wildcard patterns (using * as a wildcard character).

For example, the following query will download the time-series datasets for Apple, IBM, and Google end-of-day data, and merge them into a single table. When providing a list or pattern of datasets, a QuandlDataset column will be added to the table, and populated with the name of the dataset that provided the rows. Rows that came from the WIKI/IBM dataset will have IBM in their QuandlDataset column. Note: This column is only added when a list or pattern is used for datasets. A single dataset is returned exactly as seen on Quandl's site, with no extra column.

myTable=getQuandlTimeSeries("WIKI", "AAPL, IBM, GOOG")

Similarly, the following query will download and concatenate all time-series WIKI datasets that start with AT:

myTable=getQuandlTimeSeries("WIKI", "AT*")

For the Quandl WIKI time-series datasets, all the datasets contain the same columns in the same order, so there are no concerns about compatibility between the different datasets. However, other Quandl databases offer datasets with disparate column sets. If a selection includes datasets that use different schema from the first dataset downloaded, the command will fail.

One other command available to make it easier to use time-series data is getQuandlDatasets. For example, the following command returns a table of dataset IDs and their respective descriptions for the YALE database (Yale University Department of Economics):

yaleCatalog=getQuandlDatasets("YALE")

Or, with an API Key:

yaleCatalog=getQuandlDatasets("YALE","*","api_key=4qxnjGszorJKZ565sJP7")

Note:  "4qxnjGszorJKZ565sJP7"is not a valid Quandl API Key. Replace with your own.

Persistent Table Import

The getQuandl commands import data to in-memory tables. Importing data to persistent tables can be accomplished through the QuandlImporter and SchemaFromQuandl classes (com.illumon.iris.importers.QuandlImporter and com.illumon.iris.utils.SchemaFromQuandl).

QuandlImporter assumes (like other importers) that the target namespace and table already exist, and can then:

  • Request and download zipped or non-zipped Quandl data files
  • Extract zipped data files
  • Import data to the Intraday space for the Deephaven table

To help with creation of the schemata for Quandl data, the SchemaFromQuandl class downloads Quandl datasource metadata and creates a schema file to match. 

The table schema created from this process also supports merge operations, and allows specification of a grouping column and/or the partitioning column name. The user must copy the schema file to an appropriate location in their Deephaven installation and run Database Tools to create a table definition from the schema. The user should also review the schema file to see if there are any changes they might want to make to import options or physical partitioning.

The following arguments are available when running the QuandlImporter class:

-qd or –-quandlDatabase

-qt or –-quandlTable

-ak or –-apiKey

-aa or –-apiArgs

-dd or --destinationDirectory <path>

-dp or --destinationPartition <internal partition name / partitioning value>

-ns or --namespace <namespace>

-tn or --tableName <name>

-om or --outputMode <import behavior>

-wd or --workingDirectory <path>

-kt or --keepTempFiles

-lp or --logProgress

Parameter Value

Description

-qd

The name of the Quandl database from which to download (required).

-qt

The name of the Quandl datatable or dataset from which to download (required).

-ak

The API key to use when connecting to the Quandl API. Like the getQuandl commands described earlier, the QuandlImporter class will use an API key from a configuration settings Quandl.APIKey

value, if present, and if no key is specified using the -ak argument.

-aa

Additional optional arguments to pass to the Quandl API - a space-delimited, double-quoted, list of name=value arguments.

-dd or -dp

Either a destination directory or partition must be provided. A directory can be used to write a new set of table files to specific location on disk, where they can later be read with TableTools. A destination partition is used to write to intraday locations for existing tables. The internal partition value is used to separate data on disk; it does not need to be unique for a table. The name of the import server is a common value for this. The partitioning value is a string data value used to populate the partitioning column in the table during the import. This value must be unique within the table.

-ns

The namespace in which to find the target table (required).

-tn

The name of the target table - (optional) will be derived from the Quandl database, and dataset or datatable name, if not specified. E.g., WIKI and PRICES would result in a table called WikiPrices. Of course, the derived name must still match the name of an existing table definition.

-om

outputMode (optional)

  • SAFE checks whether the target table already contains data matching the provided partitioning value. If it does, the import is aborted.
  • REPLACE should be used when developing an import process because failed import attempts will often write some data to the table, causing the next attempt with SAFE to abort.
  • APPEND should normally be used only when you are running multiple imports to load a set of data to the same table at one time, possibly from multiple different sources, and the resultant data needs to be kept together as part of one logical partition.

-dt

Download timeout (optional) - more precisely, this is the amount of time, in seconds, to wait for the Quandl server to generate a zip file when qopts.export=true has been specified. The default is five minutes. Generation of some large table result files could be as long as 20 minutes, but simpler results often only take a few seconds. This setting only applies to datatables imports, since dataset files are never zipped.

-wd

Path in which to store data file downloaded from Quandl (optional). If this option is not used, data files will be downloaded and processed in memory. This option can be helpful to conserve memory when downloading large amounts of data, or on systems with small amounts of memory available for the import process.

-kt

When used with -wd instructs the import process to leave temp files in the working directory after importing them. By default, these temp files are deleted after import.

-lp

Enables additional progress logging during the import operation. Useful to monitor progress for large, long-running imports.

The following arguments are available when running the SchemaFromQuandl class:

-qd or –-quandlDatabase

-qt or –-quandlTable

-sp or –-schemaPath

-ak or –-apiKey

-gc or –-groupingColumn

-pc or –-partitionColumn

-dc or --datasetColumn

-ns or --namespace <namespace>

-tn or --tableName <name>

Parameter Value

Description

-qd

The name of the Quandl database from which to download (required).

-qt

The name of the Quandl datatable or dataset from which to download (required).

-sp

An optional path to which the schema file will be written, if a schema file is being generated. If not specified, this will default to the current working directory and will create or use a sub-directory that matches the namespace.

-ak

The API key to use when connecting to the Quandl API. Like the getQuandl commands described earlier, the QuandlImporter class will use an API key from a configuration settings Quandl.APIKey value, if present, and if no key is specified using the -ak argument.

-gc

Optional column to mark as the grouping column if schema is being generated.

-pc

Optional name for the partitioning column if schema is being generated. If this is not provided, the importer will default to Date for the partitioning column. Any existing column from the source that matches the name of the partitioning column will be renamed to source_[original column name].  (See: Importing Data > Schemas > Partitions )

-dc

Optional name to use for the column that will designate the dataset name from which rows were imported. This applies only to Quandl time-series data sources. If no name is provided, the importer will add a column called "QuandlDataset" to store this value.

-ns

The namespace in which to find the target table (required).

-tn

The name of the target table (optional) will be derived from the Quandl database, and dataset or datatable name, if not specified. E.g. WIKI and PRICES would result in a table called WikiPrices.

-lp

Enables additional progress logging during the import operation. Useful to monitor progress for large, long-running imports.

The following is a sample sequence of command line commands to import more than six years of EOD data for 3,000 U.S. stocks (about 5.1M rows):

  1. Create the table schema file:

sudo java -cp "/usr/illumon/latest/java_lib/*:/etc/sysconfig/illumon.d/resources/" -server -Xmx4096m -DConfiguration.rootFile=IRIS-CONFIG.prop -Dworkspace=/tmp/viewer_vncgui1\:1 -Ddevroot=/usr/illumon/latest -DLoggerFactory.teeOutput=true -DWAuthenticationClientManager.defaultPrivateKeyFile=/etc/sysconfig/illumon.d/auth/priv-iris.base64.txt -Dprocess.name=quandlschema -DIrisConsole.password=false com.illumon.iris.utils.SchemaFromQuandl -ns Quandl -qd WIKI -qt PRICES -gc ticker -pc data_set -ak <insert API Key>

  1. Copy the resultant QuandlWikiPrices.schema file to a path under the SchemaConfig.resourcePath:

sudo mkdir /etc/sysconfig/illumon.d/schema/Quandl

sudo cp Quandl/Quandl.WikiPrices.schema /etc/sysconfig/illumon.d/schema/Quandl/

  1. Import the schema file to create the table definition:

sudo /usr/illumon/latest/bin/create_iris_database

  1. Run the Quandl importer to download and import data to the new table:

sudo java -cp "/usr/illumon/latest/java_lib/*:/etc/sysconfig/illumon.d/resources/" -server -Xmx4096m -DConfiguration.rootFile=IRIS-CONFIG.prop -Dworkspace=/tmp/viewer_vncgui1\:1 -Ddevroot=/usr/illumon/latest -DLoggerFactory.teeOutput=true -DWAuthenticationClientManager.defaultPrivateKeyFile=/etc/sysconfig/illumon.d/auth/priv-iris.base64.txt -Dprocess.name=quandlimport -DIrisConsole.password=false com.illumon.iris.importers.QuandlImporter -dp IrisDB/1 -ns Quandl -qd WIKI -qt PRICES -ak 4qxnjGszorJKZ565sJP7 -om REPLACE -aa "date.gte=20100101 qopts.export=true"

Note:  The Quandl import process may take more time than expected.


Last Updated: 20 August 2019 09:54 -06:00 UTC    Deephaven v.1.20180917

Deephaven Documentation      Copyright 2016-2018  Deephaven Data Labs, LLC      All Rights Reserved