Class XmlSchemaCreator

java.lang.Object
com.illumon.iris.importers.XmlSchemaCreator

public class XmlSchemaCreator
extends Object
Reads an XML file and attempts to infer column data types and create appropriate schema and importer instructions. Also legalizes column names and adds corresponding ImportColumn entries for translation of column names.
  • Constructor Summary

    Constructors 
    Constructor Description
    XmlSchemaCreator​(com.fishlib.io.logger.Logger log, StatusCallback progress)  
  • Method Summary

    Modifier and Type Method Description
    static List<String> getElementTypeNames​(File sourceFile, int startIndex, int startDepth, int maxDepth)
    Returns slash-delimited path-qualified element names based on the starting index, depth, and max-depth in the XML document.
    static com.illumon.iris.importers.CsvImporterHelperXml getInitializedCsvImporterHelperXml​(File sourceFile, String elementType, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, com.fishlib.io.logger.Logger log)
    Returns an import helper that can provide CSV records to parse, and column details from an XML file.
    String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress)
    Get an XML String of table schema based on a file and user-provided options.
    String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows)
    Get an XML String of table schema based on a file and user-provided options.
    String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
    Get an XML String of table schema based on a file and user-provided options.
    static void main​(String... args)
    Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

  • Method Details

    • getElementTypeNames

      public static List<String> getElementTypeNames​(@NotNull File sourceFile, int startIndex, int startDepth, int maxDepth)
      Returns slash-delimited path-qualified element names based on the starting index, depth, and max-depth in the XML document.
      Parameters:
      sourceFile - File pointing to XML document to read.
      startIndex - From the root element, how many elements to skip before starting.
      startDepth - From the start index element, how many level to descend before beginning to enumerate.
      maxDepth - From the starting depth, how many further levels to recurse while enumerating.
      Returns:
      A List of Strings of the qualified element paths.
    • getInitializedCsvImporterHelperXml

      public static com.illumon.iris.importers.CsvImporterHelperXml getInitializedCsvImporterHelperXml​(@NotNull File sourceFile, String elementType, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, com.fishlib.io.logger.Logger log)
      Returns an import helper that can provide CSV records to parse, and column details from an XML file.
      Parameters:
      sourceFile - File object pointing to the CSV file to be analyzed.
      elementType - A string element name to match when finding elements to import from the XML
      startIndex - Number of elements after the root element to start looking for elements to import
      startDepth - How far under the element obtained from startIndex to start looking for elements to import
      maxDepth - How far to recurse into import elements when searching for values to import
      useElementValues - Whether to use values that are stored as the contents of elements
      useAttributeValues - Whether to user values that are stored as attributes
      namedValues - True to use values by name, false to use values by position
      startColumnIndex - Number of elements after the root element to start looking for the element that contains column names
      startColumnDepth - How far under the element obtained from startColumnIndex to start looking for the element that contains column names
      columnNameElement - The name of the element that contains column names
      log - An Iris logger
      Returns:
      an import helper class
    • getTableSchema

      public String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      sourceFile - File object pointing to the CSV file to be analyzed.
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      elementType - A string element name to match when finding elements to import from the XML
      startIndex - Number of elements after the root element to start looking for elements to import
      startDepth - How far under the element obtained from startIndex to start looking for elements to import
      maxDepth - How far to recurse into import elements when searching for values to import
      useElementValues - Whether to use values that are stored as the contents of elements
      useAttributeValues - Whether to user values that are stored as attributes
      namedValues - True to use values by name, false to use values by position
      startColumnIndex - Number of elements after the root element to start looking for the element that contains column names
      startColumnDepth - How far under the element obtained from startColumnIndex to start looking for the element that contains column names
      columnNameElement - The name of the element that contains column names
      logProgress - Whether to update the log with progress percentages.
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • getTableSchema

      public String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      sourceFile - File object pointing to the CSV file to be analyzed.
      elementType - A string element name to match when finding elements to import from the XML
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      startIndex - Number of elements after the root element to start looking for elements to import
      startDepth - How far under the element obtained from startIndex to start looking for elements to import
      maxDepth - How far to recurse into import elements when searching for values to import
      useElementValues - Whether to use values that are stored as the contents of elements
      useAttributeValues - Whether to user values that are stored as attributes
      namedValues - True to use values by name, false to use values by position
      startColumnIndex - Number of elements after the root element to start looking for the element that contains column names
      startColumnDepth - How far under the element obtained from startColumnIndex to start looking for the element that contains column names
      columnNameElement - The name of the element that contains column names
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • getTableSchema

      public String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      sourceFile - File object pointing to the CSV file to be analyzed.
      elementType - A string element name to match when finding elements to import from the XML
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      startIndex - Number of elements after the root element to start looking for elements to import
      startDepth - How far under the element obtained from startIndex to start looking for elements to import
      maxDepth - How far to recurse into import elements when searching for values to import
      useElementValues - Whether to use values that are stored as the contents of elements
      useAttributeValues - Whether to user values that are stored as attributes
      namedValues - True to use values by name, false to use values by position
      startColumnIndex - Number of elements after the root element to start looking for the element that contains column names
      startColumnDepth - How far under the element obtained from startColumnIndex to start looking for the element that contains column names
      columnNameElement - The name of the element that contains column names
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      casingStyle - if not null, CasingStyle to apply to column names - None or null = no change to casing
      replacement - character, or empty String, to use for replacments of space or hyphen in source column names
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • main

      public static void main​(String... args)
      Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.
      Parameters:
      args - Varargs list of arguments in Apache CLI format