Class CsvSchemaCreator

java.lang.Object
com.illumon.iris.importers.CsvSchemaCreator

public class CsvSchemaCreator
extends Object
Reads a CSV file and attempts to infer column data types and create appropriate schema and importer instructions. Also legalizes column names and adds corresponding ImportColumn entries for translation of column names.
  • Constructor Summary

    Constructors 
    Constructor Description
    CsvSchemaCreator​(com.fishlib.io.logger.Logger log, StatusCallback progress)  
  • Method Summary

    Modifier and Type Method Description
    static CsvImporterHelper getInitializedCsvImporterHelper​(File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)
    Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.
    String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)
    Get an XML String of table schema based on a file and user-provided options.
    String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
    Get an XML String of table schema based on a file and user-provided options.
    static void main​(String... args)
    Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

  • Method Details

    • getInitializedCsvImporterHelper

      public static CsvImporterHelper getInitializedCsvImporterHelper​(@NotNull File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)
      Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.
      Parameters:
      sourceFile - File object pointing to the CSV file to be analyzed.
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the source file does not include a header row with column names.
      log - An Iris event logger object.
      Returns:
      A CsvImporterHelper that can process the passed sourceFile.
    • getTableSchema

      public String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      sourceFile - File object pointing to the CSV file to be analyzed.
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the CSV does not include a header with column names.
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • getTableSchema

      public String getTableSchema​(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      sourceFile - File object pointing to the CSV file to be analyzed.
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the CSV does not include a header with column names.
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      casingStyle - if not null, CasingStyle to apply to column names - None or null = no change to casing
      replacement - character, or empty String, to use for replacments of space or hyphen in source column names
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • main

      public static void main​(String... args)
      Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.
      Parameters:
      args - Varargs list of arguments in Apache CLI format