Class CsvSchemaCreator

java.lang.Object
com.illumon.iris.importers.CsvSchemaCreator

public class CsvSchemaCreator extends Object
Reads a CSV file and attempts to infer column data types and create appropriate schema and importer instructions. Also legalizes column names and adds corresponding ImportColumn entries for translation of column names.
  • Constructor Summary

    Constructors
    Constructor
    Description
    CsvSchemaCreator(com.fishlib.io.logger.Logger log, StatusCallback progress)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    getInitializedCsvImporterHelper(File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)
    Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.
    getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)
    Get an XML String of table schema based on a file and user-provided options.
    getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
    Get an XML String of table schema based on a file and user-provided options.
    static void
    main(String... args)
    Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Constructor Details

    • CsvSchemaCreator

      public CsvSchemaCreator(@NotNull com.fishlib.io.logger.Logger log, StatusCallback progress)
  • Method Details

    • getInitializedCsvImporterHelper

      public static CsvImporterHelper getInitializedCsvImporterHelper(@NotNull File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)
      Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.
      Parameters:
      sourceFile - File object pointing to the CSV file to be analyzed.
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the source file does not include a header row with column names.
      log - An Iris event logger object.
      Returns:
      A CsvImporterHelper that can process the passed sourceFile.
    • getTableSchema

      public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      sourceFile - File object pointing to the CSV file to be analyzed.
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the CSV does not include a header with column names.
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • getTableSchema

      public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
      Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.
      Parameters:
      namespace - Namespace to use for the new schema.
      table - Table name to use for the new schema.
      groupingColumn - Optional single column name to mark as a Grouping column.
      partitionColumn - Which column to use as the Partitioning column.
      sourceName - Name to use for the CSV InputSource.
      sourcePartitionColumn - Column name in the source data to use for multi-partition imports
      fileFormat - Apache CSV Parser file format name.
      delimiter - Single character delimiter.
      skipHeaderLines - Number of lines to skip at the top of the file before trying to read the header row.
      skipFooterLines - Number of lines to skip from the end of the file.
      sourceFile - File object pointing to the CSV file to be analyzed.
      bestFit - Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.
      trim - Whether to trim data around values between delimiters.
      noHeader - Indicates that the CSV does not include a header with column names.
      logProgress - Whether to update the log with progress percentages.
      maxRows - A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.
      casingStyle - if not null, CasingStyle to apply to column names - None or null = no change to casing
      replacement - character, or empty String, to use for replacments of space or hyphen in source column names
      Returns:
      A String with the XML of derived table schema and CSV import instructions.
    • main

      public static void main(String... args)
      Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.
      Parameters:
      args - Varargs list of arguments in Apache CLI format