Package com.illumon.iris.importers
Class CsvSchemaCreator
java.lang.Object
com.illumon.iris.importers.CsvSchemaCreator
public class CsvSchemaCreator extends Object
Reads a CSV file and attempts to infer column data types and create appropriate schema and importer instructions.
Also legalizes column names and adds corresponding ImportColumn entries for translation of column names.
-
Constructor Summary
Constructors Constructor Description CsvSchemaCreator(com.fishlib.io.logger.Logger log, StatusCallback progress)
-
Method Summary
Modifier and Type Method Description static CsvImporterHelper
getInitializedCsvImporterHelper(File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)
Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.String
getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)
Get an XML String of table schema based on a file and user-provided options.String
getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
Get an XML String of table schema based on a file and user-provided options.static void
main(String... args)
Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.
-
Constructor Details
-
Method Details
-
getInitializedCsvImporterHelper
public static CsvImporterHelper getInitializedCsvImporterHelper(@NotNull File sourceFile, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, boolean trim, boolean noHeader, List<String> columnNames, com.fishlib.io.logger.Logger log)Sets up and returns a CsvImporterHelper to provide column details and record parsing capabilities.- Parameters:
sourceFile
- File object pointing to the CSV file to be analyzed.fileFormat
- Apache CSV Parser file format name.delimiter
- Single character delimiter.skipHeaderLines
- Number of lines to skip at the top of the file before trying to read the header row.skipFooterLines
- Number of lines to skip from the end of the file.trim
- Whether to trim data around values between delimiters.noHeader
- Indicates that the source file does not include a header row with column names.log
- An Iris event logger object.- Returns:
- A CsvImporterHelper that can process the passed sourceFile.
-
getTableSchema
public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows)Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.- Parameters:
namespace
- Namespace to use for the new schema.table
- Table name to use for the new schema.groupingColumn
- Optional single column name to mark as a Grouping column.partitionColumn
- Which column to use as the Partitioning column.sourceName
- Name to use for the CSV InputSource.sourcePartitionColumn
- Column name in the source data to use for multi-partition importsfileFormat
- Apache CSV Parser file format name.delimiter
- Single character delimiter.skipHeaderLines
- Number of lines to skip at the top of the file before trying to read the header row.skipFooterLines
- Number of lines to skip from the end of the file.sourceFile
- File object pointing to the CSV file to be analyzed.bestFit
- Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.trim
- Whether to trim data around values between delimiters.noHeader
- Indicates that the CSV does not include a header with column names.logProgress
- Whether to update the log with progress percentages.maxRows
- A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.- Returns:
- A String with the XML of derived table schema and CSV import instructions.
-
getTableSchema
public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, String fileFormat, char delimiter, int skipHeaderLines, int skipFooterLines, File sourceFile, boolean bestFit, boolean trim, boolean noHeader, List<String> columnNames, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.- Parameters:
namespace
- Namespace to use for the new schema.table
- Table name to use for the new schema.groupingColumn
- Optional single column name to mark as a Grouping column.partitionColumn
- Which column to use as the Partitioning column.sourceName
- Name to use for the CSV InputSource.sourcePartitionColumn
- Column name in the source data to use for multi-partition importsfileFormat
- Apache CSV Parser file format name.delimiter
- Single character delimiter.skipHeaderLines
- Number of lines to skip at the top of the file before trying to read the header row.skipFooterLines
- Number of lines to skip from the end of the file.sourceFile
- File object pointing to the CSV file to be analyzed.bestFit
- Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.trim
- Whether to trim data around values between delimiters.noHeader
- Indicates that the CSV does not include a header with column names.logProgress
- Whether to update the log with progress percentages.maxRows
- A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.casingStyle
- if not null, CasingStyle to apply to column names - None or null = no change to casingreplacement
- character, or empty String, to use for replacments of space or hyphen in source column names- Returns:
- A String with the XML of derived table schema and CSV import instructions.
-
main
Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.- Parameters:
args
- Varargs list of arguments in Apache CLI format
-