Package com.illumon.iris.importers
Class XmlSchemaCreator
java.lang.Object
com.illumon.iris.importers.XmlSchemaCreator
public class XmlSchemaCreator extends Object
Reads an XML file and attempts to infer column data types and create appropriate schema and importer instructions.
Also legalizes column names and adds corresponding ImportColumn entries for translation of column names.
-
Constructor Summary
Constructors Constructor Description XmlSchemaCreator(com.fishlib.io.logger.Logger log, StatusCallback progress)
-
Method Summary
Modifier and Type Method Description static List<String>
getElementTypeNames(File sourceFile, int startIndex, int startDepth, int maxDepth)
Returns slash-delimited path-qualified element names based on the starting index, depth, and max-depth in the XML document.static com.illumon.iris.importers.CsvImporterHelperXml
getInitializedCsvImporterHelperXml(File sourceFile, String elementType, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, com.fishlib.io.logger.Logger log)
Returns an import helper that can provide CSV records to parse, and column details from an XML file.String
getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress)
Get an XML String of table schema based on a file and user-provided options.String
getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows)
Get an XML String of table schema based on a file and user-provided options.String
getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)
Get an XML String of table schema based on a file and user-provided options.static void
main(String... args)
Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.
-
Constructor Details
-
Method Details
-
getElementTypeNames
public static List<String> getElementTypeNames(@NotNull File sourceFile, int startIndex, int startDepth, int maxDepth)Returns slash-delimited path-qualified element names based on the starting index, depth, and max-depth in the XML document.- Parameters:
sourceFile
- File pointing to XML document to read.startIndex
- From the root element, how many elements to skip before starting.startDepth
- From the start index element, how many level to descend before beginning to enumerate.maxDepth
- From the starting depth, how many further levels to recurse while enumerating.- Returns:
- A List of Strings of the qualified element paths.
-
getInitializedCsvImporterHelperXml
public static com.illumon.iris.importers.CsvImporterHelperXml getInitializedCsvImporterHelperXml(@NotNull File sourceFile, String elementType, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, com.fishlib.io.logger.Logger log)Returns an import helper that can provide CSV records to parse, and column details from an XML file.- Parameters:
sourceFile
- File object pointing to the CSV file to be analyzed.elementType
- A string element name to match when finding elements to import from the XMLstartIndex
- Number of elements after the root element to start looking for elements to importstartDepth
- How far under the element obtained from startIndex to start looking for elements to importmaxDepth
- How far to recurse into import elements when searching for values to importuseElementValues
- Whether to use values that are stored as the contents of elementsuseAttributeValues
- Whether to user values that are stored as attributesnamedValues
- True to use values by name, false to use values by positionstartColumnIndex
- Number of elements after the root element to start looking for the element that contains column namesstartColumnDepth
- How far under the element obtained from startColumnIndex to start looking for the element that contains column namescolumnNameElement
- The name of the element that contains column nameslog
- An Iris logger- Returns:
- an import helper class
-
getTableSchema
public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress)Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.- Parameters:
namespace
- Namespace to use for the new schema.table
- Table name to use for the new schema.groupingColumn
- Optional single column name to mark as a Grouping column.partitionColumn
- Which column to use as the Partitioning column.sourceName
- Name to use for the CSV InputSource.sourcePartitionColumn
- Column name in the source data to use for multi-partition importssourceFile
- File object pointing to the CSV file to be analyzed.bestFit
- Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.elementType
- A string element name to match when finding elements to import from the XMLstartIndex
- Number of elements after the root element to start looking for elements to importstartDepth
- How far under the element obtained from startIndex to start looking for elements to importmaxDepth
- How far to recurse into import elements when searching for values to importuseElementValues
- Whether to use values that are stored as the contents of elementsuseAttributeValues
- Whether to user values that are stored as attributesnamedValues
- True to use values by name, false to use values by positionstartColumnIndex
- Number of elements after the root element to start looking for the element that contains column namesstartColumnDepth
- How far under the element obtained from startColumnIndex to start looking for the element that contains column namescolumnNameElement
- The name of the element that contains column nameslogProgress
- Whether to update the log with progress percentages.- Returns:
- A String with the XML of derived table schema and CSV import instructions.
-
getTableSchema
public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows)Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.- Parameters:
namespace
- Namespace to use for the new schema.table
- Table name to use for the new schema.groupingColumn
- Optional single column name to mark as a Grouping column.partitionColumn
- Which column to use as the Partitioning column.sourceName
- Name to use for the CSV InputSource.sourcePartitionColumn
- Column name in the source data to use for multi-partition importssourceFile
- File object pointing to the CSV file to be analyzed.elementType
- A string element name to match when finding elements to import from the XMLbestFit
- Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.startIndex
- Number of elements after the root element to start looking for elements to importstartDepth
- How far under the element obtained from startIndex to start looking for elements to importmaxDepth
- How far to recurse into import elements when searching for values to importuseElementValues
- Whether to use values that are stored as the contents of elementsuseAttributeValues
- Whether to user values that are stored as attributesnamedValues
- True to use values by name, false to use values by positionstartColumnIndex
- Number of elements after the root element to start looking for the element that contains column namesstartColumnDepth
- How far under the element obtained from startColumnIndex to start looking for the element that contains column namescolumnNameElement
- The name of the element that contains column nameslogProgress
- Whether to update the log with progress percentages.maxRows
- A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.- Returns:
- A String with the XML of derived table schema and CSV import instructions.
-
getTableSchema
public String getTableSchema(String namespace, String table, String groupingColumn, String partitionColumn, String sourceName, String sourcePartitionColumn, File sourceFile, String elementType, boolean bestFit, int startIndex, int startDepth, int maxDepth, boolean useElementValues, boolean useAttributeValues, boolean namedValues, int startColumnIndex, int startColumnDepth, String columnNameElement, boolean logProgress, int maxRows, CasingStyle casingStyle, String replacement)Get an XML String of table schema based on a file and user-provided options. This method is public so other applications, like the Schema Editor, can use it.- Parameters:
namespace
- Namespace to use for the new schema.table
- Table name to use for the new schema.groupingColumn
- Optional single column name to mark as a Grouping column.partitionColumn
- Which column to use as the Partitioning column.sourceName
- Name to use for the CSV InputSource.sourcePartitionColumn
- Column name in the source data to use for multi-partition importssourceFile
- File object pointing to the CSV file to be analyzed.elementType
- A string element name to match when finding elements to import from the XMLbestFit
- Whether to try to use smaller types (true), like int and float, or just to use bigger types, like long and double.startIndex
- Number of elements after the root element to start looking for elements to importstartDepth
- How far under the element obtained from startIndex to start looking for elements to importmaxDepth
- How far to recurse into import elements when searching for values to importuseElementValues
- Whether to use values that are stored as the contents of elementsuseAttributeValues
- Whether to user values that are stored as attributesnamedValues
- True to use values by name, false to use values by positionstartColumnIndex
- Number of elements after the root element to start looking for the element that contains column namesstartColumnDepth
- How far under the element obtained from startColumnIndex to start looking for the element that contains column namescolumnNameElement
- The name of the element that contains column nameslogProgress
- Whether to update the log with progress percentages.maxRows
- A maximum number of rows to read, rather than reading the whole file. A value of zero or less means to read the whole file.casingStyle
- if not null, CasingStyle to apply to column names - None or null = no change to casingreplacement
- character, or empty String, to use for replacments of space or hyphen in source column names- Returns:
- A String with the XML of derived table schema and CSV import instructions.
-
main
Regular main entry point, used when this module is called from a java command line, or from an IntelliJ run configuration.- Parameters:
args
- Varargs list of arguments in Apache CLI format
-