Qore CsvUtil Module Reference  1.7
CsvUtil Module

Introduction to the CsvUtil Module

The CsvUtil module provides functionality for parsing CSV-like files.

To use this module, use "%requires CsvUtil" in your code.

All the public symbols in the module are defined in the CsvUtil namespace

Currently the module provides the following classes:

Furthermore, the following specialized classes are implemented based on the above and are provided for convenience and backwards-compatibility:

Note that the CsvIterator class can be used to parse arbitrary text data; the field separator character can be specified in the constructor, as well as the quote character and end of line sequence. See the constructor documentation for more information.

Examples:

#!/usr/bin/env qore
%new-style
%strict-args
%require-types
%enable-all-warnings
%requires CsvUtil
FileInputStream input("example-file.csv");
CsvIterator i(input);
FileOutputStream output("example-file-copy.csv");
CsvWriter writer(output, ("headers": ("cc", "serno", "desc", "received")));
while (i.next()) {
printf("%d: %y\n", i.index(), i.getValue());
writer.writeLine(i.getValue());
}

If "example-file.csv" is:

UK,1234567890,"Sony, Xperia S",31052012
UK,1234567891,"Sony, Xperia S",31052012
UK,1234567892,"Sony, Xperia S",31052012
UK,1234567893,"Sony, Xperia S",31052012

The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:

1: {0: "UK", 1: "1234567890", 2: "Sony, Xperia S", 3: "31052012"}
2: {0: "UK", 1: "1234567891", 2: "Sony, Xperia S", 3: "31052012"}
3: {0: "UK", 1: "1234567892", 2: "Sony, Xperia S", 3: "31052012"}
4: {0: "UK", 1: "1234567893", 2: "Sony, Xperia S", 3: "31052012"}

Also the "example-file-copy.csv" will contain data from the original file formatted as CSV.

If header names are provided and field types are specified, the output looks different:

#!/usr/bin/env qore
%new-style
%strict-args
%require-types
%enable-all-warnings
%requires CsvUtil
FileInputStream input("example-file.csv");
CsvIterator i(input, NOTHING, ("headers": ("cc", "serno", "desc", "received"), "fields": ("serno": "int", "received": ("type": "date", "format": "DDMMYYYY"))));
while (i.next())
printf("%d: %y\n", i.index(), i.getValue());

Now the hash keys in each record returned are those given in the constructor, and the fields "serno" and "received" are given other data types; this produces:

1: {cc: "UK", serno: 1234567890, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
2: {cc: "UK", serno: 1234567891, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
3: {cc: "UK", serno: 1234567892, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
4: {cc: "UK", serno: 1234567893, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}

Use the "header_lines" and "header_names" options to automatically read the header names from the file if present. Use the "fields" option to describe the fields and perform transformations on the data read. For more information, see the CsvFileIterator class.

Multi-type support

The CsvUtil module supports multi-type records, meaning that different input lines may have different structures, and input record type resolution is performed at run-time using predefined rules. CsvUtil classes have two constructors, a backwards-compatible constructor variant, which accepts field definition in options and supports only a single record type, and a new constructor avariant, which ccepts field specifications and options passed as separate parameters.

When used with multi-type definitions, the CsvUtil::AbstractCsvIterator::getValue() method returns records in an extended form. In this case, the return value of CsvUtil::AbstractCsvIterator::getValue() is a hash with the following keys:

  • "type": the resolved record type name
  • "record": a hash of record values

As this is not backwards compatible, this mode is enabled using the "extended_record" option.

If no record type resolution rules or logic is defined, then record types are resolved automatically based on their unique number of fields. If the number of fields is not unique (i.e. two or more records have the same number of fields), then a rule must exist to resolve the record type.

Typically the value of a particular field determines the record type, however even multiple fields could be used. Record type detection configuration is supplied by the "value" (field value equality test) or "regex" (regular expression test) keys in the field specification hash for the record in question. If multiple fields in a record definintion have "value" or "regex" keys, then all fields must match the input data in order for the input line to match the record.

The above record type resolution logic is executed in AbstractCsvIterator::identifyTypeImpl(), which executes any "regex" or "value" tests on the input line in the order of the field definitions in the record description hash.

Record type resolution is performed as follow:

  • "value": Matches the full value of the field; if an integer "value" value is used, then integer comparisons are done, otherwise string comparisons are performed.
  • "regex": Matches the full value of the field with regular expression.

When there are no record-matching keys in the field hashes for any record and the input record field number are not unique, then AbstractCsvIterator::identifyTypeImpl() must be overridden in a subclass to provide custom record matching logic.

Note
  • It is an error to have both "regex" and "value" keys in a field specification hash
  • If multiple fields have configuration for input line matching (i.e. "regex" and "value" keys), then all fields with this configuration must match for the record to be matched
  • since version 1.5.
#!/usr/bin/env qore
%new-style
%strict-args
%require-types
%enable-all-warnings
%requires CsvUtil
const spec = (
"header": (
"type": ("type": "int", "value": 1),
"invoice_no": "string",
"date": ("type": "date", "format": "YYYYMMDD"),
),
"item": (
"type": ("type": "int", "value": 2),
"item_no": "string",
"item": "string",
"pcs": "int",
"price": "number",
),
"footer": (
"type": ("type": "int", "value": 3),
"total": "number",
),
);
const opts = {};
FileInputStream input("example-multi-file.csv");
CsvIterator i(input, NOTHING, spec, ("extended_record": True) + opts);
FileOutputStream output("example-multi-file-copy.csv");
CsvWriter writer(output, spec, ("write-headers": False, "optimal-quotes" : True, "quote_escape" : "\"") + opts);
while (i.next()) {
printf("%d: %y\n", i.index(), i.getValue());
writer.writeLine(i.getValue());
}

If "example-multi-file.csv" is:

1,2016-01,20160401
2,123,"Sony, Xperia S",1,100
2,124,"Nokia, Lumia",2,150
3,250

The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:

1: {type: "header", record: {type: 1, invoice_no: "2016-01", date: 2016-04-01 00:00:00.000000 Fri +02:00 (CEST)}}
2: {type: "item", record: {type: 2, item_no: "123", item: "Sony, Xperia S", pcs: 1, price: 100}}
3: {type: "item", record: {type: 2, item_no: "124", item: "Nokia, Lumia", pcs: 2, price: 150}}
4: {type: "footer", record: {type: 3, total: 250}}

Also the "example-multi-file-copy.csv" will contain data from the original file formatted as CSV.

Release Notes

Version 1.7

Version 1.6.2

  • implemented the number_format option to allow numbers with alternative decimal separators to be parsed and generated (issue 2806)

Version 1.6.1

Version 1.6

Version 1.5.1

Version 1.5

  • fixed a bug handling the global option "eol"
  • converted to new-style
  • if "headers" are not given in the CsvUtil::AbstractCsvWriter::constructor() but "fields" are, then set the headers from the field descriptions automatically
  • added write() methods returning the generated strings to the CsvUtil::CsvStringWriter class for API compatibility with the corresponding FixedLengthDataWriter methods
  • implemented support for SQLStatement as an iterator source for CsvUtil::AbstractCsvWriter::write()
  • implemented the "datamap" and "info_log" options for CSV generation
  • implemented options with underscores to replace options with dashes:
    • CsvUtil::AbstractCsvWriter
      • "date-format" is now "date_format"
      • "optimal-quotes" is now "optimal_quotes"
      • "verify-columns" is now "verify_columns"
      • "write-headers" is now "write_headers"
    • CsvUtil::AbstractCsvIterator
      • "ignore-empty" is now "ignore_empty"
      • "ignore-whitespace" is now "ignore_whitespace"
      • "header-names" is now "header_names"
      • "header-lines" is now "header_lines"
      • "verify-columns" is now "verify_columns"
  • implemented multi-type line support
  • two constructors for backward CsvUtil compatability and similarity with FixedLengthUtil API
  • implemented new options "header_reorder"; CsvUtil::AbstractCsvIterator "date_format","extended_record"
  • implemented options for field specification: "index", "default", "header", "value", "regex"
  • fixed a UTC bug for default date 1970-01-01Z

Version 1.4

  • fixed the "format" field option when used with "*date" field types
  • implemented the "tolwr" parser option
  • changed the default field type when parsing and generating CSV files from "string" to "*string"

Version 1.3

  • added the "write-headers" option to CsvUtil::AbstractCsvWriter and subclasses to enable headers to be suppressed
  • added the "optimal-quotes" option to CsvUtil::AbstractCsvWriter and subclasses to enable more efficient csv output (now the default output option); to revert back to the previous behavior (where all fields are quoted regardless of data type or content), set to False in the constructor

Version 1.2

Version 1.1

Version 1.0

  • initial version of module