Introduction to the CsvUtil Module

The CsvUtil module provides functionality for parsing CSV-like files.

To use this module, use "%requires CsvUtil" in your code.

All the public symbols in the module are defined in the CsvUtil namespace

Currently the module provides the following classes:

AbstractCsvIterator: base abstract iterator class for iterating line-based CSV data
AbstractCsvWriter: a base class for new CSV writer implementations
CsvIterator: iterator class allowing for generic CSV input data to be processed line by line on a record basis
CsvReadDataProvider: a data provider for reading and searching CSV files
CsvReadDataProviderFactory: a data provider factory for reading and searching CSV files
CsvWriter: generic stream-based CSV writer

Furthermore, the following specialized classes are implemented based on the above and are provided for convenience and backwards-compatibility:

CsvDataIterator: iterator class allowing for CSV string data to be processed line by line on a record basis
CsvFileIterator: iterator class allowing for CSV files to be processed line by line on a record basis
CsvFileWriter: CSV file writer
CsvStringWriter: CSV in-memory writer

Note that the CsvIterator class can be used to parse arbitrary text data; the field separator character can be specified in the constructor, as well as the quote character and end of line sequence. See the constructor documentation for more information.

Examples:

#!/usr/bin/env qore
 
%new-style
%strict-args
%require-types
%enable-all-warnings
 
%requires CsvUtil
 
FileInputStream input("example-file.csv");
CsvIterator i(input);
FileOutputStream output("example-file-copy.csv");
CsvWriter writer(output, ("headers": ("cc", "serno", "desc", "received")));
 
while (i.next()) {
    printf("%d: %y\n", i.index(), i.getValue());
    writer.writeLine(i.getValue());
}

If "example-file.csv" is:

UK,1234567890,"Sony, Xperia S",31052012
UK,1234567891,"Sony, Xperia S",31052012
UK,1234567892,"Sony, Xperia S",31052012
UK,1234567893,"Sony, Xperia S",31052012

The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:

1: {0: "UK", 1: "1234567890", 2: "Sony, Xperia S", 3: "31052012"}
2: {0: "UK", 1: "1234567891", 2: "Sony, Xperia S", 3: "31052012"}
3: {0: "UK", 1: "1234567892", 2: "Sony, Xperia S", 3: "31052012"}
4: {0: "UK", 1: "1234567893", 2: "Sony, Xperia S", 3: "31052012"}

Also the "example-file-copy.csv" will contain data from the original file formatted as CSV.

If header names are provided and field types are specified, the output looks different:

#!/usr/bin/env qore
 
%new-style
%strict-args
%require-types
%enable-all-warnings
 
%requires CsvUtil
 
FileInputStream input("example-file.csv");
CsvIterator i(input, NOTHING, ("headers": ("cc", "serno", "desc", "received"), "fields": ("serno": "int", "received": ("type": "date", "format": "DDMMYYYY"))));
while (i.next())
    printf("%d: %y\n", i.index(), i.getValue());

Now the hash keys in each record returned are those given in the constructor, and the fields "serno" and "received" are given other data types; this produces:

1: {cc: "UK", serno: 1234567890, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
2: {cc: "UK", serno: 1234567891, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
3: {cc: "UK", serno: 1234567892, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}
4: {cc: "UK", serno: 1234567893, desc: "Sony, Xperia S", received: 2012-05-31 00:00:00 Thu +02:00 (CEST)}

Use the "header_lines" and "header_names" options to automatically read the header names from the file if present. Use the "fields" option to describe the fields and perform transformations on the data read. For more information, see the CsvFileIterator class.

Multi-type support

The CsvUtil module supports multi-type records, meaning that different input lines may have different structures, and input record type resolution is performed at run-time using predefined rules. CsvUtil classes have two constructors, a backwards-compatible constructor variant, which accepts field definition in options and supports only a single record type, and a new constructor avariant, which ccepts field specifications and options passed as separate parameters.

When used with multi-type definitions, the CsvUtil::AbstractCsvIterator::getValue() method returns records in an extended form. In this case, the return value of CsvUtil::AbstractCsvIterator::getValue() is a hash with the following keys:

"type": the resolved record type name
"record": a hash of record values

As this is not backwards compatible, this mode is enabled using the "extended_record" option.

If no record type resolution rules or logic is defined, then record types are resolved automatically based on their unique number of fields. If the number of fields is not unique (i.e. two or more records have the same number of fields), then a rule must exist to resolve the record type.

Typically the value of a particular field determines the record type, however even multiple fields could be used. Record type detection configuration is supplied by the "value" (field value equality test) or "regex" (regular expression test) keys in the field specification hash for the record in question. If multiple fields in a record definintion have "value" or "regex" keys, then all fields must match the input data in order for the input line to match the record.

The above record type resolution logic is executed in AbstractCsvIterator::identifyTypeImpl(), which executes any "regex" or "value" tests on the input line in the order of the field definitions in the record description hash.

Record type resolution is performed as follow:

"value": Matches the full value of the field; if an integer "value" value is used, then integer comparisons are done, otherwise string comparisons are performed.
"regex": Matches the full value of the field with regular expression.

When there are no record-matching keys in the field hashes for any record and the input record field number are not unique, then AbstractCsvIterator::identifyTypeImpl() must be overridden in a subclass to provide custom record matching logic.

Note

It is an error to have both "regex" and "value" keys in a field specification hash
If multiple fields have configuration for input line matching (i.e. "regex" and "value" keys), then all fields with this configuration must match for the record to be matched
since version 1.5.

#!/usr/bin/env qore
 
%new-style
%strict-args
%require-types
%enable-all-warnings
 
%requires CsvUtil
 
const Spec = {
    "header": {
        "type": {"type": "int", "value": 1},
        "invoice_no": "string",
        "date": {"type": "date", "format": "YYYYMMDD"},
    },
    "item": {
        "type": {"type": "int", "value": 2},
        "item_no": "string",
        "item": "string",
        "pcs": "int",
        "price": "number",
    },
    "footer": {
        "type": {"type": "int", "value": 3},
        "total": "number",
    },
};
 
const Opts = {};
 
FileInputStream input("example-multi-file.csv");
CsvIterator i(input, NOTHING, Spec, {"extended_record": True} + Opts);
 
FileOutputStream output("example-multi-file-copy.csv");
CsvWriter writer(output, Spec, {"write-headers": False, "optimal-quotes": True, "quote_escape": "\"") + Opts);
while (i.next()) {
    printf("%d: %y\n", i.index(), i.getValue());
    writer.writeLine(i.getValue());
}

If "example-multi-file.csv" is:

1,2016-01,20160401
2,123,"Sony, Xperia S",1,100
2,124,"Nokia, Lumia",2,150
3,250

The data is read verbatim, each value is returned as a string, header names are generated numerically; the output is:

1: {type: "header", record: {type: 1, invoice_no: "2016-01", date: 2016-04-01 00:00:00.000000 Fri +02:00 (CEST)}}
2: {type: "item", record: {type: 2, item_no: "123", item: "Sony, Xperia S", pcs: 1, price: 100}}
3: {type: "item", record: {type: 2, item_no: "124", item: "Nokia, Lumia", pcs: 2, price: 150}}
4: {type: "footer", record: {type: 3, total: 250}}

Also the "example-multi-file-copy.csv" will contain data from the original file formatted as CSV.

Release Notes

Version 1.7.2

updated read and write data providers to provide verbose option support (issue 4139)
implemented support for the missing bool and *bool types (issue 4118)

Version 1.7.1

fixed a bug where CSV iteration would fail if a quoted field had a line break in it (issue 4079)

Version 1.7

added data provider API support (issue 3545)

Version 1.6.2

implemented the number_format option to allow numbers with alternative decimal separators to be parsed and generated (issue 2806)

Version 1.6.1

implemented the CsvUtil::csvutil_set_global_compat_force_empty_string() function and the compat_force_empty_string CSV parsing option to force "*string" fields with no value to return an empty string when parsing rather than NOTHING for backwards compatibility with very early versions of CsvUtil (issue 2476)

Version 1.6

added support for streams; the following stream-based classes have been added:
- CsvIterator: provides a more generic interface than CsvDataIterator and CsvFileIterator
- CsvWriter: provides a more generic interface than CsvStringWriter and CsvFileWriter
fixed a bug in an error message validating input data (issue 1062)
improved fields option documentation and added an exception when detected headers do not match the fields option (issue 2179)

Version 1.5.1

fixed a bug in AbstractCsvIterator::identifyTypeImpl() generating an error message (issue 1355)

Version 1.5

fixed a bug handling the global option "eol"
converted to new-style
if "headers" are not given in the CsvUtil::AbstractCsvWriter::constructor() but "fields" are, then set the headers from the field descriptions automatically
added write() methods returning the generated strings to the CsvUtil::CsvStringWriter class for API compatibility with the corresponding FixedLengthDataWriter methods
implemented support for SQLStatement as an iterator source for CsvUtil::AbstractCsvWriter::write()
implemented the "datamap" and "info_log" options for CSV generation
implemented options with underscores to replace options with dashes:
- CsvUtil::AbstractCsvWriter
  - "date-format" is now "date_format"
  - "optimal-quotes" is now "optimal_quotes"
  - "verify-columns" is now "verify_columns"
  - "write-headers" is now "write_headers"
- CsvUtil::AbstractCsvIterator
  - "ignore-empty" is now "ignore_empty"
  - "ignore-whitespace" is now "ignore_whitespace"
  - "header-names" is now "header_names"
  - "header-lines" is now "header_lines"
  - "verify-columns" is now "verify_columns"
implemented multi-type line support
two constructors for backward CsvUtil compatability and similarity with FixedLengthUtil API
implemented new options "header_reorder"; CsvUtil::AbstractCsvIterator "date_format","extended_record"
implemented options for field specification: "index", "default", "header", "value", "regex"
fixed a UTC bug for default date 1970-01-01Z

Version 1.4

fixed the "format" field option when used with "*date" field types
implemented the "tolwr" parser option
changed the default field type when parsing and generating CSV files from "string" to "*string"

Version 1.3

added the "write-headers" option to CsvUtil::AbstractCsvWriter and subclasses to enable headers to be suppressed
added the "optimal-quotes" option to CsvUtil::AbstractCsvWriter and subclasses to enable more efficient csv output (now the default output option); to revert back to the previous behavior (where all fields are quoted regardless of data type or content), set to False in the constructor

Version 1.2

fixed CsvUtil::CsvDataIterator::next() when header_lines > 0 and working with empty input data
implemented support for the "*int", "*float", "*number", and "*date" types
implemented support for allowing subclasses of CsvUtil::CsvFileIterator to implement support for other custom types
fixed "date" type handling with empty input; now returns 1970-01-01Z (use "*date" to map empty input to NOTHING)
added the CsvUtil::CsvStringWriter, CsvUtil::AbstractCsvWriter, and CsvUtil::CsvFileWriter classes
if "headers" are not given in the CsvUtil::CsvFileIterator::constructor() but "fields" are, then set the headers from the field descriptions automatically

Version 1.1

bug fixes to header and fields option processing
fixed CsvUtil::CsvFileIterator::index() to return the line index
added CsvUtil::CsvFileIterator::lineNumber() to return the current line number in the file

Version 1.0

initial version of module

Table of Contents