Qore FixedLengthUtil Module Reference  1.5

Introduction to the FixedLengthUtil Module

The FixedLengthUtil module provides functionality for parsing files with fixed length lines. This means that we have at least one line type and each line type is described as several data items with fixed length.

To use this module, use "%requires FixedLengthUtil" in your code.

All the public symbols in the module are defined in the FixedLengthUtil namespace.

Currently the module provides the following classes:

Furthermore, the following specialized classes are implemented based on the above and are provided for convenience and backwards-compatibility:

Global Options

Valid options are:

  • "date_format": the default date format for "date" fields (see date formatting for the value in this case)
  • "encoding": the output encoding for strings parsed or returned
  • "eol": the end of line characters for parsing or generation
  • "file_flags": additional writer File Open Constants; Qore::O_WRONLY | Qore::O_CREAT are used by default. Use eg. Qore::O_EXCL to ensure not to overwrite the target or Qore::O_TRUNC to replace any existing file
  • "ignore_empty": if True then ignore empty lines
  • "number_format": the default number format for "float" or "number" fields (see Qore::parse_number() and Qore::parse_float() for the value in these cases)
  • "timezone": a string giving a time zone region name or an integer offset in seconds east of UTC
  • "truncate": The flag controls whether to truncate an output field value if its bigger than its specified length. Default is "False".
  • "tab2space": Controls whether to replace tabs with spaces and its value determines how many spaces to output in place of one tab character.

Specification Hash

Fixed length specification hash is in the form of a hash, where each hash key is the name of a record, and each value is a record description hash describing the record; see the following example:

# the following spec is suitable for input and output
const Specs = (
"header": (
"flow_type": ("length": 3, "type": "string", "value": "001"),
"record_type": ("length": 3, "type": "int", "padding_char": "0"),
"number_of_records": ("length": 8, "type": "int", "padding_char": "0"),
),
"line": (
"flow_type": ("length": 3, "type": "string"),
"record_type": ("length": 3, "type": "int", "padding_char": "0"),
"processing_id": ("length": 10, "type": "int", "padding_char": "0"),
"processing_name": ("length": 10, "type": "string"),
"po_number": ("length": 10, "type": "int", "padding_char": "0"),
),
"trailer": (
"flow_type": ("length": 3, "type": "string", "value": "003"),
"record_type": ("length": 3, "type": "int", "padding_char": "0"),
"number_of_records": ("length": 8, "type": "int", "padding_char": "0"),
),
);

In the example above, "header", "line", and "trailer" are record names, and the values of each key are record description hashes.

Record Description Hash

Each record will have a number of fields described in the record description hash. The record description hash keys represent the names of the fields, and the values are field specification hashes.

In the "header" record in the example above, the fields are "flow_type", "record_type", and "number_of_records", and the values of each of those keys are field specification hashes for the given fields. As the "header" and "trailer" have equal line length, extra configuration is required to resolve the record type; in the example above this is configured using the "value" key of the field specification hashes for the "flow_type" records.

Field Specification Hash

The field specification hash has the following format:

Key Type Description
length integer the size of the field in bytes
type string the type of data bound to the field Field Data Types
format string a date mask if the type of the field is "date"; see date formatting for more information
timezone string override global timezone for current "date" field
padding string set padding of the field "left" (default) or "right"; used only in writers; if not given then the default padding depends on the field's type: "int" fields get left padding (right justification) and all others get right padding (left justification)
padding_char string a string with size 1 to use for padding. Default " " (space). Used only in writers
value string the value to use to compare to input data when determining the record type; if "value" is defined for a field, then "regex" cannot be defined
regex string the regular expression to use to apply to input data lines when determining the record type
default string In writer the value is default output value when value is not specified in record data.
truncate boolean The flag controls whether to truncate output field value if its bigger than specified length. Default is "False".
tab2space integer Controls whether to replace tabs with spaces and its value determines how many spaces to output in place of one tab character.

Field Data Types

The following values can be used as a field type:

  • "date"
  • "float"
  • "int"
  • "number"
  • "string"

Record Type Resolution

If no record type resolution rules or logic is defined, then record types are resolved automatically based on their unique line lengths. If the record line lengths are not unique (i.e. two or more records have the same number of characters), then a rule must exist to resolve the record type.

Typically the value of the first field determines the record type, however any field in the record can be used to determine the record type or even multiple fields could be used. Record type detection configuration is supplied by the "value" (field value equality test) or "regex" (regular expression test) keys in the field specification hash for the record in question. If multiple fields in a record definintion have "value" or "regex" keys, then all fields must match the input data in order for the input line to match the record.

The above record type resolution logic is executed in FixedLengthAbstractIterator::identifyTypeImpl(), which executes any "regex" or "value" tests on the input line in the order of the field definitions in the record description hash.

Record type resolution is performed as follow:

  • "value": Matches the full value of the field; if an integer "value" value is used, then integer comparisons are done, otherwise string comparisons are performed.
  • "regex": Matches the input line string starting at the first character in the field to the rest of the line (i.e. not truncated for the current record); this enables regular expression matching against multiple columns if needed.

When there are no record-matching keys in the field hashes for any record and the input record character lengths are not unique, then FixedLengthAbstractIterator::identifyTypeImpl() must be overridden in a subclass to provide custom record matching logic.

Note
  • It is an error to have both "regex" and "value" keys in a field specification hash
  • If multiple fields have configuration for input line matching (i.e. "regex" and "value" keys), then all fields with this configuration must match for the record to be matched

Fixed Length Data Format

Input and output data are formatted in a hash with two mandatory keys:

  • "type": a string with name of the type
  • "record": a hash with line data in field - value map
("type": "type1", "record": {"col1": 11111, "col2": "bb"}),

Example of reading:

#!/usr/bin/env qore
%new-style
%enable-all-warnings
%require-types
%strict-args
%requires FixedLengthUtil
hash<auto> specs = {
"type1": {
"col1": {"length": 5, "type": "int"},
"col2": {"length": 2, "type": "string"},
},
"type2": {
"col3": {"length": 1, "type": "string"},
"col4": {"length": 3, "type": "string"},
"col5": {
"length": 8,
"type": "date",
"format": "DDMMYYYY",
# "timezone": "Europe/Prague", # use global if omitted
},
},
};
hash<auto> global_options = {
"encoding" : "UTF-8",
"eol" : "\n",
"ignore_empty": True,
"timezone" : "Europe/Prague", # used if not overridden in a date field specification
};
FixedLengthFileIterator i(file, specs, global_options);
while (i.next()) {
operation_with_hash(i.getValue())
}

Example of writing:

#!/usr/bin/env qore
%new-style
%enable-all-warnings
%require-types
%strict-args
%requires FixedLengthUtil
list<hash<auto>> data = (
{"type": "type1", "record": {"col1": 11111, "col2": "bb"}},
{"type": "type2", "record": {"col3": "c", "col4": "ddd", "col5": "31122014"}},
{"type": "type1", "record": {"col1": 22222, "col2": "gg"}},
);
hash<auto> specs = {
"type1": {
"col1": {"length": 5, "type": "int"},
"col2": {"length": 2, "type": "string"},
},
"type2": {
"col3": {"length": 1, "type": "string"},
"col4": {"length": 3, "type": "string"},
"col5": {"length": 8, "type": "date", "format": "DDMMYYYY", "timezone": "Europe/Prague"},
},
};
hash<auto> global_options = {
"eol": "\n",
};
FixedLengthFileWriter w(file, specs, global_options);
w.write(data);

Release Notes

Version 1.5

  • updated with initial support for generic expressions (issue 4538)

Version 1.4

  • added support for resolving locations with the FileLocationHandler module (issue 4456)

Version 1.3

  • added support for generic record search operators and options (issue 4430)

Version 1.2.1

  • updated read and write data providers to provide verbose option support (issue 4139)

Version 1.1

Version 1.0.1

  • fixes and improvements to errors and exceptions (issue 1828)

Version 1.0

  • initial version of module