Qore Programming Language Reference Manual  0.9.2
Strings and Character Encoding

Overview

The Qore language is character-encoding aware. All strings are assumed to have the default character encoding, unless the program explicitly specified another encoding for certain objects and operations. Every Qore string has a character encoding ID attached to it, so, when another encoding is required, the Qore language will attempt to do an encoding translation.

Qore uses the operating system's iconv library functions to perform any encoding conversions.

Qore supports character encodings that are backwards compatible with 7-bit ASCII. This includes all ISO-8859-* character encodings, UTF-8, KOIR-8, KOIU-8, and KOI7, among others (see the table below: Known Character Encodings).

However, mutibyte character encodings are currently only properly supported for UTF-8. For UTF-8 strings, the length(), index(), rindex(), substr(), reverse(), the splice operator, print formatting (regarding field lengths) functions and methods taking format strings, and regular expression operators and functions, all work with character offsets, which may be different than byte offsets. For all character encodings other than UTF-8, a 1 byte=1 character relationship is assumed.

Qore will accept any encoding name given to it, even if it is not a known encoding name or alias. In this case, Qore will tag the strings with this encoding, and pass this user-defined encoding name to the iconv library when encodings must be converted. This allows programmers to use encodings known by the system's iconv library, but unknown to Qore. In this case, Qore will assume that the strings are backwards compatible with ASCII, meaning that that one character is represented by one byte and that the strings are null-terminated.

Note that when Qore matches an encoding name to a code or alias in the following table, the comparison is not case-sensitive.

Character Encodings Known to Qore

Code Aliases Description
ISO-8859-1 ISO88591, ISO8859-1, ISO-88591, ISO8859P1, ISO81, LATIN1, LATIN-1 latin-1, Western European character set
ISO-8859-2 ISO88592, ISO8859-2, ISO-88592, ISO8859P2, ISO82, LATIN2, LATIN-2 latin-2, Central European character set
ISO-8859-3 ISO88593, ISO8859-3, ISO-88593, ISO8859P3, ISO83, LATIN3, LATIN-3 latin-3, Southern European character set
ISO-8859-4 ISO88594, ISO8859-4, ISO-88594, ISO8859P4, ISO84, LATIN4, LATIN-4 latin-4, Northern European character set
ISO-8859-5 ISO88595, ISO8859-5, ISO-88595, ISO8859P5, ISO85 Cyrillic character set
ISO-8859-6 ISO88596, ISO8859-6, ISO-88596, ISO8859P6, ISO86 Arabic character set
ISO-8859-7 ISO88597, ISO8859-7, ISO-88597, ISO8859P7, ISO87 Greek character set
ISO-8859-8 ISO88598, ISO8859-8, ISO-88598, ISO8859P8, ISO88 Hebrew character set
ISO-8859-9 ISO88599, ISO8859-9, ISO-88599, ISO8859P9, ISO89, LATIN5, LATIN-5 latin-5, Turkish character set
ISO-8859-10 ISO885910, ISO8859-10, ISO-885910, ISO8859P10, ISO810, LATIN6, LATIN-6 latin-6, Nordic character set
ISO-8859-11 ISO885911, ISO8859-11, ISO-885911, ISO8859P11, ISO811 Thai character set
ISO-8859-13 ISO885913, ISO8859-13, ISO-885913, ISO8859P13, ISO813, LATIN7, LATIN-7 latin-7, Baltic rim character set
ISO-8859-14 ISO885914, ISO8859-14, ISO-885914, ISO8859P14, ISO814, LATIN8, LATIN-8 latin-8, Celtic character set
ISO-8859-15 ISO885915, ISO8859-15, ISO-885915, ISO8859P15, ISO815, LATIN9, LATIN-9 latin-9, Western European with euro symbol
ISO-8859-16 ISO885916, ISO8859-16, ISO-885916, ISO8859P16, ISO816, LATIN10, LATIN-10 latin-10, Southeast European character set
KOI7 n/a Russian: Kod Obmena Informatsiey, 7 bit characters
KOI8-R KOI8R Russian: Kod Obmena Informatsiey, 8 bit
KOI8-U KOI8U Ukrainian: Kod Obmena Informatsiey, 8 bit
US-ASCII ASCII, USASCII 7-bit ASCII character set
UTF-8 UTF8 variable-width universal character set
UTF-16 UTF16 variable-width universal character set based on a fundamental 2-byte character encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16 as the default character encoding in Qore
UTF-16BE UTF16BE variable-width universal character set based on a fundamental 2-byte character encoding with big-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16BE as the default character encoding in Qore
UTF-16LE UTF16LE variable-width universal character set based on a fundamental 2-byte character encoding with little-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16LE as the default character encoding in Qore

UTF-16 Support in Qore

UTF-16 is currently not well supported in Qore, because Qore's string support is based on the assumption that all strings are backwards-compatible with ASCII, and UTF-16 is not due to the minimum 2-byte character width and the possibility of embedded null bytes.

It's possible to generate string data in UTF-16 encoding (using Qore::convert_encoding()), however note that all strings so generated will be tagged with a BOM (byte order marker) at the beginning of the string data (this is performed by libiconv).

The following classes support parsing UTF-16 data by converting it to UTF-8 and processing the UTF-8 data:

The following classes support processing UTF-16 data natively:

Many string operations on UTF-16 data will provide invalid results due to the embedded nulls.

Bug:
With the exception of the classes above that explicitly support UTF-16 data, BOMs are ignored and all UTF-16 data is assumed to be big-endian; little-endian UTF-16-encoded data, even with a correct BOM, will not be processed correctly in Qore (in this case use the UTF-16LE encoding specifically)

Default Character Encoding

The default character encoding for Qore is determined by environment variables.

First, the QORE_CHARSET environment variable is checked. If it is set, then this character encoding will be the default character encoding for the process. If not, then the LANG environment variable is checked. If a character encoding is specified in the LANG environment variable, then it will be used as the default character encoding. Otherwise, if no character encoding can be derived from the environment, UTF-8 is assumed.

Character encodings are automatically converted by the Qore language when necessary. Encoding conversion errors will cause a Qore exception to be thrown. The character encoding conversions supported by Qore depend on the operating system's iconv library function.

Note
The get_default_encoding() function will return the default encoding for the Qore process.

Character Encoding Usage Examples

The following is a non-exhaustive list of examples in Qore where character encoding processing is performed.

Character encodings can be explicitly performed with the convert_encoding() function, and the encoding attached to a string can be checked with the get_encoding() function. If you have a string with incorrect encoding and want to change the encoding tag of the string (without changing the actual bytes of the string), use the force_encoding() function.

get_default_encoding() returns the default encoding for the Qore process.

The Qore::SQL::Datasource, Qore::SQL::DatasourcePool, and Qore::SQL::SQLStatement classes will translate character encodings to the encoding required by the database if necessary as well (this is actually the responsibility of the DBI driver for the database in question).

The Qore::File and Qore::Socket classes translate character encodings to the encoding specified for the object if necessary, as well as tagging strings received or read with the object's encoding.

The Qore::HTTPClient class will translate character encodings to the encoding specified for the object if necessary, as well as tag strings received with the object's encoding. Additionally, if an HTTP server response specifies a specific encoding to use, the encoding of strings read from the server will be automatically set to this encoding as well.