Overview

The Qore language is character-encoding aware. All strings are assumed to have the default character encoding, unless the program explicitly specified another encoding for certain objects and operations. Every Qore string has a character encoding ID attached to it, so, when another encoding is required, the Qore language will attempt to do an encoding translation.

Qore uses the operating system's iconv library functions to perform any encoding conversions.

Qore supports character encodings that are backwards compatible with 7-bit ASCII. This includes all ISO-8859-* character encodings, UTF-8, KOIR-8, KOIU-8, and KOI7, among others (see the table below: Known Character Encodings).

However, mutibyte character encodings are currently only properly supported for UTF-8. For UTF-8 strings, the length(), index(), rindex(), substr(), reverse(), the splice operator, print formatting (regarding field lengths) functions and methods taking format strings, and regular expression operators and functions, all work with character offsets, which may be different than byte offsets. For all character encodings other than UTF-8, a 1 byte=1 character relationship is assumed.

Qore will accept any encoding name given to it, even if it is not a known encoding name or alias. In this case, Qore will tag the strings with this encoding, and pass this user-defined encoding name to the iconv library when encodings must be converted. This allows programmers to use encodings known by the system's iconv library, but unknown to Qore. In this case, Qore will assume that the strings are backwards compatible with ASCII, meaning that that one character is represented by one byte and that the strings are null-terminated.

Note that when Qore matches an encoding name to a code or alias in the following table, the comparison is not case-sensitive.

Character Encodings Known to Qore

Code	Aliases	Description
`ISO-8859-1`	`ISO88591`, `ISO8859-1`, `ISO-88591`, `ISO8859P1`, `ISO81`, `LATIN1`, `LATIN-1`	latin-1, Western European character set
`ISO-8859-2`	`ISO88592`, `ISO8859-2`, `ISO-88592`, `ISO8859P2`, `ISO82`, `LATIN2`, `LATIN-2`	latin-2, Central European character set
`ISO-8859-3`	`ISO88593`, `ISO8859-3`, `ISO-88593`, `ISO8859P3`, `ISO83`, `LATIN3`, `LATIN-3`	latin-3, Southern European character set
`ISO-8859-4`	`ISO88594`, `ISO8859-4`, `ISO-88594`, `ISO8859P4`, `ISO84`, `LATIN4`, `LATIN-4`	latin-4, Northern European character set
`ISO-8859-5`	`ISO88595`, `ISO8859-5`, `ISO-88595`, `ISO8859P5`, `ISO85`	Cyrillic character set
`ISO-8859-6`	`ISO88596`, `ISO8859-6`, `ISO-88596`, `ISO8859P6`, `ISO86`	Arabic character set
`ISO-8859-7`	`ISO88597`, `ISO8859-7`, `ISO-88597`, `ISO8859P7`, `ISO87`	Greek character set
`ISO-8859-8`	`ISO88598`, `ISO8859-8`, `ISO-88598`, `ISO8859P8`, `ISO88`	Hebrew character set
`ISO-8859-9`	`ISO88599`, `ISO8859-9`, `ISO-88599`, `ISO8859P9`, `ISO89`, `LATIN5`, `LATIN-5`	latin-5, Turkish character set
`ISO-8859-10`	`ISO885910`, `ISO8859-10`, `ISO-885910`, `ISO8859P10`, `ISO810`, `LATIN6`, `LATIN-6`	latin-6, Nordic character set
`ISO-8859-11`	`ISO885911`, `ISO8859-11`, `ISO-885911`, `ISO8859P11`, `ISO811`	Thai character set
`ISO-8859-13`	`ISO885913`, `ISO8859-13`, `ISO-885913`, `ISO8859P13`, `ISO813`, `LATIN7`, `LATIN-7`	latin-7, Baltic rim character set
`ISO-8859-14`	`ISO885914`, `ISO8859-14`, `ISO-885914`, `ISO8859P14`, `ISO814`, `LATIN8`, `LATIN-8`	latin-8, Celtic character set
`ISO-8859-15`	`ISO885915`, `ISO8859-15`, `ISO-885915`, `ISO8859P15`, `ISO815`, `LATIN9`, `LATIN-9`	latin-9, Western European with euro symbol
`ISO-8859-16`	`ISO885916`, `ISO8859-16`, `ISO-885916`, `ISO8859P16`, `ISO816`, `LATIN10`, `LATIN-10`	latin-10, Southeast European character set
`KOI7`	n/a	Russian: Kod Obmena Informatsiey, 7 bit characters
`KOI8-R`	`KOI8R`	Russian: Kod Obmena Informatsiey, 8 bit
`KOI8-U`	`KOI8U`	Ukrainian: Kod Obmena Informatsiey, 8 bit
`US-ASCII`	`ASCII`, `USASCII`	7-bit ASCII character set
`UTF-8`	`UTF8`	variable-width universal character set
`UTF-16`	`UTF16`	variable-width universal character set based on a fundamental 2-byte character encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16 as the default character encoding in Qore
`UTF-16BE`	`UTF16BE`	variable-width universal character set based on a fundamental 2-byte character encoding with big-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16BE as the default character encoding in Qore
`UTF-16LE`	`UTF16LE`	variable-width universal character set based on a fundamental 2-byte character encoding with little-endian encoding; not backwards-compatible with ASCII and therefore not supported universally in Qore; it's recommended to convert these strings to UTF-8 in Qore; do not use UTF-16LE as the default character encoding in Qore
`WINDOWS-874`	`WINDOWS874`, `CP-874`, `CP874`	Windows 874: character encoding for Latin/Thai, very similar to ISO-8859-11
`WINDOWS-936`	`WINDOWS936`, `CP-936`, `CP936`	Windows 936: character encoding for simplified Chinese
`WINDOWS-1250`	`WINDOWS1250`, `CP-1250`, `CP1250`	Windows 1250: character encoding for Central/Eastern European languages
`WINDOWS-1251`	`WINDOWS1251`, `CP-1251`, `CP1251`	Windows 1251: character encoding for Cyrillic: Russian, Ukrainian, Balarusian, Bulgarian, Serbian Cyrillic, Macedonian, and others
`WINDOWS-1252`	`WINDOWS1252`, `CP-1252`, `CP1252`	Windows 1252: character encoding for Western European languages: Spanish, French, German
`WINDOWS-1253`	`WINDOWS1253`, `CP-1253`, `CP1253`	Windows 1253: character encoding for Greek
`WINDOWS-1254`	`WINDOWS1254`, `CP-1254`, `CP1254`	Windows 1254: character encoding for Turkish
`WINDOWS-1255`	`WINDOWS1255`, `CP-1255`, `CP1255`	Windows 1255: character encoding for Hebrew
`WINDOWS-1256`	`WINDOWS1256`, `CP-1256`, `CP1256`	Windows 1256: character encoding for Arabic
`WINDOWS-1257`	`WINDOWS1257`, `CP-1257`, `CP1257`	Windows 1257: character encoding for Baltic languages
`WINDOWS-1258`	`WINDOWS1258`, `CP-1258`, `CP1258`	Windows 1258: character encoding for Vietnamese

UTF-16 Support in Qore

UTF-16 is currently not well supported in Qore, because Qore's string support is based on the assumption that all strings are backwards-compatible with ASCII, and UTF-16 is not due to the minimum 2-byte character width and the possibility of embedded null bytes.

It's possible to generate string data in UTF-16 encoding (using Qore::convert_encoding()), however note that all strings so generated will be tagged with a BOM (byte order marker) at the beginning of the string data (this is performed by libiconv).

The following classes support parsing UTF-16 data by converting it to UTF-8 and processing the UTF-8 data:

The following classes support processing UTF-16 data natively:

Many string operations on UTF-16 data will provide invalid results due to the embedded nulls.

Bug:: With the exception of the classes above that explicitly support UTF-16 data, BOMs are ignored and all UTF-16 data is assumed to be big-endian; little-endian UTF-16-encoded data, even with a correct BOM, will not be processed correctly in Qore (in this case use the UTF-16LE encoding specifically)

Default Character Encoding

The default character encoding for Qore is determined by environment variables.

First, the QORE_CHARSET environment variable is checked. If it is set, then this character encoding will be the default character encoding for the process. If not, then the LANG environment variable is checked. If a character encoding is specified in the LANG environment variable, then it will be used as the default character encoding. Otherwise, if no character encoding can be derived from the environment, UTF-8 is assumed.

Character encodings are automatically converted by the Qore language when necessary. Encoding conversion errors will cause a Qore exception to be thrown. The character encoding conversions supported by Qore depend on the operating system's iconv library function.

Note: The get_default_encoding() function will return the default encoding for the Qore process.

Character Encoding Usage Examples

The following is a non-exhaustive list of examples in Qore where character encoding processing is performed.

Character encodings can be explicitly performed with the convert_encoding() function, and the encoding attached to a string can be checked with the get_encoding() function. If you have a string with incorrect encoding and want to change the encoding tag of the string (without changing the actual bytes of the string), use the force_encoding() function.

get_default_encoding() returns the default encoding for the Qore process.

The Qore::SQL::Datasource, Qore::SQL::DatasourcePool, and Qore::SQL::SQLStatement classes will translate character encodings to the encoding required by the database if necessary as well (this is actually the responsibility of the DBI driver for the database in question).

The Qore::File and Qore::Socket classes translate character encodings to the encoding specified for the object if necessary, as well as tagging strings received or read with the object's encoding.

The Qore::HTTPClient class will translate character encodings to the encoding specified for the object if necessary, as well as tag strings received with the object's encoding. Additionally, if an HTTP server response specifies a specific encoding to use, the encoding of strings read from the server will be automatically set to this encoding as well.

Table of Contents

Overview

Character Encodings Known to Qore

UTF-16 Support in Qore

Default Character Encoding

Character Encoding Usage Examples