The Slogan Handbook - Reference (transcoder)

transcoder


transcoder(codec, @optional eol_style = 'crlf, error_handling_mode = 'ignore)

Return a transcoder encapsulating codec, eol_style, and error_handling_mode.

In many cases, the underlying stream is organized as a sequence of bytes, but these bytes may have to be treated as encodings for characters. In this case, a textual stream may be created with a transcoder to decode bytes to characters (for input) or encode characters to bytes (for output). A transcoder encapsulates a codec that determine how characters are represented as bytes.

Codec must be one of:

'ISO_8859_1

ISO-8859-1 character encoding. Each character is encoded by a single byte. Only Unicode characters with a code in the range 0 to 255 are allowed.

'ASCII

ASCII character encoding. Each character is encoded by a single byte. In principle only Unicode characters with a code in the range 0 to 127 are allowed but most types of streams treat this exactly like ISO-8859-1.

'UTF_8

UTF-8 character encoding. Each character is encoded by a sequence of one to four bytes. The minimum length UTF-8 encoding is used. If a BOM is needed at the beginning of the stream then it must be explicitly written.

'UTF_16

UTF-16 character encoding. Each character is encoded by one or two 16 bit integers (2 or 4 bytes). The 16 bit integers may be encoded using little-endian encoding or big-endian encoding. If the stream is a reader and the first two bytes read are a BOM ("Byte Order Mark" character with hexadecimal code FEFF) then the BOM will be discarded and the endianness will be set accordingly, otherwise the endianness depends on the operating system and how the Slogan runtime was compiled. If the stream is writer then a BOM will be output at the beginning of the stream and the endianness depends on the operating system and how the Slogan runtime was compiled.

'UTF_16LE

UTF-16 character encoding with little-endian endianness. It is like UTF-16 except the endianness is set to little-endian and there is no BOM processing. If a BOM is needed at the beginning of the stream then it must be explicitly written.

'UTF_16BE

UTF-16 character encoding with big-endian endianness. It is like UTF-16LE except the endianness is set to big-endian.

'UCS_2

UCS-2 character encoding. Each character is encoded by a 16 bit integer (2 bytes). The 16 bit integers may be encoded using little-endian encoding or big-endian encoding. If the stream is reader and the first two bytes read are a BOM ("Byte Order Mark" character with hexadecimal code FEFF) then the BOM will be discarded and the endianness will be set accordingly, otherwise the endianness depends on the operating system and how the Slogan runtime was compiled. If the stream is a writer then a BOM will be output at the beginning of the stream and the endianness depends on the operating system and how the Slogan runtime was compiled.

'UCS_2LE

UCS-2 character encoding with little-endian endianness. It is like UCS-2 except the endianness is set to little-endian and there is no BOM processing. If a BOM is needed at the beginning of the stream then it must be explicitly written.

'UCS_2BE

UCS-2 character encoding with big-endian endianness. It is like UCS-2LE except the endianness is set to big-endian.

'UCS_4

UCS-4 character encoding. Each character is encoded by a 32 integer (4 bytes). The 32 bit integers may be encoded using little-endian encoding or big-endian encoding. If the stream is a reader and the first four bytes read are a BOM ("Byte Order Mark" character with hexadecimal code FEFF) then the BOM will be discarded and the endianness will be set accordingly, otherwise the endianness depends on the operating system and how the Slogan runtime was compiled. If the stream is a writer then a BOM will be output at the beginning of the stream and the endianness depends on the operating system and how the Slogan runtime was compiled.

'UCS_4LE

UCS-4 character encoding with little-endian endianness. It is like UCS-4 except the endianness is set to little-endian and there is no BOM processing. If a BOM is needed at the beginning of the stream then it must be explicitly written.

'UCS_4BE.

UCS-4 character encoding with big-endian endianness. It is like UCS-4LE except the endianness is set to big-endian.

In addition to the above these codecs are also supported:

UTF / UTF-fallback-ASCII / UTF-fallback-ISO-8859-1
        / UTF-fallback-UTF-16 / UTF-fallback-UTF-16LE / UTF-fallback-UTF-16BE

. These encodings combine the UTF-8 and UTF-16 encodings. When one of these character encodings is used for an output port, characters will be encoded using the UTF-8 encoding. The first character, if there is one, is prefixed with a UTF-8 BOM (the three byte sequence EF BB BF in hexadecimal). When one of these character encodings is used for an input stream, the character encoding depends on the first few bytes. If the first bytes of the stream are a UTF-16LE BOM (FF FE in hexadecimal), or a UTF-16BE BOM (FE FF in hexadecimal), or a UTF-8 BOM (EF BB BF in hexadecimal), then the BOM is discarded and the remaining bytes of the stream are decoded using the corresponding character encoding. If a BOM is not present, then the stream is decoded using the fallback encoding specified. The encoding UTF is a synonym for UTF-fallback-UTF-8. Note that the UTF character encoding for input will correctly handle streams produced using the encodings UTF, UTF-8, UTF-16, ASCII, and if an explicit BOM is output, the encodings UTF-16LE, and UTF-16BE.

Eol_style determines how line endings are recognized. Supported eol styles are:

'lf

line-feed character

'cr

carriage-return character

'crlf

carriage return followed by line feed

'none

no line endings are recognized

In addition to the codec and eol_style, a transcoder encapsulates just one other piece of information: an error-handling mode that determine what happens if a decoding or encoding error occurs, i.e., if a sequence of bytes cannot be converted to a character with the encapsulated codec in the input direction or a character cannot be converted to a sequence of bytes with the encapsulated codec in the output direction. The error-handling mode should be one of 'ignore, 'raise and 'replace. If the error-handling mode is 'ignore, the offending sequence of bytes or the character is ignored. If the error-handling mode is 'raise, an exception with condition type i/o-decoding or i/o-encoding is raised; in the input direction, the stream is positioned beyond the sequence of bytes. If the error-handling mode is 'replace, a replacement character or character encoding is produced. In the input direction the replacement character is U+FFFD, while in the output direction the replacement is either the encoding of U+FFFD for unicode codecs or the encoding of the question-mark character (?) for other codecs.

Examples:


let t = transcoder('UTF_8, 'lf, 'raise)
transcoder_codec(t)
// UTF_8
transcoder_eol_style(t)
// lf
transcoder_error_handling_mode(t)
// raise

Also see: