A.4.11 String Encoding
Facilities for encoding, decoding, and converting
strings in various character encoding schemes are provided by packages
Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings,
Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
The encoding library
packages have the following declarations:
package Ada.Strings.UTF_Encoding
with Pure
is
--
Declarations common to the string encoding packages
type Encoding_Scheme
is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String
is String;
subtype UTF_8_String
is String;
subtype UTF_16_Wide_String
is Wide_String;
Encoding_Error :
exception;
BOM_8 :
constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
BOM_16BE :
constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
BOM_16LE :
constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
BOM_16 :
constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
package Ada.Strings.UTF_Encoding.Conversions
with Pure
is
--
Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
package Ada.Strings.UTF_Encoding.Strings
with Pure
is
--
Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
function Decode (Item : UTF_8_String)
return String;
function Decode (Item : UTF_16_Wide_String)
return String;
end Ada.Strings.UTF_Encoding.Strings;
package Ada.Strings.UTF_Encoding.Wide_Strings
with Pure
is
--
Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
function Decode (Item : UTF_8_String)
return Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings
with Pure
is
--
Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
The type Encoding_Scheme defines encoding schemes.
UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of
ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined
by Annex C of ISO/IEC 10646 in 8 bit, big-endian order; and UTF_16LE
corresponds to the UTF-16 encoding scheme in 8 bit, little-endian order.
The subtype UTF_String is used to represent a String
of 8-bit values containing a sequence of values encoded in one of three
ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used
to represent a String of 8-bit values containing a sequence of values
encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent
a Wide_String of 16-bit values containing a sequence of values encoded
in UTF-16.
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants
correspond to values used at the start of a string to indicate the encoding.
Each of the Encode functions takes a String, Wide_String,
or Wide_Wide_String Item parameter that is assumed to be an array of
unencoded characters. Each of the Convert functions takes a UTF_String,
UTF_8_String, or UTF_16_String Item parameter that is assumed to contain
characters whose position values correspond to a valid encoding sequence
according to the encoding scheme required by the function or specified
by its Input_Scheme parameter.
Each of the Convert and Encode functions returns
a UTF_String, UTF_8_String, or UTF_16_String value whose characters have
position values that correspond to the encoding of the Item parameter
according to the encoding scheme required by the function or specified
by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned.
A BOM is included at the start of the returned string if the Output_BOM
parameter is set to True. The lower bound of the returned string is 1.
Each of the Decode functions takes a UTF_String,
UTF_8_String, or UTF_16_String Item parameter which is assumed to contain
characters whose position values correspond to a valid encoding sequence
according to the encoding scheme required by the function or specified
by its Input_Scheme parameter, and returns the corresponding String,
Wide_String, or Wide_Wide_String value. The lower bound of the returned
string is 1.
For each of the Convert and Decode functions, an
initial BOM in the input that matches the expected encoding scheme is
ignored, and a different initial BOM causes Encoding_Error to be propagated.
The exception Encoding_Error
is also propagated in the following situations:
By a Convert or Decode function when a UTF encoded
string contains an invalid encoding sequence.
By a Convert or Decode function when the expected
encoding is UTF-16BE or UTF-16LE and the input string has an odd length.
By a Decode function yielding a String when the
decoding of a sequence results in a code point whose value exceeds 16#FF#.
By a Decode function yielding a Wide_String when
the decoding of a sequence results in a code point whose value exceeds
16#FFFF#.
By an Encode function taking a Wide_String as input
when an invalid character appears in the input. In particular, the characters
whose position is in the range 16#D800# .. 16#DFFF# are invalid because
they conflict with UTF-16 surrogate encodings, and the characters whose
position is 16#FFFE# or 16#FFFF# are also invalid because they conflict
with BOM codes.
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
Inspects a UTF_String
value to determine whether it starts with a BOM for UTF-8, UTF-16BE,
or UTF_16LE. If so, returns the scheme corresponding to the BOM; otherwise,
returns the value of Default.
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in one of these three schemes as specified by
Output_Scheme.
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in UTF-16.
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8) encoded in UTF-16.
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Output_Scheme.
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8.
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns the value
of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
function Encode (Item : String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
function Decode (Item : UTF_8_String) return String;
Returns the result
of decoding Item, which is encoded in UTF-8.
function Decode (Item : UTF_16_Wide_String) return String;
Returns the result
of decoding Item, which is encoded in UTF-16.
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns the value
of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
function Encode (Item : Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
function Decode (Item : UTF_8_String) return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
function Decode (Item : UTF_16_Wide_String) return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False) return UTF_String;
Returns the value
of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False) return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
function Decode (Item : UTF_8_String) return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
Implementation Advice
If an implementation supports other encoding schemes,
another similar child of Ada.Strings should be defined.
NOTE A BOM (Byte-Order Mark, code
position 16#FEFF#) can be included in a file or other entity to indicate
the encoding; it is skipped when decoding. Typically, only the first
line of a file or other entity contains a BOM. When decoding, the Encoding
function can be called on the first line to determine the encoding; this
encoding will then be used in subsequent calls to Decode to convert all
of the lines to an internal format.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe