A.4.11 String Encoding
{
AI05-0137-2}
Facilities for encoding, decoding, and converting strings in various
character encoding schemes are provided by packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings,
and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
{
AI05-0137-2}
The encoding library packages have the following declarations:
--
Declarations common to the string encoding packages
type Encoding_Scheme
is (UTF_8, UTF_16BE, UTF_16LE);
subtype UTF_String
is String;
subtype UTF_8_String
is String;
subtype UTF_16_Wide_String
is Wide_String;
Encoding_Error :
exception;
BOM_8 :
constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
BOM_16BE :
constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
BOM_16LE :
constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
BOM_16 :
constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
end Ada.Strings.UTF_Encoding;
--
Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
end Ada.Strings.UTF_Encoding.Conversions;
--
Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
function Decode (Item : UTF_8_String)
return String;
function Decode (Item : UTF_16_Wide_String)
return String;
end Ada.Strings.UTF_Encoding.Strings;
--
Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
function Decode (Item : UTF_8_String)
return Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Strings;
--
Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
{
AI05-0137-2}
{
AI05-0262-1}
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE
corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC
10646 in 8 bit, big-endian order; and UTF_16LE corresponds to the UTF-16
encoding scheme in 8 bit, little-endian order.
{
AI05-0137-2}
The subtype UTF_String is used to represent a String of 8-bit values
containing a sequence of values encoded in one of three ways (UTF-8,
UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent
a String of 8-bit values containing a sequence of values encoded in UTF-8.
The subtype UTF_16_Wide_String is used to represent a Wide_String of
16-bit values containing a sequence of values encoded in UTF-16.
{
AI05-0137-2}
{
AI05-0262-1}
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants correspond to values
used at the start of a string to indicate the encoding.
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Encode functions takes a String, Wide_String, or Wide_Wide_String
Item parameter that is assumed to be an array of unencoded characters.
Each of the Convert functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter that is assumed to contain characters whose position values
correspond to a valid encoding sequence according to the encoding scheme
required by the function or specified by its Input_Scheme parameter.
{
AI05-0137-2}
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Convert and Encode functions returns a UTF_String, UTF_8_String,
or UTF_16_String value whose characters have position values that correspond
to the encoding of the Item parameter according to the encoding scheme
required by the function or specified by its Output_Scheme parameter.
For UTF_8, no overlong encoding is returned. A BOM is included at the
start of the returned string if the Output_BOM parameter is set to True.
The lower bound of the returned string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Decode functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter which is assumed to contain characters whose position
values correspond to a valid encoding sequence according to the encoding
scheme required by the function or specified by its Input_Scheme parameter,
and returns the corresponding String, Wide_String, or Wide_Wide_String
value. The lower bound of the returned string is 1.
{
AI05-0137-2}
{
AI05-0262-1}
For each of the Convert and Decode functions, an initial BOM in the input
that matches the expected encoding scheme is ignored, and a different
initial BOM causes Encoding_Error to be propagated.
{
AI05-0137-2}
The exception Encoding_Error is also propagated in the following situations:
{
AI12-0088-1}
By a Convert or Decode function when a UTF encoded string contains an
invalid encoding sequence.
To be honest: {
AI12-0088-1}
An overlong encoding is not invalid for the purposes of this check, and
this does not depend on the character set version in use. Some recent
character set standards declare overlong encodings to be invalid; it
would be unnecessary and unfriendly to users for Convert or Decode to
raise an exception for an overlong encoding.
{
AI12-0088-1}
By a Convert or Decode function when the expected encoding is UTF-16BE
or UTF-16LE and the input string has an odd length.
{
AI05-0262-1}
By a Decode function yielding a String when the decoding of a sequence
results in a code point whose value exceeds 16#FF#.
By a Decode function yielding a Wide_String when
the decoding of a sequence results in a code point whose value exceeds
16#FFFF#.
{
AI05-0262-1}
By an Encode function taking a Wide_String as input when an invalid character
appears in the input. In particular, the characters whose position is
in the range 16#D800# .. 16#DFFF# are invalid because they conflict with
UTF-16 surrogate encodings, and the characters whose position is 16#FFFE#
or 16#FFFF# are also invalid because they conflict with BOM codes.
{
AI05-0137-2}
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
{
AI05-0137-2}
{
AI05-0269-1}
Inspects a UTF_String value to determine whether it starts with a BOM
for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding
to the BOM; otherwise, returns the value of Default.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in one of these three schemes as specified by
Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item (originally encoded in UTF-8) encoded in UTF-16.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Output_Scheme.
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return String;
Returns the result
of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
Returns the value
of Item encoded in UTF-8.
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
Returns the value
of Item encoded in UTF_16.
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-8.
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
Returns the result
of decoding Item, which is encoded in UTF-16.
Implementation Advice
{
AI05-0137-2}
If an implementation supports other encoding schemes, another similar
child of Ada.Strings should be defined.
Implementation Advice: If an implementation
supports other string encoding schemes, a child of Ada.Strings similar
to UTF_Encoding should be defined.
NOTE {
AI05-0137-2}
A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a
file or other entity to indicate the encoding; it is skipped when decoding.
Typically, only the first line of a file or other entity contains a BOM.
When decoding, the Encoding function can be called on the first line
to determine the encoding; this encoding will then be used in subsequent
calls to Decode to convert all of the lines to an internal format.
Extensions to Ada 2005
{
AI05-0137-2}
The packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions,
Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and
Strings.UTF_Encoding.Wide_Wide_Strings are new.
Wording Changes from Ada 2012
{
AI12-0088-1}
Corrigendum: Fixed the omission that Convert routines make the
same checks on input as Decode routines.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe