Annotated Ada Reference Manual (Ada 202y Draft 1)Legal Information
Contents   Index   References   Search   Previous   Next 

A.4.11 String Encoding

1/3
{AI05-0137-2} Facilities for encoding, decoding, and converting strings in various character encoding schemes are provided by packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings. 

Static Semantics

2/3
{AI05-0137-2} The encoding library packages have the following declarations:
3/5
{AI05-0137-2} {AI12-0414-1} package Ada.Strings.UTF_Encoding
  with Pure is
4/3
   -- Declarations common to the string encoding packages
   type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
5/3
   subtype UTF_String is String;
6/3
   subtype UTF_8_String is String;
7/3
   subtype UTF_16_Wide_String is Wide_String;
8/3
   Encoding_Error : exception;
9/3
   BOM_8    : constant UTF_8_String :=
                Character'Val(16#EF#) &
                Character'Val(16#BB#) &
                Character'Val(16#BF#);
10/3
   BOM_16BE : constant UTF_String :=
                Character'Val(16#FE#) &
                Character'Val(16#FF#);
11/3
   BOM_16LE : constant UTF_String :=
                Character'Val(16#FF#) &
                Character'Val(16#FE#);
12/3
   BOM_16   : constant UTF_16_Wide_String :=
               (1 => Wide_Character'Val(16#FEFF#));
13/3
   function Encoding (Item    : UTF_String;
                      Default : Encoding_Scheme := UTF_8)
      return Encoding_Scheme;
14/3
end Ada.Strings.UTF_Encoding;
15/5
{AI05-0137-2} {AI12-0414-1} package Ada.Strings.UTF_Encoding.Conversions
   with Pure is
16/3
   -- Conversions between various encoding schemes
   function Convert (Item          : UTF_String;
                     Input_Scheme  : Encoding_Scheme;
                     Output_Scheme : Encoding_Scheme;
                     Output_BOM    : Boolean := False) return UTF_String;
17/3
   function Convert (Item          : UTF_String;
                     Input_Scheme  : Encoding_Scheme;
                     Output_BOM    : Boolean := False)
      return UTF_16_Wide_String;
18/3
   function Convert (Item          : UTF_8_String;
                     Output_BOM    : Boolean := False)
      return UTF_16_Wide_String;
19/3
   function Convert (Item          : UTF_16_Wide_String;
                     Output_Scheme : Encoding_Scheme;
                     Output_BOM    : Boolean := False) return UTF_String;
20/3
   function Convert (Item          : UTF_16_Wide_String;
                     Output_BOM    : Boolean := False) return UTF_8_String;
21/3
end Ada.Strings.UTF_Encoding.Conversions;
22/5
{AI05-0137-2} {AI12-0414-1} package Ada.Strings.UTF_Encoding.Strings
   with Pure is
23/3
   -- Encoding / decoding between String and various encoding schemes
   function Encode (Item          : String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
24/3
   function Encode (Item       : String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
25/3
   function Encode (Item       : String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
26/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return String;
27/3
   function Decode (Item : UTF_8_String) return String;
28/3
   function Decode (Item : UTF_16_Wide_String) return String;
29/3
end Ada.Strings.UTF_Encoding.Strings;
30/5
{AI05-0137-2} {AI12-0414-1} package Ada.Strings.UTF_Encoding.Wide_Strings
   with Pure is
31/3
   -- Encoding / decoding between Wide_String and various encoding schemes
   function Encode (Item          : Wide_String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
32/3
   function Encode (Item       : Wide_String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
33/3
   function Encode (Item       : Wide_String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
34/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return Wide_String;
35/3
   function Decode (Item : UTF_8_String) return Wide_String;
36/3
   function Decode (Item : UTF_16_Wide_String) return Wide_String;
37/3
end Ada.Strings.UTF_Encoding.Wide_Strings;
38/5
{AI05-0137-2} {AI12-0414-1} package Ada.Strings.UTF_Encoding.Wide_Wide_Strings
   with Pure is
39/3
   -- Encoding / decoding between Wide_Wide_String and various encoding schemes
   function Encode (Item          : Wide_Wide_String;
                    Output_Scheme : Encoding_Scheme;
                    Output_BOM    : Boolean  := False) return UTF_String;
40/3
   function Encode (Item       : Wide_Wide_String;
                    Output_BOM : Boolean  := False) return UTF_8_String;
41/3
   function Encode (Item       : Wide_Wide_String;
                    Output_BOM : Boolean  := False)
      return UTF_16_Wide_String;
42/3
   function Decode (Item         : UTF_String;
                    Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
43/3
   function Decode (Item : UTF_8_String) return Wide_Wide_String;
44/3
   function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
45/3
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
46/3
{AI05-0137-2} {AI05-0262-1} The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC 10646 in 8 bit, big-endian order; and UTF_16LE corresponds to the UTF-16 encoding scheme in 8 bit, little-endian order.
47/3
{AI05-0137-2} The subtype UTF_String is used to represent a String of 8-bit values containing a sequence of values encoded in one of three ways (UTF-8, UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent a String of 8-bit values containing a sequence of values encoded in UTF-8. The subtype UTF_16_Wide_String is used to represent a Wide_String of 16-bit values containing a sequence of values encoded in UTF-16.
48/3
{AI05-0137-2} {AI05-0262-1} The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants correspond to values used at the start of a string to indicate the encoding.
49/3
{AI05-0262-1} {AI05-0269-1} Each of the Encode functions takes a String, Wide_String, or Wide_Wide_String Item parameter that is assumed to be an array of unencoded characters. Each of the Convert functions takes a UTF_String, UTF_8_String, or UTF_16_String Item parameter that is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter.
50/3
{AI05-0137-2} {AI05-0262-1} {AI05-0269-1} Each of the Convert and Encode functions returns a UTF_String, UTF_8_String, or UTF_16_String value whose characters have position values that correspond to the encoding of the Item parameter according to the encoding scheme required by the function or specified by its Output_Scheme parameter. For UTF_8, no overlong encoding is returned. A BOM is included at the start of the returned string if the Output_BOM parameter is set to True. The lower bound of the returned string is 1.
51/3
{AI05-0137-2} {AI05-0262-1} Each of the Decode functions takes a UTF_String, UTF_8_String, or UTF_16_String Item parameter which is assumed to contain characters whose position values correspond to a valid encoding sequence according to the encoding scheme required by the function or specified by its Input_Scheme parameter, and returns the corresponding String, Wide_String, or Wide_Wide_String value. The lower bound of the returned string is 1.
52/3
{AI05-0137-2} {AI05-0262-1} For each of the Convert and Decode functions, an initial BOM in the input that matches the expected encoding scheme is ignored, and a different initial BOM causes Encoding_Error to be propagated.
53/3
{AI05-0137-2} The exception Encoding_Error is also propagated in the following situations:
54/4
{AI12-0088-1} By a Convert or Decode function when a UTF encoded string contains an invalid encoding sequence.
54.a/4
To be honest: {AI12-0088-1} An overlong encoding is not invalid for the purposes of this check, and this does not depend on the character set version in use. Some recent character set standards declare overlong encodings to be invalid; it would be unnecessary and unfriendly to users for Convert or Decode to raise an exception for an overlong encoding. 
55/4
{AI12-0088-1} By a Convert or Decode function when the expected encoding is UTF-16BE or UTF-16LE and the input string has an odd length.
56/3
{AI05-0262-1} By a Decode function yielding a String when the decoding of a sequence results in a code point whose value exceeds 16#FF#.
57/3
By a Decode function yielding a Wide_String when the decoding of a sequence results in a code point whose value exceeds 16#FFFF#.
58/3
{AI05-0262-1} By an Encode function taking a Wide_String as input when an invalid character appears in the input. In particular, the characters whose position is in the range 16#D800# .. 16#DFFF# are invalid because they conflict with UTF-16 surrogate encodings, and the characters whose position is 16#FFFE# or 16#FFFF# are also invalid because they conflict with BOM codes. 
59/3
{AI05-0137-2} function Encoding (Item    : UTF_String;
                   Default : Encoding_Scheme := UTF_8)
   return Encoding_Scheme;
60/3
{AI05-0137-2} {AI05-0269-1} Inspects a UTF_String value to determine whether it starts with a BOM for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding to the BOM; otherwise, returns the value of Default.
61/3
{AI05-0137-2} function Convert (Item          : UTF_String;
                  Input_Scheme  : Encoding_Scheme;
                  Output_Scheme : Encoding_Scheme;
                  Output_BOM    : Boolean := False) return UTF_String;
62/3
Returns the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme) encoded in one of these three schemes as specified by Output_Scheme.
63/3
{AI05-0137-2} function Convert (Item          : UTF_String;
                  Input_Scheme  : Encoding_Scheme;
                  Output_BOM    : Boolean := False)
   return UTF_16_Wide_String;
64/3
Returns the value of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme) encoded in UTF-16.
65/3
{AI05-0137-2} function Convert (Item          : UTF_8_String;
                  Output_BOM    : Boolean := False)
   return UTF_16_Wide_String;
66/3
Returns the value of Item (originally encoded in UTF-8) encoded in UTF-16.
67/3
{AI05-0137-2} function Convert (Item          : UTF_16_Wide_String;
                  Output_Scheme : Encoding_Scheme;
                  Output_BOM    : Boolean := False) return UTF_String;
68/3
Returns the value of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
69/3
{AI05-0137-2} function Convert (Item          : UTF_16_Wide_String;
                  Output_BOM    : Boolean := False) return UTF_8_String;
70/3
Returns the value of Item (originally encoded in UTF-16) encoded in UTF-8.
71/3
{AI05-0137-2} function Encode (Item          : String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
72/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
73/3
{AI05-0137-2} function Encode (Item       : String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
74/3
Returns the value of Item encoded in UTF-8.
75/3
{AI05-0137-2} function Encode (Item       : String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
76/3
Returns the value of Item encoded in UTF_16.
77/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return String;
78/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
79/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return String;
80/3
Returns the result of decoding Item, which is encoded in UTF-8.
81/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return String;
82/3
Returns the result of decoding Item, which is encoded in UTF-16.
83/3
{AI05-0137-2} function Encode (Item          : Wide_String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
84/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
85/3
{AI05-0137-2} function Encode (Item       : Wide_String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
86/3
Returns the value of Item encoded in UTF-8.
87/3
{AI05-0137-2} function Encode (Item       : Wide_String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
88/3
Returns the value of Item encoded in UTF_16.
89/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return Wide_String;
90/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
91/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return Wide_String;
92/3
Returns the result of decoding Item, which is encoded in UTF-8.
93/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return Wide_String;
94/3
Returns the result of decoding Item, which is encoded in UTF-16.
95/3
{AI05-0137-2} function Encode (Item          : Wide_Wide_String;
                 Output_Scheme : Encoding_Scheme;
                 Output_BOM    : Boolean  := False) return UTF_String;
96/3
{AI05-0262-1} Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Output_Scheme.
97/3
{AI05-0137-2} function Encode (Item       : Wide_Wide_String;
                 Output_BOM : Boolean  := False) return UTF_8_String;
98/3
Returns the value of Item encoded in UTF-8.
99/3
{AI05-0137-2} function Encode (Item       : Wide_Wide_String;
                 Output_BOM : Boolean  := False) return UTF_16_Wide_String;
100/3
Returns the value of Item encoded in UTF_16.
101/3
{AI05-0137-2} function Decode (Item         : UTF_String;
                 Input_Scheme : Encoding_Scheme) return Wide_Wide_String;
102/3
Returns the result of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as specified by Input_Scheme.
103/3
{AI05-0137-2} function Decode (Item : UTF_8_String) return Wide_Wide_String;
104/3
Returns the result of decoding Item, which is encoded in UTF-8.
105/3
{AI05-0137-2} function Decode (Item : UTF_16_Wide_String) return Wide_Wide_String;
106/3
Returns the result of decoding Item, which is encoded in UTF-16.

Implementation Advice

107/3
 {AI05-0137-2} If an implementation supports other encoding schemes, another similar child of Ada.Strings should be defined. 
107.a.1/3
Implementation Advice: If an implementation supports other string encoding schemes, a child of Ada.Strings similar to UTF_Encoding should be defined.
108/3
NOTE   {AI05-0137-2} A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a file or other entity to indicate the encoding; it is skipped when decoding. Typically, only the first line of a file or other entity contains a BOM. When decoding, the Encoding function can be called on the first line to determine the encoding; this encoding will then be used in subsequent calls to Decode to convert all of the lines to an internal format. 

Extensions to Ada 2005

108.a/3
{AI05-0137-2} The packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and Strings.UTF_Encoding.Wide_Wide_Strings are new. 

Wording Changes from Ada 2012

108.b/4
{AI12-0088-1} Corrigendum: Fixed the omission that Convert routines make the same checks on input as Decode routines. 

Contents   Index   References   Search   Previous   Next 
Ada-Europe Ada 2005 and 2012 Editions sponsored in part by Ada-Europe