W3C Unicode

Unicode in XML and other Markup Languages

Unicode Technical Report #20

W3C Note 18 February 2002

Revision (Unicode):
6
This version:
http://www.unicode.org/unicode/reports/tr20/tr20-6.html
http://www.w3.org/TR/2002/NOTE-unicode-xml-20020218
Latest version:
http://www.unicode.org/unicode/reports/tr20/
http://www.w3.org/TR/unicode-xml/
Previous version:
http://www.unicode.org/unicode/reports/tr20/tr20-5.html
http://www.w3.org/TR/2000/NOTE-unicode-xml-20001215/
Date (Unicode):
2002-02-18
Authors:
Martin Dürst (duerst@w3.org) and Asmus Freytag (asmus@unicode.org)

Summary

This document contains guidelines on the use of the Unicode Standard in conjunction with markup languages such as XML.

Status of this document (common)

This is a Technical Report published jointly by the Unicode Technical Committee and by the W3C Internationalization Working Group/Interest Group (W3C Members only) in the context of the W3C Internationalization Activity.

The base version of the Unicode Standard for this document is Version 3.2. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/. Both the Unicode Standard and markup technologies are evolving. When appropriate, a new version of this document may be published.

Please mail corrigenda and other comments to the authors.

Status of this document (Unicode Consortium)

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Technical Report. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Technical Report (UTR) may contain either informative material or normative specifications, or both. Each UTR may specify a base version of the Unicode Standard. In that case, conformance to the UTR requires conformance to that version or higher.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/

Status of this document (W3C)

This Note has been endorsed by the W3C Internationalization Working Group/Interest Group, but has not been reviewed or endorsed by W3C Members.

A list of current W3C Technical Reports can be found at http://www.w3.org/TR/.

Table of Contents

  1. Introduction
    1.1 Notation
  2. General Considerations
    2.1 Linearity versus Structure
    2.2 Overlap of Control Code and Markup Semantics
    2.3 Markup and Styling
    2.4 Coincidence of Markup and Functions
    2.5 Extensibility of Markup
  3. Suitability of Characters
    3.1 Characters not Suitable for Use With Markup
    3.2 Format Characters Suitable for Use With Markup
    3.3 Line and Paragraph Separator
    3.4 Bidi Embedding Controls
    3.5 Deprecated Formatting Characters
    3.6 Byte Order Mark
    3.7 Interlinear Annotation Characters
    3.8 Object Replacement Character
    3.9 Musical Controls
    3.10 Language Tag Characters
  4. Characters with Compatibility Mappings
    4.1 Overview
    4.2 Generating New Text
    4.3 List item Marker Characters
    4.4 Fractions
    4.5 Squared or Horizontal
    4.6 Superscripts and Subscripts
    4.7 Other Characters Marked <compat>
  5. Versioning
  6. Conformance
  7. References
  8. Acknowledgements
  9. Change History
  10. Copyright

1. Introduction

The Unicode Standard  [Unicode] defines the universal character set. Its primary goal is to provide an unambiguous encoding of the content of plain text, ultimately covering all languages in the world. Currently in its third major version, Unicode contains a large number of characters covering most of the currently used scripts in the world. It also contains additional characters for interoperability with older character encodings, and characters with control-like functions included primarily for reasons of providing unambiguous interpretation of plain text. Unicode provides specifications for use of all of these characters.

For document and data interchange, the Internet and the World Wide Web are more and more making use of marked-up text such as HTML and XML. In many instances, markup provides the same, or essentially similar features to those provided by format characters in the Unicode Standard for use in plain text. Another special character category provided by Unicode are compatibility characters. While there may be valid reasons to support these characters and their specifications in plain text, their use in marked-up text can conflict with the rules of the markup language. Formatting characters are discussed in chapters 2 and 3, compatibility characters in chapter 4.

Issues resulting from canonical equivalences and Normalization [Normalization] as well as the interaction of character encoding and methods of escaping characters in markup are discussed in the Character Model for the World Wide Web [Charmod].

The issues of using Unicode characters with marked-up text depend to some degree on the rules of the markup language in question and the set of elements it contains. In a narrow sense, this document concerns itself only with XML, and to some extent HTML. However, much of the general information presented here should be useful in a broader context, including some page layout languages.

Note: Many of the recommendations of this report depend on the availability of particular markup. Where possible, appropriate DTDs or Schemas should be used or designed to make such markup available, or the DTDs or Schemas used should be appropriately extended. The current version of this document makes no specific recommendations for the design of DTDs or schemas, or for the use of particular DTDs or Schemas, but the information presented here may be useful to designers of DTDs and Schemas, and to people selecting DTDs or Schemas for their applications. The recommendations of this report do not apply in the case of XML used for blind data transport and similar cases.

1.1 Notation

This report uses XML [XML] as a prominent and general example of markup. The XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix 'xhtml:' indicates that this element is taken from [XHTML]. This means that the examples containing the namespace prefix 'xhtml:' are assumed to include a namespace declaration of xmlns:xhtml="..." 

Characters are denoted using the notation used in the Unicode Standard, i.e. an optional U+ followed by their hexadecimal number, using at least 4 digits, such as "U+1234" or "U+10FFFD". In XML or HTML this could be expressed as "&#x1234;" or "&#x10FFFD;".

2. General Considerations

There are several general points to consider when looking at the interaction between character encoding and markup. 

2.1 Linearity versus Structure

Encoding text as a sequence of characters without further information leads to a linear sequence, commonly called plain text. Character follows character, without any particular structure. Markup, on the other hand, defines a hierarchical structure for the text or data. In the case of XML and most other, similar markup languages, the markup defines a tree structure. While this tree structure is linearized for transmission in the XML document, once the document has been parsed, the tree is available directly.

Operations that are easy to perform on trees are often difficult to perform on linear sequences and vice versa. By separating functionality between character encoding and markup appropriately, the architecture becomes simpler, more powerful and longer-lasting.

In particular, operations on hierarchical structures can easily make sure that information is kept in context. Attributes assigned to parts of a document are moved together with the associated part of the document. Assigning an attribute to a part of a document limits the scope of the attribute to that part of the document. Performing the same operations on linear sequences of characters using control codes to set attributes and to delimit their scope requires much more work and is error prone. Locating the start or end of a span of text of the same attribute requires scanning backwards and forwards for the embedded delimiter or control code. Moving or editing text often results in mismatched control codes, so that an attribute might suddenly apply to text it was not intended for.

2.2 Overlap of Control Code and Markup Semantics

When markup is not available, plain text may require control characters. This is usually the case where plain text must contain some scoping or attribute information in order to be legible, i.e. to be able to transmit the same content between originator and receiver. Many of these control characters have direct equivalents in particular markup languages, since markup handles these concerns efficiently. If both characters and their markup equivalents may be present in the same text, the question of priority is raised. Therefore it is important to identify and resolve these ambiguities at the time markup is first applied.

2.3 Markup and Styling

Besides the basic character encoding and text markup there is a third contributor to text functionality, namely styling. Markup is concerned with the logical structure of the text or data, e.g. to indicate sections, subsections, and headers in a document, or to indicate the various fields of an address record. Styling is used to present the information in various ways, e.g. in different fonts, different type styles (italic, bold), different colors, etc. Some character codes do not encode a generic character, but a styled character. Where these characters are used, styling information is frozen, i.e. it is no longer possible to alter the appearance of the text by applying style information. However, there are many examples where a historically free stylistic variation has over time become a semantic distinction that is properly encoded as plain text. Sometimes, what is a free variation in some contexts, implies strict semantic differentiation in others. In all such instances, altering the appearance of the text by styling information would irreparably alter the content of the text.

2.4 Coincidence of Markup and Functions

Dealing with various functionalities on the markup level has the additional advantage that in most cases, text portions that need some particular attribute (or styling) are actually those text portions identified by markup. A paragraph may be in French, a citation may need a bidi embedding, a keyword may be in italics, a list number may be circled, etc.. This makes it very efficient to associate those attributes with markup.

However, where local or point-like functionality is needed, markup is not very efficient and its main benefit, easy manipulation of scope, is not required. On the contrary, the intrusion of markup in the middle of words can make search or sort operations more difficult. For these cases expressing the information as character codes is not only a viable, but often the preferred alternative, which needs to be considered in the design of markup languages.

2.5 Extensibility of Markup

Character encoding works with a range of integers used as character codes. This is extremely efficient, but has some limitations. Markup, on the other hand, is much more extensible. Using technologies such as XML Namespaces [Namespace], various vocabularies can be mixed.

3. Suitability of Characters in Markup

This section discusses the suitability of Unicode characters for use in markup, with particular emphasis to format characters, as well as characters that have been deprecated.

There are characters which are unsuitable in the context of markup in XML/HTML and whose use is discouraged, because one or more of the following conditions apply:

Section 3.1 provides a list of such characters. Section 3.2 provides a list of format characters that are suitable for use with markup. Sections 3.3 through 3.10 discuss in more detail the following points for the discouraged characters.

3.1 Characters not Suitable for use With Markup

The following table contains the characters currently considered not suitable for use with markup in XML or HTML. (See however the note in the Introduction.) They may also be unsuitable for other markup or page layout languages. For determining possible conflict this report uses the markup available in HTML.

Table 3.1 Characters not suitable for use with markup

Codepoints

Names/Description

Short Comment

U+2028 .. U+2029 Line and paragraph separator use <xhtml:br />, <xhtml:p></xhtml:p>, or equivalent
U+202A .. U+202E BIDI embedding controls 
(LRE, RLE, LRO, RLO, PDF)
Strongly discouraged in [HTML 4.0]
U+206A .. U+206B Activate/Inhibit Symmetric swapping Deprecated  in Unicode
U+206C .. U+206D Activate/Inhibit Arabic form shaping Deprecated in Unicode
U+206E .. U+206F Activate/Inhibit National digit shapes Deprecated in Unicode
U+FFF9 .. U+FFFB Interlinear annotation characters Use ruby markup [Ruby]
U+FEFF Byte order mark / ZWNBSP Use only as byte order mark. Use U+2060 Word Joiner instead of using U+FEFF as ZWNBSP
U+FFFC Object replacement character Use markup, e.g. HTML <object> or HTML <img>
U+1D173..U+1D173A Scoping for Musical Notation Use an appropriate markup language
U+E0000 .. U+E007F Language Tag codepoints  Use xhtml:lang or xml:lang

Except for Line and Paragraph Separator, or the Byte Order Mark, it is acceptable for browsers and similar user agents to ignore the presence of discouraged characters in HTML or XML. It is up to authoring tools to ensure proper conversion between these characters and equivalent markup where it exists.

3.2 Format Characters Suitable for Use with Markup

The following table contains format characters that do not exhibit the problems discussed at the start of this section. Despite their apparent relation to or similarity with characters in table 3.1, they are considered suitable for use with markup. It is not acceptable for user agents to ignore the characters in table 3.2. For a description of these characters see [Unicode], including the recent updates [Unicode31] and [Unicode32].

Table 3.2: Some characters that affect text format but are suitable for use with markup

Code points

Names/Description

Short Comment

U+00A0 No-break space In Latin-1
U+00AD Soft Hyphen In Latin-1
U+0363 Combining Grapheme Joiner  
U+070C Syriac Abbreviation Mark (SAM)  
U+0F0C Tibetan tsheg mark  
U+180B..U+180E Mongolian Variation Selectors(FVS1.. FVS3), Mongolian Vowel Separator Required for Mongolian
U+200C..U+200D Zero-width Joiners (ZWJ and ZWNJ) Required for a.o. Persian
U+200E..U+200F Implicit directional marks (LRM and RLM) LRM and RLM are allowed
U+2011 Non breaking Hyphen  
U+202F Narrow No-break Space  
U+2060 Word Joiner Use instead of ZWNBSP
U+2061 Function Application Mathematical use
U+2062 Invisible Times Mathematical use
U+2063 Invisible Comma Mathematical use
U+2FF0..U+2FFB Ideographic character description Graphic characters (not controls)
U+303E Ideographic variation indicator Graphic character (not a control)
FE00..FE0F Variation Selectors Not graphic characters

3.3 Line and Paragraph Separator, U+2028..U+2029

Short description: The line and paragraph separator provide unambiguous means to denote hard line breaks and paragraph delimiters in plain text.

Reason for inclusion: These characters were introduced into the Unicode Standard to overcome the ambiguous and widely divergent use of control codes for this purpose. See Unicode Technical Report #13, Unicode Newline Guidelines [UAX13].

Problems when used in markup: Including these characters in markup text does not work where it would duplicate the existing markup commands for delimiting paragraphs and lines.

Problems with other uses: The separator characters can also problematic when used in plain text, because legacy data is usually converted code point for code point into Unicode and all receivers of Unicode plain text have to effectively be able to interpret the existing use of control codes for this purpose. As a result, fewer Unicode implementations support these characters, than would be the case otherwise.

Replacement markup: In HTML, use <xhtml:br /> instead of U+2028 and surround paragraphs by <xhtml:p> and </xhtml:p> instead of separating them with U+2029.

What to do if detected: In a browser context, treat as whitespace. When received in an editing context, replace the character by the corresponding markup. 

3.4 Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A .. U+202E

Short description: The bidi embedding controls are required to supplement the Unicode Bidirectional Algorithm in plain text

Reason for inclusion: The Unicode Bidirectional algorithm unambiguously resolves the display direction for bidirectional text. It does so by assigning all characters directional categories and then resolving these in context. In a small number of circumstances this implicit  method does not produce satisfactory results and embedding controls are needed to ensure that sender and receiver agree on the display direction for a given text. See Unicode Technical Report # 9, The Bidirectional Algorithm [UAX 9].

Problems when used in markup: These characters duplicate available markup, which is better suited to handle the stateful nature of their effect. 

Problems with other uses: The embedding controls introduce a state into the plain text, which must be maintained when editing or displaying the text. Processes that are modifying the text without being aware of this state may inadvertently affect the rendering of large portions of the text, for example by removing a PDF.

Replacement markup: The following table gives the replacement markup:

Unicode Equivalent markup Comment

RLO

<xhtml:bdo dir = "rtl">  

LRO

<xhtml:bdo dir = "ltr">  
PDF </xhtml:bdo> when used to terminate RLO or LRO only, otherwise ignore
RLE dir = "rtl" attribute on block or inline element
LRE dir = "ltr" attribute on block or inline element

For details on bidi markup, please see Section 8.2 of HTML [HMTL 4.0-8.2]. The text of HTML 4.0 gives this recommendation: 

Using HTML directionality markup with Unicode characters. Authors and designers of authoring software should be aware that conflicts can arise if the dir attribute is used on inline elements (including BDO) concurrently with the corresponding [UNICODE] formatting characters. Preferably one or the other should be used exclusively. The markup method offers a better guarantee of document structural integrity and alleviates some problems when editing bidirectional HTML text with a simple text editor, but some software may be more apt at using the [UNICODE] characters. If both methods are used, great care should be exercised to insure proper nesting of markup and directional embedding or override, otherwise, rendering results are undefined.

This document goes beyond HTML and recommends that only the markup should be used.

What to do if detected: In a browser context, ignore. When received in an editing context, replace the characters by the appropriate markup. 

3.5 Deprecated Formatting Characters, U+206A..U+206F

Short description: These characters are deprecated. They were originally intended to allow explicit activation of contextual shaping, numeric digit rendering and symmetric swapping.

Reason for inclusion: These characters were retained from draft versions of ISO 10646.

Problems when used in markup: The processing model for these characters is not supported in markup.

Problems with other uses: The Unicode Standard requires that symmetric swapping, contextual shaping, and alternate digit shapes are enabled by default and no longer supports inhibiting any of them by use of these character codes. The most likely effect of their occurrence in generated text would be that of a 'garbage' character.

Conversion for use with markup: Apply the appropriate conversion to bring the data stream in line with the Unicode text model for bidirectional text and cursively-connected scripts.

What to do if detected: When received by a browser as part of marked up text, they may be ignored. When received in an editing context, they may be removed, possibly with a warning. Alternatively, an appropriate conversion from the legacy text model may be provided. This will most likely be limited to applications directly interfacing with and knowledgeable of the particular legacy implementation that inspired these characters.

3.6 Byte Order Mark, ZWNBSP, U+FEFF

Short description: U+FEFF has two funtions. It is formally known zero width no-break space (ZWNBSP), and can act as a word joiner, but its primary use is as byte order mark, to indicate in a file signature that a file is in a Unicode encoding form and of a particular byte order. The use as word joiner is deprecated for use in new data as of [Unicode3.2] in favor of U+2060 WORD JOINER. The use as byte order mark remains unaffected.

Reason for inclusion: Originally included in Unicode for the sole purpose of indicating byte order or use in file signatures, the character acquired the ZWNBSP semantics as part of the merger between ISO/IEC 10646 and Unicode. When used as a byte order mark the character is placed at the beginning of a file. If a recipient views it as FEFF then the byte order between sender and receiver match. If the recipient views it as FFFE (a non-character code point) then the sender used opposite byte order from the recipient, and the recipient needs to convert the byte order or refuse to read the file. When used as a ZWBSP the character is intended to prevent breaks between adjacent characters. This function is now provided by U+2060 zero width word joiner (ZWJ) making it unnecessary to insert U+FEFF in the middle of a file. For more information see [Unicode] and [Unicode32].

Problems when used in markup: Using U+FEFF as ZWNBSP makes it impossible to distinguish it from the case where a byte order mark was left in the middle of a file inadvertently due to incorrect splicing.

Problems with other uses: The use of byte order mark as ZWNBSP is also problematic when used in plain text, and is not intended for that purpose. The use of U+FEFF in file signatures to indicate byte order is the only recommended use of this character.

Replacement markup: None.

What to do if detected:  When received by a browser as part of marked-up text, treat depending on location. At the head of a file, treat as byte order mark (i.e. as part of the character encoding, not as part of the parsed character stream, see e.g. Section 4.3.3 of [XML 1.0]). Otherwise, assume it is older data using it as ZWNBSP. When receiving plain text in an editing environment, editors may take one or more of several actions: replace ZWNBSP in the middle of a file with ZWJ or issue a warning to the user.

3.7 Interlinear Annotation Characters, U+FFF9-U+FFFB

Short description: The interlinear annotation characters are used to delimit interlinear annotations in certain circumstances. They are intended to provide text anchors and delimiters for interlinear annotation for in-process use and are not intended for interchange.

Reason for inclusion: The interlinear annotation characters were included in Unicode only in order to reserve code points for very frequent application-internal use. The interlinear annotation characters are used to delimit interlinear annotations in contexts where other delimiters are not available, and where non-textual means exist to carry formatting information. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. This is called out-of-band information. The overall implementation makes sure that these two structures are kept in sync. If the text contains interlinear annotations, it is extremely helpful for implementations to have delimiters in the text itself; even though delimiters are not otherwise used for style markup. With this method, and unlike the case of the object replacement character, all textual information can remain in the standard text stream, but any additional formatting information is kept separately. In addition, the Interlinear Annotation Anchor serves as a placeholder for formatting information for the whole annotation object, the same way a paragraph mark can be a placeholder to attach paragraph formatting information.

Problems when used in markup: Including interlinear annotation characters in marked-up text does not work because the additional formatting information (how to position the annotation,...) is not available.

Problems with other uses: The interlinear annotation characters are also problematic when used in plain text, and are not intended for that purpose. In particular, on older display systems that simply ignore or replace the Interlinear Annotation Characters, the meaning of the text may be changed.

Replacement markup: The markup to be used in place of the Interlinear Annotation Characters depends on the formatting and nature of the interlinear annotation in question. For ruby, please see [Ruby].

What to do if detected:  When received by a browser as part of marked-up text, they may be ignored. When receiving plain text in an editing environment, editors may take one or more of several actions: remove U+FFF9 together with removing all characters between U+FFFA and following U+FFFB,  ignore U+FFF9 and turn U+FFFA and U+FFFB  to "[" and "]" respectively, issue a warning to the user, or tentatively convert into appropriate ruby markup for further editing and formatting by the user.

3.8 Object Replacement Character, U+FFFC

Short description: The object replacement character is used to stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was included in Unicode only in order to reserve a codepoint for a very frequent application-internal use. Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures. The actual text is kept in a single linear structure; additional information is kept separately with pointers to the appropriate text positions. The overall implementation makes sure that these two structures are kept in sync. If the text contains objects such as images, it is extremely helpful for implementations to have a sentinel in the text itself; any additional information is kept separately.

Problems when used in markup: Including an object replacement character in markup text does not work because the additional information (what object to include,...) is not available.

Problems with other uses: The object replacement character is also problematic when used in plain text, because there is no way in plain text to provide the actual object information or a reference to it.

Replacement markup: The markup to be used in place of the Object Replacement Character depends on the object in question and the markup context it is used in. Typical cases are <xhtml:img src='...' />, <xhtml:object ...>, or <xhtml:applet ...>. These constructs allow providing all additional information needed to identify and use the object in question.

What to do if detected: Browsers may ignore this character. When received in an editing context, if the actual object is accessible, editors may either replace the character by the appropriate markup for that object, or otherwise remove it, ideally providing a warning.

3.9 Musical Controls, U+1D173..U+1D17A

Short description: A series of characters for controlling scope in musical notation.

Reason for inclusion: These characters designate the start and end of common musical constructs. Full musical layout depends on additional information, for example pitch, that cannot be encoded using Unicode. However, many musical symbols may be depicted in isolation (and without assigning pitch) as part of a textual discussion of music. Plain text use of Unicode characters is primarily intended for this latter purpose. The scoping operators can be used to support limited renderings of beams, slurs, phrases, etc. in this context. However, in the context of markup languages, musical scoring calls for a dedicated markup language (analogous to MathML) which would be expected to contain markup for these constructs.

Problems when used in markup: These characters duplicate information that can in principle be expressed in markup.

Problems with other uses: Their special code range allows them to be easily filtered, but applications that do not expect them will treat them as garbage characters.

Replacement markup: Replace with equivalent markup if available.

What to do if detected: Browsers may ignore these characters. When received in an editing context, editors may remove or replace them by equivalent markup.

3.10 Language Tag Characters, U+E0000 .. U+E007F

Short description: A series of characters for expressing language tags, based on existing standards for language tags using the rules in [Unicode31].

Reason for inclusion: These characters allow in-band language tagging in situations where full markup is not available, while allowing easy filtering by applications that do not support them. They were solely included for the benefit of those Internet protocols, such as ACAP, which require a standard mechanism for marking language in UTF-8 strings, and at the same time to avoid the use of other tagging schemes that relied on specific details of the encoding form used.

Problems when used in markup: These characters duplicate information that can be expressed in markup.

Problems with other uses: Their special code range allows them to be easily filtered, but applications that do not expect them will treat them as garbage characters.

Replacement markup: Replace with equivalent language markup. XML and XHTML have the xml:lang attribute. HTML has the lang attribute. These attributes follow different scoping rules than the tag characters, therefore this replacement will generally not be a simple 1:1 substitution.

What to do if detected: Browsers may ignore these characters. When received in an editing context, editors may remove or replace them by equivalent markup.

4. Characters with Compatibility Mappings

The Unicode Standard provides compatibility mappings for a number of characters. Compatibility mappings indicate a relationship to another character, but the exact nature of the relationship varies. In some cases the relationship means "is based on" in some other cases it denotes a property. When plain text is marked up, it may make sense to map some of these characters to their compatibility equivalents and suitable markup. It is important to understand the nature of the distinctions between characters and their compatibility equivalents and the context in where these distinctions matter. It is never advisable to apply compatibility mappings indiscriminately. This section provides guidance on when and how to apply compatibility mappings in the case of importing text from non-XML (non-marked-up) sources. The section is organized by the "compatibility tag" associated with each compatibility mapping.

4.1 Overview

The following table gives an overview of the various compatibility characters, organized by "compatibility tag". The first column contains the tag value of the "compatibility tag" from the Unicode Character Database [UnicodeData]. Although these tags use "<" and ">", they do not appear as such in markup and should not be confused with XML tags. Code range indicates which code points the entry applies to. The entry in the Action column summarizes the recommended action to be taken whenever markup is first applied to non-XML text. Each entry indicates whether the characters can be substituted using the compatibility equivalent according to Normalization Form KC of [UAX 15], can be replaced by equivalent markup where available, or should be retained. For some cases, instead of or in addition to markup, style information [CSS2] is needed. Description and usage provides additional information. Sections 4.3 through 4.6 provide additional information for some of these sets of compatibility characters including detailed recommended actions.

Table 4.1 Characters with compatibility mappings

Tag value Code range Action Description and usage
<circled> all retain Circled letters and digits used for list item markers
<compat> 2100..2101 retain Variant letter forms that are used as symbols
<compat> 2105..2106 retain Variant letter forms that are used as symbols
<compat> 2121 retain** For use as single code point in vertical layout
<compat> 2160..2175 retain** For use as single code point in vertical layout
<compat> 3131..318E retain Compatibility Hangul Jamo. These do not-conjoin
<compat> 2002..200A retain Fixed width spaces
<compat> 3200..3229 use list item marker style Parenthesized characters used as list item markers
<compat> 322A..3243 retain** Parenthesized characters used as symbols in vertical layout
<compat> 2474..249B use list item marker style Parenthesized or dotted number used as list item marker
<compat> 249C..24B5 use list item marker style or normalize* Parenthesized letters used as list item markers
<compat> 32C0..32CB retain** String used as single code point in vertical layout
<compat> all other retain Maintain, semantic distinctions apply
<final> all normalize* Arabic Presentation forms
<font> all retain Variant letter forms that are used as symbols
<fraction> all normalize* As long as fraction slash is supported!
<initial> all normalize* Arabic Presentation forms
<isolated> all normalize* Arabic Presentation forms
<medial> all normalize* Arabic Presentation forms
<narrow> all retain Half-width characters
<no-break> all retain The compatibility mapping is merely a way to indicate the equivalent character that is not non-breaking. The distinction must be preserved
<small> all retain Precise usage unknown. Maintain, but do not generate
<square> 3300..3357 retain*** Single display cell cluster containing multiple lines of kana for vertical layout
<square> 3358..337D retain*** For use as single code point in vertical layout
<square> 33E0..33FE retain*** For use as single code point in vertical layout
<square> all other retain*** Variant letter form used as symbol in vertical layout
<super> all use <sup> markup Superscripted characters
<sub> all use <sub> markup Subscripted characters
<vertical> all normalize* East Asian Presentation forms
<wide> all retain Full-width characters

Notes:

*Use normalization form KC for these particular characters. 
**Some symbols used in vertical layout may also become accessible via list item marker style(s).
***At the time of this writing there is no appropriate markup for squared kana clusters or horizontal in vertical symbols.

4.2 Generating New Text

Presentation forms and characters for which adequate representation exists as marked up text should never be entered into new data. However, many of the characters with <font> tag are suitable for new data, as long as they are used in the manner they are intended, that is as symbols, with definite semantic differentiation between the different forms. However, they must not be used to create the appearance of styled text. On the other hand, text styles should not be used to carry the essential semantic distinction needed, for example for mathematics.

For example to write hello, one should use <i>hello</i> and not the sequence of Unicode characters U+210E, U+212F, U+2113, U+2113, U+2134. Conversely, to indicate Planck's constant one should use U+210E and not <i>h</i>.

4.3 List Item Marker Characters

Short description: Characters with a <circled> tag or characters with <compat> tag and compatibility mapping to a parenthesized string.

Reason for inclusion: They are most frequently used for marking enumerated list items, but the characters with a <circled> tag often occur as dingbats or footnote markers in tables. The same characters are used in regular text when citing an item from a numbered list.

Problems when used in markup: These characters do not cause undue interaction with markup

Problems with other uses: None

Replacement markup: (list item style) When generating marked up text these characters occur only internal to the user agent when list item styles are rendered. When marking up plain text data they could be converted to suitable list item styles, if such use can be properly inferred. However, it is often necessary to refer to numbered or lettered list item markers in the text.

Compatibility mappings of the form (n) or (n.) can be kept as single characters, or replaced by list item marker styles. A conversion to list item marker styles allows a simple extension of the set to arbitrary numbers. This is in contrast to circled characters: Support to properly generate arbitrary circled numbers is not commonly available, therefore conversion to list item marker styles does not easily allow an extension of the set of accessible circled numbers.

What to do if detected: No action needs to be taken by browsers. When received in an editing context, substitution of a list item marker style may be appropriate. However, the same characters are very often used as dingbat-like symbols in tables, or may appear in general text, when referring to an item from a list. Therefore the user must have the choice of whether to replace the character.

4.4 Fractions

Short description: Single character fractions such as ½ or ¼.

Reason for inclusion: Subsets of these occur in practically all legacy character sets.

Problems when used in markup: The repertoire is limited to a few common fractions. When used with more general methods of generating fractions such as MathML [MathML] the usual problem of dual representation arises.

Problems with other uses: Other than normalization issues, these characters present no undue problems in plain text. Where fraction slash is supported, these can be expressed by substituting their compatibility mappings.

Replacement markup: MathML.

What to do if detected: No action needs to be taken by browsers or editors, except when converting plain text to MathML.

4.5 Squared or Horizontal

Short description: Characters that are symbols composed of groups of typically kana or Latin letters, digits plus slash for use in a single display cell in vertical display of text. 

Reason for inclusion: Many existing character sets contain these as precomposed characters since for simple implementations this is the only way to support the common use of providing metric units and other abbreviations in a single character cell for vertical text layout. 

Problems when used in markup: Proposed markup, including CSS styling, would be able express an unbounded set of these abbreviations, obviating the need of cataloguing these in the character encoding standard and making them more directly accessible to text based processing, for example searching.

Problems with other uses: The repertoire of these legacy characters is limited; many more combinations are in actual use than are accounted for in character sets. Pre-composed symbols do not make their text content available to search engines. They also require re-encoding for text laid out horizontally.

Replacement markup: The style property text-combine in the CSS3 module: text Working Draft [CSS3Text].

What to do if detected: No action required. (Subject to change pending the outcome of current proposals.)

4.6 Superscripts and Subscripts

Short description: Mainly super and subscript digits, but also signs, parens and some letters.

Reason for inclusion: Super and subscript characters occur in many legacy character sets, including Latin-1. Their use in pure plain text is common for databases, e.g. including metric units for part descriptions  (viz. cm2) or (usually simplified) formulae as occur in titles of scientific publications. Super and subscripted letters and digits are common in some forms of phonetic or phonemic transcriptions.

Problems when used in markup: Using these characters directly in markup provides an alternate representation compared to marked up text, leading to different treatment by search engines. However, when super and sub-scripts are to reflect semantic distinctions, it is easier to work with these meanings encoded in text rather than markup, for example, in phonetic or phonemic transcription.

Problems with other uses: none

Replacement markup:<xhtml:sup> and <xhtml:sub> or <mathml:msup> and <mathml:msub>

What to do if detected: No action required by browsers. When received in an editing context, substitute the corresponding markup.

4.7 Other Characters Marked <compat>

Short description: The <compat> label was given to a set of compatibility characters whose further classification was not settled at the time the standard was created. The largest components are list item marker characters.

Reason for inclusion: These characters occur in many legacy character sets.

Problems when used in markup: none. There usually is no equivalent markup.

Problems with other uses: none

Replacement markup: none.

What to do if detected: No action required.

5. Versioning

This report will be updated by the Unicode Technical Committee in cooperation with the W3C Internationalization WG whenever the tables of characters in this document need to be updated as a result of the addition of characters to the Unicode Standard, as a result of a revised determination of the suitability of a given character for use with markup, or when additional background information or recommendations become available.

Each report carries a revision number, which may be used to refer to a specific version of the report. Older versions of the report will remain available. Each version of this report specifies the underlying version of the Unicode Standard.

For more information on the Unicode Standard and its versions, see:

6. Conformance

In the context of the Unicode Standard, the material in this technical report is informative. However, other documents, particularly markup language specifications, may specify conformance including normative references to this document. Such references may have to be updated as a result of future updates to this report as discussed in Section 5.

7. References

[Charmod]
Martin J. Dürst, François Yergeau, Richard Ishida, Misha Wolf, Asmus Freytag, Tex Texin, Eds., Character Model for the World Wide Web, W3C Working Draft 20-Dec-2001, <http://www.w3.org/TR/charmod>.
[CharReq]
Martin J. Dürst, Requirements for String Identity and Character Indexing Definitions for the WWW, W3C Working Draft 10-July-1998, <http://www.w3.org/TR/WD-charreq>.
[CSS2]
Bert Bos, Håkon Wium Lie, Chris Lilley, Ian Jacobs, Eds., Cascading Style Sheets, level 2 (CSS2 Specification), W3C Recommendation 12-May-1998, <http://www.w3.org/TR/REC-CSS2/>.
[CSS3Text]
Michel Suignard, Chris Lilley, Eds., CSS3 module: text, W3C Working Draft 17 May 2001, <http://www.w3.org/TR/css3-text/>.
[HTML 4.0]
Dave Raggett, Arnaud Le Hors, Ian Jacobs, Eds., HTML 4.01 Specification, W3C Recommendation 18-Dec-1997 (revised on 24-Dec-1999), <http://www.w3.org/TR/REC-html40/>.
[HTML 4.0 - 8.2]
Section 8.2 of [HTML4.0] Specifying the direction of text and tables: the dir attribute <http://www.w3.org/TR/html40/struct/dirlang.html#h-8.2>
[MathML]
David Carlisle, Patrick Ion, Robert Miner, Nico Poppelier, Eds., Mathematical Markup Language (MathML) Version 2.0, W3C Recommendation 21-Feb-2001
<http://www.w3.org/TR/REC-MathML/>
[Namespace]
Tim Bray, Dave Hollander, Andrew Layman, Namespaces in XML, W3C Recommendation 14-Jan-1999, <http://www.w3.org/TR/REC-xml-names/>.
[Ruby]
Marcin Sawicki, Michel Suignard, Masayasu Ishikawa, Martin Dürst, Tex Texin, Eds., Ruby Annotation, W3C Recommendation 31 May 2001, <http://www.w3.org/TR/ruby/>.
[Unicode]
The Unicode Standard, Version 3.0, Addison Wesley, Reading MA, 2000, ISBN: 0-201-61633-5.
[Unicode31]
Unicode Standard Annex #27 Unicode 3.1, The Unicode Consortium, 2001.
[Unicode32]
Unicode Standard Annex #28 Unicode 3.2, The Unicode Consortium, 2002.
[UnicodeData]
Unicode Character Database, <http://www.unicode.org/Public/UNIDATA/UnicodeCharacterDatabase.html>.
[UnicodeFormat]
Unicode Character Database Format, <http://www.unicode.org/Public/UNIDATA/UnicodeData.html>.
[UnicodeVersions]
Versions of the Unicode Standard, <http://www.unicode.org/unicode/standard/versions/>.
[UAX 9]
Mark Davis, Unicode Standard Annex #9, The Bidirectional Algorithm, <http://www.unicode.org/unicode/reports/tr9/>.
[UAX 13]
Mark Davis, Unicode Standard Annex #13, Unicode Newline Guidelines, <http://www.unicode.org/unicode/reports/tr13/>.
[UAX 15]
Mark Davis, Martin Dürst, Unicode Standard Annex #15, Unicode Normalization Forms, <http://www.unicode.org/unicode/reports/tr15/>.
[XHTML]
Steven Pemberton, et al., XHTML1.0: The Extensible HyperText Markup Language - A Reformulation of HTML 4.0 in XML 1.0, W3C Recommendation, <http://www.w3.org/TR/xhtml1/>.
[XML 1.0]
Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, Eds., Extensible Markup Language (XML) 1.0 (Second Edition), W3C Recommendation 6-October-2000, <http://www.w3.org/TR/REC-xml>.

8. Acknowledgements

Mark Davis (mark.davis@us.ibm.com), and Hideki Hiura (hideki.hiura@eng.sun.com) contributed to the early drafts.

9. Change History (last changes first)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-5.html: Updated reference from Unicode 3.0 to 3.1 and 3.2 where appropriate. Added sections 3.6 and  3.9. Minor wording fixes in sections 2.3, 3.1, 3.2, 3.6, 3.10, 4.3, 4.5 and 5. (AF/MJD)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-4.html: Added a note to the introduction to limit the scope. Reorganized section 3 and clarified the language. Renamed some sections and tables. Updated the document to prepare for publication as Unicode Technical Report and W3C Note (AF/MJD). Minor editorial changes to the text, added section 4.7, fixed some dates, plus a few typos. (AF)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-3.html: Minor editorial changes to the introduction, fixed some references, links, and dates, plus a few typos. (AF/MJD)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-2.html: Added sections 2.1-2.6 (MJD), sections 3.1-3.5, and 3.8, as well as sections 4.4-4.6 and 8 (AF). Edited text for publication as DRAFT Unicode Technical Report. (AF)

Changes from http://www.unicode.org/unicode/reports/tr20/tr20-1.html: Completed references, linked TOC. Various wording changes. Added W3C WD stylesheet, logo, copyright, status of this document. Streamlined authors' section. (MJD) Added material on compatibility characters. (AF)

Changes from the initial draft: Fixed the header. Fixed the numbering. Fixed the title. Put references to final version of data files based on naming conventions. Minor wording changes. Added proposed language on annotation characters to match example on FFFC. Posted for internal review by UTC and W3C. (AF)

10. Copyright

Copyright © 1999-2002 Unicode®, Inc. and W3C® (MIT, INRIA, Keio), All Rights Reserved.

This document is available under the W3C Document License or the Unicode License. Documents available from the W3C have additional warranties, liability, and trademark policies associated with them. The Unicode License specifies warranty/liability and trademark terms including:

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.