[Unicode]  Unicode Character Database
 

Extended Character Properties

Revision 3.1.1
Authors Mark Davis
Date 2001-07-12
This Version http://www.unicode.org/Public/3.1-Update1/PropList-3.1.1.html
Previous Version n/a
Latest Version http://www.unicode.org/Public/UNIDATA/PropList.html


Summary

This document describes the format and content of the PropList.txt data file in the Unicode Character Database (UCD).

Status

The file and the files described herein are part of the Unicode Character Database and governed by the UCD Terms of Use given below.

For general information on file formats and table formats, and the implications of normative vs informative properties, see UnicodeCharacterDatabase.html.

Warning: the information in this file does not completely describe the use and interpretation of Unicode character properties and behavior. It must be used in conjunction with the data in the other files in the UCD, and relies on the notation and definitions supplied in The Unicode Standard. All chapter references are to Version 3.1.0 of the standard.


Introduction

PropList.txt contains extended properties that supplement the General Category property described in UnicodeData.html. Unlike the derived properties, the properties in PropList.txt cannot be derived directly from UnicodeData.txt or other data files of the UCD. These properties are listed in the following table.

Property Value N/I Definition and Usage
White_space N Space characters and those format control characters (such as TAB, CR and LF) which should be treated by programming languages as "white space" for the purpose of parsing elements.

Note: ZERO WIDTH SPACE and ZERO WIDTH NO-BREAK SPACE are not included, since their functions are restricted to line-break control. Their names are unfortunately misleading in this respect.

Note: There are other senses of "whitespace" that encompass a different set of characters.

Bidi_Control N Those format control characters which have specific functions in the Bidirectional Algorithm.
Join_Control N Those format control characters which have specific functions for control of cursive joining and ligation.
ASCII_Hex_Digit N ASCII characters commonly used for the representation of hexadecimal numbers.
Dash I Those punctuation characters explicitly called out as dashes in the Unicode Standard, plus compatibility equivalents to those. Most of these have the Pd General Category, but some have the Sm General Category because of their use in mathematics.
Hyphen I Those dashes used to mark connections between pieces of words, plus the Katakana middle dot. The Katakana middle dot functions like a hyphen, but is shaped like a dot rather than a dash.
Quotation_Mark I Those punctuation characters that function as quotation marks.
Terminal_Punctuation I Those punctuation characters that generally mark the end of textual units.
Other_Math I Math characters that do not have the Sm General Category.
Hex_Digit I Characters commonly used for the representation of hexadecimal numbers, plus their compatibility equivalents.
Other_Alphabetic I Alphabetic characters that do not have L as their major class for the General Category (Lu, Ll, Lt, Lm, Lo).
Ideographic I Characters considered to be CJKV (Chinese, Japanese, Korean, and Vietnamese) ideographs.
Diacritic I Characters that linguistically modify the meaning of another character to which they apply. Some diacritics are not combining characters, and some combining characters are not diacritics.
Extender I Characters whose principal function is to extend the value or shape of a preceding alphabetic character. Typical of these are length and iteration marks.
Other_Lowercase I Lowercase characters that do not have the Ll General Category.
Other_Uppercase I Uppercase characters that do not have the Lu General Category.
Noncharacter_Code_Point N Code points that are explicitly defined as illegal for the encoding of characters. See Unicode 3.1 for more information.


UCD Terms of Use

Disclaimer

The Unicode Character Database is provided as is by Unicode, Inc. No claims are made as to fitness for any particular purpose. No warranties of any kind are expressed or implied. The recipient agrees to determine applicability of information provided. If this file has been purchased on magnetic or optical media from Unicode, Inc., the sole remedy for any claim will be exchange of defective media within 90 days of receipt.

This disclaimer is applicable for all other data files accompanying the Unicode Character Database, some of which have been compiled by the Unicode Consortium, and some of which have been supplied by other sources.

Limitations on Rights to Redistribute This Data

Recipient is granted the right to make copies in any form for internal distribution and to freely use the information supplied in the creation of products supporting the UnicodeTM Standard. The files in the Unicode Character Database can be redistributed to third parties or other organizations (whether for profit or not) as long as this notice and the disclaimer notice are retained. Information can be extracted from these files and used in documentation or programs, as long as there is an accompanying notice indicating the source.


HomeTerms of UseE-mail