PWG Logo

Character Repertoires Background Survey

Printer Working Group Draft, December, 2002

Editors:
Elliott Bradshaw, Oak Technology Imaging Group

Abstract

The Character Repertoires activity within the PWG seeks to agree on a set of repertoires to be used in a wide range of printing products.

The current document summarizes existing work in this area.

Status of this Document

This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.

Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.

A list of current PWG Standards and other technical documents can be found at http://www.pwg.org/standards.html.

Glossary

TBD...

Alphabet
...
Character
...
Character set
...
Codepoint
...
Repertoire
...

Bluetooth Basic Printing Profile

Section 12.2.3 of [BPP] summarizes the Character Repertoires Supported field, with values taken from this table:

Bit Number Character Repertoire Description
Bit0 ISO-8859-1 Latin alphabet No. 1
Bit1 ISO-8859-2 Latin alphabet No. 2
Bit2 ISO-8859-3 Latin alphabet No. 3
Bit3 ISO-8859-4 Latin alphabet No. 4
Bit4 ISO-8859-5 Latin/Cyrillic alphabet
Bit5 ISO-8859-6 Latin/Arabic alphabet
Bit6 ISO-8859-7 Latin/Greek alphabet
Bit7 ISO-8859-8 Latin/Hebrew alphabet
Bit8 ISO-8859-9 Latin alphabet No. 5
Bit9 ISO-8859-10 Latin alphabet No. 6
Bit10 ISO-8859-13 Latin alphabet No. 7
Bit11 ISO-8859-14 Latin alphabet No. 8
Bit12 ISO-8859-15 Latin alphabet No. 9
Bit13 GB_2312-80 Chinese (People’s Republic of China)
Bit14 Shift_JIS Japanese
Bit15 KS_C_5601-1987 Korean
Bit16 Big5 Chinese (Taiwan)
Bit17 TIS-620 Thai

Generally, these repertoires were not defined originally in Unicode.  Therefore we use various Unicode documents that map these character sets into Unicode, thus providing the list of Unicode values needed to support that repertoire. 

Latin/European

[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.

Microsoft

As part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode.  It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.  Any MS client is likely to assume these characters are available.

World Wide Web Consortium

[XHTML-Chars] defines a number of pre-defined character entities, in these groups:

For a total of 254 entries.

Summary of Non-Asian Characters

You can compare the ISO-8859, Microsoft, and XHTML repertoires side by side here.

Asian

These are the relevant fields in the [Unihan] database:

For Thai, use 8859-11, which is equivalent to TIS 620-2533 (1990) with the addition of 0xA0 NO-BREAK SPACE.

Issues

  1. How do we reference ISO-8859?  Is there a version online, or does every reader need to buy it from ISO?  If so, we should list exactly what they need to buy.  
  2. What about other ISO-8859 components?
  3. Does the presence of a codepoint with right-to-left property imply that bidi processing is required in the printer?

References

[BPP]
"Bluetooth Basic Printing Profile", Bluetooth SIG, October 5, 2001. Available at: http://www.bluetooth.com/pdf/Basic_Printing_Profile_0_95a.pdf
[ISO-8859]
...purchase each alphabet online at http://www.iso.org.
[Unicode-8859]
Mapping tables from 8859 alphabets to Unicode.  
http://www.unicode.org/Public/MAPPINGS/ISO8859/
[Unihan]
Asian property database for Unicode; include mapping from other alphabets.  A very large file;  zip form available at http://www.unicode.org/Public/UNIDATA/Unihan.zip.
[WGL4.0-desc]
Description of Microsoft's character set standard which "includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish." http://www.microsoft.com/typography/unicode/cscp.htm
[WGL4.0-data]
Unicode values for WGL4.0. http://www.microsoft.com/typography/OTSPEC/WGL4.htm.
[XHTML-chars]
Predefined character entities in XHTML. http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities