PWG Logo

Character Repertoires Implementor's Guide

Printer Working Group Draft, January 13, 2003

Editors:
Elliott Bradshaw, Oak Technology Imaging Group

Abstract

When sending a job to a printer, a print client (PC or other device) needs to make sure the printer has the ability to print the characters in the job.  On PCs and similar devices, clients traditionally use font downloading to supply characters which the printer may not have.  On smaller devices, including PDAs, set-top-boxes, etc., this will often not be an option.

This document provides guidance for implementors of printers and printing clients, including summaries and references to existing standards, recommended practices, and recommendations for future standards.

Table of Contents

Status of this Document

This document is informative only. It has not been reviewed by PWG Members nor approved. It is not a stable document and may not be cited as a normative reference from another document.

Public discussion of Character Repertoires takes place on the mailing list: cr@pwg.org (archive). To subscribe send an email to majordomo@pwg.org with the words subscribe cr in the body. You must be subscribed to the mailing list to post there. Please report errors in this document to one of the editors listed above or on the mailing list.

Revision History

January 13, 2003:

General Approach

There is very little new material in this document.  Rather, it is an attempt to summarize a complex subject, provide a conceptual framework, and bring together references so that a non-specialist can quickly find what is needed for managing printable characters.  [The present author often feels, while surfing the web, that he is rediscovering what was well-known in a different time and place.]

A second goal is to clarify areas where more standards work is needed.

We assume the reader has some familiarity with Internet technologies such as Unicode, MIME, and XML.  Older technologies are used only as needed for specific applications, and can usually be mapped into or associated with corresponding Internet technologies.  This approach has two principle advantages:

Primary References

We take technical material from these general areas:

Useful background reading can be found at:

Unicode

In this document we rely heavily on the Unicode scheme for organizing characters.  Much of the following material is excerpted from [Unicode-principles].

Unicode is a widely-adopted, worldwide character encoding standard.  For each character it defines:

Some examples include:

The actual appearance of the character on paper or screen is called a "glyph", and varies based on device, font, etc.  Unicode does not define glyphs, although it does give examples.

Character encodings define how these numeric values are represented in bits.  Unicode defines three encodings:

In order to print successfully, a client needs to know both what characters (code points) are available, as well as what encodings can be used.

There are many character sets that are not based on Unicode, and several of these are important for printing existing documents.  Fortunately, nearly all have published mappings into Unicode.  Therefore, knowing what Unicode characters are available, a client can deduce which characters are available from an alternate character set.  In addition, the client needs to find out whether the printer can accept characters encoded in the alternate character set, or whether the client must map them to Unicode.

Of course a printer may be configured to accept other character sets, but not those based on Unicode.  However, such a printer is outside the scope of this Guide.

In summary, before printing a job a client needs to determine this information about the printer:

It may also be that some characters are conditionally available, e.g. only when certain fonts are selected.  This topic is reserved for future work, and is not considered in this Guide.  In fact, one recommendation is that a printer implement a system default font that can be used to render its full character set, and that this font be used as a fall-through to handle missing characters in other fonts.

Terminology

charset
A method of converting a sequence of octets into a sequence of characters.  This is the way as it is used in the MIME registry.  See [RFC-2278] and [XML-Japanese] for discussions of the complexities of this term.  

We use the term "repertoire" in two ways:

repertoire
(1) The complete set of characters defined in a given named character set, such as ISO 8859-1.
(2) The subset of characters defined in Unicode 3.2, that are needed for an exact mapping to a smaller character set, such as ISO 8859-1.

[Issue: should we use the term "character collection" instead?]

Primarily, for purposes of this document we focus on the second definition.  We rely on Unicode for the definition of characters, and on various repertoires to tell which Unicode characters are actually present.

Examples of "charsets":

Examples of "repertoires":

Internet Charsets

Historically, the Internet community created standards for charsets based on the need to agree on coding schemes for email using MIME.  These MIME definitions have been incorporated into HTTP, XML, and most other web-based specifications.

The IANA registry (long) of charsets is available at [IANA-charsets].  Every registered charset contains at least:

In some cases, an alternate "preferred MIME name" is given.  In those cases that is the name we use.

In MIME and HTTP headers, the charset is indicated with the "charset" parameter [Issue: verify this].

In XML, the charset may be indicated with a text declaration containing a coding declaration (see [XML] Section 4.3), e.g.:
    <?xml encoding='UTF-8'?>

Printing languages based on XML may therefore use an XML text declaration to choose a non-Unicode charset, if this charset is supported by the printer.

Microsoft Codepages

As a practical matter one can't ignore the influence of Microsoft on printing applications.  Microsoft has converted to a Unicode-centric approach to their codepages, and each of their codepages is based on a published standard.  However, in some cases Microsoft has added [Issue: and changed/removed?] characters.

Discussion of Referenced Character Sets

Latin

[ISO-8859] defines various Latin-based alphabets (each up to 256 characters in size), while [Unicode-8859] is a set of mappings from ISO codes to Unicodes.

In the XHTML community, [XHTML-chars] defines a number of pre-defined character entities, in these groups:

for a total of 253 entries.

Microsoft has registered these codepages with IANA:

In addition, as part of their OpenType specification, Microsoft defines the WGL4.0 character set, which is expressed in terms of Unicode (see [WGL4.0-desc] and [WGL4.0-data]).  It has 652 characters, containing many of the characters from the ISO Latin sets, as well as quite a few symbols.  

Thai is handled has a Latin alphabet, using ISO-8859-11.  [Issue: There is apparently some controversy about this.]

You can compare the ISO-8859, XHTML, and Microsoft repertoires side by side at [PWG-Latin-table].

Asian (CJK)

Normative references to Asian character encoding definitions are given in [IANA-charsets].  In general, mapping these to Unicode is difficult, due to ambiguity in some of the characters (see [XML-Japanese] for discussion of this).

If a printer implements a specific Asian charset, we recommend that it do both of these:

If a client has text in an Asian charset (e.g. Shift-JIS), it should use that charset directly if the printer supports it.  Otherwise, it should use one of the common mappings to convert to Unicode.  This Guide does not define which of the common mappings is the preferred one.

Specific CJK repertoires are:

Another source of mappings is the Unihan database published by the Unicode Consortium [Unihan].  However, it is not easy to determine exactly which Unihan tag to use in these various cases.

Microsoft publishes their CJK codepages, with Unicode mappings:

Named Character Repertoires

The PWG will define a standard set of repertoire names to be used for printing capabilities.  The draft version of this list is:

PWG Character Repertoire Based on IANA Charset Description Reference Location
ISO-8859-1 ISO-8859-1 Latin alphabet No. 1 [RFC-1345]
ISO-8859-2 ISO-8859-2 Latin alphabet No. 2 [RFC-1345]
ISO-8859-3 ISO-8859-3 Latin alphabet No. 3 [RFC-1345]
ISO-8859-4 ISO-8859-4 Latin alphabet No. 4 [RFC-1345]
ISO-8859-5 ISO-8859-5 Latin/Cyrillic alphabet [RFC-1345]
ISO-8859-6 ISO-8859-6 Latin/Arabic alphabet [RFC-1345]
ISO-8859-7 ISO-8859-7 Latin/Greek alphabet [RFC-1345]
ISO-8859-8 ISO-8859-8 Latin/Hebrew alphabet [RFC-1345]
ISO-8859-9 ISO-8859-9 Latin alphabet No. 5 [RFC-1345]
ISO-8859-10 ISO-8859-10 Latin alphabet No. 6 [RFC-1345]
ISO-8859-13 ISO-8859-13 Latin alphabet No. 7 http://www.iana.org/assignments/
charset-reg/iso-8859-13
ISO-8859-14 ISO-8859-14 Latin alphabet No. 8 http://www.iana.org/assignments/
charset-reg/iso-8859-14
ISO-8859-15 ISO-8859-15 Latin alphabet No. 9 http://www.iana.org/assignments/
charset-reg/ISO-8859-15
ISO-8859-16 ISO-8859-16 Latin alphabet No. 10 ??? Could use http://www.unicode.org/Public/
MAPPINGS/ISO8859/8859-16.TXT
GB_2312-80 GB_2312-80 Chinese (People’s Republic of China) [RFC-1345]
Shift_JIS Shift_JIS Japanese [JIS X 0201] and [JIS X 0208]
KS_C_5601-1987 KS_C_5601-1987 Korean [RFC-1345]
Big5 Big5 Chinese (Taiwan) [Big5]
TIS-620 TIS-620 Thai [TIS-620]

Note that the XHTML predefined character entities are not shown in this table.  They should be supported implicitly by any printer processing an XHTML-based language.

[Issues:

-how should we handle Microsoft code pages?  Should a printer reference them directly?  Should a printer add in characters from "similar" MS codepages, e.g. from windows-1251 when doing Latin/Cyrillic?

-how should we handle characters in WGL4.0?  A few of these are symbols that don't show up in ISO-8859.

]

Determining A Printer's Supported Repertoires

Capability Queries for Supported Repertoires

Various protocols provide a way for a client to find out information about a printer's capabilities.  These protocols should be extended to define how the client can learn what repertoires are available in a printer.  

The fundamental semantic unit for getting this capability is an attribute named "repertoires-supported" on the Printer object.  The value is a comma-separated string containing the PWG names of the supported repertoires, including any implicitly-supported repertoires as listed below.  Various protocols may map these names to other forms of representation.  For example, the Bluetooth Basic Printing Profile uses bits in a bitmap, while the Printer MIB uses string names with no punctuation.

In addition, a protocol may provide a mechanism for discovering particular charsets that may be sent directly.  The repertoires-supported attribute does not necessarily reflect characters available in non-Unicode charsets.

Queries associating available repertoires with fonts, charsets, PDLs, etc. are reserved for future study. 

Implicitly-Supported Repertoires

If a printer uses a protocol that supports a repertoire capability query, the client should use it.  When that is not possible, a client may make the following assumptions:

Determining a Printer's Supported Charsets

Most printing languages define a default charset.  Languages based on XHTML specify that a printer must support UTF-8 (an encoding of Unicode) as well as any others.

Based on the repertoires defined above, a printer may always use the Unicode codepoints corresponding to those repertoires.  However, most of these repertoires originated with some non-Unicode encoding, and there may be problems mapping to Unicode.

A printer may choose to implement the original, non-Unicode charset based on the repertoires listed above.  This is not likely to be useful for Latin codings, but may be especially useful for Shift-JIS.

[Issue: how should a client learn which charsets are available?]

Recommendations for the Printer Implementor

  1. Always implement Unicode UTF-8, in addition to any other character encoding schemes.
  2. Implement characters described by the rules in "Implicitly-Supported Repertoires," above.
  3. Make supported characters available in all fonts, using a system font fall-through if needed.
  4. Print a recognizable "missing character" symbol (for example an empty rectangle) for any character not supported.

Recommendations for the Client Implementor

  1. If the printer provides a query mechanism to obtain supported repertoires and charsets, use it to find out what the printer can handle.
  2. Otherwise, follow the guidelines in "Implicitly-Supported Repertoires," above.
  3. If the source document is not in Unicode, decide whether or not to map it to Unicode.  Usually, if the printer can handle the original charset it is best to send it unmapped.
  4. If the document contains characters that won't print, decide whether to alert the user, map them to some other characters, let the printer handle them, etc.

Recommendations for Standards Work

This section is directed at the Printer Working Group, with suggestions for standards that need to be developed.

  1. Adopt a standard set of character repertoire names.
  2. Define the rules for implicitly supported repertoires.
  3. Define the semantics of a query mechanism to determine which repertoires and charsets are available in a printer.
  4. Agree on and publish normative references for mapping between other schemes and Unicode.

Issues

  1. How do we reference ISO-8859?  Is there a version online, or does every reader need to buy it from ISO?  If so, we should list exactly what they need to buy.  
  2. What about other ISO-8859 components?

Acknowledgements

This Guide was prepared by the PWG Character-Repertoires Working Group, with input and assistance from:

We also thank the authors of the original material cited in the references.

References

[Big5]
"Chinese for Taiwan Multi-byte set. PCL Symbol Set Id: 18T", but where is this?
[BPP]
"Bluetooth Basic Printing Profile", Bluetooth SIG, October 5, 2001. Available at: http://www.bluetooth.com/pdf/Basic_Printing_Profile_0_95a.pdf
[IANA-charsets]
http://www.iana.org/assignments/character-sets.
[ISO-8859]
...purchase each alphabet online at http://www.iso.org.
[JIS X 0201]
Japanese Industrial Standards Committee. 7-bit and 8-bit coded character sets for information interchange, JIS X 0201:1997, Japanese Standards Association, 1997.
[JIS X 0208]
Japanese Industrial Standards Committee. 7-bit and 8-bit double byte coded KANJI sets for information interchange, JIS X 0208:1997, Japanese Standards Association, 1997.
[Lunde]
CJKV Information Processing, Ken Lunde.  O'Reilly Press, 1999.
[Microsoft-codepages]
http://www.microsoft.com/globaldev/reference/cphome.asp.
[PWG-Latin-table]
ftp://ftp.pwg.org/pub/pwg/Character-Repertoires/CRsummary.html.
[RFC-1345]
Character Mnemonics and Character Sets, Jun, 1992.  ftp://ftp.rfc-editor.org/in-notes/rfc1345.txt
[RFC-2278]
IANA Charset Registration Procedures.  ftp://ftp.rfc-editor.org/in-notes/rfc2278.txt.
[TIS-620]
???. maybe http://www.nectec.or.th/it- standards/std620/std620.htm (in Thai)
[Unicode-8859]
Mapping tables from 8859 alphabets to Unicode.  
http://www.unicode.org/Public/MAPPINGS/ISO8859/
[Unicode-principles]
The Unicode® Standard: A Technical Introduction. 
http://www.unicode.org/standard/principles.html
[Unihan]
Asian property database for Unicode; include mapping from other alphabets.  A very large file;  zip form available at http://www.unicode.org/Public/UNIDATA/Unihan.zip.
[WGL4.0-data]
Unicode values for WGL4.0. http://www.microsoft.com/typography/OTSPEC/WGL4.htm.
[WGL4.0-desc]
Description of Microsoft's character set standard which "includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish." http://www.microsoft.com/typography/unicode/cscp.htm
[XHTML-chars]
Predefined character entities in XHTML. http://www.w3.org/TR/xhtml-modularization/dtd_module_defs.html#a_xhtml_character_entities
[XML]
Extensible Markup Language (XML) 1.0 (Second Edition), October, 2000.  http://www.w3.org/TR/REC-xml.
[XML-Japanese]
XML Japanese Profile, April, 2000.  http://www.w3.org/TR/japanese-xml/.