Handling special characters and names

General

This document will describe how support for foreign names was implemented in our phone book and how it works.

HTML

Implementation

One can use unicode decimal or hexadecimal (decimal are encoded as &#... while hexadecimal are encoded as &#x...).To do this, a few background infromation

  • An example of Unicode decimal value & #21776; & #29233; & #27946; = 唐爱洪  (Aihong Tang)
    • A new field was added to out PhoneBook: UnicodeName. This is a MySQL VARCHAR(40) with utf8_unicode_ci collation.
    • The HTML header was changed from content="text/html; charset=windows-1252" to content="text/html; charset=utf-8"
    • The MySQL connection from the client must specify the encoding for the data to be transferred correctly. mysql_set_charset() is only valid for PHP > 5.2.3 hence, a trick was used and instead i.e., mysql_query("SET CHARACTER SET utf8") is issued after the mysql_select_db().
  • This being applied, the database can be read and result displayed. One more trick was used for portability (and making an HTML source cut-and-paste easier) which is to convert the UTF-8 characters to HTML tags. The PHP htmentities() function was used for this.

How to enter the encoded names: simply send an Email with your encoded name (Chinese, Korean, Japanese, Russian, etc ...) to Liz Mogavero and she can add the field UnicodeName as-is in the phone book.

Results

Here are examples (thanks to the beta testers for sending me their names in their original language)

Possible problems

The font may appear broken on Linux or other systems if you do not have support for Asian fonts. In such a case, the below may help (you need to log as root or ask your system administrator to install the proper fonts):

% yum install fonts-chinese fonts-japanese fonts-korean

Restart Firefox or whatever is your preferred browser and all should be fine afterward (if not, change browser ;O  ).

 

TeX/LaTeX

Latex will also consider unicode characters but to render the fonts, some additional package will need to be installed.

\usepackage[utf8]{inputenc}

If your LaTeX compiler complains about utf8, try utf8x instead.

LaTeX has supported UTF-8 in its base package since March 2004 (but remained experimental). Before that, UTF-8 was already available in the form of Dominique Unruh’s package, which covered far more characters and was rather resource hungry. There are also many font support for languages - see the internationalization page for example.

From the above encoding approach, one can generate the LaTeX names by starting with the usepackage as indicated and then, using the characters dumped as-is in the document.

This encoding in our Author tool is not yet tested (but possible).