The character set problem

In this section, we discuss the history of character sets as well as problems that can occur in conjunction with them and how to solve them.

Alphabet soup: Databases and character sets

Expensive computers and equipment that filled up entire rooms are ancient history. What was originally available to a small elite group and only in English is now available to almost everyone and in every language. Yet the vast number of languages, fonts and, thus, character sets that deviate from the original English standards continue to cause problems.

This article explains how the MySQL database platform deals with the problem and how to prevent and solve tiresome problems with umlauts and other special characters.

The origins and structure of character sets

In the beginning, there was only US-ASCII. Or almost. Since the 1960s, the smallest unit of memory used by computers has been eight bits (8 bits = 1 byte). Eight bits can be used to express the numbers 0 through 255. For computers to display text, these numbers are assigned to text characters. To ensure compatibility between the various systems used at the time, the ASCII format was established as the standard in 1969. Characters were assigned to the first 7-bit combinations – and the first character set or «charset» was born. The number 65 therefore corresponds to the letter «A», for example.

To leave space for developments and new characters (e.g. the Euro symbol), only half of the 256 characters were permanently defined in US-ASCII. These characters included all English characters. Symbols, umlauts and special characters in other languages therefore had to be added in the second half of the 256-character set. However, because there are far more than 256 different text characters in the world (Chinese characters, for example), there are now many different extensions for all possible languages. The most important character sets for Europe belong to the ISO-8859 family. It was the early 90s before a concept was finally introduced that was able to map all character sets: Unicode. However, Unicode has still not been consistently implemented in all operating systems and applications.

Image 1: Character maps with examples from the character sets ISO-8859-1, ISO-8859-6 and UTF-8.

The most common character sets

The following sections briefly cover the most common character sets used for Western European languages.

ISO-8859-1 or Latin-1

This character set uses a fixed length of eight bits to encode the characters. The encodable characters with the first seven bits (0-127) correspond to the US-ASCII character set. The second half (128-255) contains the most common characters for Western Europe, the Americas, Australia and parts of Africa. It does not include the Euro symbol.

ISO-8859-15 or Latin-9

This character set uses a fixed length of eight bits to encode the characters. The encodable characters with the first seven bits (0-127) correspond to the US-ASCII character set. The second half (128–255) contains the common characters for Western Europe, including the Euro symbol, and full support for French, English (US), Australia and parts of Africa.

UTF-8

This character set was developed for the purpose of developing a standard character set for all languages. For this reason, the encoding is a bit more complex.

UTF-8 uses a variable length for storing the characters. The first seven bits corresponds to US-ASCII. This makes it possible to store US-ASCII characters in one byte. All other characters, such as umlauts, require 2 or more bytes (ISO-8859-x requires only 1 byte for this).

UTF-8 supports nearly all characters used around the world.

Problems

This basic US-ASCII character set and many different language-specific extensions can now be used to map most of the characters in existence. However, this requires that the character set in which text is written has been defined. If this information is not available, the computer will not know whether to display the character corresponding to the number 196 as an ISO-8859-1 «Ä» or an ISO-8859-6 «?». MS-DOS users have probably encountered the problem when the text border in applications is not always displayed properly as a border but, instead, as a rectangle of gibberish, for example. Unicode is also affected by the same problem. If it is unclear whether the text was written in Unicode, an «Ä» will suddenly become a hieroglyph of two strange characters (because Unicode encodes all non-US-ASCII characters in 2 or more bytes).

Text in databases

Databases can also store text. This is even the main function of database-driven websites. To solve the alphabet soup problem, databases have a feature that allows the appropriate charset to be defined for each text field. This ensures that the correct characters are displayed when writing the text or reading it later. The following dump illustrates this:

/*!40101 SET NAMES utf8 */;
CREATE TABLE `user` (
`Id` INT NOT NULL AUTO_INCREMENT,
`User` VARCHAR(255) CHARSET=latin1 NOT NULL default ,
`Occupation` VARCHAR(255) NOT NULL default ,
PRIMARY KEY (`Id`,`User`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_bin COMMENT='Users';

This simple database saves three fields in a table:

Field name	Type	Character set
Id	INT (number)	None (number)
User	VARCHAR (text)	Latin-1 (specific only for this field)
Occupation	VARCHAR (text)	(UTF-8 standard for this table)

The default character set for this table is UTF-8. (DEFAULT CHARSET=utf8 on the last line.)
The «User» field, however, is set to latin-1. (CHARSET=latin1 on the fourth line)
The entire dump itself is saved in UTF-8. (Defined by the «SET NAMES utf8» statement in the comment, first line)
The collation defines the collation sequence within the alphabet and not the character set, as is often assumed. (COLLATE=utf8_bin on the last line.)

For this dump, the «User» field was therefore converted from the database-internal representation in latin-1 to UTF-8. Latin-1 can be fully mapped in UTF-8. The«User» field is converted from UTF-8 back into the original Latin-1 character set when reading the dump.

Accessing text in databases: the database library

The text in databases can therefore be stored with a fixed character set. A program library, also referred to as a “client library” or simply a “library”, controls the exchange of data between the database and the application.

Image 2: Schematic sequence and interaction between website components and DB.

This program library can either pass on the data 1:1 as it is stored in the database or convert the data from one charset to another (e.g. Latin-1 to UTF-8). If the program that is using the library does not define the desired behavior, you must assume that a behavior will be selected arbitrarily. If the character set to be used for transferring the data has not been defined, it is impossible to ensure that text will be displayed or saved correctly.

In MySQL, the following statement is used to define the character set used for the connection:

/*!40101 SET NAMES utf8 */;

Your PHP code might look similar to this:

$DB->query("SET NAMES 'utf8'");

or:

mysql_query("SET NAMES 'utf8'",$connection);

The web applikation

If the entire process so far has been handled with the correct character set by the database using the library, it is up to the web application to transfer the data to the browser with the correct charset. The character set used in the generated HTML must either be declared in a META tag in the HTML header or HTTP header when transferring the data. Here, too, a false declaration can result in alphabet soup for any non-US-ASCII characters such as umlauts.

Character set definition in the HTTP header

The charset is usually specified in the HTTP header when transferring the data. UTF-8 is defined as the charset in the following HTTP header:

Content-Type: text/html; charset=utf-8

In your PHP application, the PHP code used to define this charset might look something like this:

header('Content-Type: text/html; charset=utf-8');

Defining the charset in HTML

The charset can also be defined directly with the http-eqiv meta-tag. The HTML code might look something like this:

<head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-15">
</head>

Defining the charset for forms

Conversely, the acceptable charset must be clearly defined when transferring text in forms to your applications. Otherwise, the text sent by the browser may end up with a random character set. If your web application does not carefully analyze or convert the characters in such case, it is simply a matter of chance whether the character string in the database will actually be written in the same character set defined in the database.

An example of a form with a precisely defined charset for the text fields:

<form action="action.php" accept-charset="ISO-8859-15">
Name: <input type="text" name="name" />
Nachname: <input type="text" name="nachname" />
<input type="submit" value="Senden!" />
</form>

Error cases

There is nearly an infinite number of scenarios that produce alphabet soup. However, there are three main categories:

The declared character set and the one actually used in the database do not match

This data will end up in the database if, for example, a browser uses UTF-8 encoding to transfer the data from a text field on a form due to a missing accept-charset="ISO-8859-15" statement and the application writes the data unverified to a database field declared as latin-1. This works fine as long as the charset of the database connection remains the same. If it is not defined, problems may suddenly occur when changing providers, DB versions or updating the server.

Data is converted twice or incorrectly

Text data from a form is transferred with the UTF-8 character set (as defined in the HTML) and is supposed to be written to a field declared as UTF-8 in the database. However, no charset or the wrong charset has been defined with SET NAMES for the database connection. This causes UTF-8 to be converted from a different character set to UTF-8 again. The worst case scenario is that all non-US-ASCII characters are destroyed.

This can also happen when outputting text from the database. In a database declared as Latin-1, UTF-8 text is stored incorrectly. If, for example, the database connection is set so that all text is output in UTF-8, the library attempts to convert what it thinks is Latin-1 text into UTF-8 again. The result here is also alphabet soup.

The charset for displaying data in the browser is not set or is set incorrectly.

This error is the easiest to correct. The data is stored and output with a consistently defined charset, but the charset being used is not defined in the HTTP or HTML header. In such case, most browsers attempt to guess the charset with varying degrees of success. Because the data in the application and in the database are stored consistently, this problem can be fixed with a simple PHP or HTML statement (see above).

Summary: best practices

In summary, the following guidelines make for a life free of charset-related problems when working with databases:

Declare a character set for every text field, table or database. Never leave the charset undefined for any text fields.
For each MySQL connection, declare the character set with SET NAMES.
Use the HTTP or HTML header to declare the charset of your pages.
Declare the acceptable character sets for each form. Be aware that a browser may still send data with the wrong character set.
Make sure that you always use SET NAMES to specify a single character set both when reading and generating dumps.

Troubleshooting: When the alphabet soup has already been served

The starting point for a rescue attempt is to analyze the problem: You need to know which errors there are in your application or in the data stored in the database. If the error is only found in one of the possible places, follow these steps to identify and solve the problem.

Browser is using the wrong charset

First check whether the correct charset is selected in your browser. In most browsers you can view and manually change the charset that is used. Is the selected charset the one you intended? For European websites, the most common and useful charsets belong to the ISO-8859 family, in particular ISO-8859-1, ISO-8859-15 and UTF-8. Hostpoint website is displayed with the wrong charset: ISO-8859-1 (latin-1) instead of UTF-8.

Image 3: Hostpoint website is displayed with the wrong charset: ISO-8859-1 (latin-1) instead of UTF-8.

Image 4: Hostpoint website is displayed with the correct charset: UTF-8.

Tip: UTF-8 uses 2 bytes to store umlauts and special characters. If two incorrect characters are displayed instead of umlauts (see Fig. 3), it is likely that the data or this part of the data is stored in UTF-8, but a character set from the ISO-8859 family is being used to display it in the browser.

If no characters or comma-like characters are displayed in place of umlauts, you probably have the reverse case: Characters from the ISO-8859 character sets are displayed using UTF-8 in the browser.

Solution: Make sure that the correct charset is specified in the HTTP header and HTML code.

Please use this form only to provide feedback on the above guide.
For support requests please use this form instead.