Accesskey n skips to in-page navigation. Skip to the content start.
Intended audience: XHTML/HTML coders (using editors or scripting), script developers (PHP, JSP, etc.), Web project managers, and anyone who is looking for information about how to deal with character encodings in forms.
What is the best way to deal with encoding issues in forms that may use multiple languages and scripts?
The best way to deal with encoding issues in (X)HTML forms is to serve all your pages in UTF-8. UTF-8 can represent the characters of the widest range of languages. Browsers send back form data in the same encoding as the page containing the form, so the user can fill in data in whatever language and script they need to.
There are a few details to make sure this approach works well. First, it is important to tell the browser that the form page is in UTF-8. There are various ways to tell the browser about the encoding of your page. This is important in any case, but even more so if your form page itself doesn't contain any characters outside US-ASCII, but your users may type in other characters.
Second, it may be a good idea for the script that receives the form data to check that the data returned indeed uses UTF-8 (in case something went wrong, e.g. the user changed the encoding). Checking is possible because UTF-8 has a very specific byte-pattern not seen in any other encoding. If non-UTF-8 data is received, an error message should be sent back.
As an example, in Perl, a regular expression testing for UTF-8 may look as follows:
$field =~ m/\A( [\x09\x0A\x0D\x20-\x7E] # ASCII | [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte | \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte | \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates | \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 | [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 | \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )*\z/x;
This expression can be adapted to other programming languages. It takes care of various issues, such as illegal overlong encodings and
illegal use of surrogates. It will return true if $field
is UTF-8, and false otherwise.
Tell us what you think (English).
Content first published 2003-06-09. Last substantive update 2007-10-26 13:14 GMT. This version 2010-08-20 11:10 GMT
For the history of document changes, search for qa-forms-utf-8 in the i18n blog.
Copyright © 2003-2010 W3C® (MIT, ERCIM, Keio, Beihang), All Rights Reserved. W3C liability, trademark, document use and software licensing rules apply. Your interactions with this site are in accordance with our public and Member privacy statements.