Multiple Encodings on One Server

Multiple Encodings on One Server

Character encoding must be correct, otherwise some characters on web page look like gibberish. This document offers a simple solution for using multiple encodings on a single Apache web server. This way, you can smoothly transfer from old ISO-8859-1 charset to modern UTF-8.

For users in Helia, a bypass for myy legacy/misconfiguration is offered. Myy is now testing the setup below, server wide. This setup has been in production on Haaga-Helia myy for over a year now.

© 2006 Tero Karvinen www.iki.fi/karvinen

Hindi translation (pdf) contributed by Priyanka Warade.


Short Version for Gurus

AddDefaultCharset Off

The line above was for busy gurus. The rest of us can use the step-by-step guide below.


Character Encodings

broken-unicode-on-myy-screenshot-2.jpgIf you have a byte with value 83, it could mean many things, such as “ä” “ñ” or “¤”. Character encoding defines what character a byte (or many bytes) represent.


UTF-8 – the Best Encoding

The recommended character encoding is unicode, UTF-8. It allows writing Finnish and Chinese national characters on the same document. Also, it removes a lot of guesswork from programs in defining encoding. Most used characters, such as A-Z, are represented by a single byte. Some rarely used national ones are represented by two bytes, such as ä or ö. If given a choice, use UTF-8.


Other encodings

ISO-8859-1 used to be popular character set in the western world. ISO-8859-1 is a single byte encoding, which means that for every national charset you need another encoding. There are many encodings similar to ISO-8859-1.


Character Encoding Should be in Document

The best place for character encoding is the document. This way, documents can be copied to different web servers and they are still shown correctly. For example, this document defines its character encoding in the beginning of the HTML source:

<head>
	<title>Multiple Encodings on One Server</title>
	<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>


Problem

If you save a document from OpenOffice.org Writer, the character encoding UTF-8 is used and explicitly stated in the beginning of the document – just the way it should be. However, when a document using UTF-8 is published on a web server that forces ISO-8859-1 encoding, some characters are broken. Depending on your settings, either all national characters are broken, or only some special characters (eg. smart quotes). Web server sends the encoding in a hidden http field which overrides the setting in the document.


Solution

Because the encoding is written in documents, web server should not send any hidden, overriding commands.

Create a .htaccess file in your public_html directory. Copy-paste the text below to a text file and use any file transfer program to copy it to web server.

# $HOME/public_html/.htaccess - Always use encoding from document
# http://www.iki.fi/karvinen/multiple_encodings_on_one_server.html
AddDefaultCharset Off

You must have “show hidden files” enabled to see .htaccess file, becase its name starts with a dot. On the command line, you can use ‘ls -a’ to see hidden files.


Testing

The simplest test is to open a previously broken page on your web directory. Remember to hold shift-key when pressing the reload button to make sure that the page is really reloaded.

If that fixed it for you, well done. Enjoy your new web site.


Advanced Stuff

If you have command line and curl available, you can check that the server does not return any encoding in http header:

$ curl -sI http://www.iki.fi/karvinen/

If you are administering a web site, you can fix it for all users by making the setting in global configuration file. If you are editing the file with nano, use ctrl-W to find “AddDefaultCharset” and change it to “Off”.

The configuration file is

  • Ubuntu and Apache 2: /etc/apache2/apache2.conf
  • Debian and Apache 1.3: /etc/apache/httpd.conf
  • Red Hat, Fedora: /etc/httpd/conf/httpd.conf

You can also ask your system administrator to make this change for you.


Links

“AddDefaultCharset should only be used when all of the text resources to which it applies are known to be in that character encoding and it is too inconvenient to label their charset individually. ” The Apache Software Foundation 2006: Apache Core Features. http://httpd.apache.org/docs/2.0/mod/core.html#adddefaultcharset

Karvinen, Tero 2003: UTF-8 in Apache. http://www.iki.fi/karvinen/linux/doc/apache-xhtml-utf-8.txt



Posted in Old Site | Tagged , , , , , , , , , , , , , , | Comments Off on Multiple Encodings on One Server

Comments are closed.