People of the world: migrate to UTF-8!

The Unicode issue is usually irrelevant in countries which use Latin letters only, but is very relevant to the rest of the world.

Every time when I saw 'Unicode support added' in OS/software changelogs I've been thinking "I couldn't care less", but that was stupid. Because when 100% of the software would speak Unicode (or more correct: UTF-8, which is the most popular Unicode implementation), the world would be a happier place to live in (but would still burn because of the global warming). Here's how I migrated:

So today I became a UTF-8 freak, and this meant I had to go through a few steps. I believe that trying to get UTF-8 support 5 years ago it would've been hell, today both Linux and KDE/GNOME seem to have a very well UTF-8 support. So where were we? the steps:

1. X Support: I've manually added the line "export LC_CTYPE=en_us.UTF-8" into /etc/X11/XSession (any nicer place to put it in?) so everything that runs under X would know it should support UTF-8. I also exported LC_ALL=en_us.UTF-8, but I don't think it's mandatory. Then I restarted X, and ran the command 'locale' to make sure it took effect.

2. Converting file names to UTF-8: I had to convert all the file names from ISO8859-8 (hebrew encoding) to UTF-8. I used the convmv script (available through apt-get/yum in popular distros). Simply running the following line did the trick:

convmv -f iso8859-8 -t utf-8 -r --notest /path

3. Converting data to UTF-8: iconv -f iso8859-8 -t utf-8 should do the trick, but I didn't need it yet. I had some ID3 tags to convert, which could be done automagically by a small script using some id3tag tool + iconv, but I was lazy and re-typed manually. <ashamed>

4. Console support: Actually I was about to give up the console support, but it was so easy I couldn't resist:

  • Set a Unicode console font
  • Run the unicode_start command
    (in Debian both are configurable in /etc/console-tools/config)
  • added 'export LC_CTYPE=en_us.UTF-8' in my ~/.bashrc, and checked later with 'locale' command after console login.

5. Testing: how to test if what I read is indeed UTF-8?

  • unset LC_CTYPE, LC_ALL, so the locale command would show no sign of Unicode.
  • Run from within this terminal a new xterm, check the locale command again, then this terminal should be unicode-disabled.
  • If it's text, watch it through hexdump, if it's filename - watch it through stat. If it's Unicode, non-latin letters should be represented by two bytes (chars) each, first byte always tells the language was chosen (so first byte repeats itself quite many times) ..

Hurray. Encoding problems no more.

Leave a Reply

Your email address will not be published. Required fields are marked *