Unicode spam

I got home from a nice dinner with friends, dicovered that it was ten degrees cooler in my bedroom than in the rest of the apartment, and decided that was a sign that I should settle in with my laptop. I felt like hacking on something but didn't feel like tackling any of the work-related stuff that I'm "supposed" to be doing. So I decided to start poking at my mdfrm program to improve its UTF-8 handling.

I wrote mdfrm eons ago in Perl when I was using qmail. It's a replacement for the venerable frm program (originally part of elm), which shows you a summary of From and Subject of all your new mail. I run it constantly to see what mail I have pending, and none of the replacements in various other packages quite do what I want.

mdfrm is great, but it's showing its age. Among other things, I wrote it when I was still using a C locale for everything, and it had special-case decoding only of ISO 8859-15 in RFC 2047 encoding. Some time back, I switched to using UTF-8 across the board to match the current trend on Linux in general, and now mdfrm has been spitting out ISO 8859-15, creating invalid UTF-8, and thereby mangling its output (particularly in the presence of spam).

So, the first step was to research Perl encoding libraries, which have come a long way. The Encode module (with all its subclasses) comes with Perl these days and can convert from all sorts of character sets into UTF-8. Even better, I discovered that one of the encodings it supports out of the box is RFC 2047 encoding. So problem solved for anyone who uses MIME properly; if I see an RFC 2047-encoded string, I just pass it through the appropriate decode command and I get UTF-8 back.

Now, a lot of my spam is in Asian character sets, so even with that fix, I was getting a bunch of square blobs. I use the Neep font by Jim Knoble, which has been extended in Debian to include the basics of Unicode but which doesn't have any CJK (Asian -- Chinese/Japanese/Korean) characters. Worse, the square blocks also weren't aligned properly; Perl (via format output) clearly thought they were wider than xterm thought they were. Besides, square blocks are lame; other X programs can use font sets and pull characters not found in a primary font from another font. I wanted xterm to do the same. So I started doing research.

Turns out, xterm can, and even tries to do so automatically. But this is very poorly documented unless you know exactly what to search for in the xterm manual page. xterm uses two fonts: a regular font and a wide font. By default, for wide characters, it tries to find a font double the width of the current font. In the man page, it just talks about "wide characters," but in practice what that means is CJK characters. Now, Neep doesn't provide a wide varient, so the autodetection doesn't work, but it turns out that you can set a wide font separately from the main font using the -fw command-line option and the corresponding wideFont X resource.

Setting that appropriately, I can now see Asian characters when the From and Subject use RFC 2047 encoding! The misc-fixed font works just fine as a wide font with Neep as the narrow font.

But the alignment is still broken. So I need something that will tell me, in mdfrm, what characters take up two character cells and what characters take up only one. This sounds like something that I should be able to get from a Unicode property, and indeed, Unicode Standard Annex #11 has exactly the properties I want. Unfortunately, this isn't one of the classes supported in Perl 5.8 according to perlunicode.

A Google search later, I turned up Unicode::EastAsianWidth, which seemed to do exactly what I wanted. Hm, not packaged for Debian. Well, I can fix that later. However, more fatally, it turns out that it just doesn't, er, work. Later inspection and comparison with the data file reveals that for some reason the module is missing vast swaths of wide character blocks, including the entire main CJK block. Weird. I suppose this may be due to it not having been updated in years.

It turns out that all the CJK characters I get in spam are concentrated in a few large blocks, so right now I'm just hard-coding character ranges into the new version of mdfrm based in the Annex #11 data file. Maybe at some point I'll do something more thorough, but this works. After lots of fighting with Perl formats and then with printf, I gave up on trying to do alignment with Perl's built-in facilities and just wrote the alignment code by hand.

That left the problem that most of my mail is spam and spammers don't use MIME, or standards in general. Instead, most of my non-English mail is in a hodgepodge of native Asian encodings. Enter Encode::Guess, which looked quite promising. Unfortunately, due to how it works, there are some character sets that it will think are always good, so if you include them in the guess list, you get ambiguous results. It doesn't do any character frequency analysis, just looks for characters that are completely invalid for a given encoding, so an encoding like ISO 8859-15 will almost always succeed since nearly every code point is a valid ISO 8859-15 character.

To make a long story short, after a lot of fiddling, I worked out a calling sequence that tries various encodings from strictest to loosest, and now I have something that seems to work. I can recognize Korean spam, ISO-2022-JP with its ESC codes for shifting, and a few other common encodings even without RFC 2047 tagging, alignment now works (after adding a few more character ranges), and after discovering an xterm segfault bug that I should report, I got xterm working with a proper alternate font. So now, when I run frm, I see Asian and Cyrillic characters mixed in with Western European characters, which makes the spam much more fun. And avoids all the alignment and ugly display problems.

I'm probably going to release the new version of mdfrm sometime this weekend under a different name and keep the old version around unchanged, since the new version requires Encode and Perl 5.8 and won't work at all for people who aren't using UTF-8.

Next for Unicode conversion is to change the declared character set for all of my web pages to UTF-8 so that I can start using non-ISO-8859-1 characters on my web pages.

Posted: 2007-07-06 23:40 — Why no comments?

Last spun 2022-02-06 from thread modified 2013-01-04