[Home]CharacterSet

MeatballWiki | RecentChanges | Random Page | Indices | Categories

As stated on WhatIsGlobalization, historically, computers have mostly been used in the United States. Thus, most have used English. The most popular character sets, ASCII and EBCDIC, are based on the English alphabet, which is actually the Latin alphabet. A CharacterSet is an array of glyphs, or pictures, that are keyed by a character code. For instance, the character code 65 in ASCII maps to the glyph 'A'. In the case of ASCII (ISO-646) and EBCDIC, there are only 128 positions (7 bits) in the array. This is barely enough to accommodate the upper and lower case alphabets, the variety of punctuation we use, and a small set of control characters needed for electronic transmission. It did allow 12 characters to be adapted to local usage; however, there are vastly more glyphs in the world than that.

The solution engineers using the IBM PC architecture took was "code pages." A code page is also an array of glyphs, just like ASCII, but 256 characters (8 bits-one byte) long. The additional 128 characters map to localized characters, such as the following variations on the letter 'A' used in various European languages: �ÀÁÂÃÄÅ. However, each code page maps to one particular locale's requirements. Code page 437 (CP437) is US English, CP860 is Portuguese, CP863 is Canadian French. This system works well for many scripts throughout Europe, North America and South America.

However, to support scripts with a large number of glyphs such as Japanese, which has thousands of characters, using the single byte character set (SBCS) system of code pages is wholly inadequate. The solution chosen was to use a multi-byte character set (MBCS) for CP932 Japanese. In this model, a number of bytes in the base page are marked as "shift" codes. A shift byte indicates the next byte should index into an extension page for the actual glyph.

MBCS is very difficult to program for as it is not clear whether a character is one or two bytes long. It is no longer trivial to do string operations such as finding the fourth character in a string. Long and error prone string manipulation algorithms must be used every time. Since it is second nature for many native English-speaking programmers to write programs as if they only had to deal with SBCS, much code will have to be adapted, thus increasing internationalization costs geometrically.

Unicode

To address this problem, as well as the growing need for a universal character set as economic globalization became prevalent, The Unicode Standard [ISBN 0201616335 (alternate, search)] was developed. Unicode provides a unique code point for each grapheme in most modern and ancient languages. (A grapheme represents the underlying character rather than the glyph itself.) Unicode has the ability to use most numbers up to hexadecimal 10FFFD, the main exception being D800-DFFF, giving a potential 1,112,062 points in the Unicode space. How to store these code points is left to a number of character encodings.

UTF-8, a byte-oriented encoding, is probably the main Unicode character encoding used on the web, and has the following properties:

Many modern languages and platforms, such as Windows NT, Windows 2000 and Java use UTF-16 internally.

For more about Unicode, see WikiPedia:Unicode, WikiPedia:UTF-8 and WikiPedia:UTF-16.


There are some errors in the UTF-16 part:

-- HelmutLeitner

I'll address these in order, Helmut:

The main benefit of UTF-16 over UTF-8 is that it stores the more complex languages in two bytes rather than UTF-8's three. This is a score for perceived globalization, even if the space saving itself is pretty irrelevant in this day and age.

The claim of "simple pattern-matching" is made against the specifics of the encoding, namely that one can tell by looking at a single byte (or byte pair, for UTF-16) whether one is at the start of a new code point or not, and hence one does not need to worry about explicitly finding code point start and end points when searching for patterns. This is not the case for other encoding schemes previously proposed, where for instance the last two bytes of a three-byte character could be a valid encoding for another code point.

-- ChrisPurcell

I have to agree on the issue of bugs with UTF-16 surrogate pairs; last year when I was writing a page-dump processing program in C# I had a whole series of problems with surrogate pairs in Mono's class library. They just plain didn't work all through the whole XML stack: the pairs were combined in the wrong way, detected in the wrong way, broke at buffer boundaries, etc. I'm not sure anything was actually right. :) Apparently those functions had just never actually gotten exercised before, so we got to be the guinea pigs finding and fixing the bugs.

Unfortunately UTF-8 isn't a magic bullet there either; MySQL?'s Unicode support for instance only allows a 3-byte subset of UTF-8, so you can only store 4-byte characters in a field with raw binary collation...

-- BrionVibber

Chris, I hope you will be not offended by what I say now. It is nothing personal but a deep frustation over the insensitiveness and sillyness of the majority of English speakers and programmers towards foreign languages and their characters. Take it as part of a rant. "Why should I discuss this here with you, in a system that you maintain and that does neither support any form of Unicode nor foreign characters properly? As you can see from ÖsterreichSeite? (http://www.usemod.com/cgi-bin/mb.pl?ÖsterreichSeite). It doesn't link. It doesn't sort. Search for Österreich finds only one page instead of two." -- HelmutLeitner

I may be missing the thrust of your argument, but Chris is working to make MeatballWiki as Unicode as possible given the crappy state of the art specifically because we want to fully include Meatball's audience of non-English speakers. -- SunirShah

Thank you for bringing this problem to my attention, Helmut. I can tell you immediately why it sorts where it does: Perl is sorting the strings as if they were encoded with ISO-8859, and the first byte of the two-byte UTF-8-encoded Ö just happens to be ISO-8859-encoded Ã, which of course sorts between A and B. I can't tell you when I'll fix that, but I will.

I can't tell you why it doesn't link, unfortunately. As far as I understood the UM engine, if it allowed the page to be created, it would form links. All I can think of off the top of my head is that Perl isn't seeing a word boundary before the Ö. Again, I can't tell you when I'll fix that, but I will.

I'm not sure which pages you were expecting Österreich to match. If this one, that's a bug in MySQL?'s text matching that can't cope with nested single quotes around a single word, and it affects all-ASCII search strings too. I can't promise you that I'll fix MySQL?, and I hope you don't expect it of me.

Finally, I think the answer to the question, why should you discuss this here with me, in a system that I maintain and that does neither support any form of Unicode nor foreign characters properly, is precisely the same reason why I discuss this here with you, in a language that you persistently use incorrect grammar in: because we AssumeGoodFaith, that despite the mistakes on both sides, we are trying to communicate, learn, and improve. To build something together. I hope we continue to do that. -- ChrisPurcell

Chris, my rant and anger is over. So I don't feel a need to go into details. Just a few points. :-) (1) I'm fully aware that my English is faulty. If I would participate in a grammar discussion at all, I would assume that I'm the learner and you are the expert and I would hesitate to insist on my opinions. (2) My ProWiki engine fully supports foreign characters since 2001 (Unicode since 2004) and I can assure you that Perl doesn't need to know which encoding is used. (3) Of course in searching for "Österreich" I would expect to find this CharacterSet page. If searching is broken (one of the 8 fundamental wiki features) on the transition to MySQL?, you probably should have not done it (from the perspective of a foreign user). Of course, an English programmer will see this as a minor issue of no importance. *But*, at the moment this engine is unusable for foreign language projects! BTW WikiPedia handles foreign characters, MySQL? and Unicode seemingly without problems. -- HelmutLeitner

Excellent! I wasn't aware you were an expert in wikis and Unicode. My apologies if I was asserting untruths about Perl: I couldn't find anything to contradict them online. Perhaps you could help me here? I wish to know what regular expressions and/or character set methodology you use to get Perl to pattern-match non-ASCII UTF-8 page titles. The best I could achieve was (?:[A-Z]|\xc3[\x80-\x9e]) for upper-case letters like Ö, and (?:[a-z]|\xc3[\x9f-\xbf]) for lower-case. This is hardly what I would call "not needing to know what encoding is used", so I assume you have a better alternative? (I'm ignoring WikiPedia in this case, because it doesn't use CamelCase.) -- ChrisPurcell

Well, we've put a lot of work into that; we finally finished transitioning some of our largest sites (especially en.wikipedia.org) to Unicode only in mid-2005. MySQL? didn't support Unicode at all when we started, and we had to have it treat everything as binary data to use UTF-8. This required a lot of fudging and transformation to get the MySQL?-based search to work; incorrect case-folding and word-boundary detection had to be worked around by rolling things into ASCII-friendly garbage. (We now use a custom search backend based on Apache Lucene.) MySQL? today supports Unicode, but only a subset, so we still can't transition to "native" Unicode support in MySQL?... But it's probably enough for most people (who don't have a wiki in ancient Gothic script or one with dozens of pages on obscure rare Chinese characters...)

PHP unfortunately is still a pretty Unicode-hostile environment, and we had to write a bunch of annoying special-case code. Maybe Perl's better there, maybe not, but you probably do have to be careful about code that assumes locale encoding or other 8-bit-isms. -- BrionVibber

Brion, that's interesting and surprising! I always admire MediaWiki for its Unicode handling and assumed that the switches to MySQL? and PHP had to do with better support for Unicode that these environments offer. I did not imagine that you had to work around such obstacles and edges. I wouldn't argue now that a Perl/filesystem approach is better. It is just simpler and lower level, so when you have problems it is pretty clear where they come from and how to work around them. -- HelmutLeitner

Chris, I'll publish the ProWiki source in two weeks (hopefully on Mar15, sourceforge, under the GPL), so you can draw from it. The foreign language parts are just a few lines of code. I also wouldn't say that I'm an expert on these issues, because I'm not generally interested in theory or all issues. I'm just pragmatic in my approach to support communities and have sufficiently working code on the simplest way possible. -- HelmutLeitner

I've fixed the "doesn't link" problem for ÖsterreichSeite? by changing \b (Perl's word-boundary match) to (?<![A-Za-z])(?<!\xc3[\x80-\xbf]), i.e. I've had to hard-code what a word-boundary is in UTF-8. Not sure about the sorting problem, will need actual Perl support for that somehow.

Update: I've solved the problems in a "better" way in an experimental version of the script, using the advice on [Unicode-processing issues in Perl]. I've actually hit upon that page several times, but it was only this time around that I had sufficient knowledge of all the issues (e.g. Perl's \x{1f} actually emits illegal UTF-8 even though \x{100} and upwards don't, "for backwards compatibility") to be able to get it all working. You're exceptionally lucky if you only need a few lines of code to make Perl unicode-safe, as far as I can see. Using a database, or using CGI.pm, will hurt you.

I've also fixed the sorting problem. However, I'm not willing to use the experimental site until I've checked all the bugs have been worked out. For example, thanks to a CGI/Unicode compatibility bug (CGI does not treat input as UTF-8), the Username field was silently corrupting names like TëstÜser? — but only when one did a preview. -- ChrisPurcell


CategoryGlobalization

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: