ASCII, EBCDIC and UTF-8

2009 January 14
by r.claypool

For a blog named techencoder, it seems appropriate to start with a post about a few important character encodings that our computers use.  I’ll dive into higher level abstractions and upcoming technologies soon, but a basic understanding in this topic is something every programmer should have.

Unicode characters

Background  …

Not too long ago, hardware resources were exponentially more expensive than they are today and data structures could not afford to waste them. This explains why early character encodings were designed to use 8 bits per character even though that is not sufficient to communicate in all the world’s languages. Eight bits were a necessary limitation because the cost of memory (and bandwidth and storage) was exceedingly high, but today we have enough hardware resources to use a modern encoding that is not so restrictive.  Let’s briefly look at 2 legacy encodings that you should know something about and another that I recommend using for the foreseeable future.

The Three Amigos …

EBCDIC, ASCII and UTF-8  allow for 255, 128 and (something like) 65,000+ characters respectfully.  I couldn’t find an exact limit for UTF-8 but suffice to say there is enough space for English, Chinese, Klingon and any every other language you might encounter.

  • EBCDIC (1963) is a rare find outside IBM mainframes. The System/360 series was first to use this encoding and subsequent machines from IBM have continued to use it internally. Their hardware or software translates EBDIC to another encoding when interfaced with another system, so EBCDIC is essentially dead to most of us.
  • ASCII (1963) is a widely used international standard, but it is dated. Until the middle of 2008, it was the dominate encoding on the Internet and some older or poorly implemented programs still (wrongly) assume that files are encoded in ASCII without checking the file’s Content-Type. Modern encodings are designed to be backward compatible with ASCII, so this is not a problem as long as the text is English. Non-English and multilingual documents will usually not render correctly in ASCII.
    Unicode and Universal Character Set were standardized in the early 1990's and produced several different encodings.  The one most in use today is UTF-8.
  • UTF-8 (1993) is another widely used international standard. It is based off the Unicode standard and is quickly replacing ASCII (see graph, left). It can encode characters in most of the world’s writing systems and it is the encoding you should use whenever possible. The Internet Engineering Task Force (IETF) is one among many organizations that recommend it.

Action Plan …

There are dozens or hundreds of other character encodings around the world, but a little knowledge of these 3 is all you will probably need to know. Check out the links in this article and keep these things in mind:

  • Your programs should read and honor the declared Content-Type of input.
  • Your programs should default to UTF-8 output.
  • Your programs should explicitly state the encoding that was used. Include a Content-Type in the file’s header.

Do you have anything to add?  Just leave a comment.

Happy Programming!

2 Responses leave one →
  1. 2009 October 27

    W3C has a very good tutorial on this subject.

  2. 2010 January 28

    Google has an update on the growth of Unicode.

Trackbacks & Pingbacks

  1. Oracle Express Overview and Installation on Windows 7

Leave a Reply

Note: You can use basic XHTML in your comments. Your email address will never be published.

Subscribe to this comment feed via RSS

This work by Robert Claypool is licensed under a Creative Commons Attribution 3.0 United States.