Perspectives
Mar 25, 2024

What is character encoding? Exploring Unicode, UTF-8, ASCII, and more…

As Seen In: Lokaise

Ilya Krukowski

In this article, we’ll explore different types of character encoding used in the world of information technology. We’ll delve into why these approaches to encoding are crucial, explain how they function, and highlight the key differences between ASCII, UTF-8, UTF-16, and the Unicode Standard. Understanding these differences is essential for correctly handling text in any software application.

Let’s dive in!

Why do we need character encoding?

We humans communicate using natural languages like English, French, and Japanese. Naturally, when we use computers for our daily tasks, we expect to continue using these languages. For instance, you might start your day by opening a work chat to discuss the day’s tasks with colleagues located around the globe. However, computers don’t inherently “understand” any of these natural languages because, at their core, they operate on a relatively simple principle.

CPUs and binary format

Inside your computer, one or more central processing units (CPUs) execute instructions in binary format. Picture a long tape filled with commands: fetch this number, add it to another number, double the result, and so on. While CPUs can “skip” to different parts of this tape for specific instructions, the fundamental process is straightforward and has remained largely unchanged for the past 50 years.

What exactly is “binary format”? Simply put, it’s all about zeros and ones (representing false and true, or no and yes). It’s crucial to remember that, despite the complexity of modern systems and the advent of artificial intelligence, the underlying hardware still operates on binary logic.

This presents a challenge: humans don’t naturally use binary to communicate, and computers don’t natively understand our languages. Hence, the need for character encoding arises—to translate our text into a format that computers can process and understand. While converting text might seem straightforward (humanity has used systems like Morse code for ages, after all), several nuances make it a complex endeavor.

Now, let’s jump into how various encoding types function.

ASCII

Forty years ago, the IT world was simpler. Hardware wasn’t as powerful, software wasn’t as complex, and personal computers were a luxury in many countries. Programmers primarily used English characters to write their programs. Additionally, many significant research and development efforts in information technology were taking place in the US. This led to a focus on how to accurately represent and encode Latin characters and symbols like dots, exclamation marks, and brackets.

How ASCII works

ASCII (the American Standard Code for Information Interchange) was first introduced in the 1960s for use with teletypes. Its concept is straightforward: assign numbers to each Latin character and some special characters. For instance, we agree that the number 65 represents “A”, 66 represents “B”, and so forth. But what about binary format? These numbers can easily be converted from decimal to binary and back, simplifying the encoding process.

ASCII uses numbers ranging from 0 to 127 to encode various characters. The codes from 0 to 31 are reserved for “special” characters, which have unique meanings for computers. Some of these are inspired by typewriter functions: backspace, new line, and carriage return. The codes from 32 to 127 represent letters and common symbols. For example, 32 represents a “space”, 40 is “(“, 70 is “F”, and 102 is “f”. Lowercase and uppercase letters have different codes to distinguish them. If you’re curious, a complete list of ASCII codes is available on Wikipedia.

It’s all about bits

Okay, so now we know that ASCII uses numbers from 0 to 127. Know what that means? It means that any Latin character or a special symbol can be encoded using only 7 bits of information. To put it simply, one bit can contain either 0 or 1 (since we’re working with binary format), and this is the smallest piece of information that computers can process. Now, if in binary format every bit can have only 2 possible values, we can use the exponentiation in the following way: (2 ** 7) - 1 = 127. 127 is the maximum number that can be potentially represented and 7 is the number of bits. We also have to subtract 1 because we use numbers from 0, not from 1. So, this basically proves my point but if you aren’t into the technical stuff, just believe me that numbers from 0 to 127 can fit into 7 bits, and this is quite important.

While programmers could keep things as is, the number 7 is not too convenient. For example, it’s not even; moreover, back in the day, computers could already work perfectly fine with 8 bits of information. Perhaps you also know that 8 bits make 1 byte. Therefore, it made perfect sense to use not 7, but 8 bits for ASCII encoding.

What would this mean? Let’s return to our formula and slightly modify it: (2 ** 8) - 1 = 255. We can see that by adding only a single bit of information, we have increased the numbers interval significantly! In fact, now we can use numbers from 0 to 255. Of course, developers have noticed that while the numbers 0 to 127 are already reserved for “well-known” characters, the codes from 128 to 255 aren’t occupied at all. Thus, theoretically, these “free” codes can be used to fulfill your wildest fantasies…

Extra codes for everyone

There’s a problem, however: people tend to have different fantasies. Thus, initially there was no agreement between PC manufacturers on what these “extra” codes should represent. For example, the IBM PC used codes 128 to 255 to display various lines, a square root symbol, and certain characters with diacritical marks like é. But computers produced in other countries might interpret these codes differently.

Now suppose you’re based in the US many moons ago and use an IBM PC to compose a text document. You type “regular” Latin characters, plus some characters with diacritics. For instance, the word “café” is in the text. Then, you send this document over to a colleague who’s based somewhere in the USSR, he opens it on a computer from another manufacturer that treats “extra” codes differently. What would happen in this case?

Well, all the “standard” Latin characters should be displayed properly, but there are no guarantees for the “fancy” ones. In other words, “café” might turn to something like “cafу”. What is this weird character that looks like a Latin “y”? This is actually a Cyrillic letter pronounced as “U”, and it’s used in Russian, Ukrainian, and some other languages.

The problem lies in the fact that computers produced in the late USSR treated the “extra” characters differently: specifically, they were used to represent Cyrillic characters and other extra symbols! To be precise, these PCs used a different encoding type called KOI-8, but it was still ASCII-compliant and followed the agreements for codes 0–127.

But this means that we have no idea how certain codes will be interpreted on different machines! This is surely a problem, and to overcome it specialists introduced a new approach.

ANSI code pages

To address the challenge of how “extra” codes are interpreted on different machines, engineers at the American National Standards Institute (ANSI) suggested standardizing these interpretations. They noted that manufacturers across various countries used codes 128–255 to denote characters specific to their local alphabets. This observation led to the introduction of “code pages”—a straightforward concept.

Essentially, we have a consensus on what codes 0–127 represent, so those remain unchanged. However, the meanings of codes 128–255 vary depending on the code page selected on a given computer. A code page acts as a reference table, defining what each code signifies. One of the most recognized code pages is CP-1252 (or Windows-1252), launched with Windows 3.1 in 1992. It supports the English alphabet and those of several European languages, such as Spanish, French, and Portuguese. For example, code 241 corresponds to the Spanish letter “ñ”.

For countries using Cyrillic alphabets (Russia, Ukraine, Serbia, and some others) they introduced cp1251, in Greece cp1253 was used, and so on. Simply put, the meaning of the “extra” codes changed based on what code page was currently enabled on your PC. Sound good? Well, not quite.

Problems with code pages

The first problem is related to the fact that only a single code page may be active at a time. This means that working with multiple languages (say, Greek and Serbian) is not really possible unless some weird approaches are utilized.

The second problem was specific to Asian countries that have many complex hieroglyphs. For example, in China there are literally thousands of hieroglyphs! As you’ll understand, it is utterly impossible to use a single code page to represent all of them because we have only 127 codes. To overcome such problems, local manufacturers had to use more complex approaches: some characters were encoded with 8 bits, while others occupied 16 bits (2 bytes). But this, in turn, led to other problems. For instance, if you are not sure how many bits every character occupies, how are you supposed to move backwards in your text? Of course, there were various solutions, but all in all the situation seemed overly complex.

Moreover, the internet became more widespread, and people from different countries were exchanging text documents. It was high time to introduce a new, more reliable standard.

The Unicode standard

In the early 1990s, specialists introduced a groundbreaking standard called Unicode. Its mission was simple yet revolutionary: to devise a universal set of characters understandable by any computer worldwide, regardless of location or language. This innovation meant that, for the first time, a computer in America could seamlessly communicate with one in Japan. Unicode served as a global translator, enabling the sharing of texts in any language, from English and Russian to Arabic and even emojis, without confusion.

Representing characters

The concept behind Unicode is more sophisticated than that of ASCII due to the complexity of languages worldwide. Let’s talk a bit more about this as it is really important.

It’s clear to us humans that the Latin character “A” is different from “B” or “C”, and also it’s different from “a”. Moreover, it’s different from the Cyrillic character “А” even though they look identical. However, the base meaning of “A” remains unchanged, whether it’s in semibold or italics, and regardless of the font used. For instance, “hello” means the same in Arial or Helvetica.

But does this mean that we can safely ignore all the “stylings” applied to the letters? No! After all, we can clearly understand that “e” is not the same as “é”. In German, there’s a character written as ß. Can we say that this is just a “beautiful letter B”? Once again, no! This character, in fact, means “ss”. In Latvian there’s a character “ķ”. Can we say that this is just a fancy “k”? No way Jose. In Latvian, there’s a clear distinction between “k” and “ķ”. For example, the word “cat” translates to “kaķis”, not “kakis” (which sounds quite funny to locals). Pretty complex, right? And now think about all the languages used in Eastern countries…

Why am I talking about this? Because we have to understand that in many cases, information about how a letter looks and what “extra” features it contains is absolutely crucial. In other words, the task that Unicode inventors faced is much more complex than you might have thought.

How Unicode works

Unicode represents characters, symbols, and icons with “code points”. A code point looks like this: U+1234, where “U” stands for “Unicode” and the number is in hexadecimal format, easily convertible to binary for computer processing. The code points can be found on Unicode’s official website.

For example, U+1F60A means “a smiling face” 😊, whereas a Latin “A” is encoded as U+0041. But wait, at the beginning of this article, we learned that in ASCII “A” is represented by 65. Does this mean that Unicode is fully incompatible with good old ASCII? Well, don’t forget that in Unicode we’re dealing with hexadecimal numbers, whereas in ASCII we’re talking about decimals. However, if you convert 41 from hexadecimal to decimal using any online converter, you’ll find that it equals 65! So, in fact, many “common” characters in Unicode have same codes as they had in ASCII. We’ll talk a bit more about that in a moment.

For now it seems like Unicode is not too different from ASCII because, once again, we have a list of codes representing various characters. But it appears that in Unicode it’s possible to “modify” your “base” characters by appending various diacritical symbols to them. For instance, we can take a regular Latin “e” (U+0065) and append an accent to it (U+0301), which would result in “é”. Moreover, it’s possible to add multiple diacritics to a single character, thus “constructing” any combination you wish! Well, actually, many typical combinations are already premade and you can use a single code for them. For example, é can also be represented simply as U+00E9.

Currently there are more than 1 million code points in Unicode, but in reality this number is not restricted by the standard and theoretically you can represent any character and any language with it, including fictional languages from Star Trek or Lord of the Rings.

However, there’s something important to mention: the Unicode standard itself does not state how exactly these code points should be represented in our computers. In other words, how should these hexadecimal numbers be stored in memory or transferred via the internet? That’s a good question, and people have introduced a handful of different character encoding types that implement the Unicode Standard.

UTF-16 encoding

UTF-16 (also known as UCS-2) is the first encoding that was suggested to represent Unicode texts. It’s concept is actually really simple: let’s store every code point as 16 bits (2 bytes) of information. In fact, that’s why the encoding is called UTF-16, duh.

Endianness

Let’s take the English word “cat”. It has the following code points:

U+0063 U+0061 U+0074

To store these code points in a computer’s memory, we use the following approach:

00 63 00 61 00 74

Once again, these hexadecimal numbers can be easily converted to binary. Also, every such number perfectly fits into 16 bits. But here’s the deal: no one actually says that we must write all these numbers from left to right. In fact, we can invert them:

63 00 61 00 74 00

These two approaches are known as “big-endian” and “little-endian” — these terms were taken from Jonathan Swift’s Gulliver’s Travels. You can read a bit more about it on Wikipedia as it’s a pretty funny story.

From a theoretical perspective, it has no impact on how to represent our bytes: from left to right or from right to left. However, some computer architectures historically use “big-endian”, whereas some use “little-endian”. Don’t ask me why, it’s a long story that is not directly related to today’s topic. The fact is that different CPUs might use different “endianness” (however, there are also processors that can work in both modes).

But if we have two ways of representing UTF-16-encoded strings, how are we supposed to understand what endianness is utilized in each case? To provide this information, specialists introduced a so-called “Unicode byte order mark” (or simply BOM) that was provided at the beginning of the string and had two possible values: FE FF and FF FE (the latter means that the bytes have to be inverted). Once again, these are hexadecimals.

Too many zeros!

Okay, we’ve solved the “endianness” problem, but one might ask another question: Why do I need to store all those extra zeros? Let’s take a look at our “cat” string again:

00 63 00 61 00 74

For every character, there are two zeros that seem quite redundant. In other words, if I could somehow get rid of those, I’d use half the space to store my string!

This is especially important for people who primarily use Latin characters (remember that programs are also usually written in English). Why? Well, because I’ve already mentioned that Unicode is actually ASCII-compatible. It uses the same numbers to represent Latin characters, with “A” being 65 or U+0041. And as long as all Latin characters, plus some special characters, can be easily encoded with numbers from 32 to 127 then this means that they can fit into 1 byte of information as I’ve already explained.

So, for people living in the US, Great Britain, and some other countries, it seems like we’re occupying double the space to store unneeded zeros!

Other UTF-16 issues

The third problem is that while Unicode respects ASCII, the UTF-16 encoding is not readily compatible with it. This is exactly because it uses 16 bits of information whereas ASCII operates only with 8 bits. So even if “A” is represented with the same number in Unicode and in ASCII, this number is stored differently. In ASCII, you’d just say 65 or 41 in hexadecimal. In UTF-16, you should say 0041 because we must use two bytes of data. Clearly, 41 is not the same as 0041, which meant that old documents still encoded in ASCII had to be converted for a new standard, and honestly people were not very eager to do that.

The fourth concern is that 16 bits enables us to encode only (2 ** 16) - 1 = 65 536 characters. While this is quite generous, it’s not nearly close to the million code points that Unicode currently has.

So, due to all these issues, UTF-16 was not too popular and people mostly ignored Unicode back in the day. Still, many understood that ASCII and code pages cause too many problems and thus a new encoding standard was introduced.

UTF-8 encoding

In 2003, the new UTF-8 encoding was proposed by a group of IT specialists. Like UTF-16, it also implemented the Unicode Standard but in a somewhat different way. It’s crucial to clarify that UTF-8 is not synonymous with Unicode; rather, UTF-8 is a method for encoding the vast number of characters defined by the Unicode Standard.

So, how does UTF-8 work? As the number 8 implies, the data is stored in octets or, basically, bytes (as you remember 1 byte equals to 8 bits). However, a variable number of octets is utilized depending on the actual code point, in contrast to UTF-16 where we always had two octets. Specifically, code points from U+0000 to U+007F (from 0 to 127 in decimal) occupy only a single byte. This solves the problem with the “extra” zeros found in UTF-16. Plus, it makes UTF-8 compatible with standard ASCII characters from 0 to 127. Well, yeah, people in other countries still had to prepare their documents for the new standard but unfortunately there wasn’t much to be done about it.

Characters with code points from U+0080 to U+07FF use two bytes. The pattern continues, with characters requiring up to four bytes for code points reaching up to U+10FFFF. This means that UTF-8 can not only encode all Unicode code points (around 1,100,000) but also has huge scope for expansion (something around 1 million). Not bad, eh?

For this reason, UTF-8 has become immensely popular over the past couple of decades. I think it can even be considered a de facto standard nowadays.

Other encoding for Unicode

While UTF-8 and UTF-16 are among the most popular encoding types for Unicode, they’re not the only ones out there. For instance, UTF-7 is a lesser-known encoding type that was designed to be compatible with systems only capable of handling 7 bits of data. Unlike UTF-8, UTF-7 ensures the first bit is always zero, making it a workaround for older messaging systems that might strip higher-order bits.

Another example is UTF-32, which represents each code point with 4 bytes (32 bits) of data. This straightforward approach means that even the simplest Latin characters consume significantly more space compared to their ASCII representation, making UTF-32 less space-efficient for texts primarily using characters from the ASCII range.

Encoding, encoding everywhere!

So, now you understand that there many different encoding types out there and, in fact, it’s really important to specify which one is currently being used for your data. Even some “plain text” on your computer is still encoded somehow. If you have textual information but don’t know its encoding type, this information is pretty much useless as it’s not clear how to interpret it.

For example, on web pages we use a special meta tag to define the document’s encoding:

<meta charset="utf-8">

If the encoding is absent or defined incorrectly, it might lead to various issues. For example, if you try to take a UTF-8 string and display it with ASCII cp1252, then guess what? Certain characters won’t be displayed at all (on a website, these will be replaced with question marks). This is because it might be impossible to establish parity between certain codes in UTF-8 and ASCII.

By the way, you might wonder: If the HTML document reports its own encoding, then how are we supposed to start reading this document? In other words, to read the document, we need to know the encoding, but to know the encoding we have to start reading it. This is a funny paradox, but in fact the meta tag is usually present at the beginning of the document, so your browser can start reading it using good old ASCII cp1252 because typically the HTML preamble contains only Latin characters and some special symbols like brackets. Then, after finding information about the encoding, it simply switches to it automatically.

If the encoding is not stated at all, some browsers might try to “guess” it which sometimes leads to unexpected results. That’s why, my friend, we should also provide and respect our encoding type.

Conclusion

Through this exploration, we’ve underscored the critical role of text encoding in information technology. We’ve delved into how ASCII, Unicode, UTF-8, UTF-16, and UTF-32 function and their distinguishing features. This journey through the world of encoding has been extensive, and I appreciate your company along the way.

Thank you for joining me, and I look forward to our next learning adventure.