What the UTF? ASCII Here. Part II

UTF-8

In this edition of my dissertation on ASCII I am going to devolve into the background of Unicode. This is the second of two parts of my presentation on ASCII (part I here). ASCII data is something we encounter every single day but most of us don’t really appreciate the complexity behind it or what’s happening behind the scenes. If you think this isn’t important you’d be wrong. ASCII data moves a lot of important information on the factory floor and that’s the reason our ASCII to PLC Gateway is so popular. Our customers also move a lot of ASCII data over EtherNet/IP™ and ProfiNet IO. ASCII still rules though software programmers may not want to hear that.

I’d venture many software programmers read my blog. But I’d venture of the thousands or tens of thousands of programmers working in the industrial automation industry, just a small handful, if that, would be able to develop a good internationally capable automation application. Ask them to support Japanese, Malaysian or Indonesian and they’d be lost. The reason for that is that they don’t understand the Unicode character sets.

Last month I talked about code pages which specified what the ASCII characters from 128 to 255 looked like and the mess that became of that. Everybody had a code page for their particular language implementation. I described how there were thousands and thousands of these code pages and how it sort of worked. For example, if your Greek document was never transported outside Greece since everyone used the same code page, it would work. But once the Internet happened and some of those Greek documents ended up in Holland, the text looked like something you wrote in Chinese in a drunken stupor.

Few people appreciate how hard it is to encode characters in a computer. Linguists might but they’d be the only ones. Do you know that in German there are letters that change shape when they appear at the end of the sentence? Tell me, is that two different letters? Or is it the same letter? In Arabic they consider that the same letter. In Hebrew, they consider it a new letter. A proverbial Tower of Babel.

Let’s look at the core of Unicode (www.unicode.org). In the Unicode representation, every symbol is represented by something called a code-point in the form U+xxxx where xxxx is a hexadecimal value. The English A has been assigned U+0041. In the Limbu language, this letter in the little box that I can’t pronounce is assigned code-point U+0691. The incredibly persistent people at the Unicode foundation have tediously mapped every single letter and symbol of every language into a code-point for years now. They have tables and tables of mappings on their websites. It’s fascinating to read – that is if you’re a fan of the most boring movie of all-time (An Affair to Remember with Cary Grant and Debra Carr, in my opinion).

These ASCII code-points are mappings and only mappings. The tables don’t say anything about how these code-points are stored in memory. They don’t describe if they are big-endian (MSB first) or little-endian (LSB-first) or how many bytes they occupy. This is where things got really crazy. They invented a byte order mark to precede a string of code-points. An “FE FF” is the standard indicating the string is little-endian. If you read a string with two leading bytes of “FF FE”, you would know it is big-endian.

Great, huh?

Americans for the most part didn’t like it. All our strings were much longer but didn’t encode any additional information. No juice for the squeeze to implement these sophisticated code mappings you might say. With code mappings, the string HELLO went from 5 bytes to 12 bytes – we added a bunch of zeros in front of each byte and lost 7 bytes of memory. For a long time it was just ignored by American programmers.

But eventually American programmers looked at this and did what we always do. We made things easier for us. We created the UTF-8 standard where our most used characters (00 to 7F) would still be 8 bits, but all those “funny” characters at 80 and above would use 2 bytes. Surprise! Our standard ASCII strings that we’ve used for the last forty years are identical to UTF-8 encoding. Only the funny characters above 80 have to change which is a very small percentage of our text strings.

Worked great for us, but not so well for the rest of the universe. If you’re encoding Klingon letters you’ll have to use multiple bytes and work hard at the translation, but that’s not our problem.

There’s more to the UTF standards and I’ll talk about that in my next ASCII column.

ABOUT THE AUTHOR

John Rinaldi is owner and CEO of Real Time Automation (RTA) in Pewaukee, WI. Rinaldi founded Real Time Automation in 1989, a company dedicated to making industrial networking simple. With a focus on simplicity, US support, fast service, expert consulting and tailoring for specific customer applications, RTA has become a leading supplier of networking technology worldwide.

RTA is focused on moving your data. Rinaldi is not only a recognized expert in industrial networks and an automation strategist, but a speaker, blogger, author of more than 30 articles on industrial networking and author of four books.

Control Engineers, system integrators and distributors use Real Time Automation products to move data around the factory floor. ASCII products are used for moving barcode, while scale and RF data are only one segment of a large product portfolio. Learn more about RTA by signing up for our unique industry newsletter and follow RTA on LinkedIn. Contact Rinaldi on LinkedIn here.

Share this post

Subscribe to our Newsletter

Get monthly updates from our Learn Blog with the latest in IoT and Embedded technology news, trends, tutorial and best practices. Or just opt in for product change notifications.

Leave a Reply

Your email address will not be published. Required fields are marked *