In my previous articles, I described how every character on the planet is being assigned a code point by the Unicode Consortium. Yes, that’s right, every symbol for every letter on the planet is getting a code point that looks like U+0639 where U means Unicode and 0639 is a hexadecimal identifier for the letter. Don’t make the assumption that the code points are limited to 16 bits or 65,536 code points. That code point value is unlimited.
That’s the next point. This has nothing to do with any kind of computer memory storage. It’s just a value. A Unicode value says nothing about how to store this in memory or how to send it in an email. Encodings do that.
There are many different encodings that specify how to store a code point in memory. UTF-8 (Unicode Transformation Format 8) is the most popular in North America. UTF-8 resembles the standard way of storing ASCII data that we’ve used forever with the nice property that zero is a string terminator. If your strings are all standard ASCII data, you don’t have to change a thing – you are using UTF-8 by default.
But there are other encodings: UTF-16, which uses 16-bits, or the UCS-2 standard, which uses 2 bytes (yes, it’s still different than the 16-bit UTF-16). There’s something called UTF-7 where the high bit is always zero for those systems that use the high bit for some other purpose. And there are probably others that I haven’t run across. The point here is that you can’t transmit a string or process an incoming string unless you know its encoding. That’s why you will occasionally see an email message or some other string that contains a long series of question marks. That usually means that the programmer didn’t bother to detect the encoding designation and interpret the string properly.
In an email there is an indicator of the form:
Content-Type: text/plain; charset= “UTF-8″ in the header of the email that explains to the receiver how to decode it. When a programmer ignores that kind of information, “???????????????” is what you’ll see.
For websites, it gets a little trickier. One method is for each web page to specify an HTML tag that identifies the encoding, but that’s not often used. What actually happens is the browser makes its best guess. It has some heuristics about how often certain letters appear in a certain language and it makes its best guess as to what symbols to display for each code point. It works more often than not.
And that, my friends, concludes my three-part series on ASCII encoding. I hope the next time you are working with ASCII strings, you’ll make sure to communicate what encoding you’re using and use the proper encodings to decode strings you receive. If you are looking for a device to move ASCII data in and out of a PLC, please visit our website – you’ll find out why we at RTA are known as the ASCII guys.
If you’d like to read a very well-written article on this subject, Joel on Software – ASCII Encodings is a great resource for you.
ABOUT THE AUTHOR
John Rinaldi is owner and CEO of Real Time Automation (RTA) in Pewaukee, WI. Rinaldi founded Real Time Automation in 1989, a company dedicated to making industrial networking simple. With a focus on simplicity, US support, fast service, expert consulting and tailoring for specific customer applications, RTA has become a leading supplier of networking technology worldwide.
RTA is focused on moving your data. Rinaldi is not only a recognized expert in industrial networks and an automation strategist, but a speaker, blogger, author of more than 30 articles on industrial networking and author of four books.
Control Engineers, system integrators and distributors use Real Time Automation products to move data around the factory floor. ASCII products are used for moving barcode, while scale and RF data are only one segment of a large product portfolio. Learn more about RTA by signing up for our unique industry newsletter and follow RTA on LinkedIn. Contact Rinaldi on LinkedIn here.