In the software and technology of today we see information in text form everywhere. Text that isn’t given to us in media formats such as images and videos are all stored internally in the computer as numbers with the encoding as the means we have to interpret how they should be read properly. This becomes much more evident when sharing mediums of text across languages where a persons system may not have support for it. And so this is why we as developers should familiarize ourselves with the tools we have for encoding and how we may best get the job done.
Some basic history and Unicode
The early encodings for text on computers used 8 bits per character. This is where we get the ASCII, ANSI, OEM, and many similar encodings. If you’ve used DOS pre-Windows then this is what you were seeing at the time. But using only 8 bits of data per character limits you to a maximum of 255 characters. This means that of the many languages in the world most would not be able to work with the primary English encodings as it didn’t leave enough space for the letters of their language to be implemented. Because of that many languages created their own encodings.
Since then there has been great strides towards standardization in both encodings and characters sets since the introduction of Unicode. The most widely accepted and supported encoding is now UTF-8 which is a multi-byte character encoding allowing for anywhere from 1 byte to 4 bytes to be used per character. This gives us both the flexibility and efficiency we want in an encoding. Also UTF-8 has no endianness, no little endian or big endian, which is the name we give for knowing whether we should read the bytes from left to right, or right to left. UTF-8 can be read in either direction.
The Unicode encodings have a variety of encodings available which are UTF-7, UTF-8, UCS-2, UTF-16, UCS-4, and UTF-32. UCS stands for Universal Character Set and UTF stands for UCS Transformation format. So UCS is a character per 2 byte or 4 byte grouping meaning the length of the strings made with it will be equal to the character byte grouping lengths. In other words, writing code for them will be much easier as the length will translate correctly between byte group measurement and actual represented characters. The transformation format, though, can have up to three bytes transforming a character from the first byte so you need to distinguish between byte length and grapheme count. This kind of issue doesn’t crop up that much in Ruby but working in lower-level languages like Rust or C will make you really pay attention to this detail. Rust provides a character-boundary method for checking as you go byte to byte, or you could simply use a grapheme iterator if you prefer.
UTF-7 and UTF-8 are endianless whereas UCS-2, UTF-16, UCS-4, and UTF-32 have endianness. The UTF standard provides a byte order mark or BOM for short which is a few bits with markers to indicate in which direction the bytes should be read. For example UTF-16LE encoding has the marker
0xFF 0xFE to indicate little endian meaning it should be read from left to right. UTF-16BE flips those two to indicate reading in the other direction. These BOM are an excellent feature as they not only indicate direction but also give us a huge hint at the encoding being used in case we didn’t know. UTF-32 follows the same principle with
0xFF 0xFE 0x00 0x00 given for little endian order and simply reverse their order if it’s big endian.
Now we have a basic understanding of the most well-supported encodings around from the Unicode standards. But even so, encodings of many other kinds still exist and when our system tries to read characters out without having the correct encoding we often get gibberish, or as it’s officially called Mojibake which is Japanese for “unintelligible sequence of characters”. So let’s get familiar with what Ruby has to offer regarding working with encodings.
Ruby encoding methods
There are two primary methods used when converting or representing a strings encoding and they are
String#force_encoding is a way of saying that we know the bits for the characters are correct and we simply want to properly define how those bits are to be interpreted to characters.
String#encode will transcode the bits themselves that form the characters from whatever the string is currently encoded as to our target encoding. Here’s an example:
x = "łał" x.encoding # => #<encoding:utf -8></encoding:utf> x.encode("UTF-16") # => "\uFEFF\u0142a\u0142" x # => "łał" x.bytes # => [197, 130, 97, 197, 130] x.encode("UTF-16").bytes # => [254, 255, 1, 66, 0, 97, 1, 66] x.force_encoding("ASCII-8BIT").bytes # => [197, 130, 97, 197, 130] x # => "\xC5\x82a\xC5\x82"
Here above we have some unusual characters. When we use the encode method, it returned a copy of the string with the underlying bytes changed without modifying the original string. To change the original you may use
String#encode!. With the use of
x.encode("UTF-16").bytes you can see the difference in the bytes from standard UTF-8 bytes. Notice the
\uFEFF at the beginning after we did encode? Can you guess which endian it picked from that? Pretty easy right?
The last part with the encoding ASCII-8BIT is Ruby’s binary encoding. It will represent any strings given this encoding as if it is binary data.
Here’s an example of the byte size being different from the character length.
"łał".size # => 3 "łał".bytesize # => 5
Different operating systems have different default character encodings so programming languages need to support these. Ruby has the method
Encoding.default_external which defines what the current operating systems default encoding is.
Encoding.default_external # => #<encoding:utf -8></encoding:utf>
Ruby defaults to UTF-8 as its encoding so if it is opening up files from the operating system and the default is different from UTF-8, it will transcode the input from that encoding to UTF-8. If this isn’t desirable, you may change the default internal encoding in Ruby with
Encoding.default_internal. Otherwise you can use specific IO encodings in your Ruby code.
File.open(filename, 'r:UTF-8', &:read) # or File.open(filename, external_encoding: "ASCII-8BIT", internal_encoding: "ASCII-8BIT") do |f| f.read end
When you choose to use
String#encode! you can skip providing a parameter to have it automatically use the encoding set in
Encoding.default_internal. These methods can also take two encodings as the first parameters to indicate what to transcode from and to.
You can find encodings with
Encoding.find and after setting an encoding you can check it for validity.
Encoding.find("UTF-32") # => #<encoding:utf -32 (dummy)></encoding:utf> "łał".valid_encoding? # => true "łał".force_encoding("UTF-16").valid_encoding? # => false
Ruby code files can have what’s called a magic comment at the top to indicate the encoding:
# encoding: UTF-8. If you’re working on a web application with a database, it’s also recommended to set the encoding for the database to
If you’re writing code that produces or evaluates web content or email content then encodings are involved here as well. Quick tip for web pages — it’s best to set a meta tag immediately after the opening.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> In email you would use a header of Content-Type: text/plain; charset="UTF-8"
A few low-level insights for Ruby
To get the clearest picture of what’s going on in Ruby when working with encoding, reading the source code will tell all… provided you understand C. Many methods check for ASCII compatibility first for performance reasons. Simple actions like merging two strings with different encodings may not matter too much if they’re both completely ASCII compatible. This isn’t to say that other encodings aren’t supported, no… only that for prioritizing performance, priority is given to the most common character encoding.
Lately, I’ve been integrating Ruby’s encoding support to Rust with the library Rutie. Rutie allows you to write Rust that works in Ruby and Ruby that works in Rust. This has led me to both read and write a lot of lower-level code which now allows me to give you some tips if you should do so.
If you’re writing a Ruby extension or a library/application that is integrated with the Ruby runtime from its C API, then it’s helpful to know that internally the identity for encoding values are generated during compile time. So some under the hood things in Ruby are different between Rubies on different systems but aren’t evident because at compile time, everything is made to work well with itself. What this means, though, is that pre-built binaries of Ruby may have some subtle differences from what the system it’s installed on would have set. So the best bet to have Ruby work very well through integrating with it’s C API is to compile Ruby for the system it’s on. Now, this isn’t always true as it depends from binary to binary. So for your continuous integration servers, keep an eye out for situations where “it works on my machine” for each Ruby version, but also specific ones that fail on the server. Change those to compile Ruby and this may very well fix your issue.
The encoding support in Rutie is hot off the press. I’m still learning many of the intricacies of Ruby’s encoding specific C API support but find it doesn’t just work like most of everything else does. Ruby is enabled with a few core encodings on the C API per operating system. Within Ruby itself, though, all supported encodings are loaded and work. It’s needs initialization to get full encoding support from the C API side and that looks like:
ruby_init_loadpath(); rb_require("enc/encdb"); rb_require("enc/trans/transdb");
And it’s that easy. Even in Rust.
There are plenty of great sources to read up on on this subject. Check out The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which has a great overview of the history of encodings. Also there is a very helpful online book for Unicode– Programming with Unicode.
Most libraries for detecting encoding in Ruby aren’t maintained anymore. This is probably a good sign as working with encodings have gotten easier in Ruby over the years and perhaps the tools we have are pretty good. One that does seems worthwhile to look at is rchardet. It did have a recent update so I can see the core maintainer is still here with us and involved.
Working with encoding support is fun when you know what you’re working with. But when you have to work with something obscure, that can be unpleasant. When working with systems that may integrate with alternative encodings, please be sure to record what the encoding is for the medium you receive if it’s not one of the globally accepted standards such as UTF. It’s better to have this info available at all times when processing text.
Move what you can to UTF-8 encoding if it’s not too much trouble and that should pay off well in the long run for many future possibilities. Unicode is a superb standard brought about from times of great need. And it’s a blessing we all needed.