Chapter 9 Code set conversion

omniORB supports full code set negotiation, used to select and translate between different character code sets when transmitting chars, strings, wchars and wstrings. The support is mostly transparent to application code, but there are a number of options that can be selected. This chapter covers the options, and also gives some pointers about how to implement your own code sets, in case the ones that come with omniORB are not sufficient.

9.1 Native code sets

For the ORB to know how to handle strings and wstrings given to it by the application, it must know what code set they are represented with, so it can properly translate them if need be. The defaults are ISO 8859-1 (Latin 1) for char and string, and UTF-16 for wchar and wstring. Different code sets can be chosen at initialisation time with the nativeCharCodeSet and nativeWCharCodeSet parameters. The supported code sets are printed out at initialisation time if the ORB traceLevel is 15 or greater.

For many applications, the defaults are fine. The most common non-default choice is to set the native char code set to UTF-8, allowing the full Unicode range to be supported in strings.

Note that the default for wchar is always UTF-16, even on Unix platforms where wchar is a 32-bit type. Select the UCS-4 code set to select characters outside the basic multilingual plane without having to use UTF-16 surrogates¹.

9.2 Default code sets

The way code set conversion is meant to work in CORBA communication is that each client and server has a native code set that it uses for character data in application code, and supports a number of transmission code sets that it uses for communication. When a client connects to a server, the client picks one of the server’s transmission code sets to use for the interaction. For that to work, the client plainly has to know the server’s supported transmission code sets.

Code set information from servers is embedded in IORs. A client with an IOR from a server should therefore know what transmission code sets the server supports. This approach can fail for two reasons:

A corbaloc URI (see chapter 8) does not contain any code set information.
Some badly-behaved servers that do support code set conversion fail to put codeset information in their IORs.

Similarly, a badly-behaved client may fail to specify codeset information when it first uses a connection to a server, meaning that the server does not know the transmission codeset used by the client.

The CORBA standard says that if a server has not specified transmission code set information, clients must assume that they only support ISO-8859-1 for char and string, and do not support wchar and wstring at all. Similarly, a server that does not receive codeset information from a client must assume that the client is using ISO-8859-1 for char and string, and will not send wchar and wstring. The effect is that DATA_CONVERSION or BAD_PARAM exceptions are thrown.

To avoid these issues, omniORB allows you to configure default code sets that are assumed to be supported by the other end of the communication, if they are not otherwise known. Set defaultCharCodeSet for char and string data, and defaultWCharCodeSet for wchar and wstring data.

9.3 Code set library

To save space in the main ORB core library, most of the code set implementations are in a separate library named omniCodeSets4. To use the extra code sets, you must link your application with that library. On most platforms, if you are using dynamic linking, specifying the omniCodeSets4 library in the link command is sufficient to have it initialised, and for the code sets to be available. With static linking, or platforms with less intelligent dynamic linkers, you must force the linker to initialise the library. You do that by including the omniORB4/optionalFeatures.h header. By default, that header enables several optional features. Look at the file contents to see how to turn off particular features.

9.4 Implementing new code sets

It is quite easy to implement new code sets, if you need support for code sets (or marshalling formats) that do not come with the omniORB distribution. There are extensive comments in the headers and ORB code that explain how to implement a code set; this section just serves to point you in the right direction.

The main definitions for the code set support are in include/omniORB4/codeSets.h. That defines a set of base classes use to implement code sets, plus some derived classes that use look-up tables to convert simple 8-bit and 16-bit code sets to Unicode.

When sending or receiving string data, there are a total of four code sets in action: a native char code set, a transmission char code set, a native wchar code set, and a transmission wchar code set. The native code sets are as described above; the transmission code sets are the ones selected to communicate with a remote machine. They are responsible for understanding the GIOP marshalling formats, as well as the code sets themselves. Each of the four code sets has an object associated with it which contains methods for converting data.

There are two ways in which a string/wstring can be transmitted or received. If the transmission code set in action knows how to deal directly with the native code set (the trivial case being that they are the same code set, but more complex cases are possible too), the transmission code set object can directly marshal or unmarshal the data into or out of the application buffer. If the transmission code set does not know how to handle the native code set, it converts the string/wstring into UTF-16, and passes that to the native code set object (or vice-versa). All code set implementations must therefore know how to convert to and from UTF-16.

With this explanation, the classes in codeSets.h should be easy to understand. The next place to look is in the various existing code set implementations, which are files of the form cs-*.cc in the src/lib/omniORB/orbcore and src/lib/omniORB/codesets. Note how all the 8-bit code sets (the ISO 8859-* family) consist entirely of data and no code, since they are driven by look-up tables.

1: If you have no idea what this means, don’t worry—you’re better off not knowing unless you really have to.