Converting character widths

From LSDevLinux
Jump to: navigation, search

Within Fieldworks, we often have to convert between different Unicode representations of text. The following are methods for doing this.

Converting to UTF-8

UTF-8 is used as the default encoding by most Linux applications and libraries. 8 bit types include:

  • GTK# - Glib::ustring
  • (potentially) char, uchar

The UnicodeString8 class can be used to convert 32 or 16 bit strings to 8 bit. The class inherits from std::string and can be constructed with a string and a length.

#include "UnicodeString8.h"                  // In Src/Generic
...
UnicodeString8 unistring(text, text_length); // std::string
unistring.c_str();                           // C-style UTF-8 string

Converting to UTF-16

UTF-16 is the default encoding on Windows. 16-bit characters include

  • wchar_t on Windows
  • OLECHAR (Win32Base/include/Types.h)
  • ICU UChar
  • FieldWorks wchar (Generic/common.h) (just in FW, or everwhere?)
  • FieldWorks achar (Generic/common.h)
  • FieldWorks WCHAR (Win32Base/include/Types.h)

Note that you can get the length of a UTF-16 wchar* using ICU UnicodeString .length():

int length = UnicodeString(some_wcharstar_string).length();

Converting to UTF-16 from platform wchar_t with OleStringLiteral

wchar_t and L"baz" are 16 bits on Windows and 32 bits on UNIX. Platform-independent code to preserve the 16-bit Windows wchar_t but convert the 32-bit UNIX wchar_t to 16-bit can be done with OleStringLiteral.

OleStringLiteral converts a 32-bit wchar_t* or string literal to UTF-16 once and caches the result. Using an implicit conversion operator, it returns the converted result and so may be given in place of its string literal whenever an OLE string is needed.

#include ".../FieldWorks/Src/Generic/OleStringLiteral.h"

static const OleStringLiteral literal1 = L"baz";
static const OleStringLiteral literal2 = _T("baz");
wchar_t* platformDependentBitlengthVariable = L"baz baz baz";
static const OleStringLiteral always16bitVariable = platformDependentBitlengthVariable;

Note that OleStringLiteral is designed to be used in static variables, to gain the benefit of caching the converted string. Since string literals are static anyway, this should not be at all restrictive.

Converting to UTF-16 using UniStr

A UniStr object can be constructed using a pointer to 8 or 32 bit data. (Or 16 bit for that matter).

Converting to UTF-32

32-bit characters include

  • wchar_t on UNIX
  • ICU UChar32
  • FieldWorks wwchar (Generic/common.h) (just in FW or everwhere?)
  • FieldWorks TCHAR?? (Win32Base/include/Types.h) ?? I can't tell if it's always 32-bit or not. It's "typedef wchar_t TCHAR;" in Win32Base/include/Types.h, so what does that mean it is when compiled in Windows though, 16-bit? Then it would be a platform-dependent variable.

on windows TCHAR is WCHAR if UNICODE is defined else it is a char


From UTF-8

From UTF-16