SAP CHARACTER SETS

Get Example source ABAP code based on a different SAP table

ABAP Character Set
Application Server ABAP supports only Unicode systems in the current release.

A Unicode system is an AS ABAP that is based on Unicode character representation with a code page for Unicode and also on a corresponding operating system and database.

A non-Unicode system is an AS ABAP with code pages for single-byte code and double-byte code. Non-Unicode systems are no longer supported in the current release.
Unicode (ISO/IEC 10646) with the character set UCS covers all existing characters. For the Unicode character set, there are different Unicode character representations, such as UTF, in which a character can occupy between one and four bytes, or UCS-2, where one character occupies two bytes.

UTF-16 is the system code page of a Unicode system. It covers all characters of the Unicode standard.

The ABAP programming language supports the character representation UCS-2, which represents a subset of the characters represented by UTF-16. I covers the Basic Multilingual Plane (BMP) of the Unicode standard but not the characters of the surrogate area.
The restriction to UCS-2 in ABAP means that a character is always assumed to have the length of two bytes. Every valid UTF-16 encoded character string is also a valid UCS-2 encoded string (potentially representing different characters), but not every valid UCS-2 encoded string is a valid UTF-16 encoded string, because high and low surrogates can occur that are not part of a surrogate pair. This generally only causes problems if character strings are truncated in the middle of a character representation from the UTF-16 surrogate area, or if individual characters of character sets are compared in character string processing. Also transformations of strings to external formats that are expecting valid Unicode characters, as for example XML, can lead to exceptions.
To be used in a Unicode system, an ABAP program must have the ABAP language version ABAP_STANDARD . Programs with the obsolete language version ABAP_NON_UNICODE can no longer be used in a Unicode system.

Latest notes:

The attribute CHARSIZE of system class CL_ABAP_CHAR_UTILITIES contains the number of bytes occupied by a character in the current system.

For regular expressions in PCRE syntax, it can be defined whether valid UTF-16 character strings are expected or not.
NON_V5_HINTS

The Unicode version used for an AS ABAP can be seen in transaction SM51 -> Release Notes.

Before Unicode, SAP used different codes for representing characters in different fonts, such as ASCII, EBCDIC as single-byte code pages, or double-byte code pages:

ASCII (American Standard Code for Information Interchange) encodes each character with one byte. This means that a maximum of 256 characters can be represented (strictly speaking, standard ASCII only encodes one character using 7 bit and can therefore only represent 128 characters. The extension to 8 bit was introduced in ISO-8859). Examples of common code pages are ISO-8859-1 for Western European, or ISO-8859-5 for Cyrillic fonts.

EBCDIC (Extended Binary Coded Decimal Interchange) also encodes each character using one byte and can therefore also represent 256 characters. For example, EBCDIC 0697/0500 is an IBM format that was used on the <(>AS/400<)> platform (now known as <(>IBM System i <)>) for Western European fonts.

Double-byte code pages require 1 to 2 bytes per character. As a result, 65536 characters can be represented, of which only 10000 to 15000 characters are normally used. For example, the code page <(>SJIS<)> is used for Japanese and <(>BIG5<)> for traditional Chinese fonts. Using these character sets, all languages could be covered individually in one AS ABAP. Problems generally occurred when texts from different incompatible character sets were mixed in a central system. The exchange of data between systems with incompatible character sets could also lead to problems.

In earlier non-Unicode systems, the system code pages were defined in the database table TCPDB. In non-Unicode single code page systems there was only one system code page. In the obsolete MDMP systems, there were multiple system code pages.

Before Unicode support, many ABAP programming techniques expected one character to correspond to one byte. Therefore, before a non-Unicode system is converted to Unicode, ABAP programs must be changed wherever an explicit or implicit assumption is made about the internal length of a character. This mainly affects the following:

Character string processing and byte string processing

Access to structures. The latter is affected because flat structures in a program of the obsolete ABAP language version ABAP_NON_UNICODE have been handled like character-like data objects and some programming techniques have used this as well. The structural fragment view can be used to handle structures. Before a program is switched to Unicode, the ABAP language version ABAP_STANDARD or higher must be configured in the program properties. For these versions, the Unicode checks are also executed in non-Unicode systems. The transaction UCCHECK supports the activation of these checks for existing programs. The program RSUNISCAN_FINAL can also be used instead of transaction UCCHECK.
ABAP_HINT_END

ABAP_EXAMPLE_VX5
The UTF-8 representation of the Unicode character EXTRATERRESTRIAL ALIEN is converted to its UTF-16 representation and stored in the text field surrogate_pair. Although the Unicode character EXTRATERRESTRIAL ALIEN is not contained in the Basic Multilingual Plane (BMP) of the Unicode standard its UTF-16 representation (a surrogate pair) can still be stored as an ABAP character string. But almost every string operation in ABAP handles the string simply as two UCS-2 characters with string length 2. This can cause problems when the data is to be interpreted as UTF-16 outside ABAP. The results of the two offset/length accesses produce text fields of length 1 with hexadecimal content 3DD8 and 7DDC . Since high and low surrogates can only be part of a surrogate pair and cannot appear on their own in a valid UTF-16 string, the text fields contain invalid UTF-16 strings. In a regular expression in PCRE syntax that is introduced with (*UTF), valid UTF-16 strings are expected and an exception occurs.
ABEXA 01489
ABAP_EXAMPLE_END