SAP REGEX POSIX PCRE INCOMPAT

Get Example source ABAP code based on a different SAP table

ABAP_REGEX - Incompatibilities Between POSIX and PCRE
This topics lists all features of POSIX regular expressions that cannot be reused directly in PCRE but require some migration effort by rewriting the regular expressions.
ITOC

Migrating Patterns
For the most part the features supported by PCRE form a superset of the features supported by POSIX. There are however some key differences and missing features, which are outlined in the following sections.

Fundamental Differences
Both PCRE and POSIX use a regex-directed, backtracking algorithm, meaning both implementations will in most cases yield the same result. There is however a crucial difference: PCRE will always return the leftmost match, while POSIX aims to return the leftmost longest match, meaning that if multiple possible matches start at the same offset, the longest of those is returned.
If you are making use of the leftmost longest matching rule in POSIX, you may need to reorder or rewrite parts of your regular expression to achieve the same results in PCRE.

ABAP_EXAMPLE_VX5
PCRE stops after finding the first (leftmost) match, while POSIX also tries the other match starting at the same position and, as it is longer, considers it the better match.
ABEXA 01482
To also return the longest match in the PCRE case, the example above can be rewritten as follows, reordering the alternations:
ABEXA 01483
However the different matching strategies do not only affect alternations introduced by |, but all cases where multiple matches start at the same location, for example using the ? quantifier:
ABEXA 01484
In this case, a look-ahead assertion can be used to also return the longest match in the PCRE case:
ABEXA 01485
ABAP_EXAMPLE_END

Significance of Whitespaces in Patterns
By default PCRE syntax is compiled in an extended mode on AS ABAP: Most unescaped whitespace (blanks and line breaks) of the pattern are ignored outside character classes. In order to include whitespace into a pattern, they must be escaped. In order to explicitly match whitespaces in PCRE's extended mode, there are the following options:

Escape the whitespace in the pattern. The pattern Hello World matches Hello World.

Match all whitespaces using the special character s. Hello sWorld matches Hello World. The same applies to Hello s World, which might be more readable.
While the extended mode allows you to write more readable regular expressions, it can be a bit confusing at first, especially when migrating POSIX regular expressions. The extended mode of PCRE can be switched of as follows:

By passing ABAP_FALSE to the parameter EXTENDED when creating a PCRE regular expression with method CREATE_PCRE of class CL_ABAP_REGEX.

By using the special character (?-x) in the pattern itself. This also works for the addition PCRE in statements and the parameter pcre in string functions.

ABAP_EXAMPLE_VX5
The extended mode for PCRE is enabled when using parameter pcre in the following function. This means that whitespace characters are handled as not significant when the pattern is evaluated. The PCRE regular expression does not match the string Hello World .
ABEXA 01486
The string HelloWorld however is matched by PCRE but not by POSIX :
ABEXA 01487
The following example finally shows, how the extended mode can be switched of in built-in string functions:
ABEXA 01488
ABAP_EXAMPLE_END

Comments
In the extended mode of PCRE, comments can be placed behind an unescaped #. In order to include the character # into a pattern in PCRE's extended mode, it must be escaped: The pattern Hello #World matches Hello#World.
The extended mode of PCRE can be switched of as explained in the preceding topic.

ABAP_EXAMPLE_VX5
The extended mode for PCRE is enabled when using parameter pcre in the following function. This means that the character # introduces a comment. The first PCRE regular expression does not match the string Hello#World. A POSIX regular expression and the second and third PCRE regular expression where # is escaped or the extended mode is switched off match the string.
ABEXA 01559
ABAP_EXAMPLE_END

Unicode Handling
For the representation of character strings, the ABAP programming language supports the two byte Unicode character representation UCS-2. The system code page of an AS ABAP is UTF-16, that supports all characters of the Unicode standard. UCS-2 is a subset of UTF-16 that supports the so called Basic Multilingual Plane (BMP) of the Unicode standard. In UTF-16, the other Unicode planes are encoded as surrogates ( surrogate pairs) in the surrogate area.
POSIX regular expressions always assume UCS-2 and handle characters that are represented by surrogate pairs as two separate characters what might lead to unexpected results. Unlike POSIX, PCRE can handle character strings as both UCS-2 or UTF-16. This can be configured in different ways depending on the type of regular expression operation performed: OperationDescriptionDefault Behavior
Methods of system classes CL_ABAP_REGEX and CL_ABAP_MATCHER Unicode handling is controlled by parameter UNICODE_HANDLING of factory method CREATE_PCRE. The following values can be passed: lbr lbr STRICT - handle character string as UTF-16, raise an exception upon encountering invalid UTF-16 (broken surrogate pairs) lbr lbr IGNORE - handle character string as UTF-16, ignore invalid UTF-16; parts of the input that are not valid UTF-16 cannot be matched in any way lbr lbr RELAXED - handle character string as UCS-2; special character C is enabled in patterns, the matching of surrogate pairs by their Unicode code point is however no longer possible STRICT
Addition PCRE of statements FIND and REPLACE, lbr lbr Argument pcre of built-in functions for strings No addition exists to control Unicode handling, instead the syntax (*UTF) can be specified at the start of the pattern to switch on the strict mode (see above) Without (*UTF) the relaxed mode (see above) is used, the special character C can however not be used
The following table gives a quick overview of which Unicode mode to use when migrating a pattern from POSIX to PCRE: OperationHandle Input as UCS-2 or UTF-16?Accept Invalid UTF-16?Action
Methods of system classes CL_ABAP_REGEX and CL_ABAP_MATCHERUTF-16YesSet UNICODE_HANDLING to IGNORE
Methods of system classes CL_ABAP_REGEX and CL_ABAP_MATCHERUTF-16NoSet UNICODE_HANDLING to STRICT (default)
Methods of system classes CL_ABAP_REGEX and CL_ABAP_MATCHERUCS-2 (ABAP default) -Set UNICODE_HANDLING to RELAXED
Statements and built-in functionsUTF-16YesThis cannot be achieved with the addition PCRE of statements and the argument pcre of built-in functions; use objects of CL_ABAP_REGEX
Statements and built-in functionsUTF-16NoAdd syntax (*UTF) to the pattern
Statements and built-in functionsUCS-2 (ABAP default)- No action required, relaxed mode is default

ABAP_EXAMPLE_VX5
The special character . matches two UCS-2 characters in the first two replacements, even though they form a surrogate pair for a a single UTF-16 character. The third replacement uses (*UTF) at the beginning of a PCRE regular expression and only the UTF-16 character is matched and replaced.
ABEXA 01490
ABAP_EXAMPLE_END

Matching Uppercase and Lowercase Letters
PCRE does not directly support the POSIX syntax u and l to match an uppercase and lowercase letter respectively. This includes the corresponding negations U and L.
As an alternative PCRE's p{xx} and P{xx} syntax can be used to match characters having certain Unicode character properties: DescriptionPOSIX SyntaxPCRE Syntax
uppercase letter u p{Lu}
not an uppercase letter U P{Lu}
lowercase letter l p{Ll}
not a lowercase letter L P{Ll}

ABAP_EXAMPLE_VX5
The following replacements yield the same result.
ABEXA 01493
ABAP_EXAMPLE_END

Matching All Unicode Characters
While PCRE supports most of the named sets available in the POSIX syntax, there is one exception: [[:unicode:]], which matches any character whose code is greater than 255.
Depending on the context there are different ways to achieve the same behavior in PCRE: POSIX SyntaxPCRE SyntaxDescription
[[:unicode:]][^ x{00}- x{ff}]a standalone [[:unicode:]] can be replaced by the negation of the range of characters from 0x00 to 0xff
[^[:unicode:]][ x{00}- x{ff}]similarly, a standalone [^[:unicode:]] can be replaced by the range of characters from 0x00 to 0xff
[[:unicode:]...][ x{100- xffff}...]if [[:unicode:]] is used in conjunction with other elements in a character class, the range of characters has to be specified explicitly (not by negation); when the regular expression is to be executed in a non-UTF-16 context ( UNICODE_HANDLING is set to RELAXED), this is the character range from 0x100 to 0xffff
[[:unicode:]...][ x{100}- x{10ffff}...]in a UTF-16 context (UNICODE_HANDLING is set to STRICT or IGNORE ) this range becomes 0x100 to 0x10ffff
[^[:unicode:]...][^ x{100}- x{ffff}...]similarly, when the [[:unicode:]] is used in conjunction with other elements in a negated character class, the range from 0x100 to 0xffff for a non-UTF-16 context has to be specified explicitly
[^[:unicode:]...][^ x{100}- x{10ffff}...]in a UTF-16 context this range becomes 0x100 to 0x10ffff
Alternatively, if you only care about the character range from 0 to 127, or the negation thereof, you can use the POSIX named set [[:ascii:]] available in PCRE. Using PCRE's negative POSIX named set syntax ([[:^ascii:]]), you can match non-ASCII characters. The negative POSIX named set syntax can also be used in negated character classes, allowing for a lot of flexibility.

ABAP_EXAMPLE_VX5
The following searches yield the same result.
ABEXA 01494
ABAP_EXAMPLE_END

Word Anchors
PCRE does not directly support the POSIX syntax <(><<)> and > to match the start and end of a word respectively. As an alternative the word anchor b (which matches the start and the end of a word) can be used in conjunction with a look-ahead or look-behind assertion. Alternatively, a special character set can be used. DescriptionPOSIX SyntaxPCRE Syntax
start of word <(><<)> b(?= w) or [[:<(><<)>:]]
end of word > b(?<(><<)>= w) or [[:>:]]

ABAP_EXAMPLE_VX5
The following replacements yield the same result.
ABEXA 01495
ABAP_EXAMPLE_END

Migrating Replacement Strings
Apart from referring to the content of a capture group by its number ( $1, $2, $3, ...), the replacement string syntax and capabilities of PCRE are quite different to those of POSIX.

Substituting the Whole Match
POSIX offers both $0 and $ as placeholders for the whole match in the replacement string. PCRE only supports the former syntax $0, with the latter syntax $ raising an exception. If you are using $ in your POSIX patterns, simply replace it with $0 when migrating to PCRE.

ABAP_EXAMPLE_VX5
The following replacements yield the same result.
ABEXA 01496
ABAP_EXAMPLE_END

Substituting Parts Around the Match
POSIX supports $` and $' as placeholders for the text in front of and after the match respectively. PCRE does not offer any directly equivalent functionality. If your pattern makes use of these POSIX features, you can however try to emulate them, e.g. by introducing additional capture groups
There are however limitations to this approach. If your pattern or replacement string is more complex, you may have to either perform the replacement manually (using string operations and the offset and length obtained from the match), or keep your POSIX pattern with the ##regex_posix pragma.

ABAP_EXAMPLE_VX5
The following replacements yield the same result.
ABEXA 01497
ABAP_EXAMPLE_END