Get Example source ABAP code based on a different SAP table
ABAP_REGEX - Incompatibilities Between POSIX and PCRE This topics lists all features of POSIX regular expressions that cannot be reused directly in PCRE but require some migration effort by rewriting the regular expressions. ITOC
Migrating Patterns For the most part the features supported by PCRE form a superset of the features supported by POSIX. There are however some key differences and missing features, which are outlined in the following sections.
Fundamental Differences Both PCRE and POSIX use a regex-directed, backtracking algorithm, meaning both implementations will in most cases yield the same result. There is however a crucial difference: PCRE will always return the leftmost match, while POSIX aims to return the leftmost longest match, meaning that if multiple possible matches start at the same offset, the longest of those is returned. If you are making use of the leftmost longest matching rule in POSIX, you may need to reorder or rewrite parts of your regular expression to achieve the same results in PCRE.
ABAP_EXAMPLE_VX5 PCRE stops after finding the first (leftmost) match, while POSIX also tries the other match starting at the same position and, as it is longer, considers it the better match. ABEXA 01482 To also return the longest match in the PCRE case, the example above can be rewritten as follows, reordering the alternations: ABEXA 01483 However the different matching strategies do not only affect alternations introduced by |>, but all cases where multiple matches start at the same location, for example using the ?> quantifier: ABEXA 01484 In this case, a look-ahead assertion can be used to also return the longest match in the PCRE case: ABEXA 01485 ABAP_EXAMPLE_END
Significance of Whitespaces in Patterns By default PCRE syntax is compiled in an extended mode on AS ABAP: Most unescaped whitespace (blanks and line breaks) of the pattern are ignored outside character classes. In order to include whitespace into a pattern, they must be escaped. In order to explicitly match whitespaces in PCRE's extended mode, there are the following options:
Escape the whitespace in the pattern. The pattern Hello World> matches Hello World>.
Match all whitespaces using the special character s>>. Hello sWorld> matches Hello World>. The same applies to Hello s World>, which might be more readable. While the extended mode allows you to write more readable regular expressions, it can be a bit confusing at first, especially when migrating POSIX regular expressions. The extended mode of PCRE can be switched of as follows:
By passing ABAP_FALSE> to the parameter EXTENDED> when creating a PCRE regular expression with method CREATE_PCRE> of class CL_ABAP_REGEX>>.
By using the special character (?-x)>> in the pattern itself. This also works for the addition PCRE>> in statements and the parameter pcre>> in string functions.
ABAP_EXAMPLE_VX5 The extended mode for PCRE is enabled when using parameter pcre> in the following function. This means that whitespace characters are handled as not significant when the pattern is evaluated. The PCRE regular expression does not match the string Hello World >. ABEXA 01486 The string HelloWorld> however is matched by PCRE but not by POSIX : ABEXA 01487 The following example finally shows, how the extended mode can be switched of in built-in string functions: ABEXA 01488 ABAP_EXAMPLE_END
Comments In the extended mode of PCRE, comments can be placed behind an unescaped #>. In order to include the character #> into a pattern in PCRE's extended mode, it must be escaped: The pattern Hello #World> matches Hello#World>. The extended mode of PCRE can be switched of as explained in the preceding topic.
ABAP_EXAMPLE_VX5 The extended mode for PCRE is enabled when using parameter pcre> in the following function. This means that the character #> introduces a comment. The first PCRE regular expression does not match the string Hello#World>. A POSIX regular expression and the second and third PCRE regular expression where #> is escaped or the extended mode is switched off match the string. ABEXA 01559 ABAP_EXAMPLE_END
Unicode Handling For the representation of character strings, the ABAP programming language supports the two byte Unicode character representation > UCS-2>. The system code page> of an AS ABAP is UTF-16>, that supports all characters of the Unicode standard. UCS-2 is a subset of UTF-16 that supports the so called Basic Multilingual Plane (BMP) of the Unicode standard. In UTF-16, the other Unicode planes are encoded as surrogates> ( surrogate pairs>) in the surrogate area>. POSIX regular expressions always assume UCS-2 and handle characters that are represented by surrogate pairs as two separate characters what might lead to unexpected results. Unlike POSIX, PCRE can handle character strings as both UCS-2 or UTF-16. This can be configured in different ways depending on the type of regular expression operation performed: Operation>Description>Default Behavior> Methods of system classes CL_ABAP_REGEX>> and CL_ABAP_MATCHER>> Unicode handling is controlled by parameter UNICODE_HANDLING> of factory method CREATE_PCRE>. The following values can be passed: lbr lbr STRICT> - handle character string as UTF-16, raise an exception upon encountering invalid UTF-16 (broken surrogate pairs) lbr lbr IGNORE> - handle character string as UTF-16, ignore invalid UTF-16; parts of the input that are not valid UTF-16 cannot be matched in any way lbr lbr RELAXED> - handle character string as UCS-2; special character C> is enabled in patterns, the matching of surrogate pairs by their Unicode code point is however no longer possible STRICT> Addition PCRE>> of statements FIND>> and REPLACE>>, lbr lbr Argument pcre>> of built-in functions for strings No addition exists to control Unicode handling, instead the syntax (*UTF)> can be specified at the start of the pattern to switch on the strict mode (see above) Without (*UTF)> the relaxed mode (see above) is used, the special character C> can however not be used The following table gives a quick overview of which Unicode mode to use when migrating a pattern from POSIX to PCRE: Operation>Handle Input as UCS-2 or UTF-16?>Accept Invalid UTF-16?>Action> Methods of system classes CL_ABAP_REGEX>> and CL_ABAP_MATCHER>>UTF-16YesSet UNICODE_HANDLING> to IGNORE> Methods of system classes CL_ABAP_REGEX>> and CL_ABAP_MATCHER>>UTF-16NoSet UNICODE_HANDLING> to STRICT> (default) Methods of system classes CL_ABAP_REGEX>> and CL_ABAP_MATCHER>>UCS-2 (ABAP default) -Set UNICODE_HANDLING> to RELAXED> Statements and built-in functionsUTF-16YesThis cannot be achieved with the addition PCRE> of statements and the argument pcre> of built-in functions; use objects of CL_ABAP_REGEX> Statements and built-in functionsUTF-16NoAdd syntax (*UTF)> to the pattern Statements and built-in functionsUCS-2 (ABAP default)- No action required, relaxed mode is default
ABAP_EXAMPLE_VX5 The special character .> matches two UCS-2 characters in the first two replacements, even though they form a surrogate pair for a a single UTF-16 character. The third replacement uses (*UTF)> at the beginning of a PCRE regular expression and only the UTF-16 character is matched and replaced. ABEXA 01490 ABAP_EXAMPLE_END
Matching Uppercase and Lowercase Letters PCRE does not directly support the POSIX syntax u>> and l>> to match an uppercase and lowercase letter respectively. This includes the corresponding negations U>> and L>>. As an alternative PCRE's p{xx} >> and P{xx}>> syntax can be used to match characters having certain Unicode character properties: Description>POSIX Syntax>PCRE Syntax> uppercase letter u> p{Lu}> not an uppercase letter U> P{Lu}> lowercase letter l> p{Ll}> not a lowercase letter L> P{Ll}>
ABAP_EXAMPLE_VX5 The following replacements yield the same result. ABEXA 01493 ABAP_EXAMPLE_END
Matching All Unicode Characters While PCRE supports most of the named sets available in the POSIX syntax, there is one exception: [[:unicode:]]>>, which matches any character whose code is greater than 255. Depending on the context there are different ways to achieve the same behavior in PCRE: POSIX Syntax>PCRE Syntax>Description> [[:unicode:]]>[^ x{00}- x{ff}]>a standalone [[:unicode:]]> can be replaced by the negation of the range of characters from 0x00> to 0xff> [^[:unicode:]]>[ x{00}- x{ff}]>similarly, a standalone [^[:unicode:]]> can be replaced by the range of characters from 0x00> to 0xff> [[:unicode:]...]>[ x{100- xffff}...]>if [[:unicode:]]> is used in conjunction with other elements in a character class, the range of characters has to be specified explicitly (not by negation); when the regular expression is to be executed in a non-UTF-16 context ( UNICODE_HANDLING> is set to RELAXED>), this is the character range from 0x100> to 0xffff> [[:unicode:]...]>[ x{100}- x{10ffff}...]>in a UTF-16 context (UNICODE_HANDLING> is set to STRICT> or IGNORE >) this range becomes 0x100> to 0x10ffff> [^[:unicode:]...]>[^ x{100}- x{ffff}...]>similarly, when the [[:unicode:]]> is used in conjunction with other elements in a negated character class, the range from 0x100> to 0xffff> for a non-UTF-16 context has to be specified explicitly [^[:unicode:]...]>[^ x{100}- x{10ffff}...]>in a UTF-16 context this range becomes 0x100> to 0x10ffff> Alternatively, if you only care about the character range from 0 to 127, or the negation thereof, you can use the POSIX named set [[:ascii:]] > available in PCRE. Using PCRE's negative POSIX named set syntax ([[:^ascii:]])>, you can match non-ASCII characters. The negative POSIX named set syntax can also be used in negated character classes, allowing for a lot of flexibility.
ABAP_EXAMPLE_VX5 The following searches yield the same result. ABEXA 01494 ABAP_EXAMPLE_END
Word Anchors PCRE does not directly support the POSIX syntax <(><<)>>> and >>> to match the start and end of a word respectively. As an alternative the word anchor b>> (which matches the start and the end of a word) can be used in conjunction with a look-ahead or look-behind assertion. Alternatively, a special character set can be used. DescriptionPOSIX Syntax>PCRE Syntax> start of word <(><<)>> b(?= w)> or [[:<(><<)>:]]> end of word >> b(?<(><<)>= w)> or [[:>:]]>
ABAP_EXAMPLE_VX5 The following replacements yield the same result. ABEXA 01495 ABAP_EXAMPLE_END
Migrating Replacement Strings Apart from referring to the content of a capture group by its number ( $1>, $2>, $3>, ...), the replacement string syntax and capabilities of PCRE are quite different to those of POSIX.
Substituting the Whole Match POSIX offers both $0>> and $ >> as placeholders for the whole match in the replacement string. PCRE only supports the former syntax $0>>, with the latter syntax $ > raising an exception. If you are using $ > in your POSIX patterns, simply replace it with $0>> when migrating to PCRE.
ABAP_EXAMPLE_VX5 The following replacements yield the same result. ABEXA 01496 ABAP_EXAMPLE_END
Substituting Parts Around the Match POSIX supports $`>> and $'>> as placeholders for the text in front of and after the match respectively. PCRE does not offer any directly equivalent functionality. If your pattern makes use of these POSIX features, you can however try to emulate them, e.g. by introducing additional capture groups There are however limitations to this approach. If your pattern or replacement string is more complex, you may have to either perform the replacement manually (using string operations and the offset and length obtained from the match), or keep your POSIX pattern with the ##regex_posix> pragma.
ABAP_EXAMPLE_VX5 The following replacements yield the same result. ABEXA 01497 ABAP_EXAMPLE_END