Operator Reference
tuple_regexp_match (Operator)
tuple_regexp_match
— Extract substrings using regular expressions.
Signature
tuple_regexp_match( : : Data, Expression : Matches)
Description
tuple_regexp_match
applies the regular expression in
Expression
to one or more input strings in Data
, and
in each case returns the first matching substring in Matches
.
Normally, one output string is returned for each input string, the output
string being empty if no match was found. However, if the regular
expression contains capturing groups (see below), the behavior depends
on the number of input strings: If there is only a single input string,
the result is a tuple of all captured submatches.
If there are multiple input strings, the output strings represent
the matched pattern of the first capturing group.
A summary of regular expression syntax is provided here. Basically, each character in the regular expression represents a literal to match, except for the following symbols which have a special meaning (the described syntax is compatible with Perl):
Matches start of string | |
$ |
Matches end of string (a trailing newline is allowed) |
. |
Matches any character except newline |
[...] |
Matches any character literal listed in the brackets.
If the first character is a ' ' ,
this matches any character except those in the list.
You can use the '-' character as
in [A-Z0-9] to select character ranges. Other
characters lose their special meaning in brackets, except
' ' .
Within these brackets it is possible to use the following
POSIX character classes (note that the additional brackets
are needed):
[:alnum:] alphabetic and numeric characters
[:alpha:] alphabetic characters
[:blank:] space and tab
[:cntrl:] control characters
[:digit:] digits
[:graph:] non-blank (like spaces or control characters)
[:lower:] lowercase alphabetic characters
[:print:] like [:graph:] but including spaces
[:punct:] punctuation characters
[:space:] all whitespace characters ([:blank:] , newline, ...)
[:upper:] uppercase alphabetic characters
[:xdigit:] digits allowed in hexadecimal numbers (0-9a-fA-F). |
* |
Allows 0 or more repetitions of preceding literal or group |
+ |
Allows 1 or more repetitions of preceding literal or group |
? |
Allows 0 or 1 repetitions of preceding literal or group |
Allows n to m repetitions of preceding literal or group |
|
Allows exactly n repetitions of preceding literal or group |
|
| |
Separates alternative matching expressions |
() |
Groups a subpattern and creates a capturing group.
The substrings captured by this group will be stored separately. (?: ) Groups a subpattern without creating a capturing group
(?= ) Positive lookahead (requested condition right to the match)
(?! ) Negative lookahead (forbidden condition right to the match)
(?<= ) Positive lookbehind (requested condition left to the match)
(?<! ) Negative lookbehind (forbidden condition left to the match) |
Escapes any special symbol to treat it as a literal.
Attention: Some host languages like HDevelop and C/C++ already use the backslash as a general escape character. In this case, '\.' matches a literal dot while '\\' matches a literal backslash. Furthermore, there are some special codes: Matches a digit (Negation: ) Matches a letter, digit or underscore (Negation: ) Matches a white space character (Negation: ) Matches a word boundary (Negation: ) |
The repeat quantifiers listed in the table above are greedy by default,
i.e., they attempt to maximize the length of the
match. Appending '?'
attempts to find a minimal
match, e.g., '+?'
.
If the specified expression is syntactically incorrect, you will
receive an error stating that the value of control parameter 2 is
wrong. Additional details are displayed in a message box if
set_system('do_low_error', 'true')
is set and in
HDevelop's Output Console.
Furthermore, you can set some options by passing a string tuple for
Expression
. In this case, the first element is used as the
expression, and each additional element is treated as an option.
-
'ignore_case' : Perform case-insensitive matching
-
'multiline' :
'
'
and'$'
match start and end of individual lines -
'dot_matches_all' : Allow the
'.'
character to also match newlines -
'newline_lf' , 'newline_crlf' , 'newline_cr' : Specify the encoding of newlines in the input data. The default is LF on all systems (even though in Windows files usually CRLF is used as line break, when reading a file into memory the read operators return for every line break just
'
'
, which is the same as LF).
For general information about string operations see Tuple / String Operations.
If the input parameter Data
is an empty tuple, the operator returns
an empty tuple. If Expression
is an empty tuple,
an exception is raised.
Unicode code points versus bytes
Regular expression matching operates on Unicode code points. One Unicode
code point may be composed of multiple bytes in the UTF-8 string.
If regular expression matching should only match on bytes, this operator can
be switched to byte mode with
set_system('tsp_tuple_string_operator_mode','byte')
. If
'filename_encoding' is set to 'locale' (legacy), this
operator always uses the byte mode.
HDevelop In-line Operation
HDevelop provides an in-line operation for tuple_regexp_match
,
which can be used in an expression in the following syntax:
Matches := regexp_match(Data, Expression)
Execution Information
- Multithreading type: independent (runs in parallel even with exclusive operators).
- Multithreading scope: global (may be called from any thread).
- Processed without parallelization.
Parameters
Data
(input_control) string(-array) →
(string)
Input strings to match.
Expression
(input_control) string(-array) →
(string)
Regular expression.
Default: '.*'
Suggested values: '.*' , 'ignore_case' , 'multiline' , 'dot_matches_all' , 'newline_lf' , 'newline_crlf' , 'newline_cr'
Matches
(output_control) string(-array) →
(string)
Found matches.
Example (HDevelop)
tuple_regexp_match ('abba', 'a*b*', Result) * Returns 'abb' tuple_regexp_match ('abba', 'b*a*', Result) * Returns 'a' tuple_regexp_match ('abba', 'b+a*', Result) * Returns 'bba' tuple_regexp_match ('abba', '.a', Result) * Returns 'ba' tuple_regexp_match ('abba', '[ab]*', Result) * Returns 'abba' tuple_regexp_match (['img123','img124'], 'img(.*)', Result) * Returns ['123','124'] tuple_regexp_match ('mydir/img001.bmp', 'img(.*)\\.(.*)', Result) * Returns ['001','bmp']
Alternatives
See also
tuple_regexp_replace
,
tuple_regexp_test
,
tuple_regexp_select
References
Perl Compatible Regular Expressions (PCRE), http://www.pcre.org/
Module
Foundation