|
PCREPARTIAL(3) PCREPARTIAL(3)
NAME
PCRE - Perl-compatible regular expressions
PARTIAL MATCHING IN PCRE
In normal use of PCRE, if the subject string that is passed to
pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too
short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.
There are circumstances where it might be helpful to distinguish this
case from other cases in which there is no match.
Consider, for example, an application where a human is required to type
in data for a field with specific formatting requirements. An example
might be a date in the form ddmmmyy, defined by this pattern:
^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
If the application sees the user's keystrokes one by one, and can check
that what has been typed so far is potentially valid, it is able to
raise an error as soon as a mistake is made, possibly beeping and not
reflecting the character that has been typed. This immediate feedback
is likely to be a better user interface than a check that is delayed
until the entire string has been entered.
PCRE supports the concept of partial matching by means of the PCRE_PAR-
TIAL option, which can be set when calling pcre_exec() or
pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
during the matching process the last part of the subject string matched
part of the pattern. Unfortunately, for non-anchored matching, it is
not possible to obtain the position of the start of the partial match.
No captured data is set when PCRE_ERROR_PARTIAL is returned.
When PCRE_PARTIAL is set for pcre_dfa_exec(), the return code
PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of
the subject is reached, there have been no complete matches, but there
is still at least one matching possibility. The portion of the string
that provided the partial match is set as the first matching string.
Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
the last literal byte in a pattern, and abandons matching immediately
if such a byte is not present in the subject string. This optimization
cannot be used for a subject string that might match only partially.
RESTRICTED PATTERNS FOR PCRE_PARTIAL
Because of the way certain internal optimizations are implemented in
the pcre_exec() function, the PCRE_PARTIAL option cannot be used with
all patterns. These restrictions do not apply when pcre_dfa_exec() is
used. For pcre_exec(), repeated single characters such as
a{2,4}
and repeated single metasequences such as
\d+
are not permitted if the maximum number of occurrences is greater than
one. Optional items such as \d? (where the maximum is one) are permit-
ted. Quantifiers with any values are permitted after parentheses, so
the invalid examples above can be coded thus:
(a){2,4}
(\d)+
These constructions run more slowly, but for the kinds of application
that are envisaged for this facility, this is not felt to be a major
restriction.
If PCRE_PARTIAL is set for a pattern that does not conform to the
restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
(-13).
EXAMPLE OF PARTIAL MATCHING USING PCRETEST
If the escape sequence \P is present in a pcretest data line, the
PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
uses the date example quoted above:
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
data> 25jun04\P
0: 25jun04
1: jun
data> 25dec3\P
Partial match
data> 3ju\P
Partial match
data> 3juj\P
No match
data> j\P
No match
The first data string is matched completely, so pcretest shows the
matched substrings. The remaining four strings do not match the com-
plete pattern, but the first two are partial matches. The same test,
using DFA matching (by means of the \D escape sequence), produces the
following output:
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
data> 25jun04\P\D
0: 25jun04
data> 23dec3\P\D
Partial match: 23dec3
data> 3ju\P\D
Partial match: 3ju
data> 3juj\P\D
No match
data> j\P\D
No match
Notice that in this case the portion of the string that was matched is
made available.
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
When a partial match has been found using pcre_dfa_exec(), it is possi-
ble to continue the match by providing additional subject data and
calling pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the
same working space (where details of the previous partial match are
stored). Here is an example using pcretest, where the \R escape
sequence sets the PCRE_DFA_RESTART option and the \D escape sequence
requests the use of pcre_dfa_exec():
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
data> 23ja\P\D
Partial match: 23ja
data> n05\R\D
0: n05
The first call has "23ja" as the subject, and requests partial match-
ing; the second call has "n05" as the subject for the continued
(restarted) match. Notice that when the match is complete, only the
last part is shown; PCRE does not retain the previously partially-
matched string. It is up to the calling program to do that if it needs
to.
This facility can be used to pass very long subject strings to
pcre_dfa_exec(). However, some care is needed for certain types of pat-
tern.
1. If the pattern contains tests for the beginning or end of a line,
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
ate, when the subject string for any call does not contain the begin-
ning or end of a line.
2. If the pattern contains backward assertions (including \b or \B),
you need to arrange for some overlap in the subject strings to allow
for this. For example, you could pass the subject in chunks that were
500 bytes long, but in a buffer of 700 bytes, with the starting offset
set to 200 and the previous 200 bytes at the start of the buffer.
3. Matching a subject string that is split into multiple segments does
not always produce exactly the same result as matching over one single
long string. The difference arises when there are multiple matching
possibilities, because a partial match result is given only when there
are no completed matches in a call to fBpcre_dfa_exec(). This means
that as soon as the shortest match has been found, continuation to a
new subject segment is no longer possible. Consider this pcretest
example:
re> /dog(sbody)?/
data> do\P\D
Partial match: do
data> gsb\R\P\D
0: g
data> dogsbody\D
0: dogsbody
1: dog
The pattern matches the words "dog" or "dogsbody". When the subject is
presented in several parts ("do" and "gsb" being the first two) the
match stops when "dog" has been found, and it is not possible to con-
tinue. On the other hand, if "dogsbody" is presented as a single
string, both matches are found.
Because of this phenomenon, it does not usually make sense to end a
pattern that is going to be matched in this way with a variable repeat.
4. Patterns that contain alternatives at the top level which do not all
start with the same pattern item may not work as expected. For example,
consider this pattern:
1234|3789
If the first part of the subject is "ABC123", a partial match of the
first alternative is found at offset 3. There is no partial match for
the second alternative, because such a match does not start at the same
point in the subject string. Attempting to continue with the string
"789" does not yield a match because only those alternatives that match
at one point in the subject are remembered. The problem arises because
the start of the second alternative matches within the first alterna-
tive. There is no problem with anchored patterns or patterns such as:
1234|ABCD
where no string can be a partial match for both alternatives.
Last updated: 16 January 2006
Copyright (c) 1997-2006 University of Cambridge.
PCREPARTIAL(3)
|