--- embedaddon/pcre/doc/html/pcrepartial.html 2012/02/21 23:05:52 1.1.1.1 +++ embedaddon/pcre/doc/html/pcrepartial.html 2012/02/21 23:50:25 1.1.1.2 @@ -14,24 +14,24 @@ man page, in case the conversion went wrong.

PARTIAL MATCHING IN PCRE

-In normal use of PCRE, if the subject string that is passed to -pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is -too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There -are circumstances where it might be helpful to distinguish this case from other -cases in which there is no match. +In normal use of PCRE, if the subject string that is passed to a matching +function matches as far as it goes, but is too short to match the entire +pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where it might +be helpful to distinguish this case from other cases in which there is no +match.

Consider, for example, an application where a human is required to type in data @@ -50,42 +50,41 @@ long and is not all available at once.

PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and -PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or -pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym -for PCRE_PARTIAL_SOFT. The essential difference between the two options is -whether or not a partial match is preferred to an alternative complete match, -though the details differ between the two matching functions. If both options +PCRE_PARTIAL_HARD options, which can be set when calling any of the matching +functions. For backwards compatibility, PCRE_PARTIAL is a synonym for +PCRE_PARTIAL_SOFT. The essential difference between the two options is whether +or not a partial match is preferred to an alternative complete match, though +the details differ between the two types of matching function. If both options are set, PCRE_PARTIAL_HARD takes precedence.

-Setting a partial matching option for pcre_exec() disables the use of any -just-in-time code that was set up by calling pcre_study() with the +Setting a partial matching option disables the use of any just-in-time code +that was set up by studying the compiled pattern with the PCRE_STUDY_JIT_COMPILE option. It also disables two of PCRE's standard -optimizations. PCRE remembers the last literal byte in a pattern, and abandons -matching immediately if such a byte is not present in the subject string. This +optimizations. PCRE remembers the last literal data unit in a pattern, and +abandons matching immediately if it is not present in the subject string. This optimization cannot be used for a subject string that might match only partially. If the pattern was studied, PCRE knows the minimum length of a matching string, and does not bother to run the matching function on shorter strings. This optimization is also disabled for partial matching.

-
PARTIAL MATCHING USING pcre_exec()
+
PARTIAL MATCHING USING pcre_exec() OR pcre16_exec()

-A partial match occurs during a call to pcre_exec() when the end of the -subject string is reached successfully, but matching cannot continue because -more characters are needed. However, at least one character in the subject must -have been inspected. This character need not form part of the final matched -string; lookbehind assertions and the \K escape sequence provide ways of -inspecting characters before the start of a matched substring. The requirement -for inspecting at least one character exists because an empty string can always -be matched; without such a restriction there would always be a partial match of -an empty string at the end of the subject. +A partial match occurs during a call to pcre_exec() or +pcre16_exec() when the end of the subject string is reached successfully, +but matching cannot continue because more characters are needed. However, at +least one character in the subject must have been inspected. This character +need not form part of the final matched string; lookbehind assertions and the +\K escape sequence provide ways of inspecting characters before the start of a +matched substring. The requirement for inspecting at least one character exists +because an empty string can always be matched; without such a restriction there +would always be a partial match of an empty string at the end of the subject.

-If there are at least two slots in the offsets vector when pcre_exec() -returns with a partial match, the first slot is set to the offset of the -earliest character that was inspected when the partial match was found. For -convenience, the second offset points to the end of the subject so that a -substring can easily be identified. +If there are at least two slots in the offsets vector when a partial match is +returned, the first slot is set to the offset of the earliest character that +was inspected. For convenience, the second offset points to the end of the +subject so that a substring can easily be identified.

For the majority of patterns, the first offset identifies the start of the @@ -105,13 +104,14 @@ What happens when a partial match is identified depend partial matching options are set.


-PCRE_PARTIAL_SOFT with pcre_exec() +PCRE_PARTIAL_SOFT WITH pcre_exec() OR pcre16_exec()

-If PCRE_PARTIAL_SOFT is set when pcre_exec() identifies a partial match, -the partial match is remembered, but matching continues as normal, and other -alternatives in the pattern are tried. If no complete match can be found, -pcre_exec() returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. +If PCRE_PARTIAL_SOFT is set when pcre_exec() or pcre16_exec() +identifies a partial match, the partial match is remembered, but matching +continues as normal, and other alternatives in the pattern are tried. If no +complete match can be found, PCRE_ERROR_PARTIAL is returned instead of +PCRE_ERROR_NOMATCH.

This option is "soft" because it prefers a complete match over a partial match. @@ -134,22 +134,25 @@ example, there are two partial matches, because "dog" matches the second alternative.)


-PCRE_PARTIAL_HARD with pcre_exec() +PCRE_PARTIAL_HARD WITH pcre_exec() OR pcre16_exec()

-If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns -PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to -search for possible complete matches. This option is "hard" because it prefers -an earlier partial match over a later complete match. For this reason, the -assumption is made that the end of the supplied subject string may not be the -true end of the available data, and so, if \z, \Z, \b, \B, or $ are -encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL. +If PCRE_PARTIAL_HARD is set for pcre_exec() or pcre16_exec(), +PCRE_ERROR_PARTIAL is returned as soon as a partial match is found, without +continuing to search for possible complete matches. This option is "hard" +because it prefers an earlier partial match over a later complete match. For +this reason, the assumption is made that the end of the supplied subject string +may not be the true end of the available data, and so, if \z, \Z, \b, \B, +or $ are encountered at the end of the subject, the result is +PCRE_ERROR_PARTIAL, provided that at least one character in the subject has +been inspected.

-Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8 -subject strings for validity. Normally, an invalid UTF-8 sequence causes the -error PCRE_ERROR_BADUTF8. However, in the special case of a truncated UTF-8 -character at the end of the subject, PCRE_ERROR_SHORTUTF8 is returned when +Setting PCRE_PARTIAL_HARD also affects the way UTF-8 and UTF-16 +subject strings are checked for validity. Normally, an invalid sequence +causes the error PCRE_ERROR_BADUTF8 or PCRE_ERROR_BADUTF16. However, in the +special case of a truncated character at the end of the subject, +PCRE_ERROR_SHORTUTF8 or PCRE_ERROR_SHORTUTF16 is returned when PCRE_PARTIAL_HARD is set.


@@ -169,23 +172,23 @@ if the pattern is made ungreedy the result is differen
   /dog(sbody)??/
 
-In this case the result is always a complete match because pcre_exec() -finds that first, and it never continues after finding a match. It might be -easier to follow this explanation by thinking of the two patterns like this: +In this case the result is always a complete match because that is found first, +and matching never continues after finding a complete match. It might be easier +to follow this explanation by thinking of the two patterns like this:
   /dog(sbody)?/    is the same as  /dogsbody|dog/
   /dog(sbody)??/   is the same as  /dog|dogsbody/
 
-The second pattern will never match "dogsbody" when pcre_exec() is -used, because it will always find the shorter match first. +The second pattern will never match "dogsbody", because it will always find the +shorter match first.

-
PARTIAL MATCHING USING pcre_dfa_exec()
+
PARTIAL MATCHING USING pcre_dfa_exec() OR pcre16_dfa_exec()

-The pcre_dfa_exec() function moves along the subject string character by -character, without backtracking, searching for all possible matches -simultaneously. If the end of the subject is reached before the end of the -pattern, there is the possibility of a partial match, again provided that at -least one character has been inspected. +The DFA functions move along the subject string character by character, without +backtracking, searching for all possible matches simultaneously. If the end of +the subject is reached before the end of the pattern, there is the possibility +of a partial match, again provided that at least one character has been +inspected.

When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there @@ -196,16 +199,16 @@ partial match was found is set as the first matching s at least two slots in the offsets vector.

-Because pcre_dfa_exec() always searches for all possible matches, and -there is no difference between greedy and ungreedy repetition, its behaviour is -different from pcre_exec when PCRE_PARTIAL_HARD is set. Consider the -string "dog" matched against the ungreedy pattern shown above: +Because the DFA functions always search for all possible matches, and there is +no difference between greedy and ungreedy repetition, their behaviour is +different from the standard functions when PCRE_PARTIAL_HARD is set. Consider +the string "dog" matched against the ungreedy pattern shown above:

   /dog(sbody)??/
 
-Whereas pcre_exec() stops as soon as it finds the complete match for -"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and -so returns that when PCRE_PARTIAL_HARD is set. +Whereas the standard functions stop as soon as they find the complete match for +"dog", the DFA functions also find the partial match for "dogsbody", and so +return that when PCRE_PARTIAL_HARD is set.


PARTIAL MATCHING AND WORD BOUNDARIES

@@ -217,23 +220,19 @@ results. Consider this pattern: This matches "cat", provided there is a word boundary at either end. If the subject string is "the cat", the comparison of the final "t" with a following -character cannot take place, so a partial match is found. However, -pcre_exec() carries on with normal matching, which matches \b at the end -of the subject when the last character is a letter, thus finding a complete -match. The result, therefore, is not PCRE_ERROR_PARTIAL. The same thing -happens with pcre_dfa_exec(), because it also finds the complete match. +character cannot take place, so a partial match is found. However, normal +matching carries on, and \b matches at the end of the subject when the last +character is a letter, so a complete match is found. The result, therefore, is +not PCRE_ERROR_PARTIAL. Using PCRE_PARTIAL_HARD in this case does yield +PCRE_ERROR_PARTIAL, because then the partial match takes precedence.

-

-Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because -then the partial match takes precedence. -


FORMERLY RESTRICTED PATTERNS

For releases of PCRE prior to 8.00, because of the way certain internal optimizations were implemented in the pcre_exec() function, the PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with all patterns. From release 8.00 onwards, the restrictions no longer apply, and -partial matching with pcre_exec() can be requested for any pattern. +partial matching with can be requested for any pattern.

Items that were formerly restricted were repeated single characters and @@ -265,22 +264,21 @@ that uses the date example quoted above: The first data string is matched completely, so pcretest shows the matched substrings. The remaining four strings do not match the complete pattern, but the first two are partial matches. Similar output is obtained -when pcre_dfa_exec() is used. +if DFA matching is used.

If the escape sequence \P is present more than once in a pcretest data line, the PCRE_PARTIAL_HARD option is set for the match.

-
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
+
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() OR pcre16_dfa_exec()

-When a partial match has been found using pcre_dfa_exec(), it is possible -to continue the match by providing additional subject data and calling -pcre_dfa_exec() again with the same compiled regular expression, this -time setting the PCRE_DFA_RESTART option. You must pass the same working -space as before, because this is where details of the previous partial match -are stored. Here is an example using pcretest, using the \R escape -sequence to set the PCRE_DFA_RESTART option (\D specifies the use of -pcre_dfa_exec()): +When a partial match has been found using a DFA matching function, it is +possible to continue the match by providing additional subject data and calling +the function again with the same compiled regular expression, this time setting +the PCRE_DFA_RESTART option. You must pass the same working space as before, +because this is where details of the previous partial match are stored. Here is +an example using pcretest, using the \R escape sequence to set the +PCRE_DFA_RESTART option (\D specifies the use of the DFA matching function):

     re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
   data> 23ja\P\D
@@ -297,33 +295,35 @@ program to do that if it needs to.
 

You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with PCRE_DFA_RESTART to continue partial matching over multiple segments. This -facility can be used to pass very long subject strings to -pcre_dfa_exec(). +facility can be used to pass very long subject strings to the DFA matching +functions.

-
MULTI-SEGMENT MATCHING WITH pcre_exec()
+
MULTI-SEGMENT MATCHING WITH pcre_exec() OR pcre16_exec()

-From release 8.00, pcre_exec() can also be used to do multi-segment -matching. Unlike pcre_dfa_exec(), it is not possible to restart the -previous match with a new segment of data. Instead, new data must be added to -the previous subject string, and the entire match re-run, starting from the -point where the partial match occurred. Earlier data can be discarded. It is -best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the -end of a segment as the end of the subject when matching \z, \Z, \b, \B, -and $. Consider an unanchored pattern that matches dates: +From release 8.00, the standard matching functions can also be used to do +multi-segment matching. Unlike the DFA functions, it is not possible to +restart the previous match with a new segment of data. Instead, new data must +be added to the previous subject string, and the entire match re-run, starting +from the point where the partial match occurred. Earlier data can be discarded. +

+

+It is best to use PCRE_PARTIAL_HARD in this situation, because it does not +treat the end of a segment as the end of the subject when matching \z, \Z, +\b, \B, and $. Consider an unanchored pattern that matches dates:

     re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
   data> The date is 23ja\P\P
   Partial match: 23ja
 
At this stage, an application could discard the text preceding "23ja", add on -text from the next segment, and call pcre_exec() again. Unlike -pcre_dfa_exec(), the entire matching string must always be available, and +text from the next segment, and call the matching function again. Unlike the +DFA matching functions the entire matching string must always be available, and the complete matching process occurs for each call, so more memory and more processing time is needed.

Note: If the pattern contains lookbehind assertions, or \K, or starts -with \b or \B, the string that is returned for a partial match will include +with \b or \B, the string that is returned for a partial match includes characters that precede the partially matched string itself, because these must be retained when adding on more characters for a subsequent matching attempt.

@@ -369,14 +369,14 @@ longer possible. Consider again this pcretest e 0: dogsbody 1: dog
-The first data line passes the string "dogsb" to pcre_exec(), setting the -PCRE_PARTIAL_SOFT option. Although the string is a partial match for -"dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string -"dog" is a complete match. Similarly, when the subject is presented to -pcre_dfa_exec() in several parts ("do" and "gsb" being the first two) the -match stops when "dog" has been found, and it is not possible to continue. On -the other hand, if "dogsbody" is presented as a single string, -pcre_dfa_exec() finds both matches. +The first data line passes the string "dogsb" to a standard matching function, +setting the PCRE_PARTIAL_SOFT option. Although the string is a partial match +for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter +string "dog" is a complete match. Similarly, when the subject is presented to +a DFA matching function in several parts ("do" and "gsb" being the first two) +the match stops when "dog" has been found, and it is not possible to continue. +On the other hand, if "dogsbody" is presented as a single string, a DFA +matching function finds both matches.

Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching @@ -390,10 +390,9 @@ multi-segment data. The example above then behaves dif data> gsb\R\P\P\D Partial match: gsb -4. Patterns that contain alternatives at the top level which do not all -start with the same pattern item may not work as expected when -PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider this -pattern: +4. Patterns that contain alternatives at the top level which do not all start +with the same pattern item may not work as expected when PCRE_DFA_RESTART is +used. For example, consider this pattern:

   1234|3789
 
@@ -409,8 +408,8 @@ patterns or patterns such as: 1234|ABCD where no string can be a partial match for both alternatives. This is not a -problem if pcre_exec() is used, because the entire match has to be rerun -each time: +problem if a standard matching function is used, because the entire match has +to be rerun each time:
     re> /1234|3789/
   data> ABC123\P\P
@@ -419,7 +418,7 @@ each time:
    0: 3789
 
Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running -the entire match can also be used with pcre_dfa_exec(). Another +the entire match can also be used with the DFA matching functions. Another possibility is to work with two buffers. If a partial match at offset n in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on the second buffer, you can then try a new match starting at offset n+1 in @@ -436,9 +435,9 @@ Cambridge CB2 3QH, England.


REVISION

-Last updated: 26 August 2011 +Last updated: 21 January 2012
-Copyright © 1997-2011 University of Cambridge. +Copyright © 1997-2012 University of Cambridge.

Return to the PCRE index page.