Question
<table((?!</table>).)*</table>
matches all my table tags. However,
<table(.(?!</table>))*</table>
does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.
What is the difference ?
For reference, I got the term "Tempered Greedy Token" from here: [Tempered Greedy Token Solution](http://www.rexegg.com/regex- quantifiers.html#tempered_greed)
Answer
What is a Tempered Greedy Token?
The rexegg.com [tempered greedy token](http://www.rexegg.com/regex- quantifiers.html#tempered_greed) reference is quite concise:
In
(?:(?!{END}).)*
, the*
quantifier applies to a dot, but it is now a tempered dot. The negative lookahead(?!{END})
asserts that what follows the current position is not the string{END}
. Therefore, the dot can never match the opening brace of{END}
, guaranteeing that we won't jump over the{END}
delimiter.
That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character ).
NOTE : The difference between a tempered greedy token and a negated
character class is that the former does not really match the text other than
the sequence itself, but a single character that does not start that
sequence. I.e. (?:(?!abc|xyz).)+
won't
match def
in defabc
, but will match def
and bc
, because a
starts
the forbidden abc
sequence, and bc
does not.
It consists of:
(?:...)*
- a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a*
can be+
, it depends on whether an empty string match is expected)(?!...)
- a negative lookahead that actually imposes a restriction on the value to the right of the current location.
- (or any (usually single) character) a consuming pattern.
However, we can always further temper the token by using alternations in the
negative lookahead (e.g. (?!{(?:END|START|MID)})
) or by replacing the all-
matching dot with a negated character class (e.g. (?:(?!START|END|MID)[^<>])
when trying to match text only inside tags).
Consuming part placement
Note there is no mentioning of a construction where a consuming part (the dot
in the original tempered greedy token) is placed before the lookahead.
Avinash's answer is explaining that part clearly: (.(?!</table>))*
first
matches any character (but a newline without a DOTALL modifier) and then
checks if it is not followed with </table>
resulting in a failure to match
e
in <table>table</table>
. * The
consuming part (the.
) must be placed after the tempering
lookahead.
When should we use a tempered greedy token?
Rexegg.com gives an idea:
- When we want to match a block of text between Delimiter 1 and Delimiter 2 with no Substring 3 in-between (e.g.
{START}(?:(?!{(?:MID|RESTART)}).)*?{END}
- When we want to match a block of text containing a specific pattern inside without overflowing subsequent blocks (e.g. instead of lazy dot matching as in
<table>.*?chair.*?</table>
, we'd use something like<table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>
). - When we want to match the shortest window possible between 2 strings. Lazy matching won't help when you need to get
abc 2 xyz
fromabc 1 abc 2 xyz
(seeabc.*?xyz
andabc(?:(?!abc).)*?xyz
).
Performance Issue
Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.
Say, we want to match abc 2 xyz
in abc 1 abc 2 xyz 3 xyz. Instead of
checking each character between abc
and xyz
with
abc(?:(?!abc|xyz).)*xyz
, we can skip all
characters that are not a
or x
with [^ax]*
, and then match all a
that
are not followed with bc
(with a(?!bc)
) and all x
that are not followed
with yz
(with x(?!yz)
):
abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz
.