Tempered Greedy Token - What is different about placing the dot before the negative lookahead?

ghz 1years ago ⋅ 3133 views

Question

<table((?!</table>).)*</table>

matches all my table tags. However,

<table(.(?!</table>))*</table>

does not. The second one seems to make sense if I try to write out the expression in words, but I can't make sense of the first.

What is the difference ?

For reference, I got the term "Tempered Greedy Token" from here: [Tempered Greedy Token Solution](http://www.rexegg.com/regex- quantifiers.html#tempered_greed)


Answer

What is a Tempered Greedy Token?

The rexegg.com [tempered greedy token](http://www.rexegg.com/regex- quantifiers.html#tempered_greed) reference is quite concise:

In (?:(?!{END}).)*, the * quantifier applies to a dot, but it is now a tempered dot. The negative lookahead (?!{END}) asserts that what follows the current position is not the string {END}. Therefore, the dot can never match the opening brace of {END}, guaranteeing that we won't jump over the {END} delimiter.

That is it: a tempered greedy token is a kind of a negated character class for a character sequence (cf. negated character class for a single character ).

NOTE : The difference between a tempered greedy token and a negated character class is that the former does not really match the text other than the sequence itself, but a single character that does not start that sequence. I.e. (?:(?!abc|xyz).)+ won't match def in defabc, but will match def and bc, because a starts the forbidden abc sequence, and bc does not.

It consists of:

  • (?:...)* - a quantified non-capturing group (it may be a capturing group, but it makes no sense to capture each individual character) (a * can be +, it depends on whether an empty string match is expected)
  • (?!...) - a negative lookahead that actually imposes a restriction on the value to the right of the current location
  • . - (or any (usually single) character) a consuming pattern.

However, we can always further temper the token by using alternations in the negative lookahead (e.g. (?!{(?:END|START|MID)})) or by replacing the all- matching dot with a negated character class (e.g. (?:(?!START|END|MID)[^<>]) when trying to match text only inside tags).

Consuming part placement

Note there is no mentioning of a construction where a consuming part (the dot in the original tempered greedy token) is placed before the lookahead. Avinash's answer is explaining that part clearly: (.(?!</table>))* first matches any character (but a newline without a DOTALL modifier) and then checks if it is not followed with </table> resulting in a failure to match e in <table>table</table>. * The consuming part (the.) must be placed after the tempering lookahead.

When should we use a tempered greedy token?

Rexegg.com gives an idea:

  • When we want to match a block of text between Delimiter 1 and Delimiter 2 with no Substring 3 in-between (e.g. {START}(?:(?!{(?:MID|RESTART)}).)*?{END}
  • When we want to match a block of text containing a specific pattern inside without overflowing subsequent blocks (e.g. instead of lazy dot matching as in <table>.*?chair.*?</table>, we'd use something like <table>(?:(?!chair|</?table>).)*chair(?:(?!<table>).)*</table>).
  • When we want to match the shortest window possible between 2 strings. Lazy matching won't help when you need to get abc 2 xyz from abc 1 abc 2 xyz (see abc.*?xyz and abc(?:(?!abc).)*?xyz).

Performance Issue

Tempered greedy token is resource-consuming as a lookahead check is performed after each character matched with the consuming pattern. Unrolling the loop technique can significantly increase tempered greedy token performance.

Say, we want to match abc 2 xyz in abc 1 abc 2 xyz 3 xyz. Instead of checking each character between abc and xyz with abc(?:(?!abc|xyz).)*xyz, we can skip all characters that are not a or x with [^ax]*, and then match all a that are not followed with bc (with a(?!bc)) and all x that are not followed with yz (with x(?!yz)): abc[^ax]*(?:a(?!bc)[^ax]*|x(?!yz)[^ax]*)*xyz.