Question
Im using this regex to get the contents of a tag in a file.
var regex = new RegExp("<tag:main>((?:.|\\s)*)</tag:main>");
This causes the v8 engine to hang indefinitely.
Now, if I use new RegExp("<tag:main>([\s\S]*)</tag:main>")
, all is good.
Anyone have an idea why the first one takes too long?
Answer
This catastrophically backtracks on long sequences of spaces that occur after
the last closing </tag:main>
tag. Consider the case where the subject string
ends with 100 spaces. First it matches them all with the .
on the left of
the alternation. That fails because there's no closing tag, so it tries
matching the last character with the \s
instead. That fails too, so it tries
matching the second-to-last space as a \s
and the last space as a .
. That
fails (still no closing tag) so it tries the last space as a \s
. When that
fails it matches the third-to-last space as a \s
and tries all 4 ways to
match the last two spaces. When that fails it tries the fourth-to-last space
as a \s
and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe
ends before it gets to the 100th-to-last space.
Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.
Using non-greedy *
will do what you want (you want to stop at the first
</tag:main>
, not the last), but will still do catastrophic backtracking for
long strings of spaces where the closing sequence is missing.
Making sure the same characters in the inner bracket can't match both sides of
the alternation will reduce the problem from an exponential one to one that is
linear in the length of the string. Use a character class instead of an
alternation or put \n
on the right hand side of the alternation bar. \n
is
disjoint with .
so if you hit a long sequence of spaces the regexp engine
doesn't try all left-right-left etc. combinations before terminating.