Javascript regex hangs (using v8)

ghz 1years ago ⋅ 10037 views

Question

Im using this regex to get the contents of a tag in a file.

var regex = new RegExp("<tag:main>((?:.|\\s)*)</tag:main>");

This causes the v8 engine to hang indefinitely.

Now, if I use new RegExp("<tag:main>([\s\S]*)</tag:main>"), all is good.

Anyone have an idea why the first one takes too long?


Answer

This catastrophically backtracks on long sequences of spaces that occur after the last closing </tag:main> tag. Consider the case where the subject string ends with 100 spaces. First it matches them all with the . on the left of the alternation. That fails because there's no closing tag, so it tries matching the last character with the \s instead. That fails too, so it tries matching the second-to-last space as a \s and the last space as a .. That fails (still no closing tag) so it tries the last space as a \s. When that fails it matches the third-to-last space as a \s and tries all 4 ways to match the last two spaces. When that fails it tries the fourth-to-last space as a \s and all 8 ways on the last 3 spaces. Then 16, 32 etc. The universe ends before it gets to the 100th-to-last space.

Different VMs have different reactions to regexp matches that take forever because of catastrophic backtracking. Some will simply report 'no match'. In V8 it's like writing any other infinite or near-infinite loop.

Using non-greedy * will do what you want (you want to stop at the first </tag:main>, not the last), but will still do catastrophic backtracking for long strings of spaces where the closing sequence is missing.

Making sure the same characters in the inner bracket can't match both sides of the alternation will reduce the problem from an exponential one to one that is linear in the length of the string. Use a character class instead of an alternation or put \n on the right hand side of the alternation bar. \n is disjoint with . so if you hit a long sequence of spaces the regexp engine doesn't try all left-right-left etc. combinations before terminating.