Python Regex Catastrophic Backtracking in URL handling

ghz 2days ago ⋅ 3 views

I want to write a regex to capture URLs in a text. Now, the problem is any decent regex I use for capturing URLs, encounter a catastrophic backtracking with some URLs.

I have tried "diegoperini" regex in here and also have read other questions and answers in here, here, and here. However none of them solved my problem.

Also I have three other regexes:

Regex:
SIMPLE_URL_REGEX = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*(),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
WEB_URL_REGEX = r"""(?i)\b((?:https?:(?:/{1,3}|[a-z0-9%])|[a-z0-9.\-]+[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)/)(?:[^\s()<>{}\[\]]+|\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\))+(?:\([^\s()]*?\([^\s()]+\)[^\s()]*?\)|\([^\s]+?\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])|(?:(?<!@)[a-z0-9]+(?:[.\-][a-z0-9]+)*[.](?:com|net|org|edu|gov|mil|aero|asia|biz|cat|coop|info|int|jobs|mobi|museum|name|post|pro|tel|travel|xxx|ac|ad|ae|af|ag|ai|al|am|an|ao|aq|ar|as|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|cr|cs|cu|cv|cx|cy|cz|dd|de|dj|dk|dm|do|dz|ec|ee|eg|eh|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|io|iq|ir|is|it|je|jm|jo|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mk|ml|mm|mn|mo|mp|mq|mr|ms|mt|mu|mv|mw|mx|my|mz|na|nc|ne|nf|ng|ni|nl|no|np|nr|nu|nz|om|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|Ja|sk|sl|sm|sn|so|sr|ss|st|su|sv|sx|sy|sz|tc|td|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|ye|yt|yu|za|zm|zw)\b/?(?!@)))"""
ANY_URL_REGEX = r"""(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'\".,<>?«»“”‘’]))"""

The Simple URL regex does not get trapped in cases that I tried, but also does not perform very good and miss many URLs, the other two perform better but get trapped in some cases.

Now, one part of my problem was encoding of non-ASCII URLs, which apparently were solved by decoding the text like this:

try:
    meta = unquote(meta.encode('utf-8')).decode('utf-8')
except TypeError:
    meta = unquote(meta)

But sometime later another problematic URL came up. something like this one:

https://www.example.net/ar/book//%DA%A9-%D8%B3-(%D9%81-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%DB%8C-%D9%85)

which are rare, but when happen, trigger a very inefficient backtracking. This backtracking cause the program to stop responding indefinitely. (As I have read in here the problem is that the regex module does not release the GIL.)

Considering all these information, I have two questions:

  • First, Is it possible to have / is there a regex pattern for matching URLs that perform reasonably and avoid catastrophic backtracking completely?
  • Second, If there is not such a regex, Is there another way to catch the cases where the regex get trapped and throw an exception or bypass it some other way?

Answer

You're facing a common issue with regular expressions, particularly when dealing with URL matching, where catastrophic backtracking can occur for certain inputs. To address your two questions:

1. Is it possible to have a regex pattern for matching URLs that performs reasonably and avoids catastrophic backtracking completely?

In short, it’s difficult to completely avoid catastrophic backtracking when using regular expressions to match complex patterns like URLs. This is because URLs can have many optional components (e.g., protocols, domains, paths, query parameters, etc.), which leads to a high level of complexity when trying to match them all with a single regex.

However, there are ways to mitigate the issue:

  • Limit your regex complexity: While complex patterns are hard to optimize completely, a simpler regex that covers the most common URL formats might be more performant and avoid some of the catastrophic backtracking problems.

  • Use a more efficient regex engine: Some libraries (e.g., regex in Python, which is an alternative to the built-in re module) have better handling of backtracking, and can prevent the issues you’re facing.

Here's an example of a simpler URL matching regex, which might strike a balance between accuracy and performance:

import re

# A simpler regex for matching URLs
SIMPLE_URL_REGEX = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'

# Example usage
text = "Here is a link: https://www.example.com/test/page, and another: http://example.org"
matches = re.findall(SIMPLE_URL_REGEX, text)
print(matches)

This regex matches URLs that start with http or https, followed by a domain and optional path. It’s simpler and should help with some of the performance issues, but it won't match all URL variations, especially those with query parameters, ports, or encoded characters.

For comprehensive matching of URLs, libraries like urllib or specialized URL parsers might be better options.

2. If there’s no perfect regex, how can we handle catastrophic backtracking or bypass problematic URLs?

If you're still facing catastrophic backtracking with complex URLs, here are some strategies to handle them:

a. Use Python’s regex module:

The regex module (which can be installed via pip install regex) has better handling of backtracking issues than Python's built-in re module. It’s more efficient and can handle more complex patterns without suffering from catastrophic backtracking.

Here’s how you might use it:

import regex as re

# This uses the regex module, which has better performance for complex patterns
SIMPLE_URL_REGEX = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'

text = "Here is a link: https://www.example.com/test/page"
matches = re.findall(SIMPLE_URL_REGEX, text)
print(matches)

The regex module is more powerful than re, and it's worth trying for your use case. It also supports more advanced features, like possessive quantifiers, which can help avoid backtracking.

b. Limit your regex search to smaller chunks:

Instead of searching the entire text at once, you can split the text into smaller chunks or lines and apply the regex to each chunk individually. This limits the scope of the regex, reducing the chance of hitting catastrophic backtracking.

Here’s an example of how you might split the input into chunks:

import re

SIMPLE_URL_REGEX = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'

# Split text into smaller chunks (e.g., sentences or paragraphs)
text = "Some large text with URLs"
chunks = text.split("\n")  # You could split by other delimiters as needed

matches = []
for chunk in chunks:
    matches.extend(re.findall(SIMPLE_URL_REGEX, chunk))

print(matches)

This reduces the chance that a problematic URL will cause the entire search to fail.

c. Use a try-except block to catch the backtracking issue:

If you absolutely need to use a regex that might result in catastrophic backtracking, you can catch re.error exceptions and handle them gracefully. You can either skip the matching process for that chunk of text or log an error.

Here’s an example:

import re

SIMPLE_URL_REGEX = r'https?://[a-zA-Z0-9.-]+(?:/[^\s]*)?'

text = "Some large text with problematic URLs"

try:
    matches = re.findall(SIMPLE_URL_REGEX, text)
    print(matches)
except re.error as e:
    print(f"Regex error: {e}")
    matches = []  # Handle the error or fallback

In the example above, if the regex causes catastrophic backtracking, it will be caught by the except block, and you can decide how to handle the issue (e.g., by skipping that portion of the text or using a simpler matching approach).

d. Use URL Parsing Libraries:

For more reliable and accurate URL matching, you might consider using a dedicated URL parsing library instead of regular expressions. Here’s an example using urllib from Python's standard library:

from urllib.parse import urlparse

text = "Here is a link: https://www.example.com/test/page"

# Try to parse URLs manually
matches = []
for word in text.split():
    try:
        result = urlparse(word)
        if result.scheme and result.netloc:
            matches.append(word)
    except ValueError:
        continue

print(matches)

While this approach doesn't use regex, it avoids catastrophic backtracking entirely and is much more reliable for catching valid URLs.

Conclusion:

  • Option 1: Try using the regex module in Python to handle complex patterns more efficiently.
  • Option 2: Use a simpler regex and handle text in smaller chunks.
  • Option 3: Use try-except to catch backtracking errors and handle them gracefully.
  • Option 4: Consider using URL parsing libraries like urllib for more accurate URL detection without regex.

If you continue to have performance issues or need more flexibility, switching to a URL parsing library may be your best bet to avoid the pitfalls of regex-based URL matching.