Regex that only matches text that's not part of HTML markup? (py

ghz 12hours ago ⋅ 8 views

Regex that only matches text that's not part of HTML markup? (python)

How can I make a pattern match so long as it's not inside of an HTML tag?

Here's my attempt below. Anyone have a better/different approach?

import re

inputstr = 'mary had a <b class="foo"> little loomb</b>'

rx = re.compile('[aob]')
repl = 'x'

outputstr = ''
i = 0

for astr in re.compile(r'(<[^>]*>)').split(inputstr):
    i = 1 - i

    if i:
        astr = re.sub(rx, repl, astr)

    outputstr += astr

print outputstr

output:

mxry hxd x <b class="foo"> little lxxmx</b>

Notes:

  • The <[^>]*> pattern to match HTML tags is obviously flawed -- I wrote this quickly and didn't account for the possibility of angle brackets within quoted attributes (e.g. 'next >'). It doesn't account for