Regex that only matches text that's not part of HTML markup? (python)
How can I make a pattern match so long as it's not inside of an HTML tag?
Here's my attempt below. Anyone have a better/different approach?
import re
inputstr = 'mary had a <b class="foo"> little loomb</b>'
rx = re.compile('[aob]')
repl = 'x'
outputstr = ''
i = 0
for astr in re.compile(r'(<[^>]*>)').split(inputstr):
i = 1 - i
if i:
astr = re.sub(rx, repl, astr)
outputstr += astr
print outputstr
output:
mxry hxd x <b class="foo"> little lxxmx</b>
Notes:
- The <[^>]*> pattern to match HTML tags is obviously flawed -- I wrote this quickly and didn't account for the possibility of angle brackets within quoted attributes (e.g. ''). It doesn't account for