Question
I know it's possible to match a word and then reverse the matches using other
tools (e.g. grep -v
). However, is it possible to match lines that do not
contain a specific word, e.g. hede
, using a regular expression?
Input:
hoho
hihi
haha
hede
Code:
grep "<Regex for 'doesn't contain hede'>" input
Desired output:
hoho
hihi
haha
Answer
The notion that regex doesn't support inverse matching is not entirely true. You can mimic this behavior by using negative look-arounds:
^((?!hede).)*$
The regex above will match any string, or line without a line break, not containing the (sub)string 'hede'. As mentioned, this is not something regex is "good" at (or should do), but still, it is possible.
And if you need to match line break chars as well, use the DOT-ALL
modifier (the trailing s
in
the following pattern):
/^((?!hede).)*$/s
or use it inline:
/(?s)^((?!hede).)*$/
(where the/.../
are the regex delimiters, i.e., not part of the pattern)
If the DOT-ALL modifier is not available, you can mimic the same behavior with
the character class [\s\S]
:
/^((?!hede)[\s\S])*$/
Explanation
A string is just a list of n
characters. Before, and after each character,
there's an empty string. So a list of n
characters will have n+1
empty
strings. Consider the string "ABhedeCD"
:
┌──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┬───┬──┐
S = │e1│ A │e2│ B │e3│ h │e4│ e │e5│ d │e6│ e │e7│ C │e8│ D │e9│
└──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┴───┴──┘
index 0 1 2 3 4 5 6 7
where the e
's are the empty strings. The regex (?!hede).
looks ahead to
see if there's no substring "hede"
to be seen, and if that is the case (so
something else is seen), then the .
(dot) will match any character except a
line break. Look-arounds are also called zero-width-assertions because they
don't consume any characters. They only assert/validate something.
So, in my example, every empty string is first validated to see if there's no
"hede"
up ahead, before a character is consumed by the .
(dot). The regex
(?!hede).
will do that only once, so it is wrapped in a group, and repeated
zero or more times: ((?!hede).)*
. Finally, the start- and end-of-input are
anchored to make sure the entire input is consumed: ^((?!hede).)*$
As you can see, the input "ABhedeCD"
will fail because on e3
, the regex
(?!hede)
fails (there is "hede"
up ahead!).