Pandas: How to return rows where a column has a line breaks/new line ( \n ) with one of several case-sensitive words coming directly after?
Which shows how to get a word which follows a new line.
I would now like to return rows where the column can have one of of several case-sensitive words which follows right after a new line.
Here is a minimal example
testdf = pd.DataFrame([
[ ' generates the final summary. \nRESULTS We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nRESULTS\n teamed up to find some food'],
['anthropology with RESULTS pharmacology and biology'],
[ ' generates the final summary. \nMethods We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nMETHODS\n teamed up to find some food'],
['anthropology with METHODS pharmacology and biology'],
[ ' generates the final summary. \nBACKGROUND We evaluate the performance of ', ],
[ 'the cat and bat \n\n\nBackground\n teamed up to find some food'],
['anthropology with BACKGROUND pharmacology and biology'],
])
testdf.columns = ['A']
testdf.head(10)
will return
A
0 generates the final summary. \nRESULTS We evaluate the performance of
1 the cat and bat \n\n\nRESULTS\n teamed up to find some food
2 anthropology with RESULTS pharmacology and biology
3 generates the final summary. \nMethods We evaluate the performance of
4 the cat and bat \n\n\nMETHODS\n teamed up to find some food
5 anthropology with METHODS pharmacology and biology
6 generates the final summary. \nBACKGROUND We evaluate the performance of
7 the cat and bat \n\n\nBackground\n teamed up to find some food
8 anthropology with BACKGROUND pharmacology and biology
And then
listStrings = { '\nRESULTS', '\nMETHODS' , '\nBACKGROUND' }
testdf.loc[testdf.A.apply(lambda x: len(listStrings.intersection(x.split())) >= 1)]
Will return nothing. The desired result would return the following rows.
A
0 generates the final summary. \nRESULTS We evaluate the performance of
1 the cat and bat \n\n\nRESULTS\n teamed up to find some food
4 the cat and bat \n\n\nMETHODS\n teamed up to find some food
6 generates the final summary. \nBACKGROUND We evaluate the performance of
These are rows where the word follows a '\n' and matches the case in the given set.
Answer
To filter rows based on case-sensitive words that directly follow a newline (\n
), you can use a regular expression to match the desired pattern. Here's a solution:
import pandas as pd
import re
# Create the DataFrame
testdf = pd.DataFrame([
[' generates the final summary. \nRESULTS We evaluate the performance of '],
['the cat and bat \n\n\nRESULTS\n teamed up to find some food'],
['anthropology with RESULTS pharmacology and biology'],
[' generates the final summary. \nMethods We evaluate the performance of '],
['the cat and bat \n\n\nMETHODS\n teamed up to find some food'],
['anthropology with METHODS pharmacology and biology'],
[' generates the final summary. \nBACKGROUND We evaluate the performance of '],
['the cat and bat \n\n\nBackground\n teamed up to find some food'],
['anthropology with BACKGROUND pharmacology and biology'],
], columns=['A'])
# Define the set of case-sensitive words to match
list_strings = {'\nRESULTS', '\nMETHODS', '\nBACKGROUND'}
# Create a regular expression to match the words following a newline
pattern = r'(?<!\S)(' + '|'.join(re.escape(word) for word in list_strings) + r')(?!\S)'
# Filter the rows where the pattern matches
filtered_df = testdf[testdf['A'].str.contains(pattern, regex=True)]
print(filtered_df)
Explanation:
-
Regex Pattern Construction:
(?<!\S)
: Ensures the match starts at the beginning of a word or after whitespace (negative lookbehind for non-whitespace).|
.join: Joins the words inlist_strings
with the|
(logical OR) operator.re.escape
: Escapes special characters in the words (like\n
).(?!\S)
: Ensures the match ends at the end of a word or before whitespace (negative lookahead for non-whitespace).
-
str.contains
:- Filters rows where the regex pattern is found in the column
A
.
- Filters rows where the regex pattern is found in the column
-
Case-Sensitivity:
- Regex matching is case-sensitive by default, ensuring exact matches.
Output:
The resulting filtered_df
contains the desired rows:
A
0 generates the final summary. \nRESULTS We eval...
1 the cat and bat \n\n\nRESULTS\n teamed up to f...
4 the cat and bat \n\n\nMETHODS\n teamed up to f...
6 generates the final summary. \nBACKGROUND We ...