Pandas: How to return rows where a column has a line breaks/new

ghz 15hours ago ⋅ 6 views

Pandas: How to return rows where a column has a line breaks/new line ( \n ) with one of several case-sensitive words coming directly after?

Which shows how to get a word which follows a new line.

I would now like to return rows where the column can have one of of several case-sensitive words which follows right after a new line.

Here is a minimal example

testdf = pd.DataFrame([
    [ ' generates the final summary. \nRESULTS We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nRESULTS\n teamed up to find some food'], 
                       ['anthropology with RESULTS pharmacology and biology'],
    [ ' generates the final summary. \nMethods We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nMETHODS\n teamed up to find some food'], 
                       ['anthropology with METHODS pharmacology and biology'],
        [ ' generates the final summary. \nBACKGROUND We evaluate the performance of ', ], 
                       [ 'the cat and bat \n\n\nBackground\n teamed up to find some food'], 
                       ['anthropology with BACKGROUND pharmacology and biology'],
])
testdf.columns = ['A']
testdf.head(10)

will return

A
0   generates the final summary. \nRESULTS We evaluate the performance of
1   the cat and bat \n\n\nRESULTS\n teamed up to find some food
2   anthropology with RESULTS pharmacology and biology
3   generates the final summary. \nMethods We evaluate the performance of
4   the cat and bat \n\n\nMETHODS\n teamed up to find some food
5   anthropology with METHODS pharmacology and biology
6   generates the final summary. \nBACKGROUND We evaluate the performance of
7   the cat and bat \n\n\nBackground\n teamed up to find some food
8   anthropology with BACKGROUND pharmacology and biology

And then

listStrings = { '\nRESULTS',  '\nMETHODS' ,  '\nBACKGROUND' }
testdf.loc[testdf.A.apply(lambda x: len(listStrings.intersection(x.split())) >= 1)]

Will return nothing. The desired result would return the following rows.

A
0   generates the final summary. \nRESULTS We evaluate the performance of
1   the cat and bat \n\n\nRESULTS\n teamed up to find some food
4   the cat and bat \n\n\nMETHODS\n teamed up to find some food
6   generates the final summary. \nBACKGROUND We evaluate the performance of

These are rows where the word follows a '\n' and matches the case in the given set.

Answer

To filter rows based on case-sensitive words that directly follow a newline (\n), you can use a regular expression to match the desired pattern. Here's a solution:

import pandas as pd
import re

# Create the DataFrame
testdf = pd.DataFrame([
    [' generates the final summary. \nRESULTS We evaluate the performance of '], 
    ['the cat and bat \n\n\nRESULTS\n teamed up to find some food'], 
    ['anthropology with RESULTS pharmacology and biology'],
    [' generates the final summary. \nMethods We evaluate the performance of '], 
    ['the cat and bat \n\n\nMETHODS\n teamed up to find some food'], 
    ['anthropology with METHODS pharmacology and biology'],
    [' generates the final summary. \nBACKGROUND We evaluate the performance of '], 
    ['the cat and bat \n\n\nBackground\n teamed up to find some food'], 
    ['anthropology with BACKGROUND pharmacology and biology'],
], columns=['A'])

# Define the set of case-sensitive words to match
list_strings = {'\nRESULTS', '\nMETHODS', '\nBACKGROUND'}

# Create a regular expression to match the words following a newline
pattern = r'(?<!\S)(' + '|'.join(re.escape(word) for word in list_strings) + r')(?!\S)'

# Filter the rows where the pattern matches
filtered_df = testdf[testdf['A'].str.contains(pattern, regex=True)]

print(filtered_df)

Explanation:

  1. Regex Pattern Construction:

    • (?<!\S): Ensures the match starts at the beginning of a word or after whitespace (negative lookbehind for non-whitespace).
    • |.join: Joins the words in list_strings with the | (logical OR) operator.
    • re.escape: Escapes special characters in the words (like \n).
    • (?!\S): Ensures the match ends at the end of a word or before whitespace (negative lookahead for non-whitespace).
  2. str.contains:

    • Filters rows where the regex pattern is found in the column A.
  3. Case-Sensitivity:

    • Regex matching is case-sensitive by default, ensuring exact matches.

Output:

The resulting filtered_df contains the desired rows:

                                                   A
0  generates the final summary. \nRESULTS We eval...
1  the cat and bat \n\n\nRESULTS\n teamed up to f...
4  the cat and bat \n\n\nMETHODS\n teamed up to f...
6  generates the final summary. \nBACKGROUND We ...