How to only keep certain sentences of strings in pandas Dataframe
In my pandas dataframe, I have 100 news articles under the article
column. Each news article is a string. I want to only keep the first 3 sentences of each string, but I don't know how. (I noticed each sentence is separated by \n
).
Please suggest possible solutions.
The dataframe looks like this:
print("Reading data from csv file")
print(read)
Reading data from csv file
Unnamed: 0 article
0 0 \nChina’s ambassador to the US wants American ...
1 1 \nMissouri has become the first state to file ...
2 2 \nThe US is slamming the Communist Chinese gov...
3 3 \nSecretary of State Mike Pompeo on Thursday r...
4 4 \nThe US — along with Russia, China and India ...
.. ... ...
95 95 \nChina has reported no new deaths from the co...
96 96 \nThe World Health Organization on Tuesday fin...
97 97 \nAfter two months of being shut down due to t...
98 98 \nSome coronavirus patients may suffer neurolo...
99 99 \nChina may be past the worst of the COVID-19 ...
[100 rows x 2 columns]
Answer
To keep only the first 3 sentences of each article in the article
column of your pandas DataFrame, you can split the string by the newline character \n
and then take the first three sentences. After that, you can rejoin the sentences back together. Here's a possible solution:
Steps:
- Split each article into sentences using
\n
as the delimiter. - Take the first 3 sentences (or fewer if there are less than 3 sentences).
- Rejoin the first 3 sentences into a single string.
Here's the code to do this:
import pandas as pd
# Sample dataframe
data = {'Unnamed: 0': [0, 1, 2, 3, 4],
'article': [
"\nChina’s ambassador to the US wants American ...",
"\nMissouri has become the first state to file ...",
"\nThe US is slamming the Communist Chinese gov...",
"\nSecretary of State Mike Pompeo on Thursday r...",
"\nThe US — along with Russia, China and India ..."
]}
df = pd.DataFrame(data)
# Function to get the first 3 sentences
def get_first_3_sentences(article):
# Split the article by newline
sentences = article.split("\n")
# Return the first 3 sentences joined by newline
return "\n".join(sentences[1:4]) # Skip the initial empty string (because of leading \n)
# Apply the function to the 'article' column
df['short_article'] = df['article'].apply(get_first_3_sentences)
# Show the updated dataframe
print(df[['Unnamed: 0', 'short_article']])
Explanation:
split("\n")
: This splits each article into sentences by the newline character (\n
).sentences[1:4]
: We slice the list of sentences to get the first three sentences. The first element is typically empty (due to the leading\n
), so we start from index 1."\n".join(...)
: This re-joins the first three sentences back together with\n
as the separator.
Output:
For a sample input, the output will be like this:
Unnamed: 0 short_article
0 0 China’s ambassador to the US wants American ...
1 1 Missouri has become the first state to file ...
2 2 The US is slamming the Communist Chinese gov...
3 3 Secretary of State Mike Pompeo on Thursday r...
4 4 The US — along with Russia, China and India ...
This will provide the first three sentences of each article, which you can now use as needed.