Pandas: combining results from function on subset of dataframe with the original dataframe
I am new to Pandas so please forgive me inexperience. Nonetheless I have worked on a lot of the parts of my question here.
For simplicity let's take the example from the wiki article on Quantile Normalization:
A 5 4 3
B 2 1 4
C 3 4 6
D 4 2 8
and update it to fit the data structure that I am dealing with:
df = pd.DataFrame({
'gene': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
'rep': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
'val': [5, 4, 3, 2, 1, 4, 3, 4, 6, 4, 2, 8, 0, 1, 0, 0, 2, 4],
'subset':['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'n', 'n', 'n', 'n', 'n', 'n']
})
gene rep subset val
0 a 1 y 5
1 a 2 y 4
2 a 3 y 3
3 b 1 y 2
4 b 2 y 1
5 b 3 y 4
6 c 1 y 3
7 c 2 y 4
8 c 3 y 6
9 d 1 y 4
10 d 2 y 2
11 d 3 y 8
12 e 1 n 0
13 e 2 n 1
14 e 3 n 0
15 f 1 n 0
16 f 2 n 2
17 f 3 n 4
This flattened structure might seem odd and inefficient (at the very least redundant), but for my particular use case - it is the best option - so please bare with it.
In this example we want to run quantile normalization on the original data (genes a - d), so we grab the subset (take a subset on some meta datakey):
sub = df[df.subset == 'y']
The shape is still off so using the pivot
function as I recently learned from @Wan from my GroupBy question:
piv = sub.pivot(index='gene', columns='rep', values='val')
rep 1 2 3
gene
a 5 4 3
b 2 1 4
c 3 4 6
d 4 2 8
This results in the lost of the other columns which may or may not be relevant for later. Carrying on, using my quantile normalization function that can handle mixed dataframes:
quantile_normalize(piv, [1, 2, 3])
rep 1 2 3
gene
a 5.666667 4.666667 2.000000
b 2.000000 2.000000 3.000000
c 3.000000 4.666667 4.666667
d 4.666667 3.000000 5.666667
which is the expected result from the wiki:
A 5.67 4.67 2.00
B 2.00 2.00 3.00
C 3.00 4.67 4.67
D 4.67 3.00 5.67
neat.
Now my question:
How do I take these values and plug them back into the original data frame?
Answer
To apply the quantile-normalized values back to the original dataframe while keeping the structure intact, you need to map the normalized values from the pivot
back to the original df
where each gene
-rep
pair is located.
Here's a step-by-step approach to achieve this:
1. Quantile Normalization Result (piv
)
You've already obtained the quantile-normalized values in the piv
dataframe, which looks like this:
rep 1 2 3
gene
a 5.666667 4.666667 2.000000
b 2.000000 2.000000 3.000000
c 3.000000 4.666667 4.666667
d 4.666667 3.000000 5.666667
This is the quantile-normalized version of your subset sub
, pivoted on gene
and rep
.
2. Reset the Pivot to Long Form
To bring the normalized data back into a "long" format (as required for updating the original dataframe), you need to melt the piv
dataframe back into a format where each row represents a single gene
, rep
, and normalized value
.
normalized_long = piv.reset_index().melt(id_vars='gene', value_vars=[1, 2, 3], var_name='rep', value_name='normalized_val')
# The `normalized_long` dataframe will look like this:
# gene rep normalized_val
# 0 a 1 5.666667
# 1 b 1 2.000000
# 2 c 1 3.000000
# 3 d 1 4.666667
# 4 a 2 4.666667
# 5 b 2 2.000000
# 6 c 2 4.666667
# 7 d 2 3.000000
# 8 a 3 2.000000
# 9 b 3 3.000000
# 10 c 3 4.666667
# 11 d 3 5.666667
3. Merge the Normalized Values Back to the Original Dataframe
Now that we have the quantile-normalized values in normalized_long
, we can merge these values back into the original df
by matching gene
and rep
.
# Merge the normalized values back into the original dataframe
df_normalized = df.merge(normalized_long, on=['gene', 'rep'], how='left')
# The resulting df will now have the normalized values in the 'normalized_val' column
print(df_normalized)
4. Final Result
The merged dataframe will look like this, with the original data preserved and the normalized values inserted into the new normalized_val
column:
gene rep subset val normalized_val
0 a 1 y 5 5.666667
1 a 2 y 4 4.666667
2 a 3 y 3 2.000000
3 b 1 y 2 2.000000
4 b 2 y 1 2.000000
5 b 3 y 4 3.000000
6 c 1 y 3 3.000000
7 c 2 y 4 4.666667
8 c 3 y 6 4.666667
9 d 1 y 4 4.666667
10 d 2 y 2 3.000000
11 d 3 y 8 5.666667
12 e 1 n 0 NaN
13 e 2 n 1 NaN
14 e 3 n 0 NaN
15 f 1 n 0 NaN
16 f 2 n 2 NaN
17 f 3 n 4 NaN
Explanation:
- The
merge
function ensures that thenormalized_val
fromnormalized_long
is matched bygene
andrep
into the originaldf
. NaN
values innormalized_val
correspond to the rows indf
wheresubset == 'n'
(the ones you didn't include in the quantile normalization).
Conclusion:
This process allows you to apply quantile normalization to a subset of your data and then integrate the results back into the original dataframe while preserving all other columns.