Pandas: combining results from function on subset of dataframe w

ghz 昨天 ⋅ 2 views

Pandas: combining results from function on subset of dataframe with the original dataframe

I am new to Pandas so please forgive me inexperience. Nonetheless I have worked on a lot of the parts of my question here.

For simplicity let's take the example from the wiki article on Quantile Normalization:

A    5    4    3
B    2    1    4
C    3    4    6
D    4    2    8

and update it to fit the data structure that I am dealing with:

df = pd.DataFrame({
        'gene': ['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'c', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f', 'f'],
        'rep': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
        'val': [5, 4, 3, 2, 1, 4, 3, 4, 6, 4, 2, 8, 0, 1, 0, 0, 2, 4],
        'subset':['y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'y', 'n', 'n', 'n', 'n', 'n', 'n'] 
})


    gene rep subset val
0   a   1   y   5
1   a   2   y   4
2   a   3   y   3
3   b   1   y   2
4   b   2   y   1
5   b   3   y   4
6   c   1   y   3
7   c   2   y   4
8   c   3   y   6
9   d   1   y   4
10  d   2   y   2
11  d   3   y   8
12  e   1   n   0
13  e   2   n   1
14  e   3   n   0
15  f   1   n   0
16  f   2   n   2
17  f   3   n   4

This flattened structure might seem odd and inefficient (at the very least redundant), but for my particular use case - it is the best option - so please bare with it.

In this example we want to run quantile normalization on the original data (genes a - d), so we grab the subset (take a subset on some meta datakey):

sub = df[df.subset == 'y']

The shape is still off so using the pivot function as I recently learned from @Wan from my GroupBy question:

piv = sub.pivot(index='gene', columns='rep', values='val')

rep 1   2   3
gene            
a   5   4   3
b   2   1   4
c   3   4   6
d   4   2   8

This results in the lost of the other columns which may or may not be relevant for later. Carrying on, using my quantile normalization function that can handle mixed dataframes:

quantile_normalize(piv, [1, 2, 3])

rep     1   2   3
gene            
a   5.666667    4.666667    2.000000
b   2.000000    2.000000    3.000000
c   3.000000    4.666667    4.666667
d   4.666667    3.000000    5.666667

which is the expected result from the wiki:

A    5.67    4.67    2.00
B    2.00    2.00    3.00
C    3.00    4.67    4.67
D    4.67    3.00    5.67

neat.

Now my question:

How do I take these values and plug them back into the original data frame?

Answer

To apply the quantile-normalized values back to the original dataframe while keeping the structure intact, you need to map the normalized values from the pivot back to the original df where each gene-rep pair is located.

Here's a step-by-step approach to achieve this:

1. Quantile Normalization Result (piv)

You've already obtained the quantile-normalized values in the piv dataframe, which looks like this:

rep     1   2   3
gene            
a   5.666667    4.666667    2.000000
b   2.000000    2.000000    3.000000
c   3.000000    4.666667    4.666667
d   4.666667    3.000000    5.666667

This is the quantile-normalized version of your subset sub, pivoted on gene and rep.

2. Reset the Pivot to Long Form

To bring the normalized data back into a "long" format (as required for updating the original dataframe), you need to melt the piv dataframe back into a format where each row represents a single gene, rep, and normalized value.

normalized_long = piv.reset_index().melt(id_vars='gene', value_vars=[1, 2, 3], var_name='rep', value_name='normalized_val')

# The `normalized_long` dataframe will look like this:
#   gene  rep  normalized_val
# 0    a    1        5.666667
# 1    b    1        2.000000
# 2    c    1        3.000000
# 3    d    1        4.666667
# 4    a    2        4.666667
# 5    b    2        2.000000
# 6    c    2        4.666667
# 7    d    2        3.000000
# 8    a    3        2.000000
# 9    b    3        3.000000
# 10   c    3        4.666667
# 11   d    3        5.666667

3. Merge the Normalized Values Back to the Original Dataframe

Now that we have the quantile-normalized values in normalized_long, we can merge these values back into the original df by matching gene and rep.

# Merge the normalized values back into the original dataframe
df_normalized = df.merge(normalized_long, on=['gene', 'rep'], how='left')

# The resulting df will now have the normalized values in the 'normalized_val' column
print(df_normalized)

4. Final Result

The merged dataframe will look like this, with the original data preserved and the normalized values inserted into the new normalized_val column:

   gene  rep subset  val  normalized_val
0     a    1      y    5        5.666667
1     a    2      y    4        4.666667
2     a    3      y    3        2.000000
3     b    1      y    2        2.000000
4     b    2      y    1        2.000000
5     b    3      y    4        3.000000
6     c    1      y    3        3.000000
7     c    2      y    4        4.666667
8     c    3      y    6        4.666667
9     d    1      y    4        4.666667
10    d    2      y    2        3.000000
11    d    3      y    8        5.666667
12    e    1      n    0             NaN
13    e    2      n    1             NaN
14    e    3      n    0             NaN
15    f    1      n    0             NaN
16    f    2      n    2             NaN
17    f    3      n    4             NaN

Explanation:

  • The merge function ensures that the normalized_val from normalized_long is matched by gene and rep into the original df.
  • NaN values in normalized_val correspond to the rows in df where subset == 'n' (the ones you didn't include in the quantile normalization).

Conclusion:

This process allows you to apply quantile normalization to a subset of your data and then integrate the results back into the original dataframe while preserving all other columns.