What does ValueError: cannot reindex from a duplicate axis
mean?
I am getting a ValueError: cannot reindex from a duplicate axis
when I am trying to set an index to a certain value. I tried to reproduce this with a simple example, but I could not do it.
Here is my session inside of ipdb
trace. I have a DataFrame with string index, and integer columns, float values. However when I try to create sum
index for sum of all columns I am getting ValueError: cannot reindex from a duplicate axis
error. I created a small DataFrame with the same characteristics, but was not able to reproduce the problem, what could I be missing?
I don't really understand what ValueError: cannot reindex from a duplicate axis
means, what does this error message mean? Maybe this will help me diagnose the problem, and this is most answerable part of my question.
ipdb> type(affinity_matrix)
<class 'pandas.core.frame.DataFrame'>
ipdb> affinity_matrix.shape
(333, 10)
ipdb> affinity_matrix.columns
Int64Index([9315684, 9315597, 9316591, 9320520, 9321163, 9320615, 9321187, 9319487, 9319467, 9320484], dtype='int64')
ipdb> affinity_matrix.index
Index([u'001', u'002', u'003', u'004', u'005', u'008', u'009', u'010', u'011', u'014', u'015', u'016', u'018', u'020', u'021', u'022', u'024', u'025', u'026', u'027', u'028', u'029', u'030', u'032', u'033', u'034', u'035', u'036', u'039', u'040', u'041', u'042', u'043', u'044', u'045', u'047', u'047', u'048', u'050', u'053', u'054', u'055', u'056', u'057', u'058', u'059', u'060', u'061', u'062', u'063', u'065', u'067', u'068', u'069', u'070', u'071', u'072', u'073', u'074', u'075', u'076', u'077', u'078', u'080', u'082', u'083', u'084', u'085', u'086', u'089', u'090', u'091', u'092', u'093', u'094', u'095', u'096', u'097', u'098', u'100', u'101', u'103', u'104', u'105', u'106', u'107', u'108', u'109', u'110', u'111', u'112', u'113', u'114', u'115', u'116', u'117', u'118', u'119', u'121', u'122', ...], dtype='object')
ipdb> affinity_matrix.values.dtype
dtype('float64')
ipdb> 'sums' in affinity_matrix.index
False
Here is the error:
ipdb> affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
*** ValueError: cannot reindex from a duplicate axis
I tried to reproduce this with a simple example, but I failed
In [32]: import pandas as pd
In [33]: import numpy as np
In [34]: a = np.arange(35).reshape(5,7)
In [35]: df = pd.DataFrame(a, ['x', 'y', 'u', 'z', 'w'], range(10, 17))
In [36]: df.values.dtype
Out[36]: dtype('int64')
In [37]: df.loc['sums'] = df.sum(axis=0)
In [38]: df
Out[38]:
10 11 12 13 14 15 16
x 0 1 2 3 4 5 6
y 7 8 9 10 11 12 13
u 14 15 16 17 18 19 20
z 21 22 23 24 25 26 27
w 28 29 30 31 32 33 34
sums 70 75 80 85 90 95 100
Answers
The error ValueError: cannot reindex from a duplicate axis
typically occurs when you attempt to assign or reindex in a DataFrame where the operation would result in duplicate index values.
Why does this happen?
In your example, the DataFrame affinity_matrix
already has duplicate index values. You can see this from the ipdb
trace where there are two '047'
indices in the affinity_matrix.index
:
u'047', u'047'
When you attempt to add a new row with the label 'sums'
using affinity_matrix.loc['sums']
, Pandas tries to reindex the DataFrame, but it encounters these duplicate index values and raises the error.
How to resolve this issue?
You need to ensure that the index values are unique before adding the new row. You can check and handle the duplicate index values using the following methods:
-
Check for duplicate index values: You can use
affinity_matrix.index.duplicated()
to identify the duplicated indices:print(affinity_matrix.index[affinity_matrix.index.duplicated()])
-
Drop duplicate index values: If duplicate indices are not needed, you can drop them:
affinity_matrix = affinity_matrix[~affinity_matrix.index.duplicated()]
-
Reset the index: If you do not care about the current index values, you can reset the index:
affinity_matrix = affinity_matrix.reset_index(drop=True)
-
Ensure uniqueness before setting the new row: After ensuring there are no duplicates, you can safely add the new row:
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
Updated example:
Here’s how you can apply it in your case:
# Check for duplicate indices
print(affinity_matrix.index[affinity_matrix.index.duplicated()])
# Remove duplicates or handle them
affinity_matrix = affinity_matrix[~affinity_matrix.index.duplicated()]
# Now you can safely add the 'sums' row
affinity_matrix.loc['sums'] = affinity_matrix.sum(axis=0)
This should resolve the error and allow you to add the new row successfully.