Multiple group-by with one common variable with pandas? -
I want to mark duplicate values within an id group. For example,
id AB i1 a1b1i1a1b2i1a2b2i2a1b2
Should be
ID A is a bbn i1a1bb1a1a1b2 2i1a2 1b2 2i2a1 1b 2 1 Actually count the multiplicity within the one
and BN
each id
group. How can I do this in Pandas? I've found from the group
, but it was quite messy to keep everything together. Also I tried the individual group for id, a
and id, b
. Maybe there is a way of pre-group by first and then use all other variables by the id
? (There are many variables and I have lots of man lines!)
Tried for ID, A and ID, B
I think this is a straightforward way of solving it; As you suggest, you can do the by group
each and then calculate the size of the group. And use change
so you can easily add results to the original dataframe:
df ['an'] = df.groupby (['id' "a [a] transform (NP Secure) DF [ 'BN'] = df.groupby ([ 'id', 'b']) [B] transform (NP. ID AB a BN 0 I 1A 1B1 2 1 1A1A1B2 2 2Ii1A2B2 1 2 3I2A1B2 1 1
Of course, too many columns You can: ['A', 'B'] for Colonel
(d) [col + 'n'] = df.groupby (['id', Cola]] [cola] .transf Rm (Np.size)
you edit < / Strong>: Enhancing performance for large data Has done it on a large dataset (4 million lines) and if I want to avoid something with Duplicate
using the method can also be used to do something similar, but this first one as a duplicate Will mark the comments within the group: for the call in change
then it is quite fast (it is very less elegant): < code ( '' A ',' B ']: x = df.groupby ([' ID ', col]) size () df.set_index ([' id ', Cola], inplace = true) df [Col +' N '] = x df.reset_index (inplace = true)
Comments
Post a Comment