Code Bug Fix: Python/Pandas – Creating a new Column showing Average of only the Largest value for each group

Original Source Link

I am working with a dataset, and am trying to create a new column that will show the average number per ID label in a row, but only based off the last row, which is the largest number in the ID group. Example is below.

My current dataset:

    ID      Date        DaysInDuration
    NCA   11/19/2019        31                 
    NCA   12/19/2019        62              
    NCA   12/19/2019        92             
    NCA   1/19/2020         120 * Last Row
    DTT   11/19/2019        31                 
    DTT   12/19/2019        62              
    DTT   12/19/2019        92             
    DTT   1/19/2020         100 * Last Row

And I’m trying to create this:

    ID      Date        DaysInDuration          AverageDurColumn *is only based off last row numb 
    NCA   11/19/2019        31                        30
    NCA   12/19/2019        62                        30
    NCA   12/19/2019        92                        30
    NCA   1/19/2020         120 * Last Row            30
    DTT   11/19/2019        31                        25
    DTT   12/19/2019        62                        25
    DTT   12/19/2019        92                        25
    DTT   12/29/2020        100 * Last Row            25

Thank you to all who can assist!

Here is a simple answer for you:

df['answer'] = df.groupby('ID')['DaysInDuration'].transform(lambda x: x.max()/x.count())

I simply transformed your question into "How do I take the maximum value per ID and divide it by the number of records that ID has?"

1.First Group by ID

2.Get the Max Value for Each ID

3.Divide by the number of records for that ID

4.Use transform to apply it to the rows

    ID        Date  DaysInDuration  answer
0  NCA  11/19/2019              31      30
1  NCA  12/19/2019              62      30
2  NCA  12/19/2019              92      30
3  NCA   1/19/2020             120      30
4  DTT  11/19/2019              31      25
5  DTT  12/19/2019              62      25
6  DTT  12/19/2019              92      25
7  DTT   1/19/2020             100      25

We can use GroupBy.transform here with last and size:

grp = df.groupby('ID')
last = grp['DaysInDuration'].transform('last')
n = grp['DaysInDuration'].transform('size')

df['AverageDurColumn'] = last / n

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31              30.0
1  NCA  12/19/2019              62              30.0
2  NCA  12/19/2019              92              30.0
3  NCA   1/19/2020             120              30.0
4  DTT  11/19/2019              31              25.0
5  DTT  12/19/2019              62              25.0
6  DTT  12/19/2019              92              25.0
7  DTT   1/19/2020             100              25.0

Try:

import numpy as np

df["AverageDurColumn"]=np.where(df["ID"].ne(df["ID"].shift(-1)), df["DaysInDuration"], 0)

df=df.set_index("ID")
df["AverageDurColumn"]=df.groupby("ID")["AverageDurColumn"].mean()
df=df.reset_index()

Outputs:

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31                30
1  NCA  12/19/2019              62                30
2  NCA  12/19/2019              92                30
3  NCA   1/19/2020             120                30
4  DTT  11/19/2019              31                25
5  DTT  12/19/2019              62                25
6  DTT  12/19/2019              92                25
7  DTT   1/19/2020             100                25

Another solution in one shot:

df["AverageDurColumn"]=df.groupby("ID").DaysInDuration.transform(lambda s: s.iloc[-1]/s.size)

You can use groupby, apply, and merge:

new_df = df.merge(
  df
  .groupby(['ID'])
  .apply(lambda x: x['DaysInDuration'].max() / len(x['DaysInDuration'])
  .reset_index(),
  how='outer',
  on='ID',
)

Tagged : / /

Leave a Reply

Your email address will not be published. Required fields are marked *