Red Rose Notes: Python: how to obtain nlargest in a multiindex DataFrame/Series

A brief tutorial on how to identify the top n elements of each group in a dataset.

First, I grouped my data as follow:

df_group = df_train.groupby(['SaleCondition', 'Neighborhood'])['SalePrice'].sum()

Result:

# SaleCondition  Neighborhood
# Abnorml        BrDale            288900
#                BrkSide           309600
#                ClearCr           505000
#                CollgCr           562900
#                Crawfor           587000
#                Edwards           831900
#                Gilbert           181000
#                IDOTRR            499887
#                MeadowV            92000
#                Mitchel           417686
#                NAmes            3070950
#                NPkVill           140000
#                NWAmes            993000
#                NoRidge          1603000
#                OldTown          1180680
#                SWISU             489434
#                Sawyer            728300
#                SawyerW           739400
#                Somerst           791552
#                StoneBr           187500
#                Timber            599500
# AdjLand        Edwards           416500
# Alloca         Crawfor           559724
#                Edwards           453970
#                IDOTRR             55993
#                Mitchel           206300
#                OldTown            89471
#                Sawyer            108959
#                SawyerW           534112
# Family         BrDale             88000
#                                   ...   
# Normal         Gilbert         12121140
#                IDOTRR           3148700
#                MeadowV          1583800
#                Mitchel          6527250
#                NAmes           29211643

Then, we filter only the largest values for each SaleCondition:

df_group.groupby(level=0, group_keys=False).nlargest(5)

Result:

# SaleCondition  Neighborhood
# Abnorml        NAmes            3070950
#                NoRidge          1603000
#                OldTown          1180680
#                NWAmes            993000
#                Edwards           831900
# AdjLand        Edwards           416500
# Alloca         Crawfor           559724
#                SawyerW           534112
#                Edwards           453970
#                Mitchel           206300
#                Sawyer            108959
# Family         NAmes             533000
#                Gilbert           484000
#                OldTown           473000
#                NWAmes            404500
#                Crawfor           393500
# Normal         NAmes           29211643
#                CollgCr         25010162
#                NridgHt         12827100
#                OldTown         12518308
#                NWAmes          12403155
# Partial        NridgHt         11525738
#                Somerst          7920842
#                CollgCr          4121804
#                StoneBr          3337049
#                Gilbert          2449366

This must be useful to build Pareto charts.

Python: how to obtain nlargest in a multiindex DataFrame/Series

No comments:

Post a Comment

Python: How to iterate over the indexes of a variable in a 'for loop'

Report Abuse

Labels