How to groupby in Pandas with a missing group python pandas programming
This is a note intended for my future self. Here’s how to do it:
- Have a list of the values that you expect for each group.
- Iterate over that list, and look up the values using `.loc`.
Say I want to group by months, but not all of the months will have data. Here I’ll iterate over the known month range, and fill in as I go. Starting with a DataFrame contain the columns BPID, CALYR, CALMO, and COMMAMTFYC (with potentially many records (or 0!) per month), we do this:
start = date(2016,1,1) d = start month = [] case_count = [] fyc = [] while d < date.today(): month.append(d) month_data = group.loc[(group.CAL_YR == d.year) & (group.CAL_MO == d.month), :] case_count.append(month_data.shape[0]) fyc.append(month_data.COMM_AMT_FYC.sum()) d = date(d.year+int(np.floor(d.month/12)),((d.month) % 12)+1,1) g = pd.DataFrame({"BP_ID": name, "month": month, "case_count": case_count, "fyc": fyc}))
It seems inefficient, but it should get the job done. It takes about half a second to do for the small dataframes I’m working with. In other words, I’ll do 57K of these in 30min.