I have a dataframe with missing values.
import pandas as pd
import numpy as np
np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice((0, np.nan), (5, 5)))
print df
0 1 2 3 4
0 0.0 NaN 0.0 NaN 0.0
1 0.0 NaN 0.0 NaN NaN
2 NaN NaN 0.0 NaN NaN
3 0.0 NaN 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
Question
How do I efficiently fill the missing values with what a function returns when passed the missing cell's row and column index values.
Suppose my function f is defined as:
f = lambda i, j: i ** 2 - np.sqrt(abs(j))
I expect to get:
0 1 2 3 4
0 0.0 -1.0 0.0 -1.732051 0.0
1 0.0 0.0 0.0 -0.732051 -1.0
2 4.0 3.0 0.0 2.267949 2.0
3 0.0 8.0 0.0 0.000000 0.0
4 0.0 0.0 0.0 0.000000 0.0
I've created two functions so far that generate this output:
def pir1(df, f):
dfi = df.stack(dropna=False).index.to_series().unstack()
return df.combine_first(dfi.applymap(lambda x: f(*x)))
def pir2(df, f):
dfc = df.copy()
for i in dfc.index:
for j in dfc.columns:
dfv = df.get_value(i, j)
dfc.at[i, j] = dfv if pd.notnull(dfv) else f(i, j)
return dfc
Timing
%%timeit
pir1(df, f)
100 loops, best of 3: 3.74 ms per loop
%%timeit
pir2(df, f)
1000 loops, best of 3: 714 µs per loop
Can anyone improve on these?
Aucun commentaire:
Enregistrer un commentaire