lundi 20 juin 2016

Most efficient way to fill missing elements of dataframe with a function of column and row indices

I have a dataframe with missing values.

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.choice((0, np.nan), (5, 5)))
print df

     0    1    2    3    4
0  0.0  NaN  0.0  NaN  0.0
1  0.0  NaN  0.0  NaN  NaN
2  NaN  NaN  0.0  NaN  NaN
3  0.0  NaN  0.0  0.0  0.0
4  0.0  0.0  0.0  0.0  0.0

Question

How do I efficiently fill the missing values with what a function returns when passed the missing cell's row and column index values.

Suppose my function f is defined as:

f = lambda i, j: i ** 2 - np.sqrt(abs(j))

I expect to get:

     0    1    2         3    4
0  0.0 -1.0  0.0 -1.732051  0.0
1  0.0  0.0  0.0 -0.732051 -1.0
2  4.0  3.0  0.0  2.267949  2.0
3  0.0  8.0  0.0  0.000000  0.0
4  0.0  0.0  0.0  0.000000  0.0

I've created two functions so far that generate this output:

def pir1(df, f):
    dfi = df.stack(dropna=False).index.to_series().unstack()
    return df.combine_first(dfi.applymap(lambda x: f(*x)))

def pir2(df, f):
    dfc = df.copy()
    for i in dfc.index:
        for j in dfc.columns:
            dfv = df.get_value(i, j)
            dfc.at[i, j] = dfv if pd.notnull(dfv) else f(i, j)
    return dfc

Timing


%%timeit
pir1(df, f)

100 loops, best of 3: 3.74 ms per loop

%%timeit
pir2(df, f)

1000 loops, best of 3: 714 µs per loop

Can anyone improve on these?

Aucun commentaire:

Enregistrer un commentaire