Wrappers for sklearn estimators =============================== sklearn-xarray provides wrappers that let you use sklearn estimators on xarray DataArrays and Datasets. The goal is to provide a seamless integration of both packages by only applying estimator methods on the raw data while metadata (coordinates in xarray) remains untouched whereever possible. There are two principal data types in xarray: ``DataArray`` and ``Dataset``. The wrappers provided in this package will determine automatically which xarray type they're dealing with when you call ``fit`` with either a DataArray or a Dataset as your training data. Wrapping estimators for DataArrays ---------------------------------- .. py:currentmodule:: sklearn_xarray First, we look at a basic example that shows how to wrap an estimator from sklearn for use with a ``DataArray``: .. doctest:: >>> from sklearn_xarray import wrap >>> from sklearn_xarray.datasets import load_dummy_dataarray >>> from sklearn.preprocessing import StandardScaler >>> >>> X = load_dummy_dataarray() >>> Xt = wrap(StandardScaler()).fit_transform(X) The :py:func:`wrap` function will return an object with the corresponding methods for each type of estimator (e.g. ``predict`` for classifiers and regressors). .. note:: xarray references axes by name rather than by order. Therefore, you can specify the ``sample_dim`` parameter of the wrapper to refer to the dimension in your data that represents the samples. By default, the wrapper will assume that the first dimension in the array is the sample dimension. When we run the example, we see that the data in the array is scaled, but the coordinates and dimensions have not changed: .. doctest:: >>> X # doctest:+SKIP array([[ 0.565986, 0.196107, 0.935981, ..., 0.702356, 0.806494, 0.801178], [ 0.551611, 0.277749, 0.27546 , ..., 0.646887, 0.616391, 0.227552], [ 0.451261, 0.205744, 0.60436 , ..., 0.426333, 0.008449, 0.763937], ..., [ 0.019217, 0.112844, 0.894421, ..., 0.675889, 0.4957 , 0.740349], [ 0.542255, 0.053288, 0.483674, ..., 0.481905, 0.064586, 0.843511], [ 0.607809, 0.425632, 0.702882, ..., 0.521591, 0.315032, 0.4258 ]]) Coordinates: * sample (sample) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... * feature (feature) int32 0 1 2 3 4 5 6 7 8 9 .. doctest:: >>> Xt # doctest:+SKIP array([[ 0.128639, -0.947769, 1.625452, ..., 0.525571, 1.07678 , 1.062118], [ 0.077973, -0.673463, -0.631625, ..., 0.321261, 0.408263, -0.942871], [-0.275702, -0.91539 , 0.492264, ..., -0.491108, -1.729624, 0.931952], ..., [-1.7984 , -1.227519, 1.483434, ..., 0.428084, -0.016158, 0.849506], [ 0.045001, -1.427621, 0.079865, ..., -0.286418, -1.532214, 1.210086], [ 0.27604 , -0.176596, 0.828923, ..., -0.140244, -0.651494, -0.249936]]) Coordinates: * sample (sample) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... * feature (feature) int32 0 1 2 3 4 5 6 7 8 9 Estimators changing the shape of the data ----------------------------------------- Many sklearn estimators will change the number of features during transformation or prediction. In this case, the coordinates along the feature dimension no longer correspond to those of the original array. Therefore, the wrapper will omit the coordinates along this dimension. You can specify which dimension is changed with the ``reshapes`` parameter: .. doctest:: >>> from sklearn.decomposition import PCA >>> Xt = wrap(PCA(n_components=5), reshapes='feature').fit_transform(X) >>> Xt # doctest:+SKIP array([[ 0.438773, -0.100947, 0.106754, 0.236872, -0.128751], [-0.40433 , -0.580941, 0.588425, -0.305739, -0.120676], [ 0.343535, -0.334365, 0.659667, 0.111196, 0.308099], ..., [ 0.519982, 0.38072 , 0.133793, -0.064086, 0.108029], [-0.099056, -0.086161, -0.115271, -0.053594, -0.736321], [-0.358513, -0.327132, -0.635314, -0.310221, -0.017318]]) Coordinates: * sample (sample) int32 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ... Dimensions without coordinates: feature .. todo:: reshapes dict Accessing fitted estimators --------------------------- The ``estimator`` attribute of the wrapper will always hold the unfitted estimator that was passed initially. After calling ``fit`` the fitted estimator will be stored in the ``estimator_`` attribute: .. doctest:: >>> wrapper = wrap(StandardScaler()) >>> wrapper.fit(X) EstimatorWrapper(copy=True, estimator=StandardScaler(), with_mean=True, with_std=True) >>> wrapper.estimator_.mean_ # doctest:+SKIP array([ 0.46156856, 0.47165326, 0.48397815, 0.48958361, 0.4730579 , 0.522414 , 0.46496134, 0.52299264, 0.48772645, 0.49043086]) The wrapper also directly reflects the fitted attributes: .. doctest:: >>> wrapper.mean_ # doctest:+SKIP array([ 0.46156856, 0.47165326, 0.48397815, 0.48958361, 0.4730579 , 0.522414 , 0.46496134, 0.52299264, 0.48772645, 0.49043086]) Wrapping estimators for Datasets -------------------------------- .. py:currentmodule:: sklearn_xarray.dataset The syntax for Datasets is exactly the same as for DataArrays. Note that the wrapper will fit one estimator for each variable in the Dataset. The fitted estimators are stored in the attribute ``estimator_dict_``: .. doctest:: >>> from sklearn_xarray.datasets import load_dummy_dataset >>> >>> X = load_dummy_dataset() >>> wrapper = wrap(StandardScaler()) >>> wrapper.fit(X) EstimatorWrapper(copy=True, estimator=StandardScaler(), with_mean=True, with_std=True) >>> wrapper.estimator_dict_ {'var_1': StandardScaler()} The wrapper also directly reflects the fitted attributes as dictionaries with one entry for each variable: .. doctest:: >>> wrapper.mean_['var_1'] # doctest:+SKIP array([ 0.46156856, 0.47165326, 0.48397815, 0.48958361, 0.4730579 , 0.522414 , 0.46496134, 0.52299264, 0.48772645, 0.49043086]) Wrapping dask-ml estimators --------------------------- The dask-ml_ package re-implements a number of scikit-learn estimators for use with dask_ on-disk arrays. You can wrap these estimators in the same way in order to work with dask-backed DataArrays and Datasets: .. doctest:: >>> from dask_ml.preprocessing import StandardScaler >>> import xarray as xr >>> import numpy as np >>> import dask.array as da >>> >>> X = xr.DataArray( ... da.from_array(np.random.random((100, 10)), chunks=(10, 10)), ... coords={'sample': range(100), 'feature': range(10)}, ... dims=('sample', 'feature') ... ) >>> Xt = wrap(StandardScaler()).fit_transform(X) >>> type(Xt.data) .. _dask-ml: http://dask-ml.readthedocs.io/en/latest/index.html .. _dask: http://dask.pydata.org/en/latest/