Using coordinates as targets ============================ .. py:currentmodule:: sklearn_xarray.target With sklearn-xarray you can easily point an sklearn estimator to a coordinate in an xarray DataArray or Dataset in order to use it as a target for supervised learning. This is achieved with a :py:class:`Target` object: .. doctest:: >>> from sklearn_xarray import wrap, Target >>> from sklearn_xarray.datasets import load_digits_dataarray >>> from sklearn.linear_model.logistic import LogisticRegression >>> >>> X = load_digits_dataarray() >>> y = Target(coord='digit')(X) >>> X array([[ 0., 0., 5., ..., 0., 0., 0.], [ 0., 0., 0., ..., 10., 0., 0.], [ 0., 0., 0., ..., 16., 9., 0.], ..., [ 0., 0., 1., ..., 6., 0., 0.], [ 0., 0., 2., ..., 12., 0., 0.], [ 0., 0., 10., ..., 12., 1., 0.]]) Coordinates: * sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796 * feature (feature) int64 0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63 digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8 >>> y sklearn_xarray.Target with data: array([0, 1, 2, ..., 8, 9, 8]) Coordinates: * sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796 digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8 The target can point to any DataArray or Dataset that contains the specified coordinate, simply by calling the target with the Dataset/DataArray as an argument. When you construct a target without specifying a coordinate, the target data will be the Dataset/DataArray itself. The Target object can be used as a target for a wrapped estimator in accordance with sklearn's usual syntax: .. doctest:: >>> wrapper = wrap(LogisticRegression()) >>> wrapper.fit(X, y) # doctest:+ELLIPSIS EstimatorWrapper(...) >>> wrapper.score(X, y) 1.0 .. note:: You don't have to assign the Target to any data, the wrapper's fit method will automatically call ``y(X)``. Pre-processing -------------- In some cases, it is necessary to pre-process the coordinate before it can be used as a target. For this, the constructor takes a ``transform_func`` parameter which can be used with the ``fit_transform`` method of transformers in ``sklearn.preprocessing`` (and also any other object implementing the sklearn transformer interface): .. doctest:: >>> from sklearn.neural_network import MLPClassifier >>> from sklearn.preprocessing import LabelBinarizer >>> >>> y = Target(coord='digit', transform_func=LabelBinarizer().fit_transform)(X) >>> wrapper = wrap(MLPClassifier()) >>> wrapper.fit(X, y) # doctest:+ELLIPSIS EstimatorWrapper(...) Indexing -------- A :py:class:`Target` object can be indexed in the same way as the underlying coordinate and interfaces with ``numpy`` by providing an ``__array__`` attribute which returns ``numpy.array()`` of the (transformed) coordinate. Multi-dimensional coordinates ----------------------------- In some cases, the target coordinates span multiple dimensions, but the transformer expects a lower-dimensional input. With the ``dim`` parameter of the :py:class:`Target` class you can specify which of the dimensions to keep. You can also specify the callable ``reduce_func`` to perform the reduction of the other dimensions (e.g. ``numpy.mean``). Otherwise, the coordinate will be reduced to the first element along each dimension that is not ``dim``. Lazy evaluation --------------- When you construct a target with a transformer and ``lazy=True``, the transformation will only be performed when the target's data is actually accessed. This can significantly improve performance when working with large datasets in a pipeline, because the target is assigned in each step of the pipeline. .. note:: When you index a target with lazy evaluation, the transformation is performed regardless of whether ``lazy`` was set.