Using coordinates as targets

With sklearn-xarray you can easily point an sklearn estimator to a coordinate in an xarray DataArray or Dataset in order to use it as a target for supervised learning. This is achieved with a Target object:

>>> from sklearn_xarray import wrap, Target
>>> from sklearn_xarray.datasets import load_digits_dataarray
>>> from sklearn.linear_model.logistic import LogisticRegression
>>>
>>> X = load_digits_dataarray()
>>> y = Target(coord='digit')(X)
>>> X
<xarray.DataArray (sample: 1797, feature: 64)>
array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])
Coordinates:
  * sample   (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
  * feature  (feature) int64 0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63
    digit    (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8
>>> y
sklearn_xarray.Target with data:
<xarray.DataArray 'digit' (sample: 1797)>
array([0, 1, 2, ..., 8, 9, 8])
Coordinates:
  * sample   (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
    digit    (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8

The target can point to any DataArray or Dataset that contains the specified coordinate, simply by calling the target with the Dataset/DataArray as an argument. When you construct a target without specifying a coordinate, the target data will be the Dataset/DataArray itself.

The Target object can be used as a target for a wrapped estimator in accordance with sklearn’s usual syntax:

>>> wrapper = wrap(LogisticRegression())
>>> wrapper.fit(X, y) 
EstimatorWrapper(...)
>>> wrapper.score(X, y)
1.0

Note

You don’t have to assign the Target to any data, the wrapper’s fit method will automatically call y(X).

Pre-processing

In some cases, it is necessary to pre-process the coordinate before it can be used as a target. For this, the constructor takes a transform_func parameter which can be used with the fit_transform method of transformers in sklearn.preprocessing (and also any other object implementing the sklearn transformer interface):

>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.preprocessing import LabelBinarizer
>>>
>>> y = Target(coord='digit', transform_func=LabelBinarizer().fit_transform)(X)
>>> wrapper = wrap(MLPClassifier())
>>> wrapper.fit(X, y) 
EstimatorWrapper(...)

Indexing

A Target object can be indexed in the same way as the underlying coordinate and interfaces with numpy by providing an __array__ attribute which returns numpy.array() of the (transformed) coordinate.

Multi-dimensional coordinates

In some cases, the target coordinates span multiple dimensions, but the transformer expects a lower-dimensional input. With the dim parameter of the Target class you can specify which of the dimensions to keep. You can also specify the callable reduce_func to perform the reduction of the other dimensions (e.g. numpy.mean). Otherwise, the coordinate will be reduced to the first element along each dimension that is not dim.

Lazy evaluation

When you construct a target with a transformer and lazy=True, the transformation will only be performed when the target’s data is actually accessed. This can significantly improve performance when working with large datasets in a pipeline, because the target is assigned in each step of the pipeline.

Note

When you index a target with lazy evaluation, the transformation is performed regardless of whether lazy was set.