Using coordinates as targets¶
With sklearn-xarray you can easily point an sklearn estimator to a
coordinate in an xarray DataArray or Dataset in order to use it as a target
for supervised learning. This is achieved with a Target object:
>>> from sklearn_xarray import wrap, Target
>>> from sklearn_xarray.datasets import load_digits_dataarray
>>> from sklearn.linear_model.logistic import LogisticRegression
>>>
>>> X = load_digits_dataarray()
>>> y = Target(coord='digit')(X)
>>> X
<xarray.DataArray (sample: 1797, feature: 64)>
array([[ 0., 0., 5., ..., 0., 0., 0.],
[ 0., 0., 0., ..., 10., 0., 0.],
[ 0., 0., 0., ..., 16., 9., 0.],
...,
[ 0., 0., 1., ..., 6., 0., 0.],
[ 0., 0., 2., ..., 12., 0., 0.],
[ 0., 0., 10., ..., 12., 1., 0.]])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
* feature (feature) int64 0 1 2 3 4 5 6 7 8 9 ... 55 56 57 58 59 60 61 62 63
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8
>>> y
sklearn_xarray.Target with data:
<xarray.DataArray 'digit' (sample: 1797)>
array([0, 1, 2, ..., 8, 9, 8])
Coordinates:
* sample (sample) int64 0 1 2 3 4 5 6 ... 1790 1791 1792 1793 1794 1795 1796
digit (sample) int64 0 1 2 3 4 5 6 7 8 9 0 1 ... 7 9 5 4 8 8 4 9 0 8 9 8
The target can point to any DataArray or Dataset that contains the specified coordinate, simply by calling the target with the Dataset/DataArray as an argument. When you construct a target without specifying a coordinate, the target data will be the Dataset/DataArray itself.
The Target object can be used as a target for a wrapped estimator in accordance with sklearn’s usual syntax:
>>> wrapper = wrap(LogisticRegression())
>>> wrapper.fit(X, y)
EstimatorWrapper(...)
>>> wrapper.score(X, y)
1.0
Note
You don’t have to assign the Target to any data, the wrapper’s fit method
will automatically call y(X).
Pre-processing¶
In some cases, it is necessary to pre-process the coordinate before it can be
used as a target. For this, the constructor takes a transform_func parameter
which can be used with the fit_transform method of transformers in
sklearn.preprocessing (and also any other object implementing the sklearn
transformer interface):
>>> from sklearn.neural_network import MLPClassifier
>>> from sklearn.preprocessing import LabelBinarizer
>>>
>>> y = Target(coord='digit', transform_func=LabelBinarizer().fit_transform)(X)
>>> wrapper = wrap(MLPClassifier())
>>> wrapper.fit(X, y)
EstimatorWrapper(...)
Indexing¶
A Target object can be indexed in the same way as the underlying
coordinate and interfaces with numpy by providing an __array__
attribute which returns numpy.array() of the (transformed) coordinate.
Multi-dimensional coordinates¶
In some cases, the target coordinates span multiple dimensions, but the
transformer expects a lower-dimensional input. With the dim parameter of
the Target class you can specify which of the dimensions to keep.
You can also specify the callable reduce_func to perform the reduction of
the other dimensions (e.g. numpy.mean). Otherwise, the coordinate will
be reduced to the first element along each dimension that is not dim.
Lazy evaluation¶
When you construct a target with a transformer and lazy=True, the
transformation will only be performed when the target’s data is actually
accessed. This can significantly improve performance when working with large
datasets in a pipeline, because the target is assigned in each step of the
pipeline.
Note
When you index a target with lazy evaluation, the transformation is
performed regardless of whether lazy was set.