.. only:: html .. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_activity_recognition.py: Activity recognition from accelerometer data ============================================ This demo shows how the **sklearn-xarray** package works with the ``Pipeline`` and ``GridSearchCV`` methods from scikit-learn providing a metadata-aware grid-searchable pipeline mechansism. The package combines the metadata-handling capabilities of xarray with the machine-learning framework of sklearn. It enables the user to apply preprocessing steps group by group, use transformers that change the number of samples, use metadata directly as labels for classification tasks and more. The example performs activity recognition from raw accelerometer data with a Gaussian naive Bayes classifier. It uses the `WISDM`_ activity prediction dataset which contains the activities walking, jogging, walking upstairs, walking downstairs, sitting and standing from 36 different subjects. .. _WISDM: http://www.cis.fordham.edu/wisdm/dataset.php .. code-block:: default from __future__ import print_function import numpy as np from sklearn_xarray import wrap, Target from sklearn_xarray.preprocessing import Splitter, Sanitizer, Featurizer from sklearn_xarray.model_selection import CrossValidatorWrapper from sklearn_xarray.datasets import load_wisdm_dataarray from sklearn.preprocessing import StandardScaler, LabelEncoder from sklearn.decomposition import PCA from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import GroupShuffleSplit, GridSearchCV from sklearn.pipeline import Pipeline import matplotlib.pyplot as plt First, we load the dataset and plot an example of one subject performing the 'Walking' activity. .. tip:: In the jupyter notebook version, change the first cell to ``%matplotlib notebook`` in order to get an interactive plot that you can zoom and pan. .. code-block:: default X = load_wisdm_dataarray() X_plot = X[np.logical_and(X.activity == "Walking", X.subject == 1)] X_plot = X_plot[:500] / 9.81 X_plot["sample"] = (X_plot.sample - X_plot.sample[0]) / np.timedelta64(1, "s") f, axarr = plt.subplots(3, 1, sharex=True) axarr[0].plot(X_plot.sample, X_plot.sel(axis="x"), color="#1f77b4") axarr[0].set_title("Acceleration along x-axis") axarr[1].plot(X_plot.sample, X_plot.sel(axis="y"), color="#ff7f0e") axarr[1].set_ylabel("Acceleration [g]") axarr[1].set_title("Acceleration along y-axis") axarr[2].plot(X_plot.sample, X_plot.sel(axis="z"), color="#2ca02c") axarr[2].set_xlabel("Time [s]") axarr[2].set_title("Acceleration along z-axis") .. image:: /auto_examples/images/sphx_glr_plot_activity_recognition_001.png :alt: Acceleration along x-axis, Acceleration along y-axis, Acceleration along z-axis :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Text(0.5, 1.0, 'Acceleration along z-axis') Then we define a pipeline with various preprocessing steps and a classifier. The preprocessing consists of splitting the data into segments, removing segments with `nan` values and standardizing. Since the accelerometer data is three-dimensional but the standardizer and classifier expect a one-dimensional feature vector, we have to vectorize the samples. Finally, we use PCA and a naive Bayes classifier for classification. .. code-block:: default pl = Pipeline( [ ( "splitter", Splitter( groupby=["subject", "activity"], new_dim="timepoint", new_len=30, ), ), ("sanitizer", Sanitizer()), ("featurizer", Featurizer()), ("scaler", wrap(StandardScaler)), ("pca", wrap(PCA, reshapes="feature")), ("cls", wrap(GaussianNB, reshapes="feature")), ] ) Since we want to use cross-validated grid search to find the best model parameters, we define a cross-validator. In order to make sure the model performs subject-independent recognition, we use a `GroupShuffleSplit` cross-validator that ensures that the same subject will not appear in both training and validation set. .. code-block:: default cv = CrossValidatorWrapper( GroupShuffleSplit(n_splits=2, test_size=0.5), groupby=["subject"] ) The grid search will try different numbers of PCA components to find the best parameters for this task. .. tip:: To use multi-processing, set ``n_jobs=-1``. .. code-block:: default gs = GridSearchCV( pl, cv=cv, n_jobs=1, verbose=1, param_grid={"pca__n_components": [10, 20]} ) The label to classify is the activity which we convert to an integer representation for the classification. .. code-block:: default y = Target( coord="activity", transform_func=LabelEncoder().fit_transform, dim="sample" )(X) Finally, we run the grid search and print out the best parameter combination. .. code-block:: default if __name__ == "__main__": # in order for n_jobs=-1 to work on Windows gs.fit(X, y) print("Best parameters: {0}".format(gs.best_params_)) print("Accuracy: {0}".format(gs.best_score_)) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none Fitting 2 folds for each of 2 candidates, totalling 4 fits [Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers. [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 12.2s finished Best parameters: {'pca__n_components': 10} Accuracy: 0.6746431870478216 .. note:: The performance of this classifier is obviously pretty bad, it was chosen for execution speed, not accuracy. .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 0 minutes 17.490 seconds) .. _sphx_glr_download_auto_examples_plot_activity_recognition.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_activity_recognition.py ` .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_activity_recognition.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_