The Best Machine Learning Frameworks & Extensions for Scikit-learn

Data formats

In this section, we’ll explore libraries that can be used to process and transform data.

Sklearn-pandas

You can use this package to map ‘DataFrame’ columns to Scikit-learn transformations. Then you can combine these columns into features.

data =pd.DataFrame({ 
'Name':['Ken','Jeff','John','Mike','Andrew','Ann','Sylvia','Dorothy','Emily','Loyford'],
'Age':[31,52,56,12,45,50,78,85,46,135],
'Phone':[52,79,80,75,43,125,74,44,85,45],
'Uni':['One','Two','Three','One','Two','Three','One','Two','Three','One']
})
from sklearn_pandas import DataFrameMapper mapper = DataFrameMapper([ 
('Uni', sklearn.preprocessing.LabelBinarizer()),
(['Age'], sklearn.preprocessing.StandardScaler())
])
mapper.fit_transform(data)
mapper.transformed_names_
mapper = DataFrameMapper([ 
('Uni', sklearn.preprocessing.LabelBinarizer()),
(['Age'], sklearn.preprocessing.StandardScaler())
],df_out=True)

Sklearn-xarray

This package combines n-dimensional labeled arrays from xarray with Scikit-learn tools.

  • ensure compatibility between Sklearn estimators with xarray DataArrays and Datasets,
  • enable estimators to change the number of samples,
  • have pre-processing transformers.
import numpy as np 
import xarray as xr
data = np.random.rand(16, 4)
my_xarray = xr.DataArray(data)
from sklearn.preprocessing import StandardScaler 
Xt = wrap(StandardScaler()).fit_transform(X)
pipeline = Pipeline([ 
('pca', wrap(PCA(n_components=50), reshapes='feature')),
('cls', wrap(LogisticRegression(), reshapes='feature'))
])
from sklearn_xarray.model_selection 
import CrossValidatorWrapper from sklearn.model_selection
import GridSearchCV, KFold
cv = CrossValidatorWrapper(KFold())
pipeline = Pipeline([
('pca', wrap(PCA(), reshapes='feature')),
('cls', wrap(LogisticRegression(), reshapes='feature'))
])
gridsearch = GridSearchCV(
pipeline, cv=cv, param_grid={'pca__n_components': [20, 40, 60]}
)

Auto-ML

Are there tools and libraries that integrate Sklearn for better Auto-ML? Yes there are, and here are some examples.

Auto-sklearn

With this, you can perform automated machine learning with Scikit-learn. For the setup you need to install some dependencies manually.

$ curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip install
from autosklearn.classification 
import AutoSklearnClassifier
cls = AutoSklearnClassifier()
cls.fit(X_train, y_train)
predictions = cls.predict(X_test)

Auto_ViML — Automatic Variant Interpretable Machine Learning” (pronounced “Auto_Vimal”)

Given a certain dataset, Auto_ViML tries out different models with varying features. It eventually settles on the best performing model.

  • helps you clean data by suggesting changes to missing values, formatting, and adding variables;
  • classifies variables automatically, whether it’s text, data, or numerical;
  • generates model performance graphs automatically when verbose is set to 1 or 2;
  • lets you use of ‘featuretools’ for feature engineering;
  • handles imbalance data when ‘Imbalanced_Flag’ is set to ‘True’
from sklearn.model_selection import train_test_split, cross_validate X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42) 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=54)
train, test = X_train.join(y_train), X_val.join(y_val)
model, features, train, test = Auto_ViML(train,"target",test,verbose=2)

TPOT — Tree-based Pipeline Optimization Tool

This is a Python-based auto-ml tool. It uses genetic programming to optimize machine learning pipelines.

from tpot import TPOTClassifier 
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25, random_state=42)
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_digits_pipeline.py')

Feature Tools

This is a tool for automated feature engineering. It works by transforming temporal and relational datasets into feature matrices.

import featuretools as ftentities = {
"customers" : (customers_df, "customer_id"),
"sessions" : (sessions_df, "session_id", "session_start"),
"transactions" : (transactions_df, "transaction_id", "transaction_time")
}
relationships = [("sessions", "session_id", "transactions", "session_id"),
("customers", "customer_id", "sessions", "customer_id")]
feature_matrix, features_defs = ft.dfs(entities=entities,
relationships = relationships,
target_entity = "customers")

Neuraxle

You can use Neuraxle for hyperparameter tuning and AutoML. Install ‘neuraxle’ via pip to start using it.

  • parallel computation and serialization,
  • time series processing through the provision of abstractions key to such projects.
  • a defined pipeline
  • a validation splitter
  • definition of a scoring metric via the ‘ScoringCallback’
  • a selected ‘hyperparams’ repository
  • a selected ‘hyperparams’ optimizer
  • an ‘AutoML’ loop

Experimentation frameworks

Now it’s time for a couple of SciKit tools that you can use for machine learning experimentation.

SciKit-Learn Laboratory

SciKit-Learn Laboratory is a command-line tool you can use to run machine learning experiments. To start using it, install `skll` via pip.

$ run_experimen experiment.cfg

Neptune

The Scikit-learn integration of Neptune lets you log your experiments using Neptune. For instance, you can log the summary of your Scikit-learn regressor.

from neptunecontrib.monitoring.sklearn import log_regressor_summarylog_regressor_summary(rfr, X_train, X_test, y_train, y_test)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Priya Reddy

Priya Reddy

Hey This Is priya Reddy Iam a tech writer