Automated Machine Learning (AutoML) refers to the process of automating the end-to-end application of machine learning, making it more accessible to non-experts and improving efficiency for data scientists. The goal of AutoML is to automate tasks such as feature engineering, algorithm selection, hyperparameter tuning, and model evaluation and deployment. These tasks often require significant expertise and computational resources.
`auto-sklearn` is an open-source AutoML toolkit built on top of the popular scikit-learn library in Python. It aims to make AutoML readily available to users of scikit-learn. `auto-sklearn` tackles the combined challenges of algorithm selection and hyperparameter tuning by leveraging techniques like Bayesian optimization, meta-learning, and ensemble construction.
Specifically, `auto-sklearn` automates the following key aspects of the ML pipeline:
1. Algorithm Selection: It automatically chooses the most suitable machine learning algorithm (e.g., SVM, Random Forest, Gradient Boosting) from a wide range of scikit-learn models.
2. Hyperparameter Optimization: For the selected algorithm, it automatically tunes its hyperparameters to find the optimal configuration that maximizes performance.
3. Automated Feature Preprocessing: It includes various preprocessing steps (e.g., scaling, imputation, one-hot encoding) and automatically selects and tunes them.
4. Ensemble Construction: After training multiple models with different algorithms and hyperparameters, `auto-sklearn` constructs an ensemble of the best performing models to further boost predictive performance. This ensemble usually outperforms any single model.
`auto-sklearn` achieves this efficiency and performance by using Sequential Model-based Algorithm Configuration (SMAC) for optimization, which is a form of Bayesian Optimization. It also employs meta-learning to warm-start the optimization process by leveraging knowledge from previously seen datasets. By automating these complex steps, `auto-sklearn` significantly reduces the manual effort and expertise required to build high-performing machine learning models, allowing users to focus on problem definition and data understanding.
Example Code
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
1. Load a dataset
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
sklearn.model_selection.train_test_split(X, y, random_state=1)
2. Configure and run auto-sklearn
For demonstration purposes, we use a small time_left_for_this_task and per_run_time_limit.
In a real-world scenario, you would typically increase these values.
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=60, Total time limit for the entire process (seconds)
per_run_time_limit=10, Time limit for each individual model configuration (seconds)
tmp_folder='/tmp/autosklearn_classification_example_tmp', Temporary folder for meta-data
output_folder='/tmp/autosklearn_classification_example_out', Output folder for models
n_jobs=-1, Use all available CPU cores
seed=42
)
3. Fit the model
This step will involve algorithm selection, hyperparameter tuning, and ensemble building.
automl.fit(X_train, y_train, dataset_name='digits')
4. Print the best model's configuration
print("Best model configuration:\n", automl.show_models())
5. Make predictions
y_pred = automl.predict(X_test)
6. Evaluate the model
accuracy = sklearn.metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy score: {accuracy}")
Optional: Get information about the models found
print("\nAuto-sklearn build process (models and weights):")
print(automl.sprint_statistics())
Clean up temporary folders (important in production or repeated runs)
import shutil
shutil.rmtree('/tmp/autosklearn_classification_example_tmp', ignore_errors=True)
shutil.rmtree('/tmp/autosklearn_classification_example_out', ignore_errors=True)








Automated ML with auto-sklearn