Scikit-learn (often abbreviated as `sklearn`) is a free software machine learning library for the Python programming language. It features various classification, regression, and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.
Key aspects of scikit-learn include:
- Simple and Efficient Tools: Provides a consistent API for a wide range of machine learning algorithms, making it easy to apply models without deep statistical knowledge.
- Comprehensive Machine Learning Tasks: Supports major machine learning tasks such as:
- Classification: Identifying which category an object belongs to (e.g., spam detection, image recognition).
- Regression: Predicting a continuous value associated with an object (e.g., stock price prediction, house price estimation).
- Clustering: Grouping similar objects together (e.g., customer segmentation).
- Dimensionality Reduction: Reducing the number of random variables under consideration (e.g., Principal Component Analysis).
- Model Selection and Evaluation: Tools for comparing, validating, and choosing parameters and models (e.g., cross-validation, grid search, various metrics).
- Preprocessing: Features for transforming raw data into a format suitable for machine learning algorithms (e.g., scaling, normalization, imputation).
- Built on Top of Numerical Libraries: Leverages NumPy for efficient array operations and SciPy for scientific computing, providing high performance.
- Well-Documented: Comes with extensive and clear documentation, including tutorials and examples, making it accessible for both beginners and experienced practitioners.
- Open Source: Being open-source, it benefits from a large community of developers and users, ensuring continuous improvement and support.
The typical workflow with scikit-learn involves:
1. Loading Data: Importing datasets (either built-in or custom).
2. Preprocessing Data: Cleaning, scaling, or transforming the data to suit the chosen algorithm.
3. Splitting Data: Dividing the dataset into training and testing sets.
4. Model Instantiation: Choosing and initializing a specific machine learning model.
5. Model Training: Fitting the model to the training data.
6. Prediction: Using the trained model to make predictions on new, unseen data (the test set).
7. Model Evaluation: Assessing the model's performance using various metrics.
Scikit-learn's uniform API across different models (e.g., `fit()` for training, `predict()` for making predictions) is a cornerstone of its popularity and ease of use, making it a go-to library for machine learning in Python.
Example Code
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report
1. Load a sample dataset (Iris dataset)
The Iris dataset is a classic and easy-to-understand dataset for classification.
iris = datasets.load_iris()
X = iris.data Features
y = iris.target Target variable (species)
You can inspect the data (optional)
print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)
print("Feature names:", iris.feature_names)
print("Target names:", iris.target_names)
2. Split the dataset into training and testing sets
We use 80% of the data for training and 20% for testing.
`random_state` ensures reproducibility.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set size: {len(X_train)} samples")
print(f"Testing set size: {len(X_test)} samples")
3. Choose and instantiate a classifier model (Support Vector Classifier - SVC)
A Support Vector Machine (SVM) is a powerful and versatile machine learning model.
model = SVC(kernel='linear', random_state=42)
4. Train the model using the training data
print("Training the model...")
model.fit(X_train, y_train)
print("Model training complete.")
5. Make predictions on the test set
y_pred = model.predict(X_test)
6. Evaluate the model's performance
print("\nModel Evaluation:")
Calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.4f}")
Generate a detailed classification report
This report shows precision, recall, f1-score for each class.
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
Example of a new prediction (optional)
new_data = [[5.1, 3.5, 1.4, 0.2]] Sample values for sepal length, sepal width, petal length, petal width
predicted_species_index = model.predict(new_data)
predicted_species_name = iris.target_names[predicted_species_index[0]]
print(f"\nPredicted species for new data {new_data}: {predicted_species_name}")








Scikit-learn