Scikit-learn in Python is by far one of the most useful open-source libraries. Scikit learn in Python was develop by David Cournapeau in 2007 as a Google summer of code project. It is the most useful and robust library for machine learning in Python. This library is primarily write in Python and built upon SciPy, NumPy, and Matplotlib. Canopy and Anaconda they both ship the latest version of scikit-learn.

Scikit-learn
Scikit-learn

Installation

If you already installed NumPy and Scipy, following are the two easiest ways to install scikit-learn.

Using pip

Following command can be use to install :-

pip install -U scikit-learn

Using conda

Following command can be use to install :-

conda install scikit-learn

Features

  • Supervised Learning algorithms
    •  Almost all the popular supervise learning algorithms, just like Linear Regression, Support Vector Machine, Decision Tree etc., are the part of scikit-learn
  • Unsupervised Learning algorithms
    • it also has all the popular unsupervised learning algorithms from clustering, factor analysis and PCA to unsupervised neural networks.
  • Clustering
    • scikit-learn model is use for grouping unlabeled data.
  • Cross Validation
    •  It is use to check the accuracy of supervise models on unseen data.
  • Dimensionality Reduction
    • It is use for reducing the number of attributes in data which can be further use for summarisation, visualisation and feature selection.
  • Ensemble method
    • As name suggest, it is use for combining the predictions of multiple supervise models.
  • Feature extraction
    • It is use to extract the features from data to define the attributes in image and text data.
  • Feature selection
    •  It is use to identify useful attributes to create supervise models.
  • Open Source
    • It is open source library and also commercially under BSD license.

Modelling Process

Dataset Loading

A collection of data is knows as dataset.

There are two components are Following:-

  • Features
  • Response

Features − The variables of data is knows as features. They are also known as predictors, inputs or attributes.

  • Matrix − It is the collection of features, in case there are more than one.
  • Names − It is the list of all the names of the features.

Response − It is the output variable that basically depends upon the feature variables is known as Response

  • Vector − It is use to represent response column. Generally, we have just one response column.
  • Names − It represent the possible values taken by a response vector.

Splitting the Dataset

Split the dataset into two pieces

  • Training set 
    • The training set to train the model
  • Testing set
    • The testing set to test the model

Example of Splitting the Dataset

from sklearn.model_selection import train_test_split  

X = df.iloc[:, :-1].values  

y = df.iloc[:, -1].values  

X_train, X_test, y_train, y_test = train_test_split(  

   X, y, test_size = 0.3, random_state = 1  

)  

print(X_train.shape)  

print(X_test.shape)  

print(y_train.shape)  

print(y_test.shape)  

Output

(105, 3)
(45, 3)
(105,)
(45,)

Tree Algorithm

The tree method is a most powerful non-parametric supervised learning method. A node represents a feature, A branch indicates a decision function, and every leaf node indicates the conclusion in a decision tree.

Example of Tree

import numpy as np  

from sklearn.datasets import load_iris  

from sklearn.tree import DecisionTreeClassifier  

from sklearn.model_selection import cross_val_score, train_test_split   

X, Y = load_iris( return_X_y = True )    

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)  

dtc = DecisionTreeClassifier(random_state = 0)  

dtc.fit(X_train, Y_train)    

score = cross_val_score(dtc, X, Y, cv = 10)  

print(“Accuracy scores: “, score)  

print(“Mean accuracy score: “, np.mean(score))  

Output of Tree

Accuracy scores:[1. 0.93333333 1. 0.93333333 0.93333333 0.86666667 0.93333333 1. 1. 1.]
Mean accuracy score:  0.96

Gradient Boosting

Example of Gradient Boosting

from sklearn.datasets import make_hastie_10_2  

from sklearn.ensemble import GradientBoostingClassifier  

X, Y = make_hastie_10_2(random_state = 10)   

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.4, random_state=0)  

gbc = GradientBoostingClassifier(n_estimators = 100, learning_rate = 1.0, max_depth = 1, random_state = 0)  

gbc.fit(X_train, Y_train)  

score = gbc.score(X_test, Y_test)  

print(“Accuracy scores: “, score) 

Output of Gradient Boosting

Accuracy scores:  0.9185416666666667
Dimensionality Reduction using PCA in Sklearn

Clustering Methods

Clustering is the best-unsupervised ML techniques for finding patterns of similarity and relationships between data sets is knows as clustering Method.

KMeans

This algorithm calculates the centroids is knows as Kmeans.

Example of Kmeans

from sklearn.cluster import KMeans  

import numpy as np  

from sklearn.datasets import load_diabetes  

X, Y = load_diabetes(return_X_y = True)   

cluster =  KMeans(n_clusters = 10)  

cluster.fit(X[:50, :])  

print(“The number of clusters are: “, cluster.labels_) 

Output of Kmeans

The number of clusters are:  [6 0 6 2 0 8 8 5 6 2 8 6 0 6 0 5 3 5 2 2 8 8 2 7 2 6 8 2 4 3 2 4 1 4 4 9 3 2 5 6 5 8 6 9 1 6 2 8 0 1]

Spectral Clustering

Example of Spectral Clustering

from sklearn.cluster import SpectralClustering  

import numpy as np  

from sklearn.datasets import load_diabetes  

X, Y = load_diabetes(return_X_y = True)  

cluster =  SpectralClustering(n_clusters = 10)  

cluster.fit(X[:50, :])  

print(“The number of clusters are: “, cluster.labels_) 

Output of Spectral Clustering

The number of clusters are:  [0 2 0 8 4 3 6 4 9 1 3 0 4 6 2 8 5 4 7 1 7 6 9 5 2 8 3 9 1 3 9 5 0 5 4 5 1 5 8 1 7 3 6 5 0 6 1 3 6 8]

Hierarchical Clustering

Example of Hierarchical Clustering

from sklearn.cluster import AgglomerativeClustering  

import numpy as np  

from sklearn.datasets import load_diabetes  

X, Y = load_diabetes(return_X_y = True)  

cluster = AgglomerativeClustering(n_clusters = 10, compute_distances = True)  

cluster.fit(X[:50, :])  

print(“The number of clusters are: “, cluster.labels_)  

Output of Hierarchical Clustering

The number of clusters are:  [3 6 3 5 6 0 0 1 3 5 0 2 6 3 6 1 4 1 5 6 0 0 5 9 5 2 0 5 6 4 5 0 8 7 6 7 4 5 1 3 1 0 2 7 8 3 0 0 3 2]

If you have any queries regarding this article or if I have missed something on this topic, please feel free to add in the comment down below for the audience. See you guys in another article.

To know more about Scikit-learn Library Function please Wikipedia click here

Stay Connected Stay Safe, Thank you.


Basic Engineer

Hey Readers! We have more than fifteen years of experience in Software Development, IoT, Telecom, Banking, Finance and Embedded domain. Currently we are actively working on Data Science, ML and AI with multiple market leaders worldwide. Happy Reading. Cheers!

0 Comments

Leave a Reply

Avatar placeholder

Your email address will not be published. Required fields are marked *