图解机器学习神器:Scikit-Learn
共 71101字,需浏览 143分钟
·
2024-06-25 08:13
↓推荐关注↓
- 
     
SKLearn官网:https://scikit-learn.org/stable/[2]  - 
     
SKLearn的快速使用方法也推荐大家查看ShowMeAI的文章和速查手册 AI建模工具速查|Scikit-learn使用指南[3]  
- 
     
① 机器学习基础知识:机器学习定义与四要素:数据、任务、性能度量和模型。机器学习概念,以便和SKLearn对应匹配上。
 - 
     
② SKLearn讲解:API设计原理,SKLearn几大特点:一致性、可检验、标准类、可组合和默认值,以及SKLearn自带数据以及储存格式。
 - 
     
③ SKLearn三大核心API讲解:包括估计器、预测器和转换器。这个板块很重要,大家实际应用时主要是借助于核心API落地。
 - 
     
④ SKLearn高级API讲解:包括简化代码量的流水线(Pipeline估计器),集成模型(Ensemble估计器)、有多类别-多标签-多输出分类模型(Multiclass 和 Multioutput 估计器)和模型选择工具(Model Selection估计器)。
 
1.机器学习简介
定义和构成元素
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E. 
- 
     
数据(Data)  - 
     
任务(Task)  - 
     
性能度量(Quality Metric)  - 
     
算法(Algorithm)  
数据
- 
     
从『数据具体类型』维度划分:结构化数据和非结构化数据。
 - 
       
结构化数据(structured data)是由二维表结构来逻辑表达和实现的数据。  - 
       
非结构化数据是没有预定义的数据,不便用数据库二维表来表现的数据。非结构化数据包括图片,文字,语音和视频等。  - 
     
从『数据表达形式』维度划分:原始数据和加工数据。
 - 
     
从『数据统计性质』维度划分:样本内数据和样本外数据。
 
- 
     
每行的记录(这是一朵鸢尾花的数据统计),称为一个『样本(sample)』。  - 
     
反映样本在某方面的性质,例如萼片长度(Sepal Length)、花瓣长度(Petal Length),称为『特征(feature)』。  - 
     
特征上的取值,例如『样本1』对应的5.1、3.5称为『特征值(feature value)』。  - 
     
关于样本结果的信息,例如Setosa、Versicolor,称为『类别标签(class label)』。  - 
     
包含标签信息的示例,则称为『样例(instance)』,即 样例=(特征,标签)。 - 
     
从数据中学得模型的过程称为『学习(learning)』或『训练(training)』。  - 
     
在训练数据中,每个样例称为『训练样例(training instance)』,整个集合称为『训练集(training set)』。  
任务
- 
     
监督学习(有标签)  - 
     
无监督学习(无标签)  - 
     
半监督学习(有部分标签)  - 
     
强化学习(有延迟的标签)  
性能度量
2. SKLearn数据
- 
     
监督学习:分类任务[8]  - 
     
监督学习:回归任务[9]  - 
     
无监督学习:聚类任务[10]  - 
     
无监督学习:降维任务[11]  - 
     
模型选择任务[12]  - 
     
数据预处理任务[13]  - 
     
数据导入模块[14]  
SKLearn默认数据格式
- 
     
Numpy二维数组(ndarray)的稠密数据(dense data),通常都是这种格式。  - 
     
SciPy矩阵(scipy.sparse.matrix)的稀疏数据(sparse data),比如文本分析每个单词(字典有100000个词)做独热编码得到矩阵有很多0,这时用ndarray就不合适了,太耗内存。  
自带数据集
# 导入工具库 
from sklearn.datasets import load_iris
iris = load_iris()
#数据是以『字典』格式存储的,看看 iris 的键有哪些。
iris.keys()
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']) 
#输出iris 数据中特征的大小、名称等信息和前五个样本。 
n_samples, n_features = iris.data.shape
print((n_samples, n_features))
print(iris.feature_names)
print(iris.target.shape)
print(iris.target_names)
iris.data[0:5]
(150, 4) 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
(150,)
['setosa' 'versicolor' 'virginica']
array([[5.1, 3.5, 1.4, 0.2],
[4.9, 3. , 1.4, 0.2],
[4.7, 3.2, 1.3, 0.2],
[4.6, 3.1, 1.5, 0.2],
[5. , 3.6, 1.4, 0.2]])
# 将X和y合并为Dataframe格式数据 
import pandas as pd
import seaborn as sns
iris_data = pd.DataFrame( iris.data,
columns=iris.feature_names )
iris_data['species'] = iris.target_names[iris.target]
iris_data.head(3).append(iris_data.tail(3))
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
| 
        | 
      
        | 
      
        | 
      
        | 
      
        | 
      
        | 
     
# 使用Seaborn的pairplot查看两两特征之间的关系 
sns.pairplot( iris_data, hue='species', palette='husl' )
数据集引入方式
- 
     
打包好的数据:对于小数据集,用 sklearn.datasets.load_* - 
     
分流下载数据:对于大数据集,用 sklearn.datasets.fetch_* - 
     
随机创建数据:为了快速展示,用 sklearn.datasets.make_* 
*指代具体文件名,如果大家在Jupyter这种IDE环境中,可以通过tab制表符自动补全和选择。
  - 
     
datasets.load_  - 
     
datasets.fetch_  - 
     
datasets.make_  
load_iris
  from sklearn import datasets 
datasets.load_iris
<function sklearn.datasets.base.load_iris(return_X_y=False)> 
load_digits加载手写数字图像数据集
  digits = datasets.load_digits() 
digits.keys()
dict_keys(['data', 'target', 'target_names', 'images', 'DESCR']) 
#加州房屋数据集 
california_housing = datasets.fetch_california_housing()
california_housing.keys()
dict_keys(['data', 'target', 'feature_names', 'DESCR']) 
3.SKLearn核心API
- 
     
估计器(estimator)通常是用于拟合功能的估计器。  - 
     
预测器(predictor)是具有预测功能的估计器。  - 
     
转换器(transformer)是具有转换功能的估计器。  
估计器
- 
     
① 需要输入数据。  - 
     
② 可以估计参数。  
- 
     
创建估计器:需要设置一组超参数,比如
 - 
       
线性回归里超参数 normalize=True - 
       
K均值里超参数 n_clusters=5 - 
     
拟合估计器:需要训练集
 - 
       
在监督学习中的代码范式为 model.fit(X_train, y_train) - 
       
在无监督学习中的代码范式为 model.fit(X_train) 
- 
     
model.coef_ - 
     
model.labels_ 
(1) 线性回归
linear_model中引入LinearRegression;创建模型对象命名为model,设置超参数normalize为True(在每个特征值上做标准化,这样能保证拟合的稳定性,加速模型拟合速度)。
  from sklearn.linear_model import LinearRegression 
model = LinearRegression(normalize=True)
model
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True) 
normalize=True),未设置的超参数都使用默认值。
  import numpy as np 
import matplotlib.pyplot as plt
x = np.arange(10)
y = 2 * x + 1
plt.plot( x, y, 'o' )
np.newaxis加一个维度,把[1,2,3]转成[[1],[2],[3]],这样的数据形态可以符合sklearn的要求。接着把X和y送入fit()函数来拟合线性模型的参数。
  X = x[:, np.newaxis] 
model.fit( X, y )
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=True) 
model.param_访问到拟合完数据的参数了,如下代码。
  print( model.coef_ ) 
print( model.intercept_ )
# 输出结果
# [2.]
# 0.9999999999999982
(2) K均值
n_cluster为3(为了展示方便而我们知道用的iris数据集有3类,实际上可以设置不同数量的n_cluster)。
  from sklearn.cluster import KMeans 
model = KMeans( n_clusters=3 )
model
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, 
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
注意下面代码 X = iris.data[:,0:2]其实就是提取特征维度。
from sklearn.datasets import load_iris 
iris = load_iris()
X = iris.data[:,0:2]
model.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300, 
n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
model.param_访问到拟合完数据的参数了,如下代码。
  print( model.cluster_centers_, '\n') 
print( model.labels_, '\n' )
print( model.inertia_, '\n')
print(iris.target)
[[5.77358491 2.69245283]
[6.81276596 3.07446809]
[5.006 3.428 ]]
[2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 0 1 1 1 1
1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 0 1
1 0]
37.05070212765958
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2]
- 
     
model.clustercenters:簇中心。三个簇意味着有三个坐标。 - 
     
model.labels_:聚类后的标签。 - 
     
model.inertia_:所有点到对应的簇中心的距离平方和(越小越好) 
小结
fit()方法。
  # 有监督学习 
from sklearn.xxx import SomeModel
# xxx 可以是 linear_model 或 ensemble 等
model = SomeModel( hyperparameter )
model.fit( X, y )
# 无监督学习
from sklearn.xxx import SomeModel
# xxx 可以是 cluster 或 decomposition 等
model = SomeModel( hyperparameter )
model.fit( X )
预测器
predict()函数:
  - 
     
model.predict(X_test):评估模型在新数据上的表现。 - 
     
model.predict(X_train):确认模型在老数据上的表现。 
(X_train, y_train)和测试集(X_test, y_test),再用从训练集上拟合fit()的模型在测试集上预测predict()。
  from sklearn.datasets import load_iris 
iris = load_iris()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( iris['data'],
iris['target'],
test_size=0.2 )
print( 'The size of X_train is ', X_train.shape )
print( 'The size of y_train is ', y_train.shape )
print( 'The size of X_test is ', X_test.shape )
print( 'The size of y_test is ', y_test.shape )
The size of X_train is (120, 4)
The size of y_train is (120,)
The size of X_test is (30, 4)
The size of y_test is (30,)
predict & predict_proba
predict(),后者用predict_proba()。
  y_pred = model.predict( X_test ) 
p_pred = model.predict_proba( X_test )
print( y_test, '\n' )
print( y_pred, '\n' )
print( p_pred )
score & decision_function
- 
     
score()返回的是分类准确率。 - 
     
decision_function()返回的是每个样例在每个类下的分数值。 
print( model.score( X_test, y_test ) ) 
print( np.sum(y_pred==y_test)/len(y_test) )
decision_score = model.decision_function( X_test )
print( decision_score )
小结
fit()方法,预测器都有predict()和score()方法,言外之意不是每个预测器都有predict_proba()和decision_function()方法,这个在用的时候查查官方文档就清楚了(比如RandomForestClassifier就没有decision_function()方法)。
  # 有监督学习 
from sklearn.xxx import SomeModel
# xxx 可以是 linear_model 或 ensemble 等
model = SomeModel( hyperparameter )
model.fit( X, y )
y_pred = model.predict( X_new )
s = model.score( X_new )
# 无监督学习
from sklearn.xxx import SomeModel
# xxx 可以是 cluster 或 decomposition 等
model = SomeModel( hyperparameter )
model.fit( X )
idx_pred = model.predict( X_new )
s = model.score( X_new )
转换器
- 
     
估计器里 fit + predict - 
     
转换器里 fit + transform 
- 
     
将类别型变量(categorical)编码成数值型变量(numerical)  - 
     
规范化(normalize)或标准化(standardize)数值型变量  
(1) 类别型变量编码
- 
     
LabelEncoder的输入是一维,比如1d ndarray  - 
     
OrdinalEncoder的输入是二维,比如 DataFrame  
# 首先给出要编码的列表 enc 和要解码的列表 dec。 
enc = ['red','blue','yellow','red']
dec = ['blue','blue','red']
# 从sklearn下的preprocessing中引入LabelEncoder,再创建转换器起名LE,不需要设置任何超参数。
from sklearn.preprocessing import LabelEncoder
LE = LabelEncoder()
print(LE.fit(enc))
print( LE.classes_ )
print( LE.transform(dec) )
LabelEncoder()
['blue' 'yellow' 'red']
[0 1 2]
from sklearn.preprocessing import OrdinalEncoder 
OE = OrdinalEncoder()
enc_DF = pd.DataFrame(enc)
dec_DF = pd.DataFrame(dec)
print( OE.fit(enc_DF) )
print( OE.categories_ )
print( OE.transform(dec_DF) )
OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)
[array(['blue', 'yellow', 'red'], dtype=object)]
[[0.]
[1.]
[2.]]
- 
     
① 用LabelEncoder编码好的一维数组  - 
     
② DataFrame  
from sklearn.preprocessing import OneHotEncoder 
OHE = OneHotEncoder()
num = LE.fit_transform( enc )
print( num )
OHE_y = OHE.fit_transform( num.reshape(-1,1) )
OHE_y
[2 0 1 2]
<4x3 sparse matrix of type '<class 'numpy.float64'>' 
with 4 stored elements in Compressed Sparse Row format>
- 
     
第3行打印出编码结果[2 0 1 2]。  - 
     
第5行将其转成独热形式,输出是一个『稀疏矩阵』形式,因为实操中通常类别很多,因此就一步到位用稀疏矩阵来节省内存。  
toarray()函数。
  OHE_y.toarray() 
array([[0., 0., 1.], 
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
OHE = OneHotEncoder() 
OHE.fit_transform( enc_DF ).toarray()
array([[0., 0., 1.], 
[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
(2) 特征缩放
- 
     
标准化(standardization):每个维度的特征减去该特征均值,除以该维度的标准差。  - 
     
规范化(normalization):每个维度的特征减去该特征最小值,除以该特征的最大值与最小值之差。  
from sklearn.preprocessing import MinMaxScaler 
X = np.array( [0, 0.5, 1, 1.5, 2, 100] )
X_scale = MinMaxScaler().fit_transform( X.reshape(-1,1) )
X_scale
array([[0. ], 
[0.005],
[0.01 ],
[0.015],
[0.02 ],
[1. ]])
from sklearn.preprocessing import StandardScaler 
X_scale = StandardScaler().fit_transform( X.reshape(-1,1) )
X_scale
array([[-0.47424487], 
[-0.46069502],
[-0.44714517],
[-0.43359531],
[-0.42004546],
[ 2.23572584]])
注意: fit()函数只能作用在训练集上,如果希望对测试集变换,只要用训练集上fit好的转换器去transform即可。不能在测试集上fit再transform,否则训练集和测试集的变换规则不一致,模型学习到的信息就无效了。
4.高级API
- 
     
ensemble.BaggingClassifier - 
     
ensemble.VotingClassifier - 
     
multiclass.OneVsOneClassifier - 
     
multiclass.OneVsRestClassifier - 
     
multioutput.MultiOutputClassifier - 
     
model_selection.GridSearchCV - 
     
model_selection.RandomizedSearchCV - 
     
pipeline.Pipeline 
Ensemble 估计器
- 
     
AdaBoostClassifier:逐步提升分类器 - 
     
AdaBoostRegressor:逐步提升回归器 - 
     
BaggingClassifier:Bagging分类器 - 
     
BaggingRegressor:Bagging回归器 - 
     
GradientBoostingClassifier:梯度提升分类器 - 
     
GradientBoostingRegressor:梯度提升回归器 - 
     
RandomForestClassifier:随机森林分类器 - 
     
RandomForestRegressor:随机森林回归器 - 
     
VotingClassifier:投票分类器 - 
     
VotingRegressor:投票回归器 
- 
     
含同质估计器 RandomForestClassifier - 
     
含异质估计器 VotingClassifier 
from sklearn.datasets import load_iris 
iris = load_iris()
from sklearn.model_selection import train_test_split
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(iris['data'], iris['target'], test_size=0.2)
(1) RandomForestClassifier
n_estimators超参数来决定基估计器的个数,在这里是4棵决策树(森林由树组成);此外每棵树的最大树深为5(max_depth=5)。
  from sklearn.ensemble import RandomForestClassifier 
RF = RandomForestClassifier( n_estimators=4, max_depth=5 )
RF.fit( X_train, y_train )
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', 
max_depth=5, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=4,
n_jobs=None, oob_score=False, random_state=None,
verbose=0, warm_start=False)
fit()。下面看看随机森林里包含的估计器个数和其本身。
  print( RF.n_estimators ) 
RF.estimators_
4 
[DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=705712365, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1026568399, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1987322366, splitter='best'),
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=5,
max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=1210538094, splitter='best')]
print ( "RF - Accuracy (Train): %.4g" % 
metrics.accuracy_score(y_train, RF.predict(X_train)) )
print ( "RF - Accuracy (Test): %.4g" %
metrics.accuracy_score(y_test, RF.predict(X_test)) )
RF - Accuracy (Train): 1 
RF - Accuracy (Test): 0.9667
(2) VotingClassifier
n_estimators超参数来决定树的个数,而VotingClassifier的基分类器要输入每个异质分类器。
  from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
LR = LogisticRegression( solver='lbfgs', multi_class='multinomial' )
RF = RandomForestClassifier( n_estimators=5 )
GNB = GaussianNB()
Ensemble = VotingClassifier( estimators=[('lr', LR), (‘rf', RF), ('gnb', GNB)], voting='hard' )
Ensemble. fit( X_train, y_train )
VotingClassifier(estimators=[('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='multinomial',n_jobs=None, penalty='12', random_state=None, solver='lbfgs',tol=0.0001, verbose=6, warm_start=False)), ('rf', ...e, verbose=0,warm_start=False)), ('gnb', GaussianNB(priors=None, var_smoothing=1e-09))],flatten_transform=None, n_jobs=None, voting='hard', weights=None) 
print( len(Ensemble.estimators_) ) 
Ensemble.estimators_
3 
[LogisticRegression(C=1.0, class_weight-None, dual-False, fit_intercept=True,intercept_scaling=1, max_iter=100, multi_class='multinomial',n_jobs-None, penalty="12", random_state-None, solver='1bfgs',t01=0.0001, verbose=0, warm_start=False),
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',max_depth=None, max_features='auto', max_leaf_nodes=None,min_impurity_decrease-0.0, min_impurity_splitmin_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimator:oob_score=False, random_state-None, verbose=
warm_start=False),
GaussianNB(priors-None, var_smoothing=1e-9)]
# 拟合 
LR.fit( X_train, y_train )
RF.fit( X_train, y_train )
GNB.fit( X_train, y_train )
# 评估效果 
print ( "LR - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, LR.predict(X_train)) )
print ( "RF - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, RF.predict(X_train)) )
print ( "GNB - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, GNB.predict(X_train)) )
print ( "Ensemble - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, Ensemble.predict(X_train)) )
print ( "LR - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, LR.predict(X_test)) )
print ( "RF - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, RF.predict(x_test)) )
print ( "GNB - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, RF.predict(X_test)) )
print ( "Ensemble - Accuracy (Test): %.4g" % metrics.accuracy_score(y test, Ensemble.predict(X_test)) )
# 运行结果 
LR - Accuracy (Train): 0.975
RF - Accuracy (Train): 0.9833
GNB - Accuracy (Train): 0.95
Ensemble - Accuracy (Train): 0.9833
LR - Accuracy (Test): 1
RF - Accuracy (Test): 1
GNB - Accuracy (Test): 1
Ensemble - Accuracy (Test): 1
Multiclass 估计器
sklearn.multiclass可以处理多类别(multi-class) 的多标签(multi-label) 的分类问题。下面我们会使用数字数据集digits作为示例数据来讲解。我们先将数据分成 80:20 的训练集和测试集。
  # 导入数据 
from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
# 输出结果 
dict_keys(['data', 'target', 'target_names','images', 'DESCR'])
# 数据集切分 
X_train, X_test, y_train, y_test = train_test_split( digits['data'], digits['target'], test_size=0.2 )
print( 'The size of X_train is ', X_train.shape )
print( 'The size of y_train is ', y_train.shape )
print( 'The size of X_test is ', X_test.shape )
print( 'The size of y_test is ', y_test.shape )
The size of X_train is (1437, 64) 
The size of y_train is (1437,)
The size of X_test is (360, 64)
The size of y_test is (360,)
fig, axes = plt.subplots( 10, 16, figsize=(8, 8) ) 
fig.subplots_adjust( hspace=0.1, wspace=0.1 )
for i, ax in enumerate( axes.flat ):
ax.imshow( X_train[i,:].reshape(8,8), cmap='binary’, interpolation='nearest’)
ax.text( 0.05, 0.05, str(y_train[i]),
transform=ax.transAxes, color='blue')
ax.set_xticks([])
ax.set_yticks([])
(1) 多类别分类
- 
     
一对一(One vs One,OvO):一个分类器用来处理数字0和数字1,一个用来处理数字0和数字2,一个用来处理数字1和2,以此类推。N个类需要N(N-1)/2个分类器。  - 
     
一对其他(One vs All,OvA):训练10个二分类器,每一个对应一个数字,第一个分类『1』和『非1』,第二个分类『2』和『非2』,以此类推。N个类需要N个分类器。  
- 
     
f1负责区分橙色和绿色样本  - 
     
f2负责区分橙色和紫色样本  - 
     
f3负责区分绿色和紫色样本  
from sklearn.multiclass import OneVsOneClassifier 
from sklearn.linear_model import LogisticRegression
ovo_lr = OneVsOneClassifier( LogisticRegression(solver='lbfgs', max_iter=200) )
ovo_lr.fit( X_train, y_train )
OnevsOneClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=200, multi_class=‘warn’,n_jobs=None, penalty='12', random_state=None, solver='lbfgs’,tol=0.0001, verbose=6, warm_start=False),n_jobs=None) 
print( len(ovo_lr.estimators_) ) 
ovo_lr.estimators_
45 
(LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,intercept_scaling=1, max_iter=200, multi_class='warn',n_jobs=None, penalty='12', random_state=None, solver='lbfgs',tol=60.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=200, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=200, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver='lbfgs', tol=60.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=200, multi_class='warn', n_jobs=None, penalty="12", random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
...
print ( “OvO LR - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, ovo_Ir.predict(X_train)) ) 
print ( "OvO LR - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, ovo_lr.predict(X_test}) )
# 运行结果 
OvO LR - Accuracy (Train): 1
OvO LR - Accuracy (Test): 0.9806
- 
     
图一,某个=橙色,其他=绿色和紫色  - 
     
图二,某个=绿色,其他=橙色和紫色  - 
     
图三,某个=紫色,其他=橙色和绿色  
- 
     
f1预测负类,即预测绿色和紫色  - 
     
f2预测负类,即预测橙色和紫色  - 
     
f3预测正类,即预测紫色  
from sklearn.multiclass import OneVsRestClassifier 
ova_lr = OneVsRestClassifier( LogisticRegression(solver='lbfgs', max_iter=800) )
ova_lr.fit( X_train, y_train )
OnevsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class=‘warn’, n_jobs=None, penalty='12', random_state=None, solver='lbfgs’, tol=0.0001, verbose=6, warm_start=False), n_jobs=None) 
print( len(ova_lr.estimators_) ) 
ova_lr.estimators_
10 
[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver='lbfgs',tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=800, multi_class=‘warn',
n_jobs=None, penalty='12', random_state=None, solver="lbfgs',
tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=800, multi_class='warn', n_jobs=None, penalty='12', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
...
print ( “OvA LR - Accuracy (Train): %.4g" % metrics.accuracy_score(y_train, ova_Ir.predict(X_train)) ) 
print ( "OvA LR - Accuracy (Test): %.4g" % metrics.accuracy_score(y_test, ova_lr.predict(X_test}) )
OvA LR - Accuracy (Train): 6.9993 
OvA LR - Accuracy (Test}: 6.9639
(2) 多标签分类
- 
     
标签1:奇数、偶数  - 
     
标签2:小于等于4,大于4  
y_train_multilabel,代码如下(OneVsRestClassifier也可以用来做多标签分类):
  from sklearn.multiclass import OneVsRestClassifier 
y_train_multilabel = np.c_[y_train%2==0, y_train<=4 ]
print(y_train_multilabel)
[[ True True] [False False] [False False] 
...
[False False] [False False] [False False]]
- 
     
[True True]:4是偶数,小于等于4  - 
     
[False False]:5不是偶数,大于4  
y_train_multilabel来训练模型。代码如下
  ova_ml = OneVsRestClassifier( LogisticRegression(solver='lbfgs', max_iter=800) ) 
ova_ml.fit( X_train, y_train_multilabel )
# 运行结果 
OnevsRestClassifier(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class=‘warn’, n_jobs=None, penalty='12', random_state=None, solver='lbfgs', tol=0.0001, verbose=6, warm_start=False), n_jobs=None)
print( len(ova_ml.estimators_) ) 
ova_ml.estimators_
2 
[LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class=‘warn', n_jobs=None, penalty='12°, random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False),
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=800, multi_class='warn', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False) ]
fig, axes = plt.subplots( 10, 10, figsize=(8, 8) ) 
fig.subplots_adjust( hspace=0.1, wspace=0.1 )
for i, ax in enumerate( axes.flat ):
ax.imshow( X_test[i,:].reshape(8,8), cmap='binary', interpolation='nearest')
ax.text( 6.05, 0.05, str(y_test[i]), transform=ax.transAxes, color='blue')
ax.set_xticks([])
ax.set_yticks([])
print( y_test[:1] ) 
print( ova_ml.predict(X_test[:1,:]) )
[2] 
[[1 1]]
Multioutput 估计器
sklearn.multioutput可以处理多输出(multi-output)的分类问题。
  - 
     
MultiOutputRegressor:多输出回归 - 
     
MultiOutputClassifier:多输出分类 
(1) MultiOutputClassifier
from sklearn.multioutput import MultiOutputClassifier 
from sklearn.ensemble import RandomForestClassifier
- 
     
标签1:小于等于4,4和7之间,大于等于7(三类)  - 
     
标签2:数字本身(十类)  
y_train_1st = y_train.copy() 
y_train_1st[ y_train<=4 ] = 0
y_train_1st[ np.logical_and{y_train>4, y_train<7) ] = 1
y_train_ist[ y_train>=7 ] = 2
y_train_multioutput = np.c_[y_train_1st, y_train]
y_train_multioutput
# 运行结果 
array( [[0, 4],
[1, 5],
[2, 7],
[1, 5],
[2, 9],
[2, 9]])
MO = MultiOutputClassifier( RandomForestClassifier(n_estimators=100) ) 
MO.fit( X_train, y_train_multioutput )
# 结果 
MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False), n_jobs=None)
MO.predict( X_test[:5,:] ) 
array([[0, 2],[0, 2],[0, 0],[2, 9],[1, 5]]) 
y_test_1st = y_test.copy() 
y_test_1st[ y_test<=4 ] = 0
y_test_1st[ np.logical_and(y_test>4, y_test<7) ] = 1
y_test_1st[ y_test>=7 ] = 2
y_test_multioutput = np.c_[ y_test_1st, y_test ]
y_test_multioutput[:5]
array([[0, 2],[0, 2],[0, 0],[2, 9],[1, 5]]) 
Model Selection 估计器
- 
     
cross_validate:评估交叉验证的结果。 - 
     
learning_curve:构建与绘制学习曲线。 - 
     
GridSearchCV:用交叉验证从超参数候选网格中搜索出最佳超参数。 - 
     
RandomizedSearchCV:用交叉验证从一组随机超参数搜索出最佳超参数。 
GridSearchCV和RandomizedSearchCV。我们先回顾一下交叉验证(更详细的讲解请查看ShowMeAI文章 图解机器学习 | 模型评估方法与准则)。
  (1) 交叉验证
from time import time 
from scipy.stats import randint
from sklearn.model_selection import GridSearchCv
from sklearn.model_selection import RandomizedSearchcCv
from sklearn.ensemble import RandomForestClassifier
X, y = digits.data, digits.target
RFC = RandomForestClassifier(n_estimators=20)
# 随机搜索/Randomized Search
param_dist = { "max_depth": [3, 5],
"max_features": randint(1, 11),
"min_samples_split": randint(2, 11),
"criterion": ["gini", "entropy"]}
n_iter_search = 20
random_search = RandomizedSearchCv( RFC, param_distributions=param_dist, n_iter=n_iter_search, cv=5 )}
start = time()
random_search.fit(X, y)
print("RandomizedSearchCv took %.2f seconds for %d candidates,parameter settings." % ((time() - start), n_iter_search))
print( random_search.best_params_ )
print( random_search.best_score_ )
# 网格搜索/Grid Search
param_grid = { "max_depth": [3, 5],
"max_features": [1, 3, 10],
"min_samples_ split": [2, 3, 10],
"criterion": ["gini", "entropy"]}
grid_search = GridSearchCV( RF, param_grid=param_grid, cv=5 )
start = time()
grid_search.fit(X, y)
print("\nGridSearchcv took %.2f seconds for %d candidate parameter settings." % (time() - start, len(grid_search.cv_results_['params'])))
print( grid_search.best_params_ )
print( grid_search.best_score_ )
RandomizedSearchCv took 3.73 seconds for 20 candidates parameter settings. 
{'criterion': 'entropy', '*max_depth': 5, 'max_features': 6, 'min_samples_split': 4}
0.8898163606010017
GridSearchCV took 2.30 seconds for 36 candidate parameter settings.
{'criterion': 'entropy', 'max_depth': 5, 'max_features': 10, 'min_samples_ split': 10}
0.841402337228714S5
- 
     
前5行引入相应工具库。  - 
     
第7-8行准备好数据X和y,创建一个含20个决策树的随机森林模型。  - 
     
第10-14和23-27行为对随机森林的超参数『最大树深、最多特征数、最小可分裂样本数、分裂标准』构建候选参数分布与参数网格。  - 
     
第15-18行是运行随机搜索。  - 
     
第18-30行是运行网格搜索。  
- 
     
第一行输出每种追踪法运行的多少次和花的时间。  - 
     
第二行输出最佳超参数的组合。  - 
     
第三行输出最高得分。  
Pipeline 估计器
(1) Pipeline
- 
     
如果最后一个估计器是预测器,那么Pipeline是预测器。  - 
     
如果最后一个估计器是转换器,那么Pipeline是转换器。  
X = np.array([[56,40,30,5,7,10,9,np.NaN,12], 
[1.68,1.83,1.77,np.NaN,1.9,1.65,1.88,np.NaN,1.75]])
X = np.transpose(X)
- 
     
处理缺失值的转换器SimpleImputer。  - 
     
做规划化的转换器MinMaxScaler。  
from sklearn.pipeline import Pipeline 
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
pipe = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy='mean')),
('normalize', MinMaxScaler())])
Pipeline()里输入(名称,估计器)这个元组构建的流水线列表。在本例中SimpleImputer起名叫impute,MinMaxScaler起名叫normalize。
  X_proc = pipe.fit_transform( X ) 
X_impute = SimpleImputer(missing values=np.nan, strategy='mean').fit_transform( X ) 
X_impute
# 运行结果 
array( [[50, 1.68],
[40, 1.83],
[30, 1.77],
[5, 1.78],
[7, 1.9 ],
[10, 1.65],
[9, 1.88],
[20.375, 1.78],
[12, 1.75 ]])
X_normalize = MinMaxScaler().fit_transform( X_impute ) 
X_normalize
array( [[1., 0.12 ], 
[0.77777778, 0.72],
[0.55555556, 6.48],
[0.52, 1],
[0.04444444, 1.],
[0.11111111, 9.],
[0.08888889, 6.92],
[0.34166667, 6.52],
[0.15555556, 0.4 ]])
(2) FeatureUnion
- 
     
前两列字段『智力IQ』和『脾气temper』都是类别型变量。  - 
     
后两列字段『收入income』和『身高height』都是数值型变量。  - 
     
每列中都有缺失值。  
d= { 'IQ' : ['high','avg','avg','low', high', avg', 'high', 'high',None], 
'temper' : ['good', None,'good', 'bad', 'bad','bad', 'bad', None, 'bad'],
'income' : [50,40,30,5,7,10,9,np.NaN,12],
'height' : [1.68,1.83,1.77,np.NaN,1.9,1.65,1.88,np.NaN,1.75]}
X = pd.DataFrame( d )
X
- 
     
对类别型变量:获取数据 → 中位数填充 → 独热编码  - 
     
对数值型变量:获取数据 → 均值填充 → 标准化  
DataFrameSelector。
  from sklearn.base import BaseEstimator, TransformerMixin 
class DataFrameSelector( BaseEstimator, TransformerMixin ):
def _init_( self, attribute_names ):
self.attribute_names = attribute_names
def fit( self, X, y=None ):
return self
def transform( self, X ):
return X[self.attribute_names].values
full_pipe,它并联着两个流水线
  - 
     
categorical_pipe处理分类型变量
 - 
       
DataFrameSelector用来获取  - 
       
SimpleImputer用出现最多的值来填充None  - 
       
OneHotEncoder来编码返回非稀疏矩阵  - 
     
numeric_pipe处理数值型变量
 - 
       
DataFrameSelector用来获取  - 
       
SimpleImputer用均值来填充NaN  - 
       
normalize来规范化数值  
from sklearn.pipeline import Pipeline 
from sklearn.pipeline import FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
categorical features = ['IQ', 'temper']
numeric_features = ['income', 'height']
categorical pipe = Pipeline([
('select', DataFrameSelector(categorical_features)),
('impute', SimpleImputer(missing values=None, strategy='most_frequent')),
('one_hot_encode', OneHotEncoder(sparse=False))])
numeric_pipe = Pipeline([
('select', DataFrameSelector(numeric_features)),
('impute', SimpleImputer(missing values=np.nan, strategy='mean')),
('normalize', MinMaxScaler())])
full_pipe = FeatureUnion( transformer_list=[
('numeric_pipe', numeric_pipe),
('categorical_pipe', categorical_pipe)])
X_proc = full_pipe.fit_transform( X ) 
print( X_proc )
[[1. 0.12 0. 1. 0. 0. 1. ] 
[0.77777778 0.72 1. 0. 0. 1. 0. ]
[0.55555556 0.48 1. 0. 0. 0. 1. ]
[0. 0.52 0. 0. 1. 1. 0. ]
[0.04444444 1. 0. 1. 0. 1. 0. ]
[0.11111111 0. 1. 0. 0. 1. 0. ]
[0.08888889 0.92 0. 1. 0. 1. 0. ]
[0.34166667 0.52 0. 1. 0. 1. 0. ]
[0.15555556 0.4 0. 1. 0. 1. 0. ]]
5.总结
SKLearn五大原则
(1) 一致性
- 
     
创建: model = Constructor(hyperparam) - 
     
拟参:  - 
       
有监督学习: model.fit(X_train, y_train) - 
       
无监督学习: model.fit(X_train) 
- 
     
有监督学习里预测标签: y_pred = model.predict(X_test) - 
     
无监督学习里识别模式: idx_pred = model.predict( Xtest) 
- 
     
创建: trm = Constructor(hyperparam) - 
     
获参: trm.fit(X_train) - 
     
转换: X_trm = trm.transform(X_train) 
(2) 可检验
- 
     
通例: model.hyperparameter - 
     
特例: SVC.kernel - 
     
通例: model.parameter_ - 
     
特例: SVC.support_vectors_ 
(3) 标准类
(4) 可组成
- 
     
任意转换器序列  - 
     
任意转换器序列+估计器  
(5) 有默认
SKLearn框架流程
(1) 确定任务
(2) 数据预处理
(3) 训练和评估
fit()先拟合,评估用预测器predict()来评估。
  (4) 选择模型
参考资料
SKLearn入门与简单应用案例: https://www.showmeai.tech/article-detail/202
[2]SKLearn官网: https://scikit-learn.org/stable/
[3]AI建模工具速查|Scikit-learn使用指南: https://www.showmeai.tech/article-detail/108
[4]图解机器学习 | 机器学习基础知识: https://www.showmeai.tech/article-detail/185
[5]图解机器学习 | 模型评估方法与准则: https://www.showmeai.tech/article-detail/186
[6]Python机器学习算法实践: https://www.showmeai.tech/article-detail/201
[7]机器学习评估与度量准则: https://www.showmeai.tech/article-detail/186
[8]监督学习:分类任务: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
[9]监督学习:回归任务: https://scikit-learn.org/stable/supervised_learning.html#supervised-learning
[10]无监督学习:聚类任务: https://scikit-learn.org/stable/modules/clustering.html#clustering
[11]无监督学习:降维任务: https://scikit-learn.org/stable/modules/decomposition.html#decompositions
[12]模型选择任务: https://scikit-learn.org/stable/model_selection.html#model-selection
[13]数据预处理任务: https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
[14]数据导入模块: https://scikit-learn.org/stable/datasets.html
[15]线上Jupyter环境: https://jupyter.org/try
[16]seaborn工具与数据可视化教程: https://www.showmeai.tech/article-detail/151
[17]聚类: https://www.showmeai.tech/article-detail/197
