异常值检测算法——无监督算法|拟合|样本点

异常值检测算法——无监督算法

2022-10-02 20:17:01　来源: 数学家

北京举报

分享至

无监督学习算法

无监督学习的算法适用于正常数据和异常数据都存在且没有标签的情况下，这种异常值检测也被称作为离群值检测，所谓离群点检测就是:训练数据包含离群点,即远离其它内围点。离群点检测估计器会尝试拟合出训练数据中内围点聚集的区域,会忽略有偏离的观测值。

这类算法主要有：

IsolationForest

DBSCAN

LocalOutlier Factor（LOF）

数据准备

解释：因为是训练模型，所以这里人为的加入了标签，目的在于评价模型预测结果

训练集

importnumpy asnp
importpandas aspd
importseaborn assns
importmatplotlib.pyplot asplt
importmatplotlib asmpl

#这里是脱敏数据，使用修改路径即可
data=pd.read_csv("E:/bladed/test26-abnormal-train.csv")

X=data[data.columns[:-1]]
y=data['target']

print(X.shape)
print(y.shape)
print(y.value_counts()) #1代表正常，-1代表异常，比例9:1

测试集

test=pd.read_csv("E:/bladed/test26_test_13.csv")
X_test=test[test.columns[:-1]]
y_test=test['target']
print(X_test.shape)
print(y_test.shape)
print(y_test.value_counts()) #1代表正常，-1代表异常

#数据标准化
fromsklearn.preprocessing importStandardScaler

X=StandardScaler().fit_transform(X)
X_test=StandardScaler().fit_transform(X_test)

1. IsolationForest 1.1 IsolationForest 训练

#IsolationForest训练拟合
fromsklearn.ensemble importIsolationForest

IF=IsolationForest(n_estimators=900,max_samples='auto',contamination='auto',max_features=1.0,bootstrap=False,n_jobs=-1,
behaviour='deprecated',random_state=42,verbose=0,warm_start=False).fit(X)

#预测结果展示比较
IF_predict=IF.predict(X)
train_if=pd.DataFrame(IF_predict,columns=["predict"])
train_if["y"]=y
train_if

1.2 IsolationForest训练结果评价

预测值

-1

真实值

-1

TP（-1-1）

FN（-11）

FP（1-1）

TN（11）

classification_report每个分类标签的精确度，召回率和F1-score。

精确度：`precision`，正确预测为正的，占全部预测为正的比例，

召回率：`recall`，正确预测为正的，占全部实际为正的比例，

`F1-score`：精确率和召回率的调和平均数，

同时还会给出总体的微平均值，宏平均值和加权平均值。

微平均值：microaverage，所有数据结果的平均值

宏平均值：macroaverage，所有标签结果的平均值

加权平均值：weightedaverage，所有标签结果的加权平均值

在二分类场景中，正标签的召回率称为敏感度（sensitivity），负标签的召回率称为特异性（specificity）。

fromsklearn.metrics importconfusion_matrix #混淆矩阵
fromsklearn.metrics importaccuracy_score #准确率
fromsklearn.metrics importprecision_score #精确率
fromsklearn.metrics importrecall_score #召回率
fromsklearn.metrics importf1_score #F1-得分
fromsklearn.metrics importclassification_report

print("混淆矩阵：")
print(confusion_matrix(y,IF_predict))
print("标准化混淆矩阵：")
print(confusion_matrix(y,IF_predict,normalize='true'))
print("得分详细：")
print(classification_report(y,IF_predict))

fromsklearn.metrics importroc_curve #ROC曲线
fromsklearn.metrics importroc_auc_score #AUC面积

area=roc_auc_score(y,IF.decision_function(X))
FPR,recall, thresholds =roc_curve(y,IF.decision_function(X))

plt.figure()
plt.plot(FPR,recall, color='red',label='ROCcurve (area = %0.2f)'%area)
plt.plot([0,1],[0,1],color='black',linestyle='--')
plt.xlim([-0.05,1.05])
plt.ylim([-0.05,1.05])
plt.xlabel('FalsePositive Rate')
plt.ylabel('Recall')
plt.title('Receiveroperating characteristic example')
plt.legend(loc="lowerright")
plt.show()

IsolationForest测试结果评价

预测值

-1

真实值

-1

TP（-1-1）

FN（-11）

FP（1-1）

TN（11）

IF_test=IF.predict(X_test) #测试预测结果
IF_decision_function=IF.decision_function(X_test) #测试预测决策函数结果
IF_score_samples=IF.score_samples(X_test) #测试预测样本得分

#形成DataFrame,便于展示
IF_tests=pd.DataFrame(IF_test,columns=["predict"])
IF_tests["y_test"]=y_test
IF_tests["decision_function"]=IF_decision_function
IF_tests["score_samples"]=IF_score_samples
IF_tests

#测试集评价指标结果

print("测试混淆矩阵：")
print(confusion_matrix(y_test,IF_test))
print("测试标准化混淆矩阵：")
print(confusion_matrix(y_test,IF_test,normalize='true'))
print("得分详细：")
print(classification_report(y_test,IF_test))

fromsklearn.metrics importroc_curve #ROC曲线
fromsklearn.metrics importroc_auc_score #AUC面积

area=roc_auc_score(y_test,IF.decision_function(X_test))
FPR,recall, thresholds =roc_curve(y_test,IF.decision_function(X_test))

DBSCAN

DBSCAN训练

以每个点为中心，设定邻域及邻域内需要有多少个点，如果样本点大于指定要求，则认为该点与邻域内的点属于同一类，如果小于指定值，若该点位于其它点的邻域内，则属于边界点。

设定两个参数:

eps为领域的大小

min_samples为领域内最小点的个数点。

fromsklearn.cluster importDBSCAN
DB=DBSCAN(eps=0.9,min_samples=5,metric='euclidean',metric_params=None,algorithm='auto',leaf_size=30,p=None,n_jobs=-1).fit(X)

labels=DB.labels_ #获取类别标签，-1表示未分类
labels[labels>=0]=1#筛选出异常值

#预测结果展示比较
train_DB=pd.DataFrame(labels,columns=["predict"])
train_DB["y"]=y
train_DB

train_DB["predict"].value_counts()

print("混淆矩阵：")
print(confusion_matrix(y,labels))
print("标准化混淆矩阵：")
print(confusion_matrix(y,labels,normalize='true'))
print("得分详细：")
print(classification_report(y,labels))

DBSCAN 测试结果评价

labels_test=DB.fit_predict(X_test)
labels_test[labels_test>=0]=1#筛选出异常值

#测试集评价指标结果

print("测试混淆矩阵：")
print(confusion_matrix(y_test,labels_test))
print("测试标准化混淆矩阵：")
print(confusion_matrix(y_test,labels_test,normalize='true'))
print("得分详细：")
print(classification_report(y_test,labels_test))

3 LOF 3.1 LOF训练

LOF通过计算一个数值score来反映一个样本的异常程度。这个数值的大致意思是：

一个样本点周围的样本点所处位置的平均密度比上该样本点所在位置的密度。比值越大于1，则该点所在位置的密度越小于其周围样本所在位置的密度。

方法

离群点检测novelty = False

新奇点检测novelty = True

fit_predict

可用

不可用

predict

不可用

只能用于新数据

decision_function

不可用

只能用于新数据

score_samples

用 negative_outlier_factor_

只能用于新数据

fromsklearn.neighbors importLocalOutlierFactor
LOF=LocalOutlierFactor(n_neighbors=70,algorithm='auto',leaf_size=5,metric='minkowski',p=2,metric_params=None,contamination='auto',novelty=False,n_jobs=-1).fit(X)
LOF_predict=LOF.fit_predict(X)

print("混淆矩阵：")
print(confusion_matrix(y,LOF_predict))
print("标准化混淆矩阵：")
print(confusion_matrix(y,LOF_predict,normalize='true'))
print("得分详细：")
print(classification_report(y,LOF_predict))

area=roc_auc_score(y,LOF.negative_outlier_factor_)
FPR,recall, thresholds =roc_curve(y,LOF.negative_outlier_factor_)

LOF测试

LOF_test=LOF.fit_predict(X_test)

print("混淆矩阵：")
print(confusion_matrix(y_test,LOF_test))
print("标准化混淆矩阵：")
print(confusion_matrix(y_test,LOF_test,normalize='true'))
print("得分详细：")
print(classification_report(y_test,LOF_test))

总结

这类算法主要有：

IsolationForest（最优）

DBSCAN（难调参）

LocalOutlier Factor（LOF）（效果糟糕）

BONUS TIME

数学建模资料、视频讲解、历年赛题

后台回复【校苑】领取

特别声明：以上内容(如有图片或视频亦包括在内)为自媒体平台“网易号”用户上传并发布，本平台仅提供信息存储服务。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.