numpy 如何使用Python使用K-Means匹配具有True Labels的Labels Cluster

jvidinwx  于 5个月前  发布在  Python
关注(0)|答案(4)|浏览(91)

我有一个问题与标签数据与Kmeans算法.我的测试句子得到了真正的集群,但我没有得到真正的标签.我已经使用numpy匹配集群与true_label_test,但这个kmeans可以移动集群,真正的标签不匹配的集群数量.我需要帮助这个问题.这里是我的代码

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords 
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter

stop = set(stopwords.words('indonesian'))
exclude = set(string.punctuation) 
lemma = WordNetLemmatizer()

# Cleaning the text sentences so that punctuation marks, stop words & digits are removed  
def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    processed = re.sub(r"\d+","",normalized)
    y = processed.split()
    #print (y)
    return y

path = "coba.txt"

train_clean_sentences = []
fp = open(path,'r')
for line in fp:
    line = line.strip()
    cleaned = clean(line)
    cleaned = ' '.join(cleaned)
    train_clean_sentences.append(cleaned)

#print(train_clean_sentences)
       
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_clean_sentences)

# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100)
modelkmeans.fit(X)

teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN"

test_clean_sentence = []

cleaned_test = clean(teks_satu)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+","",cleaned)
test_clean_sentence.append(cleaned)
    
Test = vectorizer.transform(test_clean_sentence) 

true_test_labels = ['AI','VR','Sistem Informasi']

predicted_labels_kmeans = modelkmeans.predict(Test)
print(predicted_labels_kmeans)

print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------")
print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0])
print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0]) 
print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0])
print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],":",predicted_labels_kmeans)

字符串

0vvn1miw

0vvn1miw1#

我也有同样的问题:我的聚类(kmeans)确实返回了不同的类(聚类数),然后是真正的类。结果是真正的标签和预测的标签不匹配。对我来说有效的解决方案是this代码(滚动到“排列最大化对角元素的总和”)。虽然这种方法工作得很好,但我认为可能有错误的情况。

j91ykkif

j91ykkif2#

下面是一个具体的例子,展示了如何将KMeans聚类id与训练数据标签进行匹配。其基本思想是,假设分类正确,confusion_matrix的对角线上应该有大的值。下面是将聚类中心id与训练标签关联之前的混淆矩阵:

cm = 
array([[  0, 395,   0,   5,   0],
       [  0,   2,   5, 391,   2],
       [  2,   0,   0,   0, 398],
       [  0,   0, 400,   0,   0],
       [398,   0,   0,   0,   2]])

字符串
现在我们只需要重新排序混淆矩阵,使其大值重新定位在对角线上。

cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])


这里我们得到了新的混淆矩阵,现在看起来很熟悉,对吧?

cm_ = 
array([[395,   5,   0,   0,   0],
       [  2, 391,   2,   5,   0],
       [  0,   0, 398,   0,   2],
       [  0,   0,   0, 400,   0],
       [  0,   0,   2,   0, 398]])


您可以使用accuracy_score进一步验证结果

y_pred_ = np.array([cm_argmax[i] for i in y_pred])
accuracy_score(y,y_pred_)
# 0.991


完整的独立代码在这里:

import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.metrics import confusion_matrix,accuracy_score
blob_centers = np.array(
    [[ 0.2,  2.3],
     [-1.5 ,  2.3],
     [-2.8,  1.8],
     [-2.8,  2.8],
     [-2.8,  1.3]])
blob_std = np.array([0.4, 0.3, 0.1, 0.1, 0.1])
X, y = make_blobs(n_samples=2000, centers=blob_centers,
                  cluster_std=blob_std, random_state=7)

def plot_clusters(X, y=None):
    plt.scatter(X[:, 0], X[:, 1], c=y, s=1)
    plt.xlabel("$x_1$", fontsize=14)
    plt.ylabel("$x_2$", fontsize=14, rotation=0)

plt.figure(figsize=(8, 4))
plot_clusters(X)
plt.show()

k = 5
kmeans = KMeans(n_clusters=k, random_state=42)
y_pred = kmeans.fit_predict(X)
cm = confusion_matrix(y, y_pred)
cm
cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])
cm_ = confusion_matrix(y, y_pred)
cm_
accuracy_score(y,y_pred_)

64jmpszr

64jmpszr3#

Albert G Lieu的解决方案很好,对我帮助很大,但如果混淆矩阵在某些轴上给出相同的结果,则可能会出现重复索引值的问题。
这部分:

cm_argmax = cm.argmax(axis=0)
cm_argmax
y_pred_ = np.array([cm_argmax[i] for i in y_pred])

字符串
应改为:

cm_argmax = cm.argmax(axis=0)

# Find the duplicate value
duplicate_value = None
for value in cm_argmax:
    if np.count_nonzero(cm_argmax == value) > 1:
        duplicate_value = value
        break

# Find the missing value
missing_value = None
for i in range(len(cm_argmax)):
    if i not in cm_argmax:
        missing_value = i
        break

# Replace one of the duplicate values with the missing value at the correct index
corrected_cm_argmax = np.copy(cm_argmax)
for i, value in enumerate(cm_argmax):
    if value == duplicate_value:
        corrected_cm_argmax[i] = missing_value
        break

corrected_cm_argmax

y_pred_ = np.array([cm_argmax[i] for i in y_pred])

wlzqhblo

wlzqhblo4#

您可以将每个聚类中的大多数真标签分配给该聚类

相关问题