我有一个问题与标签数据与Kmeans算法.我的测试句子得到了真正的集群,但我没有得到真正的标签.我已经使用numpy匹配集群与true_label_test,但这个kmeans可以移动集群,真正的标签不匹配的集群数量.我需要帮助这个问题.这里是我的代码
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import string
import re
import numpy as np
from collections import Counter
stop = set(stopwords.words('indonesian'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()
# Cleaning the text sentences so that punctuation marks, stop words & digits are removed
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
processed = re.sub(r"\d+","",normalized)
y = processed.split()
#print (y)
return y
path = "coba.txt"
train_clean_sentences = []
fp = open(path,'r')
for line in fp:
line = line.strip()
cleaned = clean(line)
cleaned = ' '.join(cleaned)
train_clean_sentences.append(cleaned)
#print(train_clean_sentences)
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(train_clean_sentences)
# Clustering the training 30 sentences with K-means technique
modelkmeans = KMeans(n_clusters=3, init='k-means++', max_iter=200, n_init=100)
modelkmeans.fit(X)
teks_satu = "Aplikasi Machine Learning untuk mengenali daun mangga dengan metode CNN"
test_clean_sentence = []
cleaned_test = clean(teks_satu)
cleaned = ' '.join(cleaned_test)
cleaned = re.sub(r"\d+","",cleaned)
test_clean_sentence.append(cleaned)
Test = vectorizer.transform(test_clean_sentence)
true_test_labels = ['AI','VR','Sistem Informasi']
predicted_labels_kmeans = modelkmeans.predict(Test)
print(predicted_labels_kmeans)
print ("\n-------------------------------PREDICTIONS BY K-Means--------------------------------------")
print ("\nIndex of Virtual Reality : ",Counter(modelkmeans.labels_[5:10]).most_common(1)[0][0])
print ("Index of Machine Learning : ",Counter(modelkmeans.labels_[0:5]).most_common(1)[0][0])
print ("Index of Sistem Informasi : ",Counter(modelkmeans.labels_[10:15]).most_common(1)[0][0])
print ("\n",teks_satu,":",true_test_labels[np.int(predicted_labels_kmeans)],":",predicted_labels_kmeans)
字符串
4条答案
按热度按时间0vvn1miw1#
我也有同样的问题:我的聚类(kmeans)确实返回了不同的类(聚类数),然后是真正的类。结果是真正的标签和预测的标签不匹配。对我来说有效的解决方案是this代码(滚动到“排列最大化对角元素的总和”)。虽然这种方法工作得很好,但我认为可能有错误的情况。
j91ykkif2#
下面是一个具体的例子,展示了如何将
KMeans
聚类id与训练数据标签进行匹配。其基本思想是,假设分类正确,confusion_matrix
的对角线上应该有大的值。下面是将聚类中心id与训练标签关联之前的混淆矩阵:字符串
现在我们只需要重新排序混淆矩阵,使其大值重新定位在对角线上。
型
这里我们得到了新的混淆矩阵,现在看起来很熟悉,对吧?
型
您可以使用
accuracy_score
进一步验证结果型
完整的独立代码在这里:
型
64jmpszr3#
Albert G Lieu的解决方案很好,对我帮助很大,但如果混淆矩阵在某些轴上给出相同的结果,则可能会出现重复索引值的问题。
这部分:
字符串
应改为:
型
wlzqhblo4#
您可以将每个聚类中的大多数真标签分配给该聚类