mahout kmeans聚类结果很奇怪：显示的文档比输入文档少

sqserrrh 于 2021-06-03 发布在 Hadoop

关注(0)|答案(0)|浏览(233)

以下是我的数据：

1.45000 lines(less than 100 words) single file.
2.Key: line ID
3.Value: line(String)

使用标准mahout cli（一切正常）参数将这些文档转换为vector：

Number of clusters: 6, Iteration:10

Result(ClusterDump): 155 Key:Value

有人能帮我做这件事吗？
编辑：
样本数据：

No.    data.  
1      The MapReduce implementation of fuzzy k-means looks similar to that of the k-means.  
2      Each entry in the sequence file has a key, which is the identifier of the vector.  
...  
45900   Fuzzy k-means has a parameter, m, called the fuzziness factor

转换为序列（使用seqdumper验证）

<key:No.> <value:data>
...
45900

矢量变换

mahout-distribution-0.8/bin/mahout seq2sparse -i /user/hadoop/book-seq -o /user/hadoop/book-vector -ow -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 --namedVector

kmeans聚类

mahout-distribution-0.8/bin/mahout kmeans -i /user/hadoop/book-vector/tfidf-vectors -c /user/hadoop/book-initial-cluster -o /user/hadoop/book-kmeans-cluster -cd 0.1 -k 6 -x 10 -cl -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure

集群转储

Directory Structure    
ClusteredPoints  
Cluster-0  
Cluster-1  
Cluster-2-final  

mahout-distribution-0.8/bin/mahout clusterdump -i /user/hadoop/book-kmeans-cluster/clusters-2-final -p /user/hadoop/book-kmeans-cluster/clusteredPoints -of TEXT -o clusterdump.txt -dm org.apache.mahout.common.distance.CosineDistanceMeasure

cat clusterdump.txt  
155 Entries

更新：

After vectorization, tfidf-vector is showing only 155 documents instead of ~ 45000

Java hadoop machine-learning mahout

来源：https://stackoverflow.com/questions/19571144/mahout-kmeans-clustering-strange-result-showing-less-documents-than-input-docum

暂无答案！

目前还没有任何答案，快来回答吧！

我来回答

mahout kmeans聚类结果很奇怪：显示的文档比输入文档少

暂无答案！

相关问题

热门标签

最新问答