mahout kmeans聚类结果很奇怪:显示的文档比输入文档少

sqserrrh  于 2021-06-03  发布在  Hadoop
关注(0)|答案(0)|浏览(233)

以下是我的数据:

1.45000 lines(less than 100 words) single file.
2.Key: line ID
3.Value: line(String)

使用标准mahout cli(一切正常)参数将这些文档转换为vector:

Number of clusters: 6, Iteration:10

Result(ClusterDump): 155 Key:Value

有人能帮我做这件事吗?
编辑:
样本数据:

No.    data.  
1      The MapReduce implementation of fuzzy k-means looks similar to that of the k-means.  
2      Each entry in the sequence file has a key, which is the identifier of the vector.  
...  
45900   Fuzzy k-means has a parameter, m, called the fuzziness factor

转换为序列(使用seqdumper验证)

<key:No.> <value:data>
...
45900

矢量变换

mahout-distribution-0.8/bin/mahout seq2sparse -i /user/hadoop/book-seq -o /user/hadoop/book-vector -ow -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 --namedVector

kmeans聚类

mahout-distribution-0.8/bin/mahout kmeans -i /user/hadoop/book-vector/tfidf-vectors -c /user/hadoop/book-initial-cluster -o /user/hadoop/book-kmeans-cluster -cd 0.1 -k 6 -x 10 -cl -ow -dm org.apache.mahout.common.distance.CosineDistanceMeasure

集群转储

Directory Structure    
ClusteredPoints  
Cluster-0  
Cluster-1  
Cluster-2-final  

mahout-distribution-0.8/bin/mahout clusterdump -i /user/hadoop/book-kmeans-cluster/clusters-2-final -p /user/hadoop/book-kmeans-cluster/clusteredPoints -of TEXT -o clusterdump.txt -dm org.apache.mahout.common.distance.CosineDistanceMeasure

cat clusterdump.txt  
155 Entries

更新:

After vectorization, tfidf-vector is showing only 155 documents instead of ~ 45000

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题