比较两个大的电子邮件地址列表的最佳方法或算法是什么?

6ioyuze2  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(338)

在短时间内比较两大电子邮件地址列表的最佳方法或算法是什么?
其思想是检测列表b中可以找到的地址。
名单不相等。我尝试了模糊校验和,但只有当列表大小相等时(在我的例子中,列表不相等),它才是好的。
我认为这是一个hadoop解决方案,但不幸的是我是hadoop的初学者。有人有什么想法,例子,解决方案,教程吗?
谢谢

6ovsh4lw

6ovsh4lw1#

这对你有用吗?应该是o(n)。

Create an empty hash set for the intersection with a hash function that doesn't collide over email addresses
Create an empty hash set for the first difference hash set with a similar hash function
Create an empty hash set for the second difference hash set with a similar hash function
Iterate through the first list:
    Add the current element to the first difference hash set
End Iterate
Iterate through the second list:
    If the current element exists in the intersection hash set:
        Remove the current element from the first difference hash set
        Remove the current element from the second difference hash set
    Else If the current element exists in the first difference hash set:
        Remove the current element from the first difference hash set
        Remove the current element from the second difference hash set
        Add the current element to the intersection hash set
    Else:
        Add the current element to the second difference hash set
    End If
End Iterate
Process the intersection hash set as the solution

它的好处是既给你交集又给你区别。它可以扩展到跟踪任意数量的列表之间的差异。

bvuwiixz

bvuwiixz2#

如果将每个列表视为一个集合,则公共地址由集合交集表示。“唯一”地址(仅出现在一个地址中)表示为:

set1 U set2 \ (set1 [intersection] set2)

在所有高级语言(如java)中都可以很容易地完成,看看apache吧 CollectionUtils.intersection() 例如。
如果列表不是太大(适合内存),可以在内存中执行以下操作(java代码):

//first two lines are just for testing, not part of the algorithm:
    List<String> l1 = Arrays.asList(new String[] { "a@b.com", "1@2.com"} );
    List<String> l2 = Arrays.asList(new String[] { "1@2.com", "asd@f.com", "qwer@ty.com"} );
    Set<String> s1 = new HashSet<String>(l1);
    for (String s : l2) {
        if (s1.contains(s)) System.out.println(s);
    }

如果您想使用hadoop,可以通过以下方式实现常见邮件:

map(set):
   for each mail in list:
         emit(mail,'1')
reduce(mail,list<1>):
    if size(list) > 1:
       emit(mail)

通过在两个集合上调用map,并在mapper的输出上进行reduce,您将获得公共元素。

相关问题