Hive句子是如何打破每一个句子的

jqjz2hbq 于 2021-06-28 发布在 Hive

关注(0)|答案(2)|浏览(309)

在发帖之前，我尝试了hive-sensements函数，做了一些搜索，但没有得到一个清晰的理解，我的问题是基于什么分隔符hive-sensements函数打破了每个句子？Hive手册上说“适当的边界”这是什么意思？下面是我尝试的一个例子，我尝试添加句点（.）和感叹号（！）在句子的不同地方。我得到了不同的结果，有人能解释一下吗？

带句点（.）

select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出-1阵列

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

带“！”

select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable

输出-2个阵列

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

Hive bigdata

来源：https://stackoverflow.com/questions/41467245/how-hive-sentences-function-breaks-each-sentence

2条答案

按热度按时间

slwdgvem1#

如果你理解句子（）的功能，它会消除你的疑虑。
句子定义（str）：
将str拆分为句子数组，其中每个句子都是单词数组。
例子：

SELECT sentences('Hello there! I am a UDF.') FROM src LIMIT 1;

[ ["Hello", "there"], ["I", "am", "a", "UDF"] ]

SELECT sentences('review . language') FROM movies;

[["review","language"]]

感叹号是一种位于句子末尾的标点符号。其他相关标点符号的例子包括句点和问号，它们也在句末 sentences() ，不必要的标点符号，如英文中的句点和逗号，会自动删除。因此，我们可以用！得到两个单词数组！。它完全涉及 java.util.Locale.java

赞(0）回复(0）举报 2021-06-28

o4hqfura2#

我不知道实际原因，但在句号（.）后观察到，如果你把空格和下一个单词的第一个字母作为大写字母，那么它就起作用了。在这里，我改变了工作地点。然而，这是不需要的！

Tokenizes a string of natural language text into words and sentences. Where each sentence is broken at the appropriate sentence boundary and returned as an array of words.

这是低于输出的

[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["Where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]

赞(0）回复(0）举报 2021-06-28

我来回答

Hive句子是如何打破每一个句子的

带句点（.）

输出-1阵列

带“！”

输出-2个阵列

2条答案

相关问题

热门标签

最新问答