在发帖之前,我尝试了hive-sensements函数,做了一些搜索,但没有得到一个清晰的理解,我的问题是基于什么分隔符hive-sensements函数打破了每个句子?Hive手册上说“适当的边界”这是什么意思?下面是我尝试的一个例子,我尝试添加句点(.)和感叹号(!)在句子的不同地方。我得到了不同的结果,有人能解释一下吗?
带句点(.)
select sentences('Tokenizes a string of natural language text into words and sentences. where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
输出-1阵列
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences","where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
带“!”
select sentences('Tokenizes a string of natural language text into words and sentences! where each sentence is broken at the appropriate sentence boundary and returned as an array of words.') from dummytable
输出-2个阵列
[["Tokenizes","a","string","of","natural","language","text","into","words","and","sentences"],["where","each","sentence","is","broken","at","the","appropriate","sentence","boundary","and","returned","as","an","array","of","words"]]
2条答案
按热度按时间slwdgvem1#
如果你理解句子()的功能,它会消除你的疑虑。
句子定义(str):
将str拆分为句子数组,其中每个句子都是单词数组。
例子:
感叹号是一种位于句子末尾的标点符号。其他相关标点符号的例子包括句点和问号,它们也在句末
sentences()
,不必要的标点符号,如英文中的句点和逗号,会自动删除。因此,我们可以用!得到两个单词数组!。它完全涉及java.util.Locale.java
o4hqfura2#
我不知道实际原因,但在句号(.)后观察到,如果你把空格和下一个单词的第一个字母作为大写字母,那么它就起作用了。在这里,我改变了工作地点。然而,这是不需要的!
这是低于输出的