如何在bigquery中将一个字符串分隔成多个列而不拆分不同的单词?

wljmcqd8  于 2021-07-24  发布在  Java
关注(0)|答案(4)|浏览(397)

我试图将一个字符串分成两列,但前提是字符串的总长度大于25个字符。如果短于25个字符,那么我只想在第二列。如果长度大于25,那么我希望字符串的第一部分在第1列,第二部分在第2列。这是踢球的人。。。我不想说得支离破碎。所以如果字符串的总长度是26,我知道我需要两列,但是我需要找出在哪里拼接字符串,这样每列中只表示完整的单词。
例如,字符串是“transportation project manager”。因为它有超过25个字符,我希望第一栏写“运输项目”,第二栏写“经理”“交通项目”少于25个字符,但我希望它停止在那里,因为没有另一个完整的字,将适合在25个字符的限制。
另一个例子-字符串是“Casewori”。因为它少于25个字符,所以我希望整个字符串在第2列中表示。
谢谢你的时间!

gr8qqesn

gr8qqesn1#

为了按照定义的最大长度(遵循您描述的逻辑)将字符串拆分为两列,我们将使用bigquery(udf)中的javascript用户定义函数和内置函数length。
首先,将分析字符串。如果最大阈值后的字符是空白,则将按给定的最大字符串长度拆分。但是,如果不是这样,则将检查每个字符,向后计数,直到找到一个空格并拆分字符串。有了这个过程,就避免了函数分解一个单词,而且它总是按照允许的最大长度进行拆分。
下面是一些示例数据的查询,

CREATE TEMP FUNCTION split_str_1(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;

//first part of the string within the threshold 
output = [];

//the rest of the string wihtout the first part
output2 = [];

//if the next character in the string is a whitespace, them split the string  
if(s[len_aux++] == ' ') {

    output.push(s.substring(prev,len_aux));
    output2.push(s.substring(prev,s.length));

}
else{
      do {
          if(s.substring(len_aux - 1, len_aux) == ' ')
            {
                 output.push(s.substring(prev,len_aux));
                 prev = len_aux;
                 output2.push(s.substring(prev,s.length));  
                 break;
           }len_aux--;
       } while(len_aux > prev)     
      }

//outputting the first part of the string      
return output[0];
""";

CREATE TEMP FUNCTION split_str_2(s string,len int64)
RETURNS string
LANGUAGE js AS """
var len_aux = len, prev = 0;

//first part of the string within the threshold 
output = [];

//the rest of the string wihtout the first part
output2 = [];

//if the next character in the string is a whitespace, them split the string  
if(s[len_aux++] == ' ') {

    output.push(s.substring(prev,len_aux));
    output2.push(s.substring(prev,s.length));

}
else{
      do {
          if(s.substring(len_aux - 1, len_aux) == ' ')
            {
                 output.push(s.substring(prev,len_aux));
                 prev = len_aux;
                 output2.push(s.substring(prev,s.length));  
                 break;
           }len_aux--;
       } while(len_aux > prev)     
      }

//outputting the first part of the string      
return output2[0];
""";

WITH data AS (
SELECT "Trying to split a string with more than 25 characters length" AS str UNION ALL
SELECT "Trying to split"  AS str
)
SELECT  str,
        IF(LENGTH(str)>25, split_str_1(str,25), null) as column_1,
        CASE WHEN LENGTH(str)>25 THEN split_str_2(str,25) ELSE str END AS column_2
FROM data

以及输出,

请注意,有两个javascript udf,这是因为当字符串长度超过25个字符时,第一个返回字符串的第一部分,第二个返回第二部分。另外,允许的最大长度作为参数传递,但可以在udf中静态定义为 len=25 .

wh6knrhe

wh6knrhe2#

我认为你的攻击Angular 应该是在第25个字符之前找到第一个空格,然后以此为基础进行分割。
使用其他提交的答案短语作为样本数据:

with sample_data as(
  select 'Transportation Project Manager' as phrase union all
  select 'Caseworker I'as phrase union all
  select "This's 25 characters long" as phrase union all
  select "This's 25 characters long (not!)" as phrase union all
  select 'Antidisestablishmentarianist' as phrase union all
  select 'Trying to split a string with more than 25 characters in length' as phrase union all
  select 'Trying to split' as phrase
),
temp as (
  select 
    phrase,
    length(phrase) as phrase_len,
    -- Find the first space before the 25th character
    -- by reversing the first 25 characters
    25-strpos(reverse(substr(phrase,1,25)),' ') as first_space_before_25 
  from sample_data
)
select 
  phrase,
  phrase_len,
  first_space_before_25,
  case when phrase_len <= 25 or first_space_before_25 = 25 then null
       when phrase_len > 25 then substr(phrase,1,first_space_before_25)
       else null
  end as col1,
  case when phrase_len <= 25 or first_space_before_25 = 25 then phrase
       when phrase_len > 25 then substr(phrase,first_space_before_25+1, phrase_len)
       else null
  end as col2
from temp


我认为使用基本的sql字符串操作可以让您非常接近。你可能需要/想要清理一下,这取决于你是否想要 col2 从一个空格开始,根据你的分界点(你提到的小于25和大于25,但不完全是25)。

jvidinwx

jvidinwx3#

下面是bigquery标准sql


# standardSQL

SELECT phrase, 
  IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1, 
  IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
  SELECT phrase, LENGTH(phrase) len,
    (
      SELECT cut FROM (
        SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
        FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
      )
      WHERE cut <= 25
      ORDER BY cut DESC
      LIMIT 1
    ) cut
  FROM `project.dataset.table`  
)

您可以使用以下示例中的示例数据(在其他答案中很好地提供)来测试、播放上述内容


# standardSQL

WITH `project.dataset.table` AS (
  SELECT 'Transportation Project Manager' AS phrase UNION ALL
  SELECT 'Caseworker I' UNION ALL
  SELECT "This's 25 characters long" UNION ALL
  SELECT "This's 25 characters long (not!)" UNION ALL
  SELECT 'Antidisestablishmentarianist' UNION ALL
  SELECT 'Trying to split a string with more than 25 characters in length' UNION ALL
  SELECT 'Trying to split'
)
SELECT phrase, 
  IF(IFNULL(cut, len ) >= len, NULL, SUBSTR(phrase, 1, cut)) col1, 
  IF(IFNULL(cut, len ) >= len, phrase, SUBSTR(phrase, cut + 1)) col2
FROM (
  SELECT phrase, LENGTH(phrase) len,
    (
      SELECT cut FROM (
        SELECT -1 + SUM(LENGTH(word) + 1) OVER(ORDER BY OFFSET) AS cut
        FROM UNNEST(SPLIT(phrase, ' ')) word WITH OFFSET
      )
      WHERE cut <= 25
      ORDER BY cut DESC
      LIMIT 1
    ) cut
  FROM `project.dataset.table`  
)

有输出

Row phrase                                                          col1                        col2     
1   Transportation Project Manager                                  Transportation Project      Manager  
2   Caseworker I                                                    null                        Caseworker I     
3   This's 25 characters long                                       null                        This's 25 characters long    
4   This's 25 characters long (not!)                                This's 25 characters long   (not!)   
5   Antidisestablishmentarianist                                    null                        Antidisestablishmentarianist     
6   Trying to split a string with more than 25 characters in length Trying to split a string    with more than 25 characters in length   
7   Trying to split                                                 null                        Trying to split

注意:如果您想去掉前导(在col2中)和尾随(在col1中)空格,您可以添加trim()来处理这个额外的逻辑

4jb9z9bj

4jb9z9bj4#

哇,这是个很棒的面试问题!我想到的是:

WITH sample_data
      AS (
  SELECT 'Transportation Project Manager' AS phrase
   UNION ALL
  SELECT 'Caseworker I' AS phrase
   UNION ALL
  SELECT "This's 25 characters long" AS phrase
   UNION ALL
  SELECT "This's 25 characters long (not!)" AS phrase
   UNION ALL
  SELECT 'Antidisestablishmentarianist' AS phrase
         ),
         unnested_words --Make a dataset with one row per "word" per phrase
      AS (
  SELECT *,
         --To preserve the spaces for character counts, prepend one to every word but the first
         CASE WHEN i = 0 THEN '' ELSE ' ' END || word AS word_with_space
    FROM sample_data
   CROSS
    JOIN UNNEST(SPLIT(phrase, ' ')) AS word WITH OFFSET AS i
         ),
         with_word_length
      AS (
  SELECT *,
         --This doesn't need its own CTE, but done here for clarity
         LENGTH(word_with_space) AS word_length
    FROM unnested_words
         ),
         running_sum --Mark when the total character length exceeds 25
      AS (
  SELECT *,
         SUM(word_length) OVER (PARTITION BY phrase ORDER BY i) <= 25 AS is_first_25
    FROM with_word_length
         ),
         by_subphrase --Make a subphrase of words in the first 25, and one for any others
      AS (
  SELECT phrase,
         ARRAY_TO_STRING(ARRAY_AGG(word), '') AS subphrase
    FROM running_sum
GROUP BY phrase, is_first_25
         ),
         by_phrase --Put subphrases into an array (back to one row per phrase)
      AS (
  SELECT phrase, ARRAY_AGG(subphrase) AS subphrases FROM by_subphrase GROUP BY 1
         )
  SELECT phrase,
         --Break the array of subphrases into colummns per your rules
         CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN subphrases[OFFSET(0)] ELSE subphrases[OFFSET(1)] END, 
         CASE WHEN ARRAY_LENGTH(subphrases) = 1 THEN NULL ELSE subphrases[OFFSET(0)] END
    FROM by_phrase

不是很漂亮,但是完成了。

相关问题