MacOS终端：通过列1值拆分CSV文件

yr9zkbsy 于 7个月前发布在 Mac

关注(0)|答案(7)|浏览(65)

我有一个两列的电子表格（以CSV格式保存），如下所示：

COLUMN 1,COLUMN 2
3-Entrepreneurship,"innovation, daily"
,countless
2-Police/Enforcement,"innocent, protect"
2-Bathroom:home room,toilet handle
3-Companies,née dresses
2-Sense of Smell,odorless
3-Entrepreneurship,old ideas
3-Entrepreneurship,¡new income streams!
3-Companies,Zoë’s food store
,many
2-Police/Enforcement,crime
2-Bathroom:home room,bath room
,ring
3-Companies,móvíl résumés
2-Sense of Smell,musty smell
3-Entrepreneurship,good publicity guru!
3-Companies,Señor

字符串
x1c 0d1x的数据
完整的电子表格有1000行（以CSV格式保存，逗号用于分隔两列）。它包含的类别多于此处列出的第1列。
如图所示，第2列的一些条目由两个或三个单词组成，中间用空格隔开。它们还使用逗号、撇号和重音字符（这些字符出现在多个类别中，而不仅仅是标题为3-Companies的类别）。
我想把CSV文件按照第1列中的值拆分成单独的TXT文件。单独的文件将不再是电子表格，而只是一个单词列表。
例如，拆分后

在文件3-创业.txt*

3-Entrepreneurship
innovation, daily
old ideas
¡new income streams!
good publicity guru!

型

在文件2-浴室：家庭房间.txt*

2-Bathroom:home room
toilet handle
bath room

型

在文件2-警察/执法.txt*

2-Police/Enforcement
innocent, protect
crime

型

在文件2中-Sense of sweet.txt *

2-Sense of Smell
odorless
musty smell

型

在文件3-Companies.txt中 *

3-Companies
née dresses
Zoë’s food store
móvíl résumés
Señor

型
这只是一个示例。完整的文件有超过5个类别（在第2列中），因此将有超过5个拆分后的文件。

环境我在MacOS 12.6.9中使用终端。理想情况下，我希望复制并粘贴一行代码，并使其作用于终端活动目录中的CSV文件（因此我不必将文件名硬编码到代码中）。
初次尝试

我实际上问了这个问题的一个不同的变体here。在那个版本中，第2列（而不是第1列）被用来进行分割。那个版本也没有把类别作为每个分割的TXT文件的第一行。
我试着修改它，如下所示：

tail -n +2 *.csv | sort -t',' -k2 | awk -F',' '$2~/^[[:space:]]*$/{next} {sub(/\x0d$/,"")} $2!=prev{close(out); out=$2".txt"; prev=$2} {print $1 > out}'

型
但是，尽管它按列1拆分并将类别名称放在顶部，但它忽略了类别2的所有内容，而是将所有类别2值拆分为单独的文件。
请注意：我不受此代码的约束。任何适用于MacOS 12.6.9的解决方案都可以。

csv

来源：https://stackoverflow.com/questions/77293902/macos-terminal-split-csv-file-by-column-1-value

7条答案

按热度按时间

oewdyzsn1#

一个awk想法：

awk '
#NR==1 { next }                                                 # uncomment this line if the file *DOES* include the header record "COLUMN 1,COLUMN 2"
       { pos = index($0,",")                                    # find location of first comma

         cat = substr($0,1,pos-1)                               # extract the category
         gsub(/\//,":",cat)                                     # replace "/" with ":" (unix/linux filenames cannot include "/")
         if (! cat) cat = prevcat                               # if cat is empty/blank then use prevcat

         words = substr($0,pos+1)                               # extract the words
         gsub(/^"|"$/,"",words)                                 # strip any leading/trailing double quote

         cats[cat] = (cats[cat] ? cats[cat] : cat) ORS words    # update our categories array (cats[]) with our new data; if cats[cat] is empty (ie, this is a new array entry) then start by adding our cat to the list

         prevcat = cat                                          # update previous cat
       }

END    { for (cat in cats) {                                    # loop through list of categories (ie, the indexes of the cats[] array)
             outfile = cat ".txt"                               # create output file name
             print cats[cat] > outfile                          # print array entry to output file
             close (outfile)                                    # close the output file
         }
       }
' full.csv

字符串

**注意：**此解决方案将替换OP当前的所有tail | sort | awk代码

这将产生：

$ head -20 [0-9]*.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police:Enforcement.txt <==
2-Police:Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

型

赞(0）回复(0）举报 7个月前

nkhmeac62#

使用任何POSIX awk，不管会生成多少输出文件，并且不将所有输入存储在内存中，它都会正确处理输入中的转义双引号（例如，它会将a,"foo""bar",b转换为a,foo"bar,b而不是a,foo""bar,b）：

$ cat tst.awk
NR > 1 {
    pos = index($0,",")
    tag = substr($0,1,pos-1)    # tag is everything before first ,
    val = substr($0,pos+1)      # val[ue] is everything after it

    gsub(/^"|"$/,"",val)        # strip field-enclosing quotes
    gsub(/""/,"\"",val)         # de-escape CSV quotes so "" -> "

    out = (tag == "" ? prevTag : tag ) ".txt"
    gsub("/","_",out)           # file names cannot contain / so map to _

    if ( !seen[out]++ ) {
        print tag > out
    }
    print val >> out
    close(out)                  # necessary to avoid "too many open files"

    prevTag = tag
}

字符串

$ awk -f tst.awk file.csv

型

$ head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police_Enforcement.txt <==
2-Police/Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

型
或者，你可以使用Decorate-Sort-Undecorate idiom来提高效率，因为它不需要在每次写入时打开/关闭输出文件，每个标签只需要打开/关闭一次：

$ cat tst.sh
#!/usr/bin/env bash

awk -v OFS=',' '
    NR > 1 {
        pos = index($0,",")
        tag = substr($0,1,pos-1)    # tag is everything before first ,
        val = substr($0,pos+1)      # val[ue] is everything after it

        print NR, (tag == "" ? prevTag : tag), val
        prevTag = tag
    }
' "${@:--}" |
sort -t',' -k2,2 -k1,1n |
awk -F',' '
    $2 != prevTag {
        close(out)
        out = $2 ".txt"
        gsub("/","_",out)       # file names cannot contain / so map to _
        print $2 > out
        prevTag = $2
    }
    {
        sub(/[^,]+,[^,]+,/,"")  # remove the NR and tag
        gsub(/^"|"$/,"")        # strip field-enclosing quotes
        gsub(/""/,"\"")         # de-escape CSV quotes so "" -> "
        print > out
    }
'

型
你可以称之为./tst.sh file.csv。
DSU脚本与将所有输入存储在awk中然后在END部分中处理它相比的一个好处是，在上面的情况下，只有sort需要一次操作整个输入，而不是awk，并且sort使用请求分页等来处理任意大的输入文件。

赞(0）回复(0）举报 7个月前

gj3fmq9x3#

这里有一个Ruby来做这件事：

ruby -r csv -e 'BEGIN{h=Hash.new { |x, key| x[key] = [] }; last_col_1=""}
$<.each_line.with_index{|line,i|
    next if i==0
    lc=CSV.parse(line).flatten
    if lc[0].nil? then col_1=last_col_1 else col_1=lc[0] end
    last_col_1=col_1
    h[col_1] << lc[1]
}
h.each{|k,v| File.open("#{k.gsub(/\//,":")}.txt", "w"){|f| f.write "#{k}\n#{v.join("\n")}"} }
' your_file

字符串
(Note，在示例输入中将文件名中的字符/替换为:，以生成2-Police:Enforcement.txt，因为2-Police/Enforcement.txt是非法文件名。）
制作：

% head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring
==> 2-Police:Enforcement.txt <==
2-Police/Enforcement
innocent, protect
crime
==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell
==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor
==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

型

赞(0）回复(0）举报 7个月前

cbwuti444#

你也可以用Perl来解决这个问题：

perl -F, -lane 'if ($. != 1) { $pos = index($_, ","); $cat = substr($_, 0, $pos); $cat =~ s/\//:/g; $cat = $prevcat if $cat eq ""; $words = substr($_, $pos+1); $words =~ s/^"|"$//g; $cats{$cat} = ($cats{$cat} ? $cats{$cat} : $cat) . "\n" . $words; $prevcat = $cat; } END { while (($cat, $words) = each %cats) { open(my $fh, ">", "$cat.txt") or die "Cannot open $cat.txt: $!"; print $fh $words; close($fh); } }' your_data.csv

字符串
生成的文件：

$ ls -1 *.txt
2-Bathroom:home room.txt
2-Police:Enforcement.txt
2-Sense of Smell.txt
3-Companies.txt
3-Entrepreneurship.txt

型
输出量：

$ head *.txt
==> 2-Bathroom:home room.txt <==
2-Bathroom:home room
toilet handle
bath room
ring

==> 2-Police:Enforcement.txt <==
2-Police:Enforcement
innocent, protect
crime

==> 2-Sense of Smell.txt <==
2-Sense of Smell
odorless
musty smell

==> 3-Companies.txt <==
3-Companies
née dresses
Zoë’s food store
many
móvíl résumés
Señor

==> 3-Entrepreneurship.txt <==
3-Entrepreneurship
innovation, daily
countless
old ideas
¡new income streams!
good publicity guru!

型

赞(0）回复(0）举报 7个月前

afdcj2ne5#

Python可以正确处理CSV数据。下面的程序使用字典（map）将每个值（col 2）存储在一个列表中，该列表与最后一个看到的类别（col 1）相关联。这种最后一次看到的方法允许缺少类别的值与最后一个类别（在它上面）相关联：

import csv

category_values: dict[str, list[str]] = {}

with open("input.csv", newline="", encoding="utf-8") as f:
    reader = csv.reader(f)
    next(reader)  # discard header

    category = ""
    for row in reader:
        _category, value = row[0], row[1]

        if _category != "" and _category != category:
            category = _category

        if category not in category_values:
            category_values[category] = []

        category_values[category].append(value)

字符串
我们可以用两组循环来检查字典：

for category, values in category_values.items():
    print(category)
    for value in values:
        print(f"  {value}")

型
我得到：

3-Entrepreneurship
  innovation, daily
  countless
  old ideas
  ¡new income streams!
  good publicity guru!
2-Police/Enforcement
  innocent, protect
  crime
2-Bathroom:home room
  toilet handle
  bath room
  ring
3-Companies
  née dresses
  Zoë’s food store
  many
  móvíl résumés
  Señor
2-Sense of Smell
  odorless
  musty smell

型
然后使用一组类似的循环将类别值写入它们自己的文件。我根据类别进行一些基本的文件名清理：

for category, values in category_values.items():
    fname = category.replace("/", "-").replace(":", "-").replace("\\", "-")
    with open(f"output-{fname}.txt", "w", newline="", encoding="utf-8") as f:
        f.write(category + "\n")
        for value in values:
            f.write(value + "\n")

型
然后我得到一个文件列表，比如：

output-2-Bathroom-home room.txt
output-2-Police-Enforcement.txt
output-2-Sense of Smell.txt
output-3-Companies.txt
output-3-Entrepreneurship.txt

型
output-2-Bathroom-home room.txt看起来像：

2-Bathroom:home room
toilet handle
bath room
ring

型

赞(0）回复(0）举报 7个月前

vybvopom6#

为了完整起见，您可以使用普通的bash来完成此操作（因为只有1000行，性能应该不是问题）。请注意，与其他基于awk的答案不同，这实际上创建了您想要的文件，即使是在一个（例如，2-Police/Enforcement.txt）。和其他答案一样，如果你有多个-行记录在输入CSV中。如果第二行的第一个字段为空，则将创建的文本文件为.txt。将以下内容放入文件中（例如，~/bin/csv2txt）：

#!/usr/bin/env bash

tail -n+2 "$1" | while IFS=, read -r tmp b; do
  a="${tmp:-$a}"
  [[ "$a" == */* ]] && mkdir -p "${a%/*}"
  [[ -f "$a.txt" ]] || printf '%s\n' "$a" > "$a.txt"
  b="${b#\"}"
  printf '%s\n' "${b%\"}" >> "$a.txt"
done

字符串
使其可执行：

chmod +x ~/bin/csv2txt

型
然后：

~/bin/csv2txt file.csv

型
或者，如果您的PATH中已经有~/bin：

csv2txt file.csv

型
免责声明：

tail -n+2 "$1" | while IFS=, read -r tmp b; do：我们使用tail删除CSV文件的第一行，并使用Input Field Separator将其他行传输到while循环（IFS）设置为逗号。对于每行，我们存储第一个字段（在第一个逗号之前）在tmp和该行的其余部分（在第一个逗号之后）。我们使用read的-r选项来禁止反斜杠转义任何字符。
a="${tmp:-$a}"：如果tmp不为空，我们将其分配给a，否则（例如在第3行,countless），我们让a未修改。
[[ "$a" == */* ]] && mkdir -p "${a%/*}"：如果a包含一个斜杠（例如2-Police/Enforcement），我们将创建相应的目录。
[[ -f "$a.txt" ]] || printf '%s\n' "$a" > "$a.txt"：如果目标文本文件（"$a.txt"）不存在，我们将在其中打印$a。
b="${b#\"}"：我们从b中删除任何前导"。
printf '%s\n' "${b%\"}" >> "$a.txt"：我们从b中删除任何尾随的"，并将其值附加到目标文本文件中。

赞(0）回复(0）举报 7个月前

zy1mlcev7#

要正确处理可能包含转义引号等的引号值，您需要一种具有适当CSV解析器的语言。

#!/usr/bin/env python3

import csv

with open('input.csv') as csvin:
    for row in csv.reader(csvin):
        with open(row[0], 'a') as txtout:
            txtout.write(row[1] + '\n')

字符串
这里是一个稍微修饰的版本，涵盖了更多的角落情况。我假设第一列中的空单元格的例子应该被跳过，并且带有斜杠的标签应该会导致子目录。

import csv
from pathlib import Path

with open('input.csv') as csvin:
    # skip header
    reader = csv.reader(csvin)
    reader.__next__()
    for row in reader:
        if row[0]:
            if '/' in row[0]:
                Path(row[0]).parent.mkdir(parents=True, exist_ok=True)
            with open(row[0] + '.txt', 'a+') as txtout:
                txtout.write(row[1] + '\n')

型
演示：https://ideone.com/7akFvw
如果你想让第一行包含一个头，这会使代码复杂化（你需要检查文件是否已经存在），但我认为缺少头是一个特性，而不是一个bug。
如果你真的需要优化它的速度，你应该保持尽可能多的文件句柄打开，但如果你打开超过操作系统允许的数量（通常在20的顺序）后退。关闭并立即重新打开一个文件往往会慢得多。

赞(0）回复(0）举报 7个月前

我来回答

MacOS终端：通过列1值拆分CSV文件

7条答案

相关问题

热门标签

最新问答