使用bash脚本更新CSV的一些问题

owfi6suc  于 5个月前  发布在  其他
关注(0)|答案(4)|浏览(74)

问题是,如果标题字段由逗号分隔的几个单词组成,则无法正确处理该字段。
Bash任务1
公司ABC有员工的帐户创建不受控制的过程。目前的过程意味着添加姓名,电子邮件和其他个人数据手动到accounts.csv文件没有任何规则。部门负责人已决定根据命名约定的实施来改进它。对于新人来说是个好主意,但如何处理当前用户的列表?您已被要求提供帮助。请开发自动化的方式(Bash脚本),并根据当前的accounts.csv及以下创建新的accounts_new.csv文件。
需要更新列名称。名称格式:名字/姓氏的第一个字母“”和所有其他字母“”。需要用domain @abc更新列电子邮件。电子邮件格式:名字和全名的第一个字母“”。如果两个电子邮件地址在其他方面相同,则应通过在@符号前添加location_id的值来消除歧义。
脚本的名称应该是task1.sh
accounts.csv文件的路径应作为脚本的参数。
done的定义:开发了Bash脚本,该脚本自动创建accounts_new.csv并根据上述规则更新列nameemail
来自accounts.csv的故障数据:

id,location_id,name,title,email,department
3,2,Brenda Brown,Director, Second Career Services,,
4,3,Howard Lader,Manager, Senior Counseling,,
5,4,Kimberly Pesavento,Commercial Director,,
18,11,Dave Genesy,Head of Office,,
22,14,Andres Espinoza,Manager, Commanding Officer,,

字符串
一定是这样的:

*what I need* - '3,2,Brenda Brown,"Director, Second Career Services",[email protected],\n',                 
  what I have -  '3,2,Brenda Brown,Director,[email protected],\n',
  *what I need* -  '4,3,Howard Lader,"Manager, Senior Counseling",[email protected],\n',
  what I have -  '4,3,Howard Lader,Manager,[email protected],\n',
 * what I need* -  '5,4,Kimberly Pesavento,Commercial director,[email protected],\n',
  what I have - '5,4,Kimberly Pesavento,Commercial Director,[email protected],\n',
  *what I need* - '18,11,Dave Genesy,Head of office,[email protected],\n',             
  what I have - '18,11,Dave Genesy,Head of Office,[email protected],\n',
  *what I need* - '22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected],\n',                        
  what I have - '22,14,Andres Espinoza,Manager,[email protected],\n',
#!/bin/bash
function format_name() {
    local first_name last_name
    # Разбиение полного имени на имя и фамилию, приведение к нижнему регистру
    read -r first_name last_name <<<"$(echo "$1" | tr '[:upper:]' '[:lower:]')"
    # Преобразование первой буквы имени и фамилии к верхнему регистру
    first_name=$(tr '[:lower:]' '[:upper:]' <<<"${first_name:0:1}")${first_name:1}
    last_name=$(tr '[:lower:]' '[:upper:]' <<<"${last_name:0:1}")${last_name:1}

    if [[ $last_name == *-* ]]; then
        hyphen_pos=$(expr index "$last_name" "-")
        last_name=${last_name:0:hyphen_pos}$(tr '[:lower:]' '[:upper:]' <<<"${last_name:$hyphen_pos:1}")${last_name:$hyphen_pos+1}
    fi
    echo "$first_name $last_name"
}

function is_exception() {
    local word=$1
    local exceptions=("Office" "for" "and" "the" "of" "new") # Список исключений
    for exception in "${exceptions[@]}"; do
        if [[ $word == $exception ]]; then
            return 0 # Слово является исключением
        fi
    done
    return 1 # Слово не является исключением
}

function format_title() {
    local title="$1"
    local formatted_title=""
    local word

    # Преобразование каждого слова в заголовке
    for word in $title; do
        if is_exception "$word"; then
            # Преобразование слова-исключения к нижнему регистру
            formatted_title+=" $(tr '[:upper:]' '[:lower:]' <<<"$word")"
        else
            # Преобразование первой буквы слова к верхнему регистру
            formatted_title+=" $(tr '[:lower:]' '[:upper:]' <<<"${word:0:1}")${word:1}"
        fi
    done
    echo "${formatted_title:1}" # Удаление начального пробела
}

function generate_email() {
    local alias=$1
    local location_id=$2
    local domain="@abc.com"
    email=${alias}

    # Добавление location_id, если адрес уже встречался
    if [[ ${alias_counts[$alias]} -gt 1 ]]; then
        email=${email}${location_id}
    fi

    email=${email}${domain}
    echo "$email"
}

function create_email_alias() {
    local first_name="${1%% *}"
    local last_name="${1#* }"

    alias=$(echo "${first_name:0:1}${last_name}" | tr '[:upper:]' '[:lower:]')
    echo "$alias"
}

if [ $# -ne 1 ]; then
    echo "Usage: $0 <accounts.csv>"
    exit 1
fi

declare -A alias_counts
alias_array=()

input_file=$1
output_file="accounts_new.csv"

read -r header <"$input_file"
echo "$header" >"$output_file"

skip_header=true

while IFS=',' read -r id location_id name title email department; do
    if $skip_header; then
        skip_header=false
        continue
    fi

    alias=$(create_email_alias "$name")
    alias_array+=("$alias")
    ((alias_counts[$alias]++))
done <"$input_file"

skip_header=true

while IFS=',' read -r id location_id name title email department; do
    if $skip_header; then
        skip_header=false
        continue
    fi

    # Форматирование имени и заголовка
    formatted_name=$(format_name "$name")
    formatted_title=$(format_title "$title")
    formatted_email=$(generate_email "$(create_email_alias "$name")" "$location_id")

    # Запись отформатированных данных в выходной файл
    echo "$id,$location_id,$formatted_name,$formatted_title,$formatted_email,$department" >>"$output_file"
done <"$input_file"

echo "Processing done. $output_file created."
hfwmuf9z

hfwmuf9z1#

不要使用shell读取循环,参见why-is-using-a-shell-loop-to-process-text-considered-bad-practice
这是一个可能的开始:

$ cat tst.sh
#!/usr/bin/env bash

awk '
    BEGIN { FS = OFS = "," }
    NR == 1 {
        for ( i=1; i<=NF; i++ ) {
            tags2fldNrs[$i] = i
            fldNrs2tags[i] = $i
        }
        badFldNrsBeg = tags2fldNrs["title"]
        nf = NF
        next
    }
    {
        badFldNrsEnd = badFldNrsBeg + (NF - nf)

        fldNr = 0
        for ( i=1; i<=NF; i++ ) {
            if ( (i <= badFldNrsBeg) || (badFldNrsEnd < i) ) {
                vals[++fldNr] = $i
            }
            else {
                vals[fldNr] = vals[fldNr] FS $i
            }
        }
        vals[badFldNrsBeg] = "\"" vals[badFldNrsBeg] "\""

        nameFldNr = tags2fldNrs["name"]
        emailFldNr = tags2fldNrs["email"]

        n = split(vals[nameFldNr],name," ")
        vals[emailFldNr] = tolower(substr(name[1],1,1) name[n]) "@abc.com"

        for ( i=1; i<=nf; i++ ) {
            printf "%s%s", vals[i], (i<nf ? OFS : ORS)
        }
    }
' "${@:--}"

字符串

$ ./tst.sh accounts.csv
3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,"Commercial Director",[email protected],
18,11,Dave Genesy,"Head of Office",[email protected],
22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected],


你应该做任何你需要做的事情来确保电子邮件地址是唯一的,至少添加代码来确保任何包含逗号或双引号的字段都用双引号括起来,并且字段内的任何双引号都用双引号来转义。有关详细信息,请参阅What's the most robust way to efficiently parse CSV using awk?
FWIW我写了上面的内容,作为OP关于如何使用awk处理此类问题的有用学习经验,但是如果我们可以假设输入总是看起来像OP发布的,那么它可能只是:

$ awk '
    BEGIN { FS=OFS="," }
    NR == 1 { nf = NF; next }
    {
        d=(NF-nf); $4="\""$4; $(4+d)=$(4+d)"\""
        split($3,n," ")
        $(NF-1) = tolower(substr(n[1],1,1) n[2])"@abc.com"
    }
1' accounts.csv
3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,"Commercial Director",[email protected],
18,11,Dave Genesy,"Head of Office",[email protected],
22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected],

nzkunb0c

nzkunb0c2#

我没有试图理解或调试你的代码。这里有一个简单得多的解决方案,只使用 * 几乎 * Bash内置。也许特别注意parameter expansions.
(我在理解了这个要求后,更新了这个方法,以便对电子邮件地址进行重复检测,我提出的快速修复方法使用了几个标准的外部实用程序。它可以在原生Bash中使用关联数组来完成,我猜你的代码已经尝试过这样做了。)

#!/bin/bash

capitalize () {
    local sep=''
    local field
    for field in "$@"; do
        field=${field,,}
        printf "%s%s" "$sep" "${field^}"
        sep=' '
    done
    printf '\n'
}

name2e () {
    printf '%0.1s%s\n' "${1,,}" "${2,,}"
}

get_duplicates () {
    cut -d, -f3 "$1" |
    sed 's/^\([^ ]\)[^ ]* /\1/' |
    tr A-Z a-z |
    sort |
    uniq -d
}

duplicates=$(get_duplicates "$1")
# A single newline in a string
nl='
'

first=true
while read -r line; do
    case $first in
        true) echo "$line"; first=false; continue;;
    esac
    for field in id location_id name; do
        printf -v "$field" '%s' "${line%%,*}"
        line=${line#"${!field}",}
    done
    for field in email department; do
        printf -v "$field" '%s' "${line##*,}"
        line=${line%",${!field}"}
    done
    case $email in
        '') email="$(name2e $name)";;
    esac
    case $nl$duplicates$nl in
        *"$nl$email$nl"*)
            email=$email$location_id;;
    esac
    case $line in
        *,*) title="\"$line\"";;
        *) title="$line";;
    esac
    printf "%i,%i,%s,%s,%s,%s\n" \
           "$id" "$location_id" "$(capitalize $name)" \
           "$title" "[email protected]" "$department"
done <"$1"

字符串
这从开头和结尾各去掉两个字段,并假定剩下的就是标题。
这里唯一比较先进的技术是使用printf -v "$field"来创建变量,该变量的名称是$field的值,间接引用${!field}指向另一个方向。
我并不特别喜欢我通过避免引用函数的参数来节省一些时间,但这是一种利用您提供的有限数据解决这个问题的快速方法。希望它至少可以激励您研究更复杂的角落情况,如果您需要处理它们(如“Right Hon. August Prinz von der Wienerschnitzel”和其他Falsehoods Programmers Believe about Names的插图)。
演示:https://ideone.com/mNpGwb

3zwtqj6y

3zwtqj6y3#

使用一种具有CSV解析器的语言,这样的事情就容易多了。
下面是一个Ruby的例子:

ruby -r csv -e '
inp=CSV.parse($<.read)[1..]
dups=inp.map{|row| fn,ln=row[2].downcase.split; "#{fn[0]+ln}@abc.com"}.tally
puts CSV.generate{|csv|
    inp.each{|row| fn,ln=row[2].downcase.split
        email=dups["#{fn[0]+ln}@abc.com"]>1 ? 
            "#{fn[0]+ln+row[1]}@abc.com" : "#{fn[0]+ln}@abc.com"
        csv<<row[0..2]+[row[4].nil? ? row[3] : row[3..4].join(","), 
                            email,nil] 
        }
}' file

字符串
打印(以OP为例):

3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,Commercial Director,[email protected],
18,11,Dave Genesy,Head of Office,[email protected],
22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected],


以Tripleee为例:

3,2,Brenda Brown,"Director, Second Career Services",[email protected],
4,3,Howard Lader,"Manager, Senior Counseling",[email protected],
5,4,Kimberly Pesavento,Commercial Director,[email protected],
18,11,Dave Genesy,Head of Office,[email protected],
22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected],
42,17,Danielle Genesy,Master of the Known Universe,[email protected],


Python、Perl或最近的GNU Awk都是类似的。

pbwdgjma

pbwdgjma4#

如果你想使用CSV感知工具,你可以使用Miller
您的示例CSV文件格式不正确,请参阅下图(右侧更正)。在您的文件中,部门信息位于电子邮件字段中。


的数据
从yout示例文件开始

id,location_id,name,title,email,department
3,2,Brenda Brown,Director, Second Career Services,,
4,3,Howard Lader,Manager, Senior Counseling,,
5,4,Kimberly Pesavento,Commercial Director,,
18,11,Dave Genesy,Head of Office,,
22,14,Andres Espinoza,Manager, Commanding Officer,,

字符串
您可以运行

mlr --ho --csv --ragged unsparsify then put '$department=$email;$email="to@set"' then \
remove-empty-columns then put '$email=""' then \
put '
$title_department=sub(joinv([$title,$department],","),",$","");
$email=tolower(substr0($name,0,0).splita($name," ")[-1])."@abc.com"
' then \
cut -o -f id,location_id,name,title_department,email start.csv


得到

3,2,Brenda Brown,"Director, Second Career Services",[email protected]
4,3,Howard Lader,"Manager, Senior Counseling",[email protected]
5,4,Kimberly Pesavento,Commercial Director,[email protected]
18,11,Dave Genesy,Head of Office,[email protected]
22,14,Andres Espinoza,"Manager, Commanding Officer",[email protected]


一些注解:

  • --ragged,因为某些行具有与标题行不同的字段
  • $title_department=sub(joinv([$title,$department],","),",$","")连接两个字段,并删除结尾处的逗号
  • $email=tolower(substr0($name,0,0).splita($name," ")[-1])."@abc.com"获取name的第一个字符,并将其与name字段中的最后一个单词(姓氏)连接起来

相关问题