csv 将tsv文件中的行修复为与下一行合并合并

okxuctiv  于 7个月前  发布在  其他

我有一个输出文件遇到了一个问题。(660 Mil字符)。创建此文件的过程似乎在其缓冲区中存储了100,000个字符,然后将其转储到tsv,放置回车并开始一个新行。然后将剩余数据放置到此文件的新行中,输入100,000个字符,并重复上述过程。

1 ......48 fields...CR LF
2 ......48 fields...CR LF
3 ......48 fields...CR LF
4 ......37 fields CR LF
5 ......11 fields CR LF
6 ......48 fields...CR LF
7 ......48 fields...CR LF
8 ......48 fields...CR LF
9 ......17 fields CR LF
10 ......21 fields CR LF

行1-4包含100,000个字符,包括所有行的CR LF行5-9包含100,000个字符,包括所有行的CR LF



  • 下面的代码执行逐行的纯文本处理,并根据每一行上制表符的计数来决定哪些行需要连接,以便有效地删除无关的换行符(CRLF)。
  • 带有-File参数的switch语句允许快速(按照PowerShell标准)逐行阅读。
  • System.IO.StreamWriter示例用于写入输出文件。
  • 密码编码警告:
  • System.IO.StreamWriter默认创建无BOM的UTF-8文件,尽管您可以通过构造函数参数显式控制编码。
  • 如果输入文件没有BOM,switch * 总是 * 采用UTF-8编码。
  • 如果需要指定不同的编码,请改用System.IO.StreamReader示例。
# Specify your input file here.
$inFile = 'in.csv'
# The count of separators (tabs) on each line.
$expectedTabCount = 47 

# Create a writer for the output file (must be different from the input file).
# NOTE: Be sure to use a *full path*
$outFileWriter = [System.IO.StreamWriter] "$pwd/out.tsv"

# Initialize helper variables.
$potentiallyCompleteLine = $incompleteLine = $null

# Process the input file line by line, join the split lines as needed,
# and write to the output file.
switch -File $inFile {
  default {
    if ($incompleteLine) {
      # Previous line was incomplete.
      # Join the incomplete line with the current one and write to the file.
      $outFileWriter.WriteLine($incompleteLine + $_)
      $incompleteLine = $null
    } elseif (($_ -replace '[^\t]').Length -lt $expectedTabCount) {
      # Incomplete line
      if ($potentiallyCompleteLine -and $_ -notmatch '\t') {
        # The previous line had the expected count of separators, but was
        # cut in half in the last field.
        $outFileWriter.WriteLine($potentiallyCompleteLine + $_)
        $potentiallyCompleteLine = $null
      } else {
        if ($potentiallyCompleteLine) {
          # Write the previous potentially complete line, which can now be
          # assumed to be complete.
          $potentiallyCompleteLine = $null
        # Save the current line, which must be joined with the next one.
        $incompleteLine = $_
    } else {
      # Complete line.
      if ($potentiallyCompleteLine) {
        # Write the previous potentially complete line, which can now be
        # assumed to be complete.
      # This line has the expected number of separators, but the line could still
      # be incomplete if the CRLF was inserted *in the middle of the last field*,
      # so writing must be deferred.
      $potentiallyCompleteLine = $_
# Write the last complete line.
if ($potentiallyCompleteLine) { $outFileWriter.WriteLine($potentiallyCompleteLine) }

[1]令人惊讶的是,这也适用于传统的 Windows PowerShell 版本,它-大部分但不一致-默认为系统的活动 ANSI 代码页(而现代的跨平台PowerShell (Core) 7+版本现在一致默认为(无BOM)UTF-8)。有关Windows PowerShell中不同的默认值,请参阅this answer的底部部分以了解详细信息。
