csv 将tsv文件中的行修复为与下一行合并合并

okxuctiv  于 7个月前  发布在  其他
关注(0)|答案(1)|浏览(128)

我有一个输出文件遇到了一个问题。(660 Mil字符)。创建此文件的过程似乎在其缓冲区中存储了100,000个字符,然后将其转储到tsv,放置回车并开始一个新行。然后将剩余数据放置到此文件的新行中,输入100,000个字符,并重复上述过程。

Example:
1 ......48 fields...CR LF
2 ......48 fields...CR LF
3 ......48 fields...CR LF
4 ......37 fields CR LF
5 ......11 fields CR LF
6 ......48 fields...CR LF
7 ......48 fields...CR LF
8 ......48 fields...CR LF
9 ......17 fields CR LF
10 ......21 fields CR LF

字符串
行1-4包含100,000个字符,包括所有行的CR LF行5-9包含100,000个字符,包括所有行的CR LF
tsv的标题行有48个值,47个分隔符。“断开”的行的分隔符少于47个。我已经创建了一个PowerShell,可以为我提供行号,我甚至可以在Notepad++中编辑文件,但每天必须在文件中修复超过5,000行!我正在寻找一种方法来编写此脚本。这似乎是大约每190行,但这不是一个常数,由于字符计数。我基本上需要一种方法去100,000个字符,删除回车符,使下一行与该行连接,形成一个完整的48字段行。
我不能删除所有的回车符和换行符,因为这会造成混乱。有什么想法如何将第一行不匹配的回车计数并将其连接到下一行以修复数据行,并继续到下一个示例显示此问题并修复它?
我尝试过PowerShell替换当前行并删除回车符,但更新文件时会将回车符放回原处。

nzrxty8p

nzrxty8p1#

  • 下面的代码执行逐行的纯文本处理,并根据每一行上制表符的计数来决定哪些行需要连接,以便有效地删除无关的换行符(CRLF)。
  • 带有-File参数的switch语句允许快速(按照PowerShell标准)逐行阅读。
  • System.IO.StreamWriter示例用于写入输出文件。
  • 密码编码警告:
  • System.IO.StreamWriter默认创建无BOM的UTF-8文件,尽管您可以通过构造函数参数显式控制编码。
  • 如果输入文件没有BOM,switch * 总是 * 采用UTF-8编码。
  • 如果需要指定不同的编码,请改用System.IO.StreamReader示例。
# Specify your input file here.
$inFile = 'in.csv'
# The count of separators (tabs) on each line.
$expectedTabCount = 47 

# Create a writer for the output file (must be different from the input file).
# NOTE: Be sure to use a *full path*
$outFileWriter = [System.IO.StreamWriter] "$pwd/out.tsv"

# Initialize helper variables.
$potentiallyCompleteLine = $incompleteLine = $null

# Process the input file line by line, join the split lines as needed,
# and write to the output file.
switch -File $inFile {
  default {
    if ($incompleteLine) {
      # Previous line was incomplete.
      # Join the incomplete line with the current one and write to the file.
      $outFileWriter.WriteLine($incompleteLine + $_)
      $incompleteLine = $null
    } elseif (($_ -replace '[^\t]').Length -lt $expectedTabCount) {
      # Incomplete line
      if ($potentiallyCompleteLine -and $_ -notmatch '\t') {
        # The previous line had the expected count of separators, but was
        # cut in half in the last field.
        $outFileWriter.WriteLine($potentiallyCompleteLine + $_)
        $potentiallyCompleteLine = $null
      } else {
        if ($potentiallyCompleteLine) {
          # Write the previous potentially complete line, which can now be
          # assumed to be complete.
          $outFileWriter.WriteLine($potentiallyCompleteLine)
          $potentiallyCompleteLine = $null
        }  
        # Save the current line, which must be joined with the next one.
        $incompleteLine = $_
      }
    } else {
      # Complete line.
      if ($potentiallyCompleteLine) {
        # Write the previous potentially complete line, which can now be
        # assumed to be complete.
        $outFileWriter.WriteLine($potentiallyCompleteLine)
      }
      # This line has the expected number of separators, but the line could still
      # be incomplete if the CRLF was inserted *in the middle of the last field*,
      # so writing must be deferred.
      $potentiallyCompleteLine = $_
    }
  }
}
# Write the last complete line.
if ($potentiallyCompleteLine) { $outFileWriter.WriteLine($potentiallyCompleteLine) }
$outFileWriter.Close()

字符串
[1]令人惊讶的是,这也适用于传统的 Windows PowerShell 版本,它-大部分但不一致-默认为系统的活动 ANSI 代码页(而现代的跨平台PowerShell (Core) 7+版本现在一致默认为(无BOM)UTF-8)。有关Windows PowerShell中不同的默认值,请参阅this answer的底部部分以了解详细信息。

相关问题