用Python重组csv的更有效方法

wfypjpf4  于 5个月前  发布在  Python
关注(0)|答案(1)|浏览(63)

为了学习Python,我做了一个重新组织csv的小项目。
1.我想删除以# STN开头的标题之前的所有行
1.然后提取日期YYYYMMDD列,并重新格式化它
1.并提取RH列,该列包含我感兴趣的值
1.要将它们写入一个新的csv文件,我想加载到influxdb中,

BRON: KONINKLIJK NEDERLANDS METEOROLOGISCH INSTITUUT (KNMI)
Opmerking: door stationsverplaatsingen en veranderingen in waarneemmethodieken zijn deze tijdreeksen van dagwaarden mogelijk inhomogeen! Dat betekent dat deze ree
ks van gemeten waarden niet geschikt is voor trendanalyse. Voor studies naar klimaatverandering verwijzen we naar de gehomogeniseerde reeks maandtemperaturen van
De Bilt <http://www.knmi.nl/kennis-en-datacentrum/achtergrond/gehomogeniseerde-reeks-maandtemperaturen-de-bilt> of de Centraal Nederland Temperatuur <http://www.k
nmi.nl/kennis-en-datacentrum/achtergrond/centraal-nederland-temperatuur-cnt>.
SOURCE: ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE (KNMI)
Comment: These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for
trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt <http://www.knmi.nl/kennis-en-datacentrum/achterg
rond/gehomogeniseerde-reeks-maandtemperaturen-de-bilt> or the Central Netherlands Temperature <http://www.knmi.nl/kennis-en-datacentrum/achtergrond/centraal-neder
land-temperatuur-cnt>.
YYYYMMDD  = Datum (YYYY=jaar MM=maand DD=dag) / Date (YYYY=year MM=month DD=day)
DDVEC     = Vectorgemiddelde windrichting in graden (360=noord, 90=oost, 180=zuid, 270=west, 0=windstil/variabel). Zie http://www.knmi.nl/kennis-en-datacentrum/achtergrond/klimatologische-brochures-en-boeken / Vector mean wind direction in degrees (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
FHVEC     = Vectorgemiddelde windsnelheid (in 0.1 m/s). Zie http://www.knmi.nl/kennis-en-datacentrum/achtergrond/klimatologische-brochures-en-boeken / Vector mean windspeed (in 0.1 m/s)
FG        = Etmaalgemiddelde windsnelheid (in 0.1 m/s) / Daily mean windspeed (in 0.1 m/s)
FHX       = Hoogste uurgemiddelde windsnelheid (in 0.1 m/s) / Maximum hourly mean windspeed (in 0.1 m/s)
FHXH      = Uurvak waarin FHX is gemeten / Hourly division in which FHX was measured
FHN       = Laagste uurgemiddelde windsnelheid (in 0.1 m/s) / Minimum hourly mean windspeed (in 0.1 m/s)
FHNH      = Uurvak waarin FHN is gemeten / Hourly division in which FHN was measured
FXX       = Hoogste windstoot (in 0.1 m/s) / Maximum wind gust (in 0.1 m/s)
FXXH      = Uurvak waarin FXX is gemeten / Hourly division in which FXX was measured
TG        = Etmaalgemiddelde temperatuur (in 0.1 graden Celsius) / Daily mean temperature in (0.1 degrees Celsius)
TN        = Minimum temperatuur (in 0.1 graden Celsius) / Minimum temperature (in 0.1 degrees Celsius)
TNH       = Uurvak waarin TN is gemeten / Hourly division in which TN was measured
TX        = Maximum temperatuur (in 0.1 graden Celsius) / Maximum temperature (in 0.1 degrees Celsius)
TXH       = Uurvak waarin TX is gemeten / Hourly division in which TX was measured
T10N      = Minimum temperatuur op 10 cm hoogte (in 0.1 graden Celsius) / Minimum temperature at 10 cm above surface (in 0.1 degrees Celsius)
T10NH     = 6-uurs tijdvak waarin T10N is gemeten / 6-hourly division in which T10N was measured; 6=0-6 UT, 12=6-12 UT, 18=12-18 UT, 24=18-24 UT
SQ        = Zonneschijnduur (in 0.1 uur) berekend uit de globale straling (-1 voor <0.05 uur) / Sunshine duration (in 0.1 hour) calculated from global radiation (-1 for <0.05 hour)
SP        = Percentage van de langst mogelijke zonneschijnduur / Percentage of maximum potential sunshine duration
Q         = Globale straling (in J/cm2) / Global radiation (in J/cm2)
DR        = Duur van de neerslag (in 0.1 uur) / Precipitation duration (in 0.1 hour)
RH        = Etmaalsom van de neerslag (in 0.1 mm) (-1 voor <0.05 mm) / Daily precipitation amount (in 0.1 mm) (-1 for <0.05 mm)
RHX       = Hoogste uursom van de neerslag (in 0.1 mm) (-1 voor <0.05 mm) / Maximum hourly precipitation amount (in 0.1 mm) (-1 for <0.05 mm)
RHXH      = Uurvak waarin RHX is gemeten / Hourly division in which RHX was measured
PG        = Etmaalgemiddelde luchtdruk herleid tot zeeniveau (in 0.1 hPa) berekend uit 24 uurwaarden / Daily mean sea level pressure (in 0.1 hPa) calculated from 24 hourly values
PX        = Hoogste uurwaarde van de luchtdruk herleid tot zeeniveau (in 0.1 hPa) / Maximum hourly sea level pressure (in 0.1 hPa)
PXH       = Uurvak waarin PX is gemeten / Hourly division in which PX was measured
PN        = Laagste uurwaarde van de luchtdruk herleid tot zeeniveau (in 0.1 hPa) / Minimum hourly sea level pressure (in 0.1 hPa)
PNH       = Uurvak waarin PN is gemeten / Hourly division in which PN was measured
VVN       = Minimum opgetreden zicht / Minimum visibility; 0: <100 m, 1:100-200 m, 2:200-300 m,..., 49:4900-5000 m, 50:5-6 km, 56:6-7 km, 57:7-8 km,..., 79:29-30 km, 80:30-35 km, 81:35-40 km,..., 89: >70 km)
VVNH      = Uurvak waarin VVN is gemeten / Hourly division in which VVN was measured
VVX       = Maximum opgetreden zicht / Maximum visibility; 0: <100 m, 1:100-200 m, 2:200-300 m,..., 49:4900-5000 m, 50:5-6 km, 56:6-7 km, 57:7-8 km,..., 79:29-30 km, 80:30-35 km, 81:35-40 km,..., 89: >70 km)
VVXH      = Uurvak waarin VVX is gemeten / Hourly division in which VVX was measured
NG        = Etmaalgemiddelde bewolking (bedekkingsgraad van de bovenlucht in achtsten, 9=bovenlucht onzichtbaar) / Mean daily cloud cover (in octants, 9=sky invisible)
UG        = Etmaalgemiddelde relatieve vochtigheid (in procenten) / Daily mean relative atmospheric humidity (in percents)
UX        = Maximale relatieve vochtigheid (in procenten) / Maximum relative atmospheric humidity (in percents)
UXH       = Uurvak waarin UX is gemeten / Hourly division in which UX was measured
UN        = Minimale relatieve vochtigheid (in procenten) / Minimum relative atmospheric humidity (in percents)
UNH       = Uurvak waarin UN is gemeten / Hourly division in which UN was measured
EV24      = Referentiegewasverdamping (Makkink) (in 0.1 mm) / Potential evapotranspiration (Makkink) (in 0.1 mm)

# STN,YYYYMMDD,DDVEC,FHVEC,   FG,  FHX, FHXH,  FHN, FHNH,  FXX, FXXH,   TG,   TN,  TNH,   TX,  TXH, T10N,T10NH,   SQ,   SP,    Q,   DR,   RH,  RHX, RHXH,   PG,   PX,  PXH,   PN,  PNH,  VVN, VVNH,  VVX, VVXH,   NG,   UG,   UX,  UXH,   UN,  UNH, EV24

  380,20231209,  197,   57,   64,   90,   14,   40,    1,  170,   14,   81,   65,    2,  110,   20,   57,    6,    0,    0,  112,   52,   42,   15,   14,10025,10122,    1, 9934,   17,   57,   14,   75,   10,    8,   85,   93,   18,   73,   24,    2
  380,20231210,  228,   70,   73,   90,   17,   60,    8,  170,    3,   95,   79,   18,  107,   23,   77,    6,    4,    5,  192,   31,   19,    8,   18,10045,10088,   10,10009,   19,   57,   18,   74,   16,    8,   79,   93,   18,   68,    2,    3
  380,20231211,  238,   53,   55,   90,    4,   30,   15,  160,    4,   93,   75,   24,  106,    2,   67,   24,    0,    0,  106,   23,   26,    8,   14,10046,10093,   22,10001,    4,   39,   13,   75,   18,    8,   83,   96,   14,   77,   20,    2
  380,20231212,  190,   32,   34,   50,   21,   30,    1,  100,   21,   91,   62,    2,  112,   14,   48,    6,    2,    2,  165,   19,   42,   32,   21,10029,10089,    1, 9979,   24,   60,    2,   70,    7,    8,   90,   95,    2,   81,   10,    2
  380,20231213,  207,   19,   28,   40,    1,   10,   17,   80,    1,   78,   61,   19,   93,   12,   37,   18,    0,    0,  164,   32,   20,    5,    4, 9997,10077,   24, 9965,    6,   32,   19,   70,   11,    8,   93,   98,   19,   83,   11,    2

字符串
下面的代码可以工作,但我很想知道它是否可以简化(从而可能更有效),而不使用pandas(我不能在需要处理csv的服务器上安装pandas)。

import csv
csv_path = 'etmgeg_380_test.txt'

csv_header_row = []
csv_columns = {
  'YYYYMMDD': {'header': '_time', 'column': None},
  'RH': {'header': '_value', 'column': None},
  'field': {'header': '_field', 'column': None, 'value': 'value'}
  }
keep_c = []
with open(csv_path, newline='') as txtfile:
    reader = csv.reader(txtfile, delimiter=',')
    r = 0
    csv_clean = []
    header_r = 9999
    for row in reader:
        r += 1
        if row and row[0][:1] == "#":
          header_r = r
          c = 0
          for column in row:
            c += 1
            csv_header = column.strip()
            if csv_header in csv_columns.keys():
              csv_header_row.append(csv_columns[csv_header]['header'])
              csv_columns[csv_header]['column'] = c
              keep_c.append(c)
          for key in csv_columns.keys():
            if csv_columns[key]['header'] not in csv_header_row:
              csv_header_row.append(csv_columns[key]['header'])
          print(csv_header_row)
        if row and r > header_r:
          row_clean = []
          c = 0
          for column in row:
            c += 1
            if c in keep_c:
              value = None
              if c == csv_columns['YYYYMMDD']['column']:
                value = column[:4] + '-' + column[4:6] + '-' + column[6:]
              elif c == csv_columns['RH']['column']:
                value = None if column.isspace() else int(column)
              row_clean.append(value)
          for key in csv_columns.keys():
            if csv_columns[key]['column'] is None:
              row_clean.append(csv_columns[key]['value'])
          csv_clean.append(row_clean)

with open (csv_path[:-3] + 'csv', 'w', newline = '') as csvfile:
  writer = csv.DictWriter(csvfile, fieldnames = csv_header_row)
  writer.writeheader()
  for row in csv_clean: writer.writerow(dict(zip(csv_header_row, row)))

mrfwxfqh

mrfwxfqh1#

做得好。我建议从下面的东西开始。这不一定更有效。我认为它更简洁(不依赖于太多的Python oneliner)。
对我来说,两个主要的区别是:

  • 将文件的第一部分作为文本处理,然后在达到真实的数据时切换到CSV模式
  • 强制DictReader使用特定的字段名,并使用输出词典列表的第一行来启动DictWriter

我认为利用DictReader和DictWriter的这两个方面将减少大量的循环和if块。这也意味着你只需要在一个地方声明新的字段名,在read-row循环中。

import csv
from datetime import datetime

new_rows = []

with open("input.txt", newline="") as f:
    # Treat f as txt to start
    line = ""
    for line in f:
        if line.startswith("# STN"):
            # Found header
            break

    fieldnames = [x.strip() for x in line.split(",")]

    next(f)  # discard empty line immediately after header

    # Now f can be treated as CSV
    reader = csv.DictReader(f, fieldnames=fieldnames, skipinitialspace=True)

    for row in reader:
        dt = datetime.strptime(row["YYYYMMDD"], r"%Y%m%d")

        new_rows.append(
            {
                "New_Date": dt.strftime(r"%d/%m/%Y"),
                "RH": row["RH"],
            }
        )

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=new_rows[0])  # use the first row's dict to set the fieldnames
    writer.writeheader()
    writer.writerows(new_rows)

字符串

相关问题