pandas 对包含nan元素的xarray数据执行sklearn.linearmodel岭回归

ny6fqffe  于 6个月前  发布在  其他
关注(0)|答案(1)|浏览(71)

我尝试使用岭校正来执行多元线性回归,以确定xarray Dataframe中某些空间变量之间的关系。因为这些是观测数据,所以数据中偶尔会有NaN值,sklearn无法原生处理。我尝试使用data.interpolate_na(fill_value='extrapolate'),但这无法替换所有NaN值。一个可行的解决方案是使用data.fillna(0),但这可能会有问题,因为我宁愿不'发明'数据从整个布时,像插值或掩蔽会更好。我的代码如下:

from sklearn.linear_model import Ridge
import xarray as xr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cartopy.crs as ccrs

def detrend_dim(da, dim, deg=1): # removes (temporal) trends from data
    # detrend along a single dimension
    p = da.polyfit(dim=dim, deg=deg)
    fit = xr.polyval(da[dim], p.polyfit_coefficients)
    return da - fit

#loading and preprocessing data
data = xr.open_dataset('reanalysisdata.nc')
data2 = xr.open_dataset('satellitedata.nc')
land = xr.open_dataset('twoptfivedeglandmask.nc')
data2.coords['mask'] = (('lat', 'lon'), land.FRLAND.mean(dim='time').data)
data2 = data2.where(data2.mask == 0)
data.coords['mask'] = (('lat', 'lon'), land.FRLAND.mean(dim='time').data)
data = data.where(data.mask == 0)

#trying to deal with NaN values
data = data.fillna(0)
data2 = data2.fillna(0)

#removing time trends
data['x1'] = detrend_dim(data.x1, 'time')
data['x2'] = detrend_dim(data.x2, 'time')
data['x3'] = detrend_dim(data.x3, 'time')
data2['y'] = detrend_dim(data2.y, 'time')

#removing seasonal cycle
dataM = data.groupby('time.month')
dataU = dataM - dataM.mean()
data2M = data2.groupby('time.month')
data2U = data2M - data2M.mean()

#performing regression
a = np.zeros((72,144,3))
for i in range(len(data.lat)):
    for j in range(len(data.lon)):
        a[i,j,:] = (Ridge().fit(np.array((dataU.isel(lev=2).x1.values[:,i,j].reshape(-1,1),
                    dataU.isel(lev=2).x2.values[:,i,j].reshape(-1,1),
                    dataU.isel(lev=2).x3.values[:,i,j].reshape(-1,1))).reshape(108,3),
                    data2U.y.values[:,i,j].reshape(108)).coef_)
dataU = data.assign_coords(varname=['x1','x2','x3'])
dataU['multiple_reg_coeff'] = (('lat','lon','varname'), a)

字符串
我需要迭代地检查每个变量的NaN吗?在sklearn.linear_model中有回归,它可以原生地处理NaN,但我对它们背后的数学知识的理解比岭回归少。

yr9zkbsy

yr9zkbsy1#

sklearn使用插补处理Nan值,如本答案https://stackoverflow.com/a/33114098/15791525所示
虽然插补数据不是“发明”,但您仍然可能希望避免数据集中的新值。在这种情况下,我建议您在不丢失太多样本的情况下丢弃具有Nan值的数据组。

相关问题