scipy.stats能识别和掩盖明显的异常值吗?

368yc8dk  于 8个月前  发布在  其他
关注(0)|答案(5)|浏览(74)

使用scipy.stats.linregress,我对一些高度相关的x,y实验数据集执行简单的线性回归,并首先直观地检查每个x,y散点图的离群值。更一般地说,即有没有一种方法可以识别和屏蔽异常值?

cmssoen2

cmssoen21#

statsmodels包有你需要的东西。看看这个小代码片段及其输出:

# Imports #
import statsmodels.api as smapi
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*10 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make graph #
regression = smapi.OLS(x, y).fit()
figure = smgraphics.regressionplots.plot_fit(regression, 0)
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test) if t[2] < 0.5)
print 'Outliers: ', list(outliers)

Outliers: [(15, 220)]

编辑

随着statsmodels的更新,情况发生了一些变化。下面是一个新的代码片段,显示了相同类型的离群值检测。

# Imports #
from random import random
import statsmodels.api as smapi
from statsmodels.formula.api import ols
import statsmodels.graphics as smgraphics
# Make data #
x = range(30)
y = [y*(10+random())+200 for y in x]
# Add outlier #
x.insert(6,15)
y.insert(6,220)
# Make fit #
regression = ols("data ~ x", data=dict(data=y, x=x)).fit()
# Find outliers #
test = regression.outlier_test()
outliers = ((x[i],y[i]) for i,t in enumerate(test.icol(2)) if t < 0.5)
print 'Outliers: ', list(outliers)
# Figure #
figure = smgraphics.regressionplots.plot_fit(regression, 1)
# Add line #
smgraphics.regressionplots.abline_plot(model_results=regression, ax=figure.axes[0])

Outliers: [(15, 220)]

n53p2ov0

n53p2ov02#

scipy.stats没有任何直接针对离群值的内容,因此可以回答一些关于statmodels的链接和广告(这是对scipy.stats的统计补充)
用于识别异常值
http://jpktd.blogspot.ca/2012/01/influence-and-outlier-measures-in.html
http://jpktd.blogspot.ca/2012/01/anscombe-and-diagnostic-statistics.html
http://statsmodels.sourceforge.net/devel/generated/statsmodels.stats.outliers_influence.OLSInfluence.html
代替掩蔽,更好的方法是使用鲁棒估计器
http://statsmodels.sourceforge.net/devel/rlm.html
示例,不幸的是,图目前未显示http://statsmodels.sourceforge.net/devel/examples/generated/tut_ols_rlm.html
RLM降低离群值权重。估计结果具有weights属性,对于离群值,权重小于1。这也可以用于寻找离群值。RLM在有几个离群值的情况下也更鲁棒。

68de4m5k

68de4m5k3#

更一般地说,即有没有一种方法可以识别和屏蔽异常值?
存在各种离群值检测算法; scikit-learn实现了其中的一些。
【免责声明:我是scikit-learn的贡献者。

ttisahbt

ttisahbt4#

尽管这个问题现在已经很老了,但我仍然会提供另一个完全基于numpyscipy的答案,尽管后者是可选的。
在线性回归中寻找异常值是一项非常常见但棘手的任务。幸运的是,有所谓的影响力措施。异常值对回归具有不自然的高影响,并且通过这样的措施可以基于一些拒绝规则来识别和拒绝它们。
一种常见的影响力测量方法是Cook's distance D,另一种不太常见但更强大的方法是Peña敏感度S,这两种方法都在同一篇维基百科文章中描述(免责声明:我补充了后者)。从他们的基本思想来看,每个数据点都被忽略一次,并评估回归模型中的变化。移除会对模型产生很大影响的数据点很可能是离群值(Cook距离背后的想法),对于移除其他数据点时不会真正受到影响的数据点也是如此(Peña敏感性背后的想法)。
当许多离群值被分组在一起时,会有一个很大的风险,因为这样他们就可以互相“掩盖”。如果删除其中一个点,模型将不会改变,因为所有其他离群值都会保持它的位置。这就是为什么库克的距离可能会失败,因此应该伴随着培尼亚的敏感性。
这就是背后的理论。如有需要,请参阅相关链接。对于实现,它们可以与回归本身一起非常有效地执行,我想使用

由于所有的数学都已经在文章中描述过了,我在这里不会详细介绍,除了线性回归中projection or "hat"-matrix H的一般概念。基本上,模型y ~= X @ b(@是矩阵或 “点” 积)也可以根据投影矩阵表述为y ~= H @ y。这将在以下代码中使用,该代码使用奇异值分解计算线性回归的帽子矩阵。有了这个帽子矩阵,库克的距离和培尼亚灵敏度是一次推导出来的。为了正确使用SVD,否则需要多次计算,我们为回归定义了2个辅助函数。一个用于计算系数向量B,另一个用于计算矩阵X的秩,其等于回归中使用的自变量的数量(因此这也适用于X中的相关变量):

import numpy as np

# Definition of a regression function that uses a pre-computed Singular Value
# Decomposition (SVD) of the X-matrix

def lstsq_from_svd(
    y_vect: np.ndarray,
    u: np.ndarray,
    s: np.ndarray,
    vt: np.ndarray,
    rcond: float | int = 1e-15,
) -> np.ndarray:
    # all singular values that are numerically small are set to 0 after the inversion
    large_s_idxs = np.where(s >= rcond * s.max())
    inv_s = np.zeros_like(s)
    inv_s[large_s_idxs] = 1.0 / s[large_s_idxs]

    # now, the coefficient vector is computed as V @ inv(S) @ U.T @ y
    return (vt.T * inv_s[np.newaxis, ::]) @ u.T @ y_vect

# Definition of a matrix rank function that uses a pre-computed SVD of the X-matrix

def matrix_rank_from_svd(
    x_shape: tuple[int, int],
    s: np.ndarray,
) -> int:
    # first, the tolerance below which a singular value is considered 0 is evaluated
    # (same behaviour as ``np.linalg.matrix_rank`` for ``tol=None``)
    tol = s.max() * max(x_shape) * np.finfo(s.dtype).eps
    return np.count_nonzero(s > tol)

下面的实现是通用的,因此它可以使用用于y ~= X @ b的任意矩阵X,因此也可以使用它来查找正弦或三次多项式拟合中的异常值。为了评估使用这两种测量方法的重要性,提供了4种不同的场景,其中离群值在回归中表现出不同的leverage水平。表达式diag(R)从方阵R中提取主对角线。相反,diag(r)将向量r变成一个对角线上有r的对角矩阵,就像np.diag一样。Assert用来证明所有的简化都是有效的,所有涉及的方程和拒绝规则都作为注解包含在内。此外,还将结果与出版物进行了核对。

# Imports

from enum import StrEnum

from matplotlib import pyplot as plt
from scipy.stats import median_abs_deviation

# Specification of the global constants

POLYNOMIAL_DEGREE = 1

# Definition of 4 different outlier scenarios (based on their leverage)

class OutlierScenarios(StrEnum):
    OUTLIER_FREE = "outlier free"
    HIGH_LEVERAGE = "high leverage outliers"  # 3 outliers
    MEDIUM_LEVERAGE = "medium leverage outliers"  # 3 outliers
    LOW_LEVERAGE = "low leverage outliers"  # 3 outliers

scenarios = [member for member in OutlierScenarios]

# "Reading" the base XY-data

xy_data = np.array(
    [
        [0.3899, 0.0],
        [0.0880, -0.3179],
        [-0.6355, 1.0950],
        [-0.5596, -1.8740],
        [0.4437, 0.4282],
        [-0.9499, 0.8956],
        [0.7812, 0.7310],
        [0.5690, 0.5779],
        [-0.8217, 0.0403],
        [-0.2656, 0.6771],
        [-1.1878, 0.5689],
        [-2.2023, -0.2556],
        [0.9863, -0.3775],
        [-0.5186, -0.2959],
        [0.3274, -1.4751],
        [0.2341, -0.2340],
        [0.0215, 0.1184],
        [-1.0039, 0.3148],
        [-0.9471, 1.4435],
        [-0.3744, -0.3510],
        [-1.1859, 0.6232],
        [-1.0559, 0.7990],
        [1.4725, 0.9409],
        [0.0557, -0.9921],
        [-1.2173, 0.2120],
        [-0.412, 0.2379],
        [-1.1283, -1.0078],
    ]
)

# definition of the outlier-containing datasets
outlier_datasets = {
    OutlierScenarios.OUTLIER_FREE: np.array(
        [
            [1.02, 0.72],
            [0.75, 0.42],
            [-0.44, -0.21],
        ]
    ),
    OutlierScenarios.HIGH_LEVERAGE: np.array([[20.0, 5.0]] * 3),
    OutlierScenarios.MEDIUM_LEVERAGE: np.array([[5.0, 5.0]] * 3),
    OutlierScenarios.LOW_LEVERAGE: np.array([[0.5, 5.0]] * 3),
}

# now, the ordinary regression is performed
fig_reg, ax_reg = plt.subplots(nrows=2, ncols=2, sharex="all", sharey="all")
plot_x_vect = np.linspace(start=-2.5, stop=20.5, num=101)

# running the main loop over all scenarios
for idx, scen in enumerate(scenarios):
    # reading the outlier data ...
    outlier_xy_data = outlier_datasets[scen]
    # ... and concatenating them with the base-data
    contam_x_vect = np.concatenate([xy_data[::, 0], outlier_xy_data[::, 0]])
    contam_y_vect = np.concatenate([xy_data[::, 1], outlier_xy_data[::, 1]])

    # now, the X-matrix for the system Xb ~= y is set up (N with + 1 since the constant
    # also counts)
    x_mat = np.vander(x=contam_x_vect, N=POLYNOMIAL_DEGREE + 1, increasing=True)
    plot_x_mat = np.vander(x=plot_x_vect, N=POLYNOMIAL_DEGREE + 1, increasing=True)

    # the coefficient vector is given by b = pinv(X) @ y where pinv(X) denotes the
    # Moore-Penrose-Pseudoinverse which can be computed form the SVD, i.e.,
    # X = U @ diag(s) @ V.T
    u, s, vt = np.linalg.svd(a=x_mat, full_matrices=False)

    coeff_b_vect = lstsq_from_svd(y_vect=contam_y_vect, u=u, s=s, vt=vt)
    assert np.allclose(
        np.linalg.lstsq(x_mat, contam_y_vect, rcond=None)[0], coeff_b_vect
    ), "SVD-based inversion incorrect"

    # using the same SVD, the hat-matrix H = X @ pinv(X) can be computed. If pinv(X) is
    # the left inverse of X, H is given by H = U @ U.T
    hat_matrix = u @ u.T
    assert np.allclose(
        hat_matrix, x_mat @ np.linalg.pinv(x_mat)
    ), "Hat-matrix simplification incorrect"

    # all that's missing is the residuals now given by r = y - X @ b ...
    resid_vect = (contam_y_vect - x_mat @ coeff_b_vect)
    # ... along with the mean squared error MSE = (r.T @ r) / (n - p) with
    # n as the number of samples and p as the number of effective variables (=rank(X))
    num_samples = x_mat.shape[0]
    num_vars = matrix_rank_from_svd(x_shape=x_mat.shape, s=s)
    mse = np.sum(resid_vect * resid_vect) / (x_mat.shape[0] - num_vars)

    # using the equations given at
    # https://en.wikipedia.org/wiki/Cook's_distance#Relationship_to_other_influence_measures_(and_interpretation)
    # the Cooks's distance and Peña sensitivity can be computed
    # as a basis, the T-matrix is produced as T = H @ diag(r) @ diag(1 / (1 - diag(H)))
    # for all diagonal matrices involved, fast row-wise multiplication is used rather
    # then ``@`` since most of the entries are 0s and the computational cost would thus
    # be high for nothing
    t_mat = hat_matrix * (resid_vect / (1.0 - np.diag(hat_matrix)))[np.newaxis, ::]

    # the Cook's distances are defined as (1 / p / RMSE²) * diag(T.T @ T)
    # since only the main diagonal is required, the columns of T * T can be summed
    # individually instead
    t_mat_squared = t_mat * t_mat
    mse_prefact = 1.0 / (num_vars * mse)
    cooks_d = mse_prefact * np.sum(t_mat_squared, axis=0)
    assert np.allclose(
        cooks_d, mse_prefact * np.diag(t_mat.T @ t_mat)
    ), "Cook simplification incorrect"

    # the Peña sensitivities are defined as
    # (1 / p / RMSE²) * diag(1 / diag(H)) @ diag(T @ T.T)
    # here, the rows if T * T can be summed individually and for the diagonal product,
    # element-wise division is a lot more efficient
    pena_s = mse_prefact * (np.sum(t_mat_squared, axis=1) / np.diag(hat_matrix))
    assert np.allclose(
        pena_s,
        mse_prefact * np.diag(1.0 / np.diag(hat_matrix)) @ np.diag(t_mat @ t_mat.T),
    ), "Peña simplification incorrect"

    # according to the german Wikipedia, outliers can be flagged using (4 / n) as
    # threshold for Cook's distance
    cook_cutoff = 4.0 / x_mat.shape[0]
    cook_outlier_mask = cooks_d > cook_cutoff

    # for the Peña sensitivities, the rejection can be based on robust estimation of
    # the normal distribution they follow and discarding datapoints that are far
    # outide the tails
    pena_median = np.median(pena_s)
    pena_mad = median_abs_deviation(x=pena_s, scale=1.0)
    pena_z_scores = (pena_s - pena_median) / pena_mad
    pena_outlier_mask = np.abs(pena_z_scores) > 4.5

    # Plotting all the results
    ax_idxs = divmod(idx, 2)
    ax_reg[ax_idxs].set_title(f"{scen}", fontsize=18)
    ax_reg[ax_idxs].scatter(xy_data[::, 0], xy_data[::, 1], label="original datapoints")
    ax_reg[ax_idxs].scatter(
        outlier_xy_data[::, 0], outlier_xy_data[::, 1], c="red", label="added set"
    )
    ax_reg[ax_idxs].scatter(
        contam_x_vect[cook_outlier_mask],
        contam_y_vect[cook_outlier_mask],
        s=150,
        c="none",
        ec="black",
        lw=2.0,
        label="flagged by Cook's D",
    )
    ax_reg[ax_idxs].scatter(
        contam_x_vect[pena_outlier_mask],
        contam_y_vect[pena_outlier_mask],
        s=200,
        marker="s",
        c="none",
        ec="black",
        lw=2.0,
        label="flagged by Peña's S",
    )
    ax_reg[ax_idxs].plot(
        plot_x_vect, plot_x_mat @ coeff_b_vect, label="regression line", color="black"
    )

    if idx % 2 == 0:
        ax_reg[ax_idxs].set_ylabel("y", fontsize=16)
    if idx / 2 >= 1:
        ax_reg[ax_idxs].set_xlabel("x", fontsize=16)
    if idx == 0:
        legend = ax_reg[ax_idxs].legend(ncol=2, loc="best", fontsize=14)

plt.show()

此示例生成以下图:Comparison of outliers flagged by Cook's distance and Peña sensitivity因此,它们在每个场景中都不是上级的,它们应该组合使用。
几点意见:

  • POLYNOMIAL_DEGREE更改为2将显示如何为抛物线拟合标记离群值,有趣的是,将其更改为0将在计算y数据的平均值时发现离群值,因为这只不过是一个只有常数的线性回归问题。在这种情况下,培尼亚敏感性变得病态。

  • 也可以使用加权回归,但是H = U @ U.T将不再成立,因为H = X @ pinv(sqrt(W)@ X)@ sqrt(W),其中W是y中误差的方差-协方差矩阵。系数也是B = pinv(sqrt(W)@ X)@ sqrt(W)@ y。残差和MSE然后还需要加权r_w = sqrt(W)@(y-X @ B)和wMSE =(r.T @ W @ r)/(n-p)

  • Peña灵敏度拒绝规则使用4.5倍的绝对偏差中位数(MAD)意味着,所有与中位数相差3个稳健标准偏差的S值都被拒绝为离群值,因为MAD的约1.5倍类似于正态分布数据的标准偏差。基本上,任何其他robust measure of scale都可以用来稳健地估计标准差,例如Qn尺度或双权中方差,它们对正态分布更有利。

  • 该概念可以扩展到任何其他线性投影y ~= H @ y,例如平滑。让我们以著名的Savitzky-Golay滤波器为例。除了边界--如果想要使其真正准确,则需要特殊处理--它的帽子矩阵H是可以通过下面的代码计算的Toeplitz矩阵。很容易看出,它的每一行都是核移位一,例如过滤方法的目的是:

from scipy.linalg import convolution_matrix
from scipy.signal import savgol_coeffs
# setup of a Savitzky-Golay-filter with 3rd order polynomial and window-length 5 for
# a signal of arbitrary length n
POLYDEGREE = 3
WINDOW_LENGTH = 5
savgol_kernel = savgol_coeffs(
    window_length=WINDOW_LENGTH, polyorder=POLYDEGREE, deriv=0
)
print(savgol_kernel)
# [-0.08571429  0.34285714  0.48571429  0.34285714 -0.08571429]
hat_matrix = convolution_matrix(a=savgol_kernel, n=6, mode="same")
print(hat_matrix)
# [[ 0.48571429  0.34285714 -0.08571429  0.          0.          0.        ]
#  [ 0.34285714  0.48571429  0.34285714 -0.08571429  0.          0.        ]
#  [-0.08571429  0.34285714  0.48571429  0.34285714 -0.08571429  0.        ] # one full kernel
#  [ 0.         -0.08571429  0.34285714  0.48571429  0.34285714 -0.08571429] # one full kernel
#  [ 0.          0.         -0.08571429  0.34285714  0.48571429  0.34285714]
#  [ 0.          0.          0.         -0.08571429  0.34285714  0.48571429]]

然后还需要使用帽矩阵的迹及其与自身的乘积以更复杂的方式计算变量的数量。
如果你想做得更详细,可以对导致库克距离和培尼亚灵敏度的矩阵进行特征值分析(如维基百科文章中所提到的)。
希望这对你有帮助!

bcs8qyzn

bcs8qyzn5#

也可以使用scipy.optimize.least_squares限制离群值的影响。特别是,看看f_scale参数:
内值和离群值残差之间的软边界值,默认值为1.0。.该参数对loss ='linear'没有影响,但对于其他损失值,它至关重要。
在页面上,他们比较了3种不同的功能:普通的least_squares和两个涉及f_scale的方法:

res_lsq =     least_squares(fun, x0, args=(t_train, y_train))
res_soft_l1 = least_squares(fun, x0, loss='soft_l1', f_scale=0.1, args=(t_train, y_train))
res_log =     least_squares(fun, x0, loss='cauchy', f_scale=0.1, args=(t_train, y_train))

可以看出,正态最小二乘更容易受到数据异常值的影响,并且可以将不同的loss函数与不同的f_scales组合使用。可能的损失函数是(取自文件):

‘linear’ : Gives a standard least-squares problem.
‘soft_l1’: The smooth approximation of l1 (absolute value) loss. Usually a good choice for robust least squares.
‘huber’  : Works similarly to ‘soft_l1’.
‘cauchy’ : Severely weakens outliers influence, but may cause difficulties in optimization process.
‘arctan’ : Limits a maximum loss on a single residual, has properties similar to ‘cauchy’.

关于稳健非线性回归的科学食谱。

相关问题