如何在Pandas框架中将重叠的过滤条件识别为行?

yqyhoc1h  于 6个月前  发布在  其他
关注(0)|答案(2)|浏览(42)

下面的例子是一个优化的输出,它产生了一个值列表,我将使用这些值作为过滤器,并应用于另一组数据。
我试图消除那些产生的值/筛选器已经在其他行中覆盖的行,只留下唯一的值/筛选器。例如,如果我应用索引行#1(var_1 > 2,var_1<5,var_2>5,var_2 <9)中的值/筛选器,它将产生与索引行0(1,5,5,9)中的值/筛选器相同的结果。
索引行#0、#2和#3应该是unique_result = True
在尝试#1中,我的代码只是查看值是<=到最小值还是>=到最大值,但正如您所看到的,记录#1显示为True,因为第四个值9大于或等于最大值。
在尝试#2中,我过滤了嵌套框架,首先删除了我正在迭代的行,然后查看该值是否小于过滤后的嵌套框架的最小值。(我只对一个值这样做,因为我得到了我不理解的NaN)。

df = pd.DataFrame(
    {
        "var_1_lower_limit": [1, 2, 1, 3, 2],
        "var_1_upper_limit": [5, 5, 4, 7, 6],
        "var_2_lower_limit": [5, 5, 3, 6, 4],
        "var_2_upper_limit": [9, 9, 8, 9, 7],
    }
)

def find_unique(
    var_1_lower_limit, var_1_upper_limit, var_2_lower_limit, var_2_upper_limit
):
    if (
        var_1_lower_limit <= df_filtered.var_1_lower_limit.min()
        or var_1_upper_limit >= df_filtered.var_1_upper_limit.max()
        or var_2_lower_limit <= df_filtered.var_2_lower_limit.min()
        or var_2_upper_limit >= df_filtered.var_2_upper_limit.max()
    ):
        return True
    else:
        return False

### Attempt #1
df["unique_result"] = df.apply(
    lambda x: find_unique(
        x["var_1_lower_limit"],
        x["var_1_upper_limit"],
        x["var_2_lower_limit"],
        x["var_2_upper_limit"],
    ),
    axis=1,
)

display(df)

### Attempt #2
for i in df.itertuples():
    df_filtered = df.drop(df[df.index == i.Index].index)
    df["low_value_2"] = df_filtered.var_1_lower_limit.apply(
        lambda x: True if x < i.var_1_lower_limit else False
    )

display(df)

字符串
编辑:添加了我希望以x1c 0d1x结束的输出

ovfsdjhp

ovfsdjhp1#

好的,再次感谢@Laurent,他指出了我忽略的一个组合,所以我意识到我需要遍历所有可能的潜在值组合,而不是试图比较过滤器是最小值还是最大值。

import pandas as pd
import itertools
import numpy as np

### On two variables
df = pd.DataFrame({'var_1_lower_limit':[1,2,1,3,2],'var_1_upper_limit':[5,5,4,7,6],'var_2_lower_limit':[5,5,3,6,4],'var_2_upper_limit':[9,9,8,9,7]})
df.sort_values(['var_1_lower_limit','var_1_upper_limit','var_2_lower_limit','var_2_upper_limit'],ascending=True, inplace=True) # I think important to sort here
display(df)

default_possibilities = [0,1,2,3,4,5,6,7,8,9]
possibilities = list(itertools.product(default_possibilities,default_possibilities,default_possibilities,default_possibilities)) #Creates all combinations of possibilities 0-9,0-9,0-9,0-9
print('length of possibilities',len(possibilities))
possibilities = [item for item in possibilities if item[1]>=item[0] and item[3]>=item[2]] # removes combinations that don't make sense such as first value is 5 and second value is 3
print('length of revised possibilities',len(possibilities))

filtering_output = []

# For each combination of potential values, see if it passes a filter
for i in possibilities: 
  a = i[0]
  b = i[1]
  c = i[2]
  d = i[3]
  for z in df.itertuples():
    if z.var_1_lower_limit <=a and z.var_1_upper_limit >=b and z.var_2_lower_limit <=c and z.var_2_upper_limit >=d:
      filtering_output.append({'combination':i,'filter_id':str(z.var_1_lower_limit)+','+str(z.var_1_upper_limit)+','+str(z.var_2_lower_limit)+','+str(z.var_2_upper_limit),'var_1_lower_limit':z.var_1_lower_limit,'var_1_upper_limit':z.var_1_upper_limit, 'var_2_lower_limit':z.var_2_lower_limit,'var_2_upper_limit':z.var_2_upper_limit,'pass_or_fail':'YES'})
    else:
      continue

filtering_output_df = pd.DataFrame(filtering_output) ### This df has all passing combinations
display(filtering_output_df)  

unique_combinations = filtering_output_df.drop_duplicates(subset=['combination'], keep='first') # Since there are combinations that pass multiple filters, reduce to only the first iteration of that combination

final_filters = unique_combinations.drop_duplicates(subset=['var_1_lower_limit','var_1_upper_limit','var_2_lower_limit','var_2_upper_limit'], keep='first') # since df was sorted, keeping only filters that had a passing combination
display(final_filters)

字符串

yx2lnoni

yx2lnoni2#

下面是一种使用Pandas shift的惯用方法:

import pandas as pd

# Sort dataframe and set 'unique' column with 'True' as default value
df = df.sort_values(
    by=[
        "var_1_lower_limit",
        "var_1_upper_limit",
        "var_2_lower_limit",
        "var_2_upper_limit",
    ]
).assign(unique=True)

# Identify filters not contained within others by comparing current row
# with previous one or the next, and set them as 'False'
non_unique_filters = (
    (df["var_1_lower_limit"] >= df["var_1_lower_limit"].shift(-1))
    & (df["var_1_upper_limit"] <= df["var_1_upper_limit"].shift(-1))
    & (df["var_2_lower_limit"] >= df["var_2_lower_limit"].shift(-1))
    & (df["var_2_upper_limit"] <= df["var_2_upper_limit"].shift(-1))
) | (
    (df["var_1_lower_limit"] >= df["var_1_lower_limit"].shift(1))
    & (df["var_1_upper_limit"] <= df["var_1_upper_limit"].shift(1))
    & (df["var_2_lower_limit"] >= df["var_2_lower_limit"].shift(1))
    & (df["var_2_upper_limit"] <= df["var_2_upper_limit"].shift(1))
)
df.loc[non_unique_filters, "unique"] = False

df = df.sort_index()

字符串
然后又道:

print(df)
# Output
   var_1_lower_limit  var_1_upper_limit  var_2_lower_limit  var_2_upper_limit  unique
0                  1                  5                  5                  9    True
1                  2                  5                  5                  9   False
2                  1                  4                  3                  8    True
3                  3                  7                  6                  9    True
4                  2                  6                  4                  7    True

相关问题