pandas 如何在python中找到具有相同值的列列表

3okqufwl  于 6个月前  发布在  Python
关注(0)|答案(3)|浏览(87)

我试图在一个数据框中找到列的列表,其中列中的值相同。在R中有一个包,其中有一个是InDouble,试图在Python中实现。

df  =   
a b c d e f g h i   
1 2 3 4 1 2 3 4 5  
2 3 4 5 2 3 4 5 6  
3 4 5 6 3 4 5 6 7

字符串
它应该给予我列相同的价值观,如

a, e are equal
b,f are equal 
c,g are equal

cqoc49vn

cqoc49vn1#

让我们尝试使用itertools和combinations:

from itertools import combinations

[(i, j) for i,j in combinations(df, 2) if df[i].equals(df[j])]

字符串
输出量:

[('a', 'e'), ('b', 'f'), ('c', 'g'), ('d', 'h')]

xhv8bpkk

xhv8bpkk2#

from itertools import combinations

    cols_to_remove=[]
    for i,j in combinations(chk,2):
        if chk[i].equals(chk[j]):
            cols_to_remove.append(j)
    
    chk=chk.drop(cols_to_remove,axis=1)

字符串

4c8rllxm

4c8rllxm3#

制作一个具有相同值的列组合的dict/map。

>>> from itertools import combinations
>>> _dup_ohe_col_pairs = [(i, j) for i,j in combinations(df_dups, 2) if df_dups[i].equals(df_dups[j])]
>>> _dup_ohe_col_pairs = sorted(_dup_ohe_col_pairs, key=lambda x: x[0])

字符串
不要只是将_dup_ohe_col_pairs的键传递给pandas.DataFrame.drop来删除列。如果ab具有相同的值,则此dict将具有[('a','b'), ('b','a')],因此您最终将删除它们。假设当您有3,4或5个类似的列时会发生什么。从该Map中选择或筛选要保留的内容非常困难。
你应该这么做:

# from: https://stackoverflow.com/questions/75257052/getting-unique-values-and-their-following-pairs-from-list-of-tuples/75257487#75257487
def get_unique_to_duplicates_map(data):
    # ecs stands for equivalent classes (https://en.wikipedia.org/wiki/Equivalence_class)
    ecs = []
    
    for a, b in data:
        a_ec = next((ec for ec in ecs if a in ec), None)
        b_ec = next((ec for ec in ecs if b in ec), None)
        if a_ec:
            if b_ec:
                # Found equivalence classes for both elements, everything is okay
                if a_ec is not b_ec:
                    # We only need one of them though
                    ecs.remove(b_ec)
                    a_ec.update(b_ec)
            else:
                # Add the new element to the found equivalence class       
                a_ec.add(b)
        else:              
            if b_ec:
                # Add the new element to the found equivalence class
                b_ec.add(a)
            else:                                                   
                # First time we see either of these: make a new equivalence class 
                ecs.append({a, b})

    # Extract a representative element and construct a dictionary
    out = {
        ec.pop(): ec
        for ec in ecs
    }

    # return it
    return out

>>> _unique_to_dups_map = get_unique_to_duplicates_map(data=_dup_ohe_col_pairs)
_unique_to_dups_map
>>> dropped_to_retained_dict = {v_i:k for k,v in _unique_to_dups_map.items() for v_i in v}
>>> dropped_to_retained_dict = {k:v for k, v in sorted(dropped_to_retained_dict.items(), key=lambda item:item[1])}
>>> dropped_to_retained_dict
>>> df_dups.drop(columns=dropped_to_retained_dict.keys(), axis=1, inplace=True)

列具有类似值但编码不同的解决方案:

可能发生的情况是,两列基本上具有相同的值,但编码不同。例如:

b c d e f
1 1 3 4 1 a
2 3 4 5 2 c 
3 2 5 6 3 b
4 3 4 5 2 c  
5 4 5 6 3 d
6 2 4 5 2 b  
7 4 5 6 3 d


在上面的例子中,你可以看到,在标签编码之后,列f将与列b具有相同的值。那么,如何捕获像这样的重复列?这里是:

from tqdm import tqdm_notebook

# create an empty dataframe with same index as your dataframe(let's call it train_df), which will be filled with factorized version of original data.
train_enc = pd.DataFrame(index=train_df.index)
# now encode all the features 
for col in tqdm_notebook(train_df.columns):
    train_enc[col] = train_df[col].factorize()[0]
# find and print duplicated columns
dup_cols = {}
# start with one feature
for i, c1 in enumerate(tqdm_notebook(train_enc.columns)):
    # compare it all the remaining features
    for c2 in train_enc.columns[i + 1:]:
        # add the entries to above dict, if matches with the column in first loop
        if c2 not in dup_cols and np.all(train_enc[c1] == train_enc[c2]):
            dup_cols[c2] = c1
# now print dup_cols dictionary would have names of columns as keys that are identical to a column in value.
print(dup_cols)


与其他列匹配的列名,当编码时将在标准输出中打印。
如果要删除重复列,可以执行以下操作:

train_df.drop(columns=dup_cols.keys(), axis=1, inplace=True)

相关问题