如何计算CSV文件的联合概率

332nm8kg  于 7个月前  发布在  其他
关注(0)|答案(2)|浏览(55)

我试图计算csv文件的联合概率,并返回一个新的csv文件,其中有一个额外的列的联合概率。问题是我的csv文件中有一些nan,我希望包含nan的行也有联合概率。我的csv输入文件看起来像:(这只是一个子集)

Age,Salary
84.0,74198.0
25.5,57881.5
41.0,NaN
57.0,NaN
54.0,40286.0

字符串
CSV输出文件看起来像这样:

Age,Salary,JointProbability
84.0,74198.0,0.04000000000000001
25.5,57881.5,0.04000000000000001
41.0,,0.0
57.0,,0.0
54.0,40286.0,0.04000000000000001


所需的csv输出:(概率随机,因此它们的总和必须为1)我还希望具有概率的附加列被称为P(输入csv文件的列名),其可以根据输入csv文件而改变。我还希望NaN在那里,而不是像前一个文件中那样为空白。

Age,Salary,P(Age,Salary)
84.0,74198.0,0.3
25.5,57881.5,0.1
41.0,NaN,0.1
57.0,NaN,0.2
54.0,40286.0,0.3


代码:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('random_sampled_data.csv')

# Calculate joint probabilities
joint_probabilities = []
total_count = len(df)

# Calculate joint probability for each row
for _, row in df.iterrows():
    
    for column in df.columns:
        probability = len(df[df[column] == row[column]]) / total_count
    joint_probabilities.append(probability)

# Add joint probabilities as a new column to the DataFrame
df['JointProbability'] = joint_probabilities

# Save the updated DataFrame to a new CSV file
df.to_csv('output_with_joint_probabilities.csv', index=False)

oxiaedzo

oxiaedzo1#

创建一个包含概率的新列,考虑具有NaN值的行,并计算CSV文件的联合概率。

我修改了你的代码如下:

import pandas as pd
import numpy as np

df = pd.read_csv('input_file.csv')

joint_probabilities = []
total_count = len(df)

for _, row in df.iterrows():
    probability = 1
    for column in df.columns:
        if pd.isna(row[column]):
            probability *= 0.1  
        else:
            count = len(df[df[column] == row[column]])
            probability *= count / total_count
    joint_probabilities.append(probability)

joint_probabilities = np.array(joint_probabilities)
joint_probabilities /= joint_probabilities.sum()

df['P(' + ', '.join(df.columns) + ')'] = joint_probabilities

df.to_csv('output_file.csv', index=False)

字符串

zwghvu4y

zwghvu4y2#

我再给你一个答案:

import numpy as np
import pandas as pd

df = pd.DataFrame({'A':[1,2,3,3], 'B':[1,2,np.nan,np.nan]})

colNameJoiningKeyName     = 'colJoiningKey'
colNameJointProbabilities = 'JointProbability'

df[colNameJoiningKeyName] = df[[colName for colName in df.columns if colName!=colNameJoiningKeyName]].astype(str).sum(axis = 1)
frequencyDict             = df[colNameJoiningKeyName].value_counts(normalize=True).to_dict()

df[colNameJointProbabilities] = df[colNameJoiningKeyName].map(frequencyDict)
del df[colNameJoiningKeyName]

print(df)

字符串

相关问题