Pandas 合并具有不同列的两个数据框

python pandas dataframe data-munging

我肯定在这里遗漏了一些简单的东西。尝试在 pandas 中合并两个数据框，它们的列名大多相同，但右侧的数据框有一些左侧没有的列，反之亦然。

>df_may

  id  quantity  attr_1  attr_2
0  1        20       0       1
1  2        23       1       1
2  3        19       1       1
3  4        19       0       0

>df_jun

  id  quantity  attr_1  attr_3
0  5         8       1       0
1  6        13       0       1
2  7        20       1       1
3  8        25       1       1

我尝试使用外部连接加入：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer")

但这会产生：

Left data columns not unique: Index([....

我还指定了要加入的单个列（例如 on = "id"），但这会复制除 id 之外的所有列，例如 attr_1_x、attr_1_y，这并不理想。我还将列的整个列表（有很多）传递给 on：

mayjundf = pd.DataFrame.merge(df_may, df_jun, how="outer", on=list(df_may.columns.values))

产生：

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

我错过了什么？我想获得一个附加了所有行的 df，并在可能的情况下填充 attr_1、attr_2、attr_3，在它们不显示的地方填充 NaN。这似乎是一个非常典型的数据处理工作流程，但我被困住了。

提前致谢。

EdChum

我认为在这种情况下 concat 是你想要的：

In [12]:

pd.concat([df,df1], axis=0, ignore_index=True)
Out[12]:
   attr_1  attr_2  attr_3  id  quantity
0       0       1     NaN   1        20
1       1       1     NaN   2        23
2       1       1     NaN   3        19
3       0       0     NaN   4        19
4       1     NaN       0   5         8
5       0     NaN       1   6        13
6       1     NaN       1   7        20
7       1     NaN       1   8        25

通过在此处传递 axis=0，您将 df 堆叠在一起，我相信这是您想要的，然后在它们各自的 dfs 中不存在的地方产生 NaN 值。

出于某种原因，这对我不起作用。我得到了 pandas.errors.InvalidIndexError: Reindexing only valid with unique value Index objects

我试图以这种方式合并三个具有不同列的 DF。一些列被添加，一些列丢失。

在我当前的用例中，这似乎也不适合我。一些列被丢弃。似乎对哪个数据框是列表中第一个要连接的数据框敏感？奇怪的是，无论顺序如何，从官方 concat 文档中运行该示例都可以正常工作。

tdy

接受的答案将破坏 if there are duplicate headers：

InvalidIndexError：重新索引仅对具有唯一值的索引对象有效。

例如，这里的 A 有 3x trial 列，这会阻止 concat：

A = pd.DataFrame([[3, 1, 4, 1]], columns=['id', 'trial', 'trial', 'trial'])
#    id  trial  trial  trial
# 0   3      1      4      1

B = pd.DataFrame([[5, 9], [2, 6]], columns=['id', 'trial'])
#    id  trial
# 0   5      9
# 1   2      6

pd.concat([A, B], ignore_index=True)
# InvalidIndexError: Reindexing only valid with uniquely valued Index objects

要解决此问题，请在 concat 之前使用 deduplicate the column names：

parser = pd.io.parsers.base_parser.ParserBase({'usecols': None})

for df in [A, B]:
    df.columns = parser._maybe_dedup_names(df.columns) 

pd.concat([A, B], ignore_index=True)
#    id  trial  trial.1  trial.2
# 0   3      1        4        1
# 1   5      9      NaN      NaN
# 2   2      6      NaN      NaN

或者作为单行但可读性较差：

pd.concat([df.set_axis(parser._maybe_dedup_names(df.columns), axis=1) for df in [A, B]], ignore_index=True)

请注意，对于 pandas <1.3.0，请使用：parser = pd.io.parsers.ParserBase({})

normanius

我今天使用 concat、append 或 merge 中的任何一个都遇到了这个问题，我通过添加一个按顺序编号的辅助列然后进行外连接来解决这个问题

helper=1
for i in df1.index:
    df1.loc[i,'helper']=helper
    helper=helper+1
for i in df2.index:
    df2.loc[i,'helper']=helper
    helper=helper+1
df1.merge(df2,on='helper',how='outer')

接受的答案有什么不工作：pd.concat([df,df1], axis=0, ignore_index=True)？

我用非独特的列到达了这一点。考虑 a = pd.DataFrame({'d':[1], 'b':[2]}).rename(columns={'b':'d'}) 和 b=pd.DataFrame({'d':[4, 6]}) 那么 pd.concat([a, b], axis=0, ignore_index=True) 会失败。尽管可以应用一些变通方法，但我认为最好解决问题的根源以具有唯一的列名（如我的情况）。另外，在尝试重命名已经存在的列名时，我会收到一些警告。

Pandas 合并具有不同列的两个数据框

关注公众号

想领先一步获取最新的外包任务吗？

相似问题

平台

支持

联系我们