ChatGPT解决这个技术问题 Extra ChatGPT

Convert Pandas column containing NaNs to dtype `int`

I read data from a .csv file to a Pandas dataframe as below. For one of the columns, namely id, I want to specify the column type as int. The problem is the id series has missing/empty values.

When I try to cast the id column to integer while reading the .csv, I get:

df= pd.read_csv("data.csv", dtype={'id': int}) 
error: Integer column has NA values

Alternatively, I tried to convert the column type after reading as below, but this time I get:

df= pd.read_csv("data.csv") 
df[['id']] = df[['id']].astype(int)
error: Cannot convert NA to integer

How can I tackle this?

I think that integer values cannot be converted or stored in a series/dataframe if there are missing/NaN values. This I think is to do with numpy compatibility (I'm guessing here), if you want missing value compatibility then I would store the values as floats
see here: pandas.pydata.org/pandas-docs/dev/…; you must have a float dtype when u have missing values (or technically object dtype but that is inefficient); what is your goal of using int type?
I believe this is a NumPy issue, not specific to Pandas. It's a shame since there are so many cases when having an int type that allows for the possibility of null values is much more efficient than a large column of floats.
I have a problem with this too. I have multiple dataframes which I want to merge based on a string representation of several "integer" columns. However, when one of those integer columns has a np.nan, the string casting produces a ".0", which throws off the merge. Just makes things slightly more complicated, would be nice if there was simple work-around.
@Rhubarb, Optional Nullable Integer Support is now officially added on pandas 0.24.0 - finally :) - please find an updated answer bellow. pandas 0.24.x release notes

j
jezrael

In version 0.24.+ pandas has gained the ability to hold integer dtypes with missing values.

Nullable Integer Data Type.

Pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension types implemented within pandas. It is not the default dtype for integers, and will not be inferred; you must explicitly pass the dtype into array() or Series:

arr = pd.array([1, 2, np.nan], dtype=pd.Int64Dtype())
pd.Series(arr)

0      1
1      2
2    NaN
dtype: Int64

For convert column to nullable integers use:

df['myCol'] = df['myCol'].astype('Int64')

Note that dtype must be "Int64" and not "int64" (first 'i' must be capitalized)
df.myCol = df.myCol.astype('Int64') or df['myCol'] = df['myCol'].astype('Int64')
It may be obvious to some but it I think it is still worth noting that you can use any Int (e.g. Int16, Int32) and indeed probably should if the dataframe is very large to save memory.
@jezrael, in which cases this does not work...? It does not work for me and I have failed to find a generic solution.
I'm getting TypeError: cannot safely cast non-equivalent float64 to int64
3
3 revs, 2 users 83%

The lack of NaN rep in integer columns is a pandas "gotcha".

The usual workaround is to simply use floats.


Are there any other workarounds besides treating them like floats?
@jsc123 you can use the object dtype. This comes with a small health warning but for the most part works well.
Can you provide an example of how to use object dtype? I've been looking through the pandas docs and googling, and I've read it's the recommended method. But, I haven't found an example of how to use the object dtype.
In v0.24, you can now do df = df.astype(pd.Int32Dtype()) (to convert the entire dataFrame, or) df['col'] = df['col'].astype(pd.Int32Dtype()). Other accepted nullable integer types are pd.Int16Dtype and pd.Int64Dtype. Pick your poison.
It is NaN value but isnan checking doesn't work at all :(
h
hibernado

My use case is munging data prior to loading into a DB table:

df[col] = df[col].fillna(-1)
df[col] = df[col].astype(int)
df[col] = df[col].astype(str)
df[col] = df[col].replace('-1', np.nan)

Remove NaNs, convert to int, convert to str and then reinsert NANs.

It's not pretty but it gets the job done!


I have been pulling my hair out trying to load serial numbers where some are null and the rest are floats, this saved me.
The OP wants a column of integers. Converting it to string does not meet the condition.
Works only if col doesn't already have -1. Otherwise, it will mess with the data
then how to get back to int..??
This produces a column of strings!! For a solution with current versions of pandas, see stackoverflow.com/questions/58029359/…
m
mork

It is now possible to create a pandas column containing NaNs as dtype int, since it is now officially added on pandas 0.24.0

pandas 0.24.x release notes Quote: "Pandas has gained the ability to hold integer dtypes with missing values


A
Abhishek Bhatia

Whether your pandas series is object datatype or simply float datatype the below method will work

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float).astype('Int64')

Thank you @Abhishek Bhatia this worked for me.
j
jmenglund

If you absolutely want to combine integers and NaNs in a column, you can use the 'object' data type:

df['col'] = (
    df['col'].fillna(0)
    .astype(int)
    .astype(object)
    .where(df['col'].notnull())
)

This will replace NaNs with an integer (doesn't matter which), convert to int, convert to object and finally reinsert NaNs.


K
Kamil

I had the problem a few weeks ago with a few discrete features which were formatted as 'object'. This solution seemed to work.

for col in discrete:
    df[col] = pd.to_numeric(df[col],errors='coerce').astype(pd.Int64Dtype())

e
elomage

You could use .dropna() if it is OK to drop the rows with the NaN values.

df = df.dropna(subset=['id'])

Alternatively, use .fillna() and .astype() to replace the NaN with values and convert them to int.

I ran into this problem when processing a CSV file with large integers, while some of them were missing (NaN). Using float as the type was not an option, because I might loose the precision.

My solution was to use str as the intermediate type. Then you can convert the string to int as you please later in the code. I replaced NaN with 0, but you could choose any value.

df = pd.read_csv(filename, dtype={'id':str})
df["id"] = df["id"].fillna("0").astype(int)

For the illustration, here is an example how floats may loose the precision:

s = "12345678901234567890"
f = float(s)
i = int(f)
i2 = int(s)
print (f, i, i2)

And the output is:

1.2345678901234567e+19 12345678901234567168 12345678901234567890

B
Bradon

As of Pandas 1.0.0 you can now use pandas.NA values. This does not force integer columns with missing values to be floats.

When reading in your data all you have to do is:

df= pd.read_csv("data.csv", dtype={'id': 'Int64'})  

Notice the 'Int64' is surrounded by quotes and the I is capitalized. This distinguishes Panda's 'Int64' from numpy's int64.

As a side note, this will also work with .astype()

df['id'] = df['id'].astype('Int64')

Documentation here https://pandas.pydata.org/pandas-docs/stable/user_guide/integer_na.html


g
gboffi

If you can modify your stored data, use a sentinel value for missing id. A common use case, inferred by the column name, being that id is an integer, strictly greater than zero, you could use 0 as a sentinel value so that you can write

if row['id']:
   regular_process(row)
else:
   special_process(row)

C
Corbin

Most solutions here tell you how to use a placeholder integer to represent nulls. That approach isn't helpful if you're uncertain that integer won't show up in your source data though. My method with will format floats without their decimal values and convert nulls to None's. The result is an object datatype that will look like an integer field with null values when loaded into a CSV.

keep_df[col] = keep_df[col].apply(lambda x: None if pandas.isnull(x) else '{0:.0f}'.format(pandas.to_numeric(x)))

this approach can add a lot of memory overhead, especially on larger dataframes
A
Alex Metsai

Use .fillna() to replace all NaN values with 0 and then convert it to int using astype(int)

df['id'] = df['id'].fillna(0).astype(int)

Works but I think replacing NaN with 0 changes the meaning of the data.
M
Monaheng Ramochele
import pandas as pd

df= pd.read_csv("data.csv")
df['id'] = pd.to_numeric(df['id'])

Is there a reason you prefer this formulation over that proposed in the accepted answer? If so, it'd be useful to edit your answer to provide that explanation—and especially since there are ten additional answers that are competing for attention.
While this code may resolve the OP's issue, it is best to include an explanation as to how/why your code addresses it. In this way, future visitors can learn from your post, and apply it to their own code. SO is not a coding service, but a resource for knowledge. Also, high quality, complete answers are more likely to be upvoted. These features, along with the requirement that all posts are self-contained, are some of the strengths of SO as a platform differentiates it from forums. You can edit to add additional info &/or to supplement your explanations with source documentation.
M
Mehdi Golzadeh

If you want to use it when you chain methods, you can use assign:

df = (
     df.assign(col = lambda x: x['col'].astype('Int64'))
)

T
TWebbs

For anyone needing to have int values within NULL/NaN-containing columns, but working under the constraint of being unable to use pandas version 0.24.0 nullable integer features mentioned in other answers, I suggest converting the columns to object type using pd.where:

df = df.where(pd.notnull(df), None)

This converts all NaNs in the dataframe to None, treating mixed-type columns as objects, but leaving the int values as int, rather than float.


D
Digestible1010101

First you need to specify the newer integer type, Int8 (...Int64) that can handle null integer data (pandas version >= 0.24.0)

df = df.astype('Int8')

But you may want to only target specific columns which have integer data mixed with NaN/nulls:

df = df.astype({'col1':'Int8','col2':'Int8','col3':'Int8')

At this point, the NaN's are converted into <NA> and if you want to change the default null value with df.fillna(), you need to coerce the object datatype on the columns you wish to change, otherwise you will see TypeError: <U1 cannot be converted to an IntegerDtype

You can do this by df = df.astype(object) if you don't mind changing every column datatype to object (individually, each value's type is still preserved) ... OR df = df.astype({"col1": object,"col2": object}) if you prefer to target individual columns.

This should help with forcing your integer columns mixed with nulls to stay formatted as integers and change the null values to whatever you like. I can't speak to the efficiency of this method, but it worked for my formatting and printing purposes.


l
luca992

With pandas >.24 version, type Int64 supports nan.

You may run into an error if your floats haven't been rounded, floored, ceilinged, or rounded.

df['A'] = np.floor(pd.to_numeric(df['A'], errors='coerce')).astype('Int64')

Source: https://stackoverflow.com/a/67021201/1363742


N
Neuneck

I ran into this issue working with pyspark. As this is a python frontend for code running on a jvm, it requires type safety and using float instead of int is not an option. I worked around the issue by wrapping the pandas pd.read_csv in a function that will fill user-defined columns with user-defined fill values before casting them to the required type. Here is what I ended up using:

def custom_read_csv(file_path, custom_dtype = None, fill_values = None, **kwargs):
    if custom_dtype is None:
        return pd.read_csv(file_path, **kwargs)
    else:
        assert 'dtype' not in kwargs.keys()
        df = pd.read_csv(file_path, dtype = {}, **kwargs)
        for col, typ in custom_dtype.items():
            if fill_values is None or col not in fill_values.keys():
                fill_val = -1
            else:
                fill_val = fill_values[col]
            df[col] = df[col].fillna(fill_val).astype(typ)
    return df

N
Nikhil Redij

Try this:

df[['id']] = df[['id']].astype(pd.Int64Dtype())

If you print it's dtypes, you will get id Int64 instead of normal one int64


b
buhtz

The following solution is the only one that serves my purpose, and I think it is the best solution when using a recent Pandas version.

df['A'] = np.floor(pd.to_numeric(df['A'],
                   errors='coerce'))
                   .astype('Int64')

I find the solution on StackOverflow see the link below for more information. https://stackoverflow.com/a/67021201/9294498


Please explain your solution.
k
kamran kausar

First remove the rows which contain NaN. Then do Integer conversion on remaining rows. At Last insert the removed rows again. Hope it will work


W
WolVes

The issue with Int64, like many other's solutions, is that if you have null values, they get replaced with <NA> values, which do not work with pandas default 'NaN' functions, like isnull() or fillna(). Or if you convert values to -1 you end up in a situation where you may be deleting your information. My solution is a little lame, but will provide int values with np.nan, allowing for nan functions to work without compromising your values.

            def to_int(x):
                try:
                    return int(x)
                except:
                    return np.nan

            df[column] = df[column].apply(to_int)

m
mqx

Had a similar problem. That was my solution:

def toint(zahl = 1.1):
    try:
        zahl = int(zahl)
    except:
        zahl = np.nan
    return zahl

print(toint(4.776655), toint(np.nan), toint('test'))

4 nan nan

df = pd.read_csv("data.csv") 
df['id'] = df['id'].astype(float)
df['id'] = toint(df['id'])

l
lassebenninga

Since I didn't see the answer here, I might as well add it:

One-liner to convert NANs to empty string if you for some reason you still can't handle np.na or pd.NA like me when relying on a library with an older version of pandas:

df.select_dtypes('number').fillna(-1).astype(str).replace('-1', '')


caution with this approach... if any of your data really is -1, it will be overwritten.
N
Nimantha

I think the approach of @Digestible1010101 is the more appropriate for Pandas 1.2.+ versions, something like this should do the job:

df = df.astype({
            'col_1': 'Int64',
            'col_2': 'Int64',
            'col_3': 'Int64',
            'col_4': 'Int64', })

J
Justin Malinchak

Assuming your DateColumn formatted 3312018.0 should be converted to 03/31/2018 as a string. And, some records are missing or 0.

df['DateColumn'] = df['DateColumn'].astype(int)
df['DateColumn'] = df['DateColumn'].astype(str)
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.zfill(8))
df.loc[df['DateColumn'] == '00000000','DateColumn'] = '01011980'
df['DateColumn'] = pd.to_datetime(df['DateColumn'], format="%m%d%Y")
df['DateColumn'] = df['DateColumn'].apply(lambda x: x.strftime('%m/%d/%Y'))

K
KeepLearning

use pd.to_numeric()

df["DateColumn"] = pd.to_numeric(df["DateColumn"])

simple and clean


If there are NaN values in the column, pd.to_numeric will convert the dtype to float not int because NaN is considered a float.