ChatGPT解决这个技术问题 Extra ChatGPT

Convert floats to ints in Pandas?

I've been working with data imported from a CSV. Pandas changed some columns to float, so now the numbers in these columns get displayed as floating points! However, I need them to be displayed as integers or without comma. Is there a way to convert them to integers or not display the comma?

You can change the type (so long as there are no missing values) df.col = df.col.astype(int)
This question is two questions at the same time, and the title of this question reflects only one of them.
For an people hitting the above and finding it useful in concept but not working for you, this is the version that worked for me in python 3.7.5 with pandas X: df = df.astype(int)

E
EdChum

To modify the float output do this:

df= pd.DataFrame(range(5), columns=['a'])
df.a = df.a.astype(float)
df

Out[33]:

          a
0 0.0000000
1 1.0000000
2 2.0000000
3 3.0000000
4 4.0000000

pd.options.display.float_format = '{:,.0f}'.format
df

Out[35]:

   a
0  0
1  1
2  2
3  3
4  4

In the latest version of pandas you need to add copy = False to the arguments of astype to avoid a warning
Is it needed to do df.a = df.a.astype(float) ? Does this make a copy (not sure how the copy param to astype() is used) ? Anyway to update the type "in place" ?
@EdChum, is there a way to prevent Pandas from converting types to begin with? For example try DF.({'200': {'#': 354, '%': 0.9971830985915493}, '302': {'#': 1, '%': 0.0028169014084507044}}) Note the # get converted to float and they are rows, not columns. because each is a Series which can only store a single uniform type?
@alancalvitti what is your intention here to preserve the values or the dtype? If it's dtype then you need to create those columns as dtype object so it allows mixed, otherwise my advice would be to just use float and when doing comparisons use np.isclose
@EdChum, the intention is to preserve the input types. So the # above should remain ints, while the % are typically floats.
J
Jaroslav Bezděk

Use the pandas.DataFrame.astype(<type>) function to manipulate column dtypes.

>>> df = pd.DataFrame(np.random.rand(3,4), columns=list("ABCD"))
>>> df
          A         B         C         D
0  0.542447  0.949988  0.669239  0.879887
1  0.068542  0.757775  0.891903  0.384542
2  0.021274  0.587504  0.180426  0.574300
>>> df[list("ABCD")] = df[list("ABCD")].astype(int)
>>> df
   A  B  C  D
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0

EDIT:

To handle missing values:

>>> df
          A         B     C         D
0  0.475103  0.355453  0.66  0.869336
1  0.260395  0.200287   NaN  0.617024
2  0.517692  0.735613  0.18  0.657106
>>> df[list("ABCD")] = df[list("ABCD")].fillna(0.0).astype(int)
>>> df
   A  B  C  D
0  0  0  0  0
1  0  0  0  0
2  0  0  0  0

I tried your approach and it gives me a ValueError: Cannot convert NA to integer
@MJP You cannot convert series from float to integer if there are missing values see pandas.pydata.org/pandas-docs/stable/…, you have to use floats
The values aren't missing, but the column doesn't specify a value for each row on purpose. Is there any way to achieve a workaround? Since those values are foreign key ids, I need ints.
I've made an edit in which all NaN's are replaced with a 0.0.
Or better yet, if you are only modifying a CSV, then: df.to_csv("path.csv",na_rep="",float_format="%.0f",index=False) But this will edit all the floats, so it may be better to convert your FK column to a string, do the manipulation, and then save.
J
Jaroslav Bezděk

Considering the following data frame:

>>> df = pd.DataFrame(10*np.random.rand(3, 4), columns=list("ABCD"))
>>> print(df)
...           A         B         C         D
... 0  8.362940  0.354027  1.916283  6.226750
... 1  1.988232  9.003545  9.277504  8.522808
... 2  1.141432  4.935593  2.700118  7.739108

Using a list of column names, change the type for multiple columns with applymap():

>>> cols = ['A', 'B']
>>> df[cols] = df[cols].applymap(np.int64)
>>> print(df)
...    A  B         C         D
... 0  8  0  1.916283  6.226750
... 1  1  9  9.277504  8.522808
... 2  1  4  2.700118  7.739108

Or for a single column with apply():

>>> df['C'] = df['C'].apply(np.int64)
>>> print(df)
...    A  B  C         D
... 0  8  0  1  6.226750
... 1  1  9  9  8.522808
... 2  1  4  2  7.739108

What if there is a NaN in the value?
@Zhang18 I tried this solution and in case of NaN you have this error: ValueError: ('cannot convert float NaN to integer', u'occurred at index <column_name>')
@enri: Can try the following code - df['C'] = df['C'].dropna().apply(np.int64)
s
smci

To convert all float columns to int

>>> df = pd.DataFrame(np.random.rand(5, 4) * 10, columns=list('PQRS'))
>>> print(df)
...     P           Q           R           S
... 0   4.395994    0.844292    8.543430    1.933934
... 1   0.311974    9.519054    6.171577    3.859993
... 2   2.056797    0.836150    5.270513    3.224497
... 3   3.919300    8.562298    6.852941    1.415992
... 4   9.958550    9.013425    8.703142    3.588733

>>> float_col = df.select_dtypes(include=['float64']) # This will select float columns only
>>> # list(float_col.columns.values)

>>> for col in float_col.columns.values:
...     df[col] = df[col].astype('int64')

>>> print(df)
...     P   Q   R   S
... 0   4   0   8   1
... 1   0   9   6   3
... 2   2   0   5   3
... 3   3   8   6   1
... 4   9   9   8   3

J
Jaroslav Bezděk

This is a quick solution in case you want to convert more columns of your pandas.DataFrame from float to integer considering also the case that you can have NaN values.

cols = ['col_1', 'col_2', 'col_3', 'col_4']
for col in cols:
   df[col] = df[col].apply(lambda x: int(x) if x == x else "")

I tried with else x) and else None), but the result is still having the float number, so I used else "".


it will apply the "" to all the values in col
It will apply empty string ("") to all the missing values, if that is what is required, but the rest of the values will be integer.
Thanks for this. This worked when .astype() and .apply(np.int64) did not.
This feels hacky, and I see no reason to use it over the many alternatives available.
Thanks, this was the only answer that properly handled NaN and preserves them (as empty string or 'N/A') while converting other values to int.
C
Community

Expanding on @Ryan G mentioned usage of the pandas.DataFrame.astype(<type>) method, one can use the errors=ignore argument to only convert those columns that do not produce an error, which notably simplifies the syntax. Obviously, caution should be applied when ignoring errors, but for this task it comes very handy.

>>> df = pd.DataFrame(np.random.rand(3, 4), columns=list('ABCD'))
>>> df *= 10
>>> print(df)
...           A       B       C       D
... 0   2.16861 8.34139 1.83434 6.91706
... 1   5.85938 9.71712 5.53371 4.26542
... 2   0.50112 4.06725 1.99795 4.75698

>>> df['E'] = list('XYZ')
>>> df.astype(int, errors='ignore')
>>> print(df)
...     A   B   C   D   E
... 0   2   8   1   6   X
... 1   5   9   5   4   Y
... 2   0   4   1   4   Z

From pandas.DataFrame.astype docs:

errors : {‘raise’, ‘ignore’}, default ‘raise’ Control raising of exceptions on invalid data for provided dtype. raise : allow exceptions to be raised ignore : suppress exceptions. On error return original object New in version 0.20.0.


p
prashanth

The columns that needs to be converted to int can be mentioned in a dictionary also as below

df = df.astype({'col1': 'int', 'col2': 'int', 'col3': 'int'})

J
Jaroslav Bezděk
>>> import pandas as pd
>>> right = pd.DataFrame({'C': [1.002, 2.003], 'D': [1.009, 4.55], 'key': ['K0', 'K1']})
>>> print(right)
           C      D key
    0  1.002  1.009  K0
    1  2.003  4.550  K1
>>> right['C'] = right.C.astype(int)
>>> print(right)
       C      D key
    0  1  1.009  K0
    1  2  4.550  K1

t
tdy

Use 'Int64' for NaN support

astype(int) and astype('int64') cannot handle missing values (numpy int)

astype('Int64') can handle missing values (pandas int)

df['A'] = df['A'].astype('Int64') # capital I

This assumes you want to keep missing values as NaN. If you plan to impute them, you could fillna first as Ryan suggested.

Examples of 'Int64' (capital I)

If the floats are already rounded, just use astype: df = pd.DataFrame({'A': [99.0, np.nan, 42.0]}) df['A'] = df['A'].astype('Int64') # A # 0 99 # 1 # 2 42 If the floats are not rounded yet, round before astype: df = pd.DataFrame({'A': [3.14159, np.nan, 1.61803]}) df['A'] = df['A'].round().astype('Int64') # A # 0 3 # 1 # 2 2 To read int+NaN data from a file, use dtype='Int64' to avoid the need for converting at all: csv = io.StringIO(''' id,rating foo,5 bar, baz,2 ''') df = pd.read_csv(csv, dtype={'rating': 'Int64'}) # id rating # 0 foo 5 # 1 bar # 2 baz 2

Notes

'Int64' is an alias for Int64Dtype: df['A'] = df['A'].astype(pd.Int64Dtype()) # same as astype('Int64')

Sized/signed aliases are available: lower bound upper bound 'Int8' -128 127 'Int16' -32,768 32,767 'Int32' -2,147,483,648 2,147,483,647 'Int64' -9,223,372,036,854,775,808 9,223,372,036,854,775,807 'UInt8' 0 255 'UInt16' 0 65,535 'UInt32' 0 4,294,967,295 'UInt64' 0 18,446,744,073,709,551,615


F
Francisco Puga

In the text of the question is explained that the data comes from a csv. Só, I think that show options to make the conversion when the data is read and not after are relevant to the topic.

When importing spreadsheets or csv in a dataframe, "only integer columns" are commonly converted to float because excel stores all numerical values as floats and how the underlying libraries works.

When the file is read with read_excel or read_csv there are a couple of options avoid the after import conversion:

parameter dtype allows a pass a dictionary of column names and target types like dtype = {"my_column": "Int64"}

parameter converters can be used to pass a function that makes the conversion, for example changing NaN's with 0. converters = {"my_column": lambda x: int(x) if x else 0}

parameter convert_float will convert "integral floats to int (i.e., 1.0 –> 1)", but take care with corner cases like NaN's. This parameter is only available in read_excel

To make the conversion in an existing dataframe several alternatives have been given in other comments, but since v1.0.0 pandas has a interesting function for this cases: convert_dtypes, that "Convert columns to best possible dtypes using dtypes supporting pd.NA."

As example:

In [3]: import numpy as np                                                                                                                                                                                         

In [4]: import pandas as pd                                                                                                                                                                                        

In [5]: df = pd.DataFrame( 
   ...:     { 
   ...:         "a": pd.Series([1, 2, 3], dtype=np.dtype("int64")), 
   ...:         "b": pd.Series([1.0, 2.0, 3.0], dtype=np.dtype("float")), 
   ...:         "c": pd.Series([1.0, np.nan, 3.0]), 
   ...:         "d": pd.Series([1, np.nan, 3]), 
   ...:     } 
   ...: )                                                                                                                                                                                                          

In [6]: df                                                                                                                                                                                                         
Out[6]: 
   a    b    c    d
0  1  1.0  1.0  1.0
1  2  2.0  NaN  NaN
2  3  3.0  3.0  3.0

In [7]: df.dtypes                                                                                                                                                                                                  
Out[7]: 
a      int64
b    float64
c    float64
d    float64
dtype: object

In [8]: converted = df.convert_dtypes()                                                                                                                                                                            

In [9]: converted.dtypes                                                                                                                                                                                           
Out[9]: 
a    Int64
b    Int64
c    Int64
d    Int64
dtype: object

In [10]: converted                                                                                                                                                                                                 
Out[10]: 
   a  b     c     d
0  1  1     1     1
1  2  2  <NA>  <NA>
2  3  3     3     3


This is the answer people need to look at if they're using pandas >= 1.0. Thanks so much!
F
Fellipe Alcantara

Although there are many options here, You can also convert the format of specific columns using a dictionary

Data = pd.read_csv('Your_Data.csv')

Data_2 = Data.astype({"Column a":"int32", "Column_b": "float64", "Column_c": "int32"})

print(Data_2 .dtypes) # Check the dtypes of the columns

This is an useful and very fast way to change the data format of specific columns for quick data analysis.