pandas string data types

Question

I want to specify data types for pandas read_csv. Here's a quick look at something that does work and then doesn't when types are specified. Why doesn't the latter work?

import io
import pandas as pd

csv = """foo,1234567,a,1 
foo,2345678,b,3 
bar,3456789,b,5 
"""

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"])

print(df)

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"], 
        dtype=["|S3", "np.int64", "|S1", "np.int8"])

print(df)

I've updated to make this much simpler and, hopefully, clearer on BrenBarn's suggestion. My real dataset is much larger, but I'd like to use the method to generate types for all my data on import.

Have you tried making a simpler dataset and trying with just one or two datatypes to see which one is causing the problem? — BrenBarn, Commented Sep 29, 2013 at 18:06
I'll do that, though the error it throws now suggests (to my novice mind) that I'm not specifying correctly, not that there is a mismatch between my specification and the data. But I'll give it a shot and report back! — Don, Commented Sep 29, 2013 at 18:28
pandas will convert a specified string dtype, like S20 to object dtype which represents string types. Why is that a problem? This is the standard way of representing variable length strings (and is actually more efficient than a fixed S20 dtype) — Jeff, Commented Sep 29, 2013 at 18:43
@Jeff Oh, cool. So if object is more efficient than string_ types, then I'm happy with that piece. I'd like to specify all my integer types at int32 or less rather than int64, though. I guess I can try converting them post-import. — Don, Commented Sep 29, 2013 at 19:02
see docs, basically dtype = { 'column_1' : np.int32, 'column_2' : np.int64 }. You don't need to specify object as that will happen automatically for string-like columns — Jeff, Commented Sep 29, 2013 at 19:49

Don · Accepted Answer · 2013-09-30 15:40:54Z

5

As Jeff indicated, my syntax was bad. The names and types have to be zipped into a dic style list of relationships. The code below works, but note that you can't dtype a string width; you can only define it as an object.

import pandas as pd
import io

csv = """foo,1234567,a,1
foo,2345678,b,3
bar,3456789,b,5
"""

df = pd.read_csv(io.StringIO(csv),
        names = ["fb", "num", "ab", "x"], 
        dtype = {"fb" : object, "num" : np.int64, "ab" : object, "x" : np.int8})
print(df)

edited Sep 30, 2013 at 15:40

answered Sep 30, 2013 at 0:31

Don

8781 gold badge9 silver badges20 bronze badges

1

Right, that's why I was asking about the simplification. I was thinking that if you tried to simplify it down you would maybe find it out it didn't work at all, even for numeric types (although I didn't know for sure). It still seems lame that you can't specify actual string dtype though.
– BrenBarn
Commented Sep 30, 2013 at 0:45
1

pandas doesn't support the internal string types (in fact they are always converted to object).
– Jeff
Commented Sep 30, 2013 at 1:30

Add a comment |

Collectives™ on Stack Overflow

pandas string data types

1 Answer 1

Not the answer you're looking for? Browse other questions tagged
python
numpy
pandas
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Not the answer you're looking for? Browse other questions tagged pythonnumpypandas or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
numpy
pandas
or ask your own question.