3

I want to specify data types for pandas read_csv. Here's a quick look at something that does work and then doesn't when types are specified. Why doesn't the latter work?

import io
import pandas as pd

csv = """foo,1234567,a,1 
foo,2345678,b,3 
bar,3456789,b,5 
"""

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"])

print(df)

df = pd.read_csv(io.StringIO(csv),
        names=["fb", "num", "loc", "x"], 
        dtype=["|S3", "np.int64", "|S1", "np.int8"])

print(df)

I've updated to make this much simpler and, hopefully, clearer on BrenBarn's suggestion. My real dataset is much larger, but I'd like to use the method to generate types for all my data on import.

9
  • 1
    Have you tried making a simpler dataset and trying with just one or two datatypes to see which one is causing the problem?
    – BrenBarn
    Commented Sep 29, 2013 at 18:06
  • I'll do that, though the error it throws now suggests (to my novice mind) that I'm not specifying correctly, not that there is a mismatch between my specification and the data. But I'll give it a shot and report back!
    – Don
    Commented Sep 29, 2013 at 18:28
  • 1
    pandas will convert a specified string dtype, like S20 to object dtype which represents string types. Why is that a problem? This is the standard way of representing variable length strings (and is actually more efficient than a fixed S20 dtype)
    – Jeff
    Commented Sep 29, 2013 at 18:43
  • @Jeff Oh, cool. So if object is more efficient than string_ types, then I'm happy with that piece. I'd like to specify all my integer types at int32 or less rather than int64, though. I guess I can try converting them post-import.
    – Don
    Commented Sep 29, 2013 at 19:02
  • 1
    see docs, basically dtype = { 'column_1' : np.int32, 'column_2' : np.int64 }. You don't need to specify object as that will happen automatically for string-like columns
    – Jeff
    Commented Sep 29, 2013 at 19:49

1 Answer 1

5

As Jeff indicated, my syntax was bad. The names and types have to be zipped into a dic style list of relationships. The code below works, but note that you can't dtype a string width; you can only define it as an object.

import pandas as pd
import io

csv = """foo,1234567,a,1
foo,2345678,b,3
bar,3456789,b,5
"""

df = pd.read_csv(io.StringIO(csv),
        names = ["fb", "num", "ab", "x"], 
        dtype = {"fb" : object, "num" : np.int64, "ab" : object, "x" : np.int8})
print(df)
2
  • 1
    Right, that's why I was asking about the simplification. I was thinking that if you tried to simplify it down you would maybe find it out it didn't work at all, even for numeric types (although I didn't know for sure). It still seems lame that you can't specify actual string dtype though.
    – BrenBarn
    Commented Sep 30, 2013 at 0:45
  • 1
    pandas doesn't support the internal string types (in fact they are always converted to object).
    – Jeff
    Commented Sep 30, 2013 at 1:30

Not the answer you're looking for? Browse other questions tagged or ask your own question.