I want to specify data types for pandas read_csv. Here's a quick look at something that does work and then doesn't when types are specified. Why doesn't the latter work?
import io
import pandas as pd
csv = """foo,1234567,a,1
foo,2345678,b,3
bar,3456789,b,5
"""
df = pd.read_csv(io.StringIO(csv),
names=["fb", "num", "loc", "x"])
print(df)
df = pd.read_csv(io.StringIO(csv),
names=["fb", "num", "loc", "x"],
dtype=["|S3", "np.int64", "|S1", "np.int8"])
print(df)
I've updated to make this much simpler and, hopefully, clearer on BrenBarn's suggestion. My real dataset is much larger, but I'd like to use the method to generate types for all my data on import.
S20
toobject
dtype which represents string types. Why is that a problem? This is the standard way of representing variable length strings (and is actually more efficient than a fixedS20
dtype)object
is more efficient thanstring_
types, then I'm happy with that piece. I'd like to specify all my integer types atint32
or less rather thanint64
, though. I guess I can try converting them post-import.dtype = { 'column_1' : np.int32, 'column_2' : np.int64 }
. You don't need to specify object as that will happen automatically for string-like columns