The pandas I/O API is a set of top level reader
functions accessed like pandas.read_csv()
that generally return a pandas object. The corresponding writer
functions are object methods that are accessed like DataFrame.to_csv()
. Below is a table containing available readers
and writers
.
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | Local clipboard | read_clipboard | to_clipboard |
binary | MS Excel | read_excel | to_excel |
binary | HDF5 Format | read_hdf | to_hdf |
binary | Feather Format | read_feather | to_feather |
binary | Parquet Format | read_parquet | to_parquet |
binary | Msgpack | read_msgpack | to_msgpack |
binary | Stata | read_stata | to_stata |
binary | SAS | read_sas | |
binary | Python Pickle Format | read_pickle | to_pickle |
SQL | SQL | read_sql | to_sql |
SQL | Google Big Query | read_gbq | to_gbq |
Here is an informal performance comparison for some of these IO methods.
Note
For examples that use the StringIO
class, make sure you import it according to your Python version, i.e. from StringIO import StringIO
for Python 2 and from io import StringIO
for Python 3.
The two workhorse functions for reading text files (a.k.a. flat files) are read_csv()
and read_table()
. They both use the same parsing code to intelligently convert tabular data into a DataFrame
object. See the cookbook for some advanced strategies.
The functions read_csv()
and read_table()
accept the following common arguments:
filepath_or_buffer : various
str
, pathlib.Path
, or py._path.local.LocalPath
), URL (including http, ftp, and S3 locations), or any object with a read()
method (such as an open file or StringIO
).sep : str, defaults to ',' for read_csv(), \t for read_table()
None
, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer
. In addition, separators longer than 1 character and different from '\s+'
will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\\r\\t'
.delimiter : str, default None
delim_whitespace : boolean, default False
Specifies whether or not whitespace (e.g. ' '
or '\t'
) will be used as the delimiter. Equivalent to setting sep='\s+'
. If this option is set to True
, nothing should be passed in for the delimiter
parameter.
New in version 0.18.1: support for the Python parser.
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None
. Explicitly pass header=0
to be able to replace existing names.
The header can be a list of ints that specify row locations for a multi-index on the columns e.g. [0,1,3]
. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True
, so header=0 denotes the first line of data rather than the first line of the file.
names : array-like, default None
header=None
. Duplicates in this list will cause a UserWarning
to be issued.index_col : int or sequence or False, default None
DataFrame
. If a sequence is given, a MultiIndex is used. If you have a malformed file with delimiters at the end of each line, you might consider index_col=False
to force pandas to not use the first column as the index (row names).usecols : list-like or callable, default None
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names
or inferred from the document header row(s). For example, a valid list-like usecols
parameter would be [0, 1, 2]
or ['foo', 'bar', 'baz']
.
Element order is ignored, so usecols=[0, 1]
is the same as [1, 0]
. To instantiate a DataFrame from data
with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']]
for columns in ['foo', 'bar']
order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']]
for ['bar', 'foo']
order.
If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True:
In [1]: data = 'col1,col2,col3\na,b,1\na,b,2\nc,d,3' In [2]: pd.read_csv(StringIO(data)) Out[2]: col1 col2 col3 0 a b 1 1 a b 2 2 c d 3 In [3]: pd.read_csv(StringIO(data), usecols=lambda x: x.upper() in ['COL1', 'COL3'])
© 2008–2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development Team
Licensed under the 3-clause BSD License.
http://pandas.pydata.org/pandas-docs/version/0.23.4/io.html