Read .CSV and .TSV file using Python Pandas


CSV files and TSV files are very popular because they are easy to read and edit. They are also easy to connect to databases. You can use a CSV or TSV file with Python Pandas to read and manipulate data.

In this blog post, we will cover the following topics

What is a CSV file?

A CSV file is a comma-separated values file. It stores data in a tabular format where each row consists of one or more fields and each column represents a specific field. A CSV file can be opened in any text editor, such as Notepad or TextEdit, but they are best viewed using a spreadsheet application, such as Microsoft Excel or OpenOffice Calc.

What is a TSV file?

A TSV file is a tab-separated values file that is often used by spreadsheet programs to share data between databases. It keeps a data table in which each record is on a separate line and the data columns are separated by tabs.

You can use read_csv() and read_table() function of pandas to read CSV and TSV file. What are Difference between read_csv() and read_table () function?

read_csv() and read_table() function

There isn’t much difference between read_csv() and read_table() function. In fact, the same function is called by the source.

  • Delimiter of read_csv() function is a comma (,) character. Example: col1, col2, col3
  • Delimiter of read_table() function is tab (\t ). Example: col1 \t col2 \t col3

If you need to read a csv file (comma delimited), use read_csv() function.

If you need to read a tsv file (tab delimited), use read_table() function.

If your csv or tsv file is not a comma or a tab, you can set the delimiter with the argument ( sep =”any char” ).

Read CSV file with the header

Example: Reading a CSV File in Python Pandas

read_csv(filepath/filename)

This method returns what is known as a pandas DataFrame object, which essentially is just a two-dimensional table that can be manipulated just like any other dataset in python pandas.

Let’s take a look at an example of how this works. We will use the following CSV file:

H1, H2, H3, H4 A1, B1, C1, D1 A2, B2, C2, D2 A3, B3, C3, D3
import pandas as pd df = pd.read_csv('read_csv_example.csv') print(df)
Code language: Python (python)

The first line imports the pandas library into our environment so that we can use its functions. The second line uses the read_csv method to store the contents of the sample.csv file into the df object (which is short for dataframe). If we print out the df object, we see that it contains the following information:

H1 H2 H3 H4 0 A1 B1 C1 D1 1 A2 B2 C2 D2 2 A3 B3 C3 D3

header=0 indicates that the first row should be used as the column headers, which is what we want in this case. (the same without header argument)

import pandas as pd df1 = pd.read_csv('read_csv_example.csv', header=0) print(df1)
Code language: JavaScript (javascript)

Without this argument, our dataframe would look like this: (the same without header argument)

H1 H2 H3 H4 0 A1 B1 C1 D1 1 A2 B2 C2 D2 2 A3 B3 C3 D3

If setting header =2, data is read from the line specified (2), and lines above it are ignored.

import pandas as pd df2 = pd.read_csv('read_csv_example.csv', header=2) print(df2)
Code language: JavaScript (javascript)

Output

A2 B2 C2 D2 0 A3 B3 C3 D3

Read CSV file without header

A1, B1, C1, D1 A2, B2, C2, D2 A3, B3, C3, D3

You should specify the path as an absolute path or a relative path from the current directory. Place your file in the current directory.

If no arguments are given, the first row becomes the column header.

import pandas as pd0 df = pd0.read_csv('read_csv_no_col.csv') print(df) # A1 B1 C1 D1 #0 A2 B2 C2 D2 #1 A3 B3 C3 D3 print(df.columns) # InIndex(['A1', ' B1', ' C1', ' D1'], dtype='object')
Code language: Python (python)

If header=None argument then a sequential number will assign to the column name.

import pandas as pd1 df_header_none = pd1.read_csv('read_csv_no_col.csv', header=None) print(df_header_none) # 0 1 2 3 #0 A1 B1 C1 D1 #1 A2 B2 C2 D2 #2 A3 B3 C3 D3
Code language: Python (python)

Any value can be set as a column name like Specify as a list or tuple.

You can set a list or tuple to names argument as a column name (names=(‘Col1’, ‘Col2’, ‘Col3’, ‘Col4’))

import pandas as pd2 df_names = pd2.read_csv('read_csv_no_col.csv', names=('Col1', 'Col2', 'Col3', 'Col4')) print(df_names) # Col1 Col2 Col3 Col4 #0 A1 B1 C1 D1 #1 A2 B2 C2 D2 #2 A3 B3 C3 D3
Code language: Python (python)

Select columns to read (usecols Argument)

Use the usecols argument if you only want to read certain columns.

Provide a list of the column numbers that need to be read. Always make use of a list, even if there is just one column available.

# Read columns 1 and 3 df_usecols = pd.read_csv('read_csv_example.csv', header=None, usecols=[1, 3]) print(df_usecols) # Output # 1 3 #0 H2 H4 #1 B1 D1 #2 B2 D2 #3 B3 D3 # Read only column 2 df_usecols = pd.read_csv('read_csv_example.csv', header=None, usecols=[2]) print(df_usecols) # Output # 2 #0 H3 #1 C1 #2 C2 #3 C3
Code language: Python (python)

It is easy to use an anonymous function (also called a “lambda expression”) when leaving out a specific column and reading it. It is easier to specify a few column numbers to read than a large number of column numbers to read, especially if you want to leave out a few columns from a file with a lot of columns and still read it.

df_header_usecols = pd.read_csv('read_csv_example.csv', usecols=lambda x: x not in ['H1']) print(df_header_usecols) # H2 H3 H4 #0 B1 C1 D1 #1 B2 C2 D2 #2 B3 C3 D3
Code language: Python (python)

If you use it together with index_col, you must index_colalso specify the columns specified by usecols

df_index_usecols = pd.read_csv('read_csv_example.csv', index_col=0, usecols=[0, 1, 3]) print(df_index_usecols)
Code language: Python (python)

skiprows argument: skipping lines and reading it

If you pass an integer to skiprows, it will skip that number of lines from the beginning of the file.

Data:

H1, H2, H3, H4 A1, B1, C1, D1 A2, B2, C2, D2 A3, B3, C3, D3
df_none = pd.read_csv('read_csv_example.csv', header=None, skiprows=2) print(df_none) # Output # 0 1 2 3 #0 A2 B2 C2 D2 #1 A3 B3 C3 D3 #You can also specify a list of line numbers to skip, here rows 0 and 2 df_none_skiprows = pd.read_csv('read_csv_example.csv', header=None, skiprows=[0, 2]) print(df_none_skiprows) #Output # 0 1 2 3 #0 A1 B1 C1 D1 #1 A3 B3 C3 D3 # A lambda expression, which is an anonymous function, can be used to read only a certain line. It is easier than giving the line number to skip when you want to read only certain lines from a file with many lines. df_none_skiprows = pd.read_csv('read_csv_example.csv', header=None, skiprows=lambda x: x not in [0, 2]) print(df_none_skiprows) #output # 0 1 2 3 #0 H1 H2 H3 H4 #1 A2 B2 C2 D2
Code language: Python (python)

skipfooter argument : skip from the end of the file

df_none_skipfooter = pd.read_csv('read_csv_example.csv', header=None, skipfooter=1, engine='python') print(df_none_skipfooter) # 0 1 2 3 #0 H1 H2 H3 H4 #1 A1 B1 C1 D1 #2 A2 B2 C2 D2
Code language: Python (python)

nrows argument:

Read only the first few lines (Useful for checking a small amount of data in a large file.)

df_none_nrows = pd.read_csv('read_csv_example.csv', header=None, nrows=2) print(df_none_nrows) # 0 1 2 3 #0 H1 H2 H3 H4 #1 A1 B1 C1 D1
Code language: PHP (php)

Specifying the data type for columns and reading it

If any data type is specified as an argument, all index_col columns including the column specified in are converted to that type and read. For example dtype=str, all columns are cast to strings.

df_str = pd.read_csv('read_csv_example.csv', index_col=0, dtype=str)
Code language: Python (python)

To convert the column type after reading, You use astype() function.

# convert column a to int type df_str_cast = df_str.astype({'H1': int})
Code language: Python (python)

You can also specify the column type in dictionary form as an argument when reading with. Columns other than the specified column will be automatically selected types.

df_str_col = pd.read_csv('read_csv_example.csv', index_col=0, dtype={'H1': str, 'H2': str}) print(df_str_col) print(df_str_col.dtypes)
Code language: Python (python)

You can specify not only the column name but also the column number. Note that when index columns are specified, it is necessary to specify the column numbers including the index columns.

df_str_col_num = pd.read_csv('read_csv_example.csv', index_col=0, dtype={2: str, 3: str}) print(df_str_col_num) print(df_str_col_num.dtypes)
Code language: Python (python)

Handling missing values ​​NaN

Possible values ​​such as empty strings ”, character strings ‘NaN‘, and ‘nan‘, are generally treated as missing values ​​by default.

Specify a value (-) to treat as a missing value, using an argument: na_values

df_nan_set_na = pd.read_csv('read_csv_example.csv', index_col=0, na_values='-') print(df_nan_set_na) print(df_nan_set_na.isnull())
Code language: Python (python)

Specify values ​​to treat as missing values: Arguments: na_values, keep_default_na

Specify a value for the argument: keep_default_na = False, only the value specified is treated as a missing value.

df_nan_set_na_no_keep = pd.read_csv('read_csv_example.csv', index_col=0, na_values=['-', 'NaN', 'null'], keep_default_na=False) print(df_nan_set_na_no_keep) print(df_nan_set_na_no_keep.isnull())
Code language: PHP (php)

Do not treat any values ​​as missing: Argument na_filter

na_filter = False : not treated as missing values.

df_nan_no_filter = pd.read_csv('read_csv_example.csv', index_col=0, na_filter=False) print(df_nan_no_filter) print(df_nan_no_filter.isnull())
Code language: Python (python)

Specifying Encoding

By default utf-8 (if omitted utf-8). encoding argument specifies the encoding.

df_utf-8 = pd.read_csv('header_utf-8.csv', encoding='utf-8')
Code language: JavaScript (javascript)

You can also read a CSV file that has been compressed with zip or the like.

It corresponds only when the CSV file alone is compressed. An error occurs if multiple files are compressed.

df_zip = pd.read_csv('read_csv_example.csv.zip') print(df_zip)
Code language: Python (python)

Reading TSV file

df_tsv = pd.read_table('read_csv_example.tsv', index_col=0) df_tsv_sep = pd.read_csv('read_csv_example.tsv', index_col=0, sep='\t')
Code language: JavaScript (javascript)

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts