How to Convert CSV to Avro in Python?


Apache Avro is a language-neutral data serialization system. It is comparable to tools like Google’s Protocol Buffers (Protobuf) or Apache Thrift. While CSV is a very popular format for storing tabular data, it has some drawbacks. First, CSV files do not have a built-in schema, making it difficult to process them without prior knowledge of the contents. Second, CSV files are typically hard to parse because of their lack of structure. Finally, CSV files can be very large, which can make processing them slow and unwieldy.

Avro solves these problems by providing a self-describing file format that includes a schema definition. This allows programs to process Avro files without prior knowledge of their contents. In addition, Avro files are always accompanied by a companion file that provides the schema, making it easy for programs to parse them. Finally, Avro files are compact and efficient, making them fast and easy to process.

In this blog post, we will show you how to convert a CSV file to Avro in Python. We will first create a CSV file with some sample data. Next, we will use the avro-tools library to convert our CSV file to an Avro file. Finally, we will use the avro library to read our Avro file and print out its contents.

Converting CSV to Avro in Python

To convert a CSV file to Avro in Python, we will first need to install the avro-tools and avro libraries. Assuming that you have already installed Python on your system, you can do this by running the following command: pip install avro-tools avro Once these libraries have been installed, we can now start converting our CSV file to Avro.

import avro.schema, csv, codecs from avro.datafile import DataFileReader, DataFileWriter from avro.io import DatumReader, DatumWriter def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs): # csv.py doesn't do Unicode; encode temporarily as UTF-8: csv_reader = csv.reader(utf_8_encoder(unicode_csv_data), dialect=dialect, **kwargs) for row in csv_reader: # decode UTF-8 back to Unicode, cell by cell: yield [unicode(cell, 'utf-8') for cell in row] def utf_8_encoder(unicode_csv_data): for line in unicode_csv_data: yield line.encode('utf-8') schema = avro.schema.parse(open("data/song.avsc").read()) with codecs.open('data/subset_unique_tracks.txt', 'r', encoding='latin_1') as csvfile: reader = unicode_csv_reader(csvfile, delimiter='|') writer = DataFileWriter(open("data/songs.avro", "w"), DatumWriter(), schema, codec='deflate') for count, row in enumerate(reader): print count try: writer.append({"id1": row[0], "id2": row[1], "artist": row[2], "song": row[3]}) except IndexError: print "Bad record, skip." writer.close() # Uncomment to read and print the data from the Avro file # reader = DataFileReader(open("data/songs.avro", "r"), DatumReader()) # for user in reader: # print user # reader.close()
Code language: Python (python)

Download

To use:

  • Clone the repo (have pip and virtualenv installed)
  • virtualenv env
  • source env/bin/activate
  • pip install -r requirements.txt
  • Unzip data/unique_tracks.zip to the data/ directory
  • In the root directory: python avro-convert.py

In this blog post, we showed you how to convert CSV data into an Apache AVRO format in Python. We first installed the necessary libraries, then created a script that reads in a CSV file, writes its schema into an AVRO companion file, and finally uses the AVRO library’s reader function to read and print out the contents of our new AVRO format file. Thanks for reading!

Andy Avery

I really enjoy helping people with their tech problems to make life easier, ​and that’s what I’ve been doing professionally for the past decade.

Recent Posts