Data Pipelines With Python Generators

In my last post we learned a little bit about generators. In this post we’ll spend some time understanding generators and how they can be used to build out a simple data pipeline.

In case you forgot generators are iterators which utilize yield to output 1 result at a time minimizing the impact on memory. Generators become useful in handling large batches of data and performing distinct operations on each input.

Suppose we want to build a simple pipeline that took a set of first and last names structured like last name, first name and we wanted to structure them for insertion into a data warehouse.

Xi, Zhang
O'rourke, Lenny
Moore, Trevor
Rivani, Lisa

The first step in our pipeline can simply read in each row and yield the result.

def read_csv(file_path):
    with open(file_path) as csvfile:
        csv_reader = csv.reader(csvfile, delimiter=',')
        for rows in csv_reader:
            yield rows

This will read in each set of csv rows and output them as ['last name' , 'first name']. Now lets make each set of lists all lower case

def process_name(rows):
    for row in rows:
        yield map(str.lower, row)

Now we can run this pipeline and see the result

pipeline = process_name(read_csv('names.csv'))
for result in pipeline:
    print(result)
    ## We could also do our warehouse insert steps at this point in the pipeline.