Don’t Slurp: How to Read Files in Python

no-slurping


A few weeks ago, a well-intentioned Python programmer asked a straight-forward question to a LinkedIn group for professional Python programmers:
What’s the best way to read file in Python?

Invariably a few programmers jumped in and told our well-intentioned programmer to just read the whole thing into memory:
f = open('/path/to/file', 'r+')

contents = f.read()

Just to mix things up, someone followed-up to demonstrate the exact same technique using ‘with’ (a great improvement as it ensures the file is properly closed in all cases):
with open('/path/to/file', 'r+') as f:

contents = f.read()
# do more stuff


Either implementation boils down to the use of a technique we call “slurping“, and it’s by far the most common way you’ll encounter files being read in the wild. It also happens to nearly always be the wrong way to read a file for 2 reasons:

  1. It’s quite memory inefficient

  2. It’s slower than processing data as it is read, because it defers any processing done on read data until after all data has been read into memory, rather than processing as data is read.



A Better Way: Filter


UNIX filter is a program that reads from stdin and writes to stdout. Filters are usually written in such a way that you can either read from stdin, or read from 1 or more files passed on the command line. There are many examples of filters: grep, sed, awk, cut, cat, wc and sh, just to name a few of the most commonly used ones.

One thing nearly all filters have in common is that they are stream processors, meaning that they work on chunks of data as they flow through the program. Because stdin is a line-buffered file by default, the most efficient chunk of data to work on ends up being the line, and so nearly all stream-processors operate on streams one line at a time. Python has some syntactic sugar that makes stream-processing line-by-line even more straight-forward than it usually would be:

# a simple filter that prepends line numbers

import sys
lineno = 0
# this reads in one line at a time from stdin
for line in sys.stdin:
lineno += 1
print '{:>6} {}'.format(lineno, line[:-1])

Our stream-processor is now more memory efficient than our slurp approach for all files with more than 1 line, and we are emitting incremental data to stdout (where another program might start immediately consuming it) rather than waiting until we’ve consumed the whole file to start processing it. To see just how much faster this is, let’s look at the speed of each program over 10 million lines:
$ # slurp version

$ jot 10000000 | time python lineno-slurp > /dev/null
16.42 real 10.63 user 0.46 sys
$ # stream version
$ jot 10000000 | time python lineno-stream > /dev/null
11.52 real 11.48 user 0.02 sys

And of course it’s also more memory efficient. So the moral of the story is that Python makes it simple and elegant to write stream-processors on line-buffered data-streams. We can easily apply the pattern above to an arbitrary number of files as well:
# a simple filter that prepends line numbers

# import sys EDIT: unused, pointed out in comments here and on HN
for fname in ( 'file.txt', 'file2.txt, ):
with open(fname, 'r+') as f:
lineno = 0
# this reads in one line at a time from stdin
for line in f:
lineno += 1
print '{:>6} {}'.format(lineno, line[:-1])