Python Data Script Boilerplate Version 2
A while back I wrote up a Python data script boilerplate that crystallized some of the things I found myself doing over and over. And while that boilerplate has served surprisingly well, I’ve found myself regularly making a few changes, so I figure it’s probably time for an update to version two.
I’ll show you the finished product first, and then walk through each
chunk, noting what I’ve changed1. The first chunk is pretty self explanatory. It sets the shebang (now
explicitly Python 3), gives a docstring that doubles as program help
info later, and organizes the imports. This time I’ve included commented
imports that I often use, like Pandas and NumPy. I’ve also dropped some
imports that became unnecessary since The next chunk sets up our module-level info. I’ve added in a version
this time around, because you should version things. I’ve also made the
log name dependent on how whether the script is loaded as a module or
not, because the full name may be more helpful if the script ends up as
a submodule somewhere, which occasionally happens. The data manipulation function is more-or-less unchanged, since this
is where the actual work occurs. In general, you’ll want to rename this
function to what it actually does. The That encoding point is another difference between the old and new
version. Previously, encodings were set as script-level constants, which
really works against the reusability idea. In the next chunk we just throw in some handy helpers. First we add
mutually exclusive verbose and quiet flags to set the logging level.
Then we add in a version flag, because gosh darn are we
professional. Now we parse our arguments and convert our input and output files to
I’ll admit that while I included standard input and output in my
original boilerplate three years ago, it’s only in the last year or so
that I’ve found myself using it a lot. It plays very well with cloud
infrastructure, and makes modularity all that much easier. Working with
text flows also allows you to use command-line tools like
The Finally we have the standard I’m not going to re-hash the Unix Philosophy
or the overkill
question, since I covered those last time. But the question that’s even
more pressing now than it was three years ago is: why the heck are we
writing scripts instead of doing everything as a Jupyter Notebook? I guess I’m a bit of a notebook skeptic, even
though I use notebooks every day. I recognize that people use them all
the time to do large-scale, impressive things at production. Look at Netflix.
They’re great for experimentation, they’re great for graphics. But I just don’t trust them. I don’t trust that the cell can be run
out of order, or multiple times, or that you can have a variable that
you defined and then deleted or changed the definition of so that it
doesn’t match anything on the screen. I don’t like that it doesn’t work
cleanly and directly with version control, and I don’t like that it
doesn’t work cleanly and directly with text streams on the command line.
You can’t import them, and 99 percent of them are named
Maybe that means I’m just not disciplined enough, and maybe it means
I’m a grumpy old man. I can live with that. But scripts have never let
me down. So that’s the new boilerplate. If you use it, or have questions or
edits, I’d love to hear from you on Twitter or just email me.The Boilerplate
#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.
This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""
# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging
# External library imports
# import pandas as pd
# import numpy as np
# Standard Library from-style imports go here
from pathlib import Path
# External library from-style imports go here
# from matplotlib import pyplot as plt
= '0.1'
__version__
= logging.getLogger(__name__ if __name__ != '__main__ '
log else Path(__file__).stem)
def manipulate_data(data):
"""This function is where the real work happens (or at least starts).
Probably you should write some real documentation for it.
Arguments:
* data: the data to be manipulated
"""
"Doing some fun stuff here!")
log.info(return data
def parse_args():
"""Parse command line arguments."""
= argparse.ArgumentParser(description=__doc__)
parser 'infile', nargs='?', default='-')
parser.add_argument('-ei', '--infile_encoding', default='utf-8')
parser.add_argument('-o', '--outfile', default='-')
parser.add_argument('-eo', '--outfile_encoding', default='utf-8')
parser.add_argument(
= parser.add_mutually_exclusive_group()
verbosity '-v', '--verbose', action='store_const',
verbosity.add_argument(=logging.DEBUG, default=logging.INFO)
const'-q', '--quiet', dest='verbose',
verbosity.add_argument(='store_const', const=logging.WARNING)
action
'--version', action='version',
parser.add_argument(=f'%(prog)s v{__version__}')
version
= parser.parse_args()
args = argparse.FileType(encoding=args.infile_encoding)(args.infile)
args.infile = argparse.FileType(
args.outfile ='w',
mode=args.outfile_encoding,
encoding# newline='', # for csvs
)(args.outfile)return args
def read_instream(instream):
"""Convert raw input for to a manipulable format.
Arguments:
* instream: a file-like object
Returns: probably a DataFrame
"""
'Reading Input')
log.info(return instream.read()
def main():
= parse_args()
args =args.verbose)
logging.basicConfig(level= read_instream(args.infile)
data = manipulate_data(data)
results print(results, file=args.outfile)
if __name__ == "__main__":
main()
Walkthrough
argparse
is a bit
more sophisticated than it used to be.#!/usr/bin/env python3
"""
A boilerplate script to be customized for data projects.
This script-level docstring will double as the description when the script is
called with the --help or -h option.
"""
# Standard Library imports
import argparse
# import collections
# import csv
# import itertools
import logging
# External library imports
# import pandas as pd
# import numpy as np
# Standard Library from-style imports go here
from pathlib import Path
# External library from-style imports go here
# from matplotlib import pyplot as plt
= '0.1'
__version__
= logging.getLogger(__name__ if __name__ != '__main__ '
log else Path(__file__).stem)
def manipulate_data(data):
"""This function is where the real work happens (or at least starts).
Probably you should write some real documentation for it.
Arguments:
* data: the data to be manipulated
"""
"Doing some fun stuff here!")
log.info(return data
parse_args
function is in many ways the star of the
show here, and I’m going to break it into different chunks. In the first
chunk, we create the parser and add an infile
and
outfile
argument. We create optional encoding for each of
those as well. I’ve changed infile
to be a positional
argument because that makes it easier to use with make-style workflow
tools. We’re taking the infile
and outfile
arguments as strings, with default values of -
; as we’ll see
below, this is the least ugly 2 way to make use of
argparse
’s neat FileType
object but also let
the user set the encoding at runtime.
def parse_args():
"""Parse command line arguments."""
= argparse.ArgumentParser(description=__doc__)
parser 'infile', nargs='?', default='-')
parser.add_argument('-ei', '--infile_encoding', default='utf-8')
parser.add_argument('-o', '--outfile', default='-')
parser.add_argument('-eo', '--outfile_encoding', default='utf-8') parser.add_argument(
= parser.add_mutually_exclusive_group()
verbosity '-v', '--verbose', action='store_const',
verbosity.add_argument(=logging.DEBUG, default=logging.INFO)
const'-q', '--quiet', dest='verbose',
verbosity.add_argument(='store_const', const=logging.WARNING)
action
'--version', action='version',
parser.add_argument(=f'%(prog)s v{__version__}') version
FileType
objects. The great thing about
FileType
s is that you can set the properties like mode and
encoding, and the constructor is smart enough to wrap standard input and
output if the provided filename is -
. No more messing around with
sys.stdin
and io
objects! It looks a bit odd
because FileType
actually creates a new type,
which is then instantiated with the path to the file.grep
and sed
which are often undervalued,
expecially when working with large files.= parser.parse_args()
args = argparse.FileType(encoding=args.infile_encoding)(args.infile)
args.infile = argparse.FileType(
args.outfile ='w',
mode=args.outfile_encoding,
encoding# newline='', # for csvs
)(args.outfile)return args
read_instream
function isn’t always one that lives
through to the production script. In some cases, the
read_instream
function is entirely replaced by a
pd.read_csv
or something like that. If it’s simple enough,
I keep it in the main
function. But when you do have a
complicated few steps to get the data in the right shape, it’s best to
segregate it to its own function. The temptation is to put the code
getting your data ready for manipulation or analysis in the
manipulation or analysis function, but that’s bad design because it
means you spend a lot of time in a function not doing the thing that is
the point of that function. If only for mental clarity, keep it
separate. Tidy your data here.def read_instream(instream):
"""Convert raw input for to a manipulable format.
Arguments:
* instream: a file-like object
Returns: probably a DataFrame
"""
'Reading Input')
log.info(return instream.read()
main
function: input,
manipulate, output. On a suggestion from Arya McCarthy, I’ve switched
to using the print
function to print the final results,
since print
will implicitly handle conversion to a text
format, while you have to do that yourself when using
outfile.write
. Of course, that line will often be replaced
with to_csv
or something like that.def main():
= parse_args()
args =args.verbose)
logging.basicConfig(level= read_instream(args.infile)
data = manipulate_data(data)
results print(results, file=args.outfile)
if __name__ == "__main__":
main()
Why Scripts instead of Notebooks?
Untitled
.Tell me what you think!