The Big Problem With Stata

29 October 2014

I use Python for almost all my data work, but both in my workplace and my field more generally Stata dominates. People use Stata for a reason¹, and it provides a far wider range of advanced statistical tools than you can find with Python (at least so far), but I hate working in it.

I’ve always found it hard to explain to others just why I hate it so much. You can generally get your problem solved, the help files aren’t terrible, there’s lots of Google-able help online², you can write functions if you want to learn how. And while I find lots of little things annoying (the way you get variable values, for example, or the terrible do-file editor), the big problem was the one other people didn’t understand.

Today, however, I was re-reading some pages about the Unix Philosophy, when I saw something that hit the nail on the head. It’s Rob Pike’s Rule 5:

Rule 5. Data dominates. If you’ve chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming

Stata only has one data structure: the dataset. A dataset is a list of columns of uniform length. You can only have one dataset open at a time.

This is the right data structure for performing the actual analysis of data—say, a regression—and the wrong data structure for literally everything else. The problem is, 90 percent of doing data work is cleaning, aligning, adjusting, aggregating, disaggregating, and generally mucking around with your source data, because source data always comes from people who hate you. And because the data structure is wrong, you’re forced to use algorithms that look like they come from an H.P. Lovecraft story.

Never having seen anything better, most Stata users seem to be resigned to doing things like creating an entire column to store a single number and writing impenetrable loops for simple tasks. Or they use sensible tools to create their datasets (increasingly Python, but also even something like Excel) and then use Stata just for the analysis.

The latter is my approach when I can’t avoid Stata entirely. But I’m really looking forward to the day when I can avoid the fundamentally flawed design of Stata altogether.

In my graduate program, we started learning econometrics with a different statistical program, called SAS. SAS is…SAS is rough.↩︎
I’m looking at you, R.↩︎