After Iowa: How Did the 430 Forecast Perform?

02 February 2016

So how did my 430 model do? On both sides the 30-day version was significantly worse than the 4-day. That makes sense; there was a lot of polling in the final days, and a good number of voters wait until the end of a campaign to make up their minds already.

On the Democratic side, the race looked like a coin flip, with the odds very, very slightly in Clinton’s favor. As it turns out the race is being called (by Sanders, at least) a virtual tie, and in fact several delegates were awarded on coin flips ¹. Martin O’Malley over-performed in the model; seems like none of his people bothered to show up.

On the Republican side, the 430 model far overestimated Trump’s chances, and underestimated both Cruz and Rubio. It’s a great example of garbage in, garbage out. The polls overestimated how many of Trump’s voters would turn out (and how few of Cruz’s would), so the model did the same. With Rubio, I think, what happened was a little different: he surged very late and very fast, and there just weren’t many polls at the right time to catch it.

So, should we change the model at all? Maybe we should assume all Trump results are inflated (which I feel is probably true) and deflate them by some amount. That feels pretty arbitrary, though, and we’d probably overcompensate anyway. Or we could try to see which pollsters are the best and weight accordingly. That also seems prone to over-fitting, though, unless you want to go the full 538. So I don’t think I’m going to try to add too much sophistication in it that way.

I do think I may fool around both with the time-weighting function, which was more or less pulled out of a hat, and with the window. It may be that, instead of looking at a time-window (so, all polls in the last 30 days), it would be more helpful to look at a number-of-polls window, still weighted by time-delay. That way, as we get closer to elections and more polling happens, we zero in automatically. I’ll probably put out a new version for New Hampshire and see what that looks like.

And you should, too! The notebook is available here, so feel free to make your own version and let me know on Twitter if you come up with anything interesting!

The Clinton campaign won 6 of 6 tosses, the odds of which are only 1 in 64. But surely if the Clinton camp were sharp enough to pre-supply unfair coins they would have won outright, right? Right?↩︎

How to Knock Off Nate Silver: the 430 Election Forecast Model

01 February 2016

It’s caucus day in Iowa, so what better time to rip off Nate Silver? Silver has a well-respected election forecasting model based on polls, polling firms’ historical house effects and accuracy, and a few non-poll factors like endorsements. Probably took a lot of hard work.

But it’s actually pretty easy to build a simple poll-based model yourself using Python. I’ve thrown one together, which I’m calling the 430 model. Why 430? Because of the 80/20 rule: you can get 80 percent of the way there with 20 percent of the work, and 430 is about 80 percent of 538.

Everything below is available in a Jupyter notebook here Let’s start with a little setup:

import collections
import datetime

import numpy as np
import pandas as pd
import requests

API_ENDPOINT = "http://elections.huffingtonpost.com/pollster/api/polls"

np.random.seed(2016)

So, first, let’s get our polling data, which we can do using the Pollster API:

def get_all_results(state='US', party='gop', start_date='2015-6-1'):
    topic = '2016-president-{}-primary'.format(party)
    params = {'state': state,
              'after': start_date,
              'topic': topic
             }
    page = 1
    while True:
        params['page'] = page
        page_results = requests.get(API_ENDPOINT,
                                    params=params).json()
        for poll in page_results:
            subpop = next(i['subpopulations'][0]
                          for i in poll['questions']
                          if i['topic'] == topic)
            for response in subpop['responses']:
                if response['first_name']:
                    yield {'poll': poll['id'],
                           'date': poll['end_date'],
                           'filter': subpop['name'].lower(),
                           'obs': subpop['observations'],
                           'candidate': '{} {}'.format(response['first_name'],
                                                       response['last_name']),
                           'mean': response['value']}

        if len(page_results) < 10:
            break
        page += 1

def get_polls(state='US', party='gop', start_date='2015-6-1'):
    polls = pd.DataFrame(get_all_results(state=state,
                                         party=party,
                                         start_date=start_date))
    polls['date'] = pd.to_datetime(polls['date'])
    return polls

Those two functions will allow us to specify a state, a party (either gop or dem, per Pollster), and how far back in time we want to go. What we get back is a Pandas DataFrame where each row has the result for one candidate for one poll, plus some metadata about the poll. All we really need for this model is the date, result, and number of observations, but I’ve also included the population screen in case you want to, say, restrict to only likely voters.

Now we want to combine those poll results for a point-in-time estimate of the mean, plus a standard deviation of the estimate. But not all polls are equally good; we’ll want to be able to weight them somehow.

We’re going to be lazy, and just weight based on recency. For each poll, we’ll set a weight of one over the square of the age of the poll plus one (the plus one is so that we don’t divide by zero). Then we can create a super-poll, in which we pool all the folks who said they’d vote for each candidate in any poll, multiplied by the weight of that poll. This allows us to calculate both the weighted estimate of the mean and the standard deviation of the estimate:

def get_distribution_for_date(polls, target_date=None, window=30):
    if target_date is None:
        target_date = datetime.datetime.today()
    polls = polls[
        (polls['date'] <= target_date)
        & (polls['date'] > target_date - datetime.timedelta(window))
    ]
    weights = 1 / np.square(
        (target_date - polls['date']) /np.timedelta64(1, 'D') + 1)
    weighted = polls[['candidate']].copy()
    weighted['n'] = weights * polls['obs']
    weighted['votes'] = polls['mean'] / 100 * polls['obs'] * weights
    weighted = weighted.groupby('candidate').sum()
    weighted['mean'] = weighted['votes'] / weighted['n']
    weighted['std'] = np.sqrt(
        (weighted['mean'] * (1 - weighted['mean'])) / weighted['n'])
    return weighted[['mean', 'std']].query('mean > 0').copy()

This function allows us to specify a target date, in case we want a snapshot from earlier in the campaign, and also a window that gives us a maximum age of polls. That’s so Scott Walker doesn’t show up in our results even though he’s already dropped out of the race.

Now we can run simulations! All we have to do is a draw from the normal distribution for each candidate, and see who got the highest percent of the vote:

def run_simulation(dists, trials=10000):
    runs = pd.DataFrame(
        [np.random.normal(dists['mean'], dists['std'])
         for i in range(trials)],
        columns=dists.index)
    results = pd.Series(collections.Counter(runs.T.idxmax()))
    return results / results.sum()

Finally, here’s a little function to automate all the steps and print out results:

def predict(state='us', party='gop', window=30, trials=10000,
            target_date=None):
    polls = get_polls(state=state, party=party)
    dists = get_distribution_for_date(
        polls, window=window,target_date=target_date)
    print('Superpoll Results:')
    print(dists.sort_values('mean', ascending=False)
          .applymap(lambda x: '{:.1%}'.format(x)))
    print()
    print('Simulation Results:')
    print(run_simulation(dists, trials=trials)
          .sort_values(ascending=False)
          .map(lambda x: '{:.1%}'.format(x)))

So, now for the fun part, who wins? Here’s the superpoll results on the Republican side:

Candidate	Estimate	Standard Deviation
Donald Trump	28.2%	2.0%
Ted Cruz	23.6%	1.8%
Marco Rubio	17.4%	1.6%
Ben Carson	7.6%	1.2%
Rand Paul	4.7%	0.9%
Jeb Bush	4.1%	0.9%
Mike Huckabee	3.3%	0.8%
John Kasich	2.8%	0.7%
Carly Fiorina	2.4%	0.7%
Chris Christie	2.1%	0.6%
Rick Santorum	1.3%	0.5%
Jim Gilmore	0.1%	0.2%

In the simulation, Donald Trump won 96 percent of the time, while Ted Cruz won 4 percent.

Here’s our superpoll results for the Dems:

Candidate	Estimate	Standard Deviation
Hillary Clinton	47.4%	2.4%
Bernie Sanders	46.0%	2.4%
Martin O’Malley	3.6%	0.9%

Seems pretty close, but bad news, Bernie fans! In the simulation, Hillary won 66 percent of the time, while Sanders only won 34 percent.

Now, that’s all with a 30-day window. What if we keep it to just the most recent polls?

Here’s what we get with a 4-day window on the GOP side:

Candidate	Estimate	Standard Deviation
Donald Trump	27.5%	2.1%
Ted Cruz	23.1%	2.0%
Marco Rubio	18.1%	1.9%
Ben Carson	7.5%	1.3%
Rand Paul	5.1%	1.1%
Jeb Bush	4.1%	0.9%
Mike Huckabee	3.5%	0.9%
John Kasich	2.8%	0.8%
Carly Fiorina	2.5%	0.7%
Chris Christie	2.0%	0.7%
Rick Santorum	1.3%	0.5%

Trump still wins 93.6 percent of simulations, but Cruz is up to 6.4 percent, and Rubio is on the board, although with 0.0%.

On the Democratic side, things get really interesting:

Candidate	Estimate	Standard Deviation
Hillary Clinton	47.0%	2.7%
Bernie Sanders	46.9%	2.7%
Martin O’Malley	3.2%	1.0%

In the simulation, O’Malley stuns! Just kidding, Clinton wins 51 percent of the time, and Sanders wins 49 percent. Boy is that going to be fun.

Obviously we shouldn’t bet the house on these predictions. My weighting model may be wrong (read: is wrong) or the polls themselves may be wrong (read: are completely unreliable in recent elections). But this shows you how simple it really is to get something like this off the ground.

If you decide to play around with this model (again, you can download the Jupyter notebook here), be sure to let me know on twitter. It would be a lot of fun to see what people come up with.

Ignore Expected Value, Don't Play Powerball

13 January 2016

The Powerball lottery is in the news because the jackpot is up to about $1.5 billion, which sources tell me is a lot of money. The classic stats argument is that you should buy a ticket only if the expected value is greater than the cost of the ticket, which is $2.

Expected value is sum of the value of each possible outcome times the odds of that possible outcome. In this case, it’s the sum of the value times the odds of the 9 ways to win.

Outcome	Odds
$1,500,000,000	1 in 292,201,338.00
$1,000,000	1 in 11,688,053.52
$50,000	1 in 913,129.18
$100	1 in 36,525.17
$100	1 in 14,494.11
$7	1 in 579.76
$7	1 in 701.33
$4	1 in 91.98
$4	1 in 38.32

Now there’s also an optional add-on called PowerPlay, which will multiply your winnings by a randomly-selected multiplier (except the jackpot, which remains the same, and the $1 million prize, which doubles to $2 million). The odds for the multiplier are:

Multiplier	Odds
5	1 in 21.00
4	1 in 14.00
3	1 in 3.23
2	1 in 1.75

If you do the math ¹, that gives us an expected value of about $5.11 for a standard ticket, and $5.57 for a powerplay ticket. Well good grief! That means we should all go buy a bunch of tickets, right? That’s statistics! We even know not to buy powerplay, because the return isn’t as good!²

Well, no, not really. The problem here is that expected value is a misleading term. It does not actually tell you what value to expect for a single ticket.

The expected value here is really just the mean winnings for all possible tickets. But means aren’t helpful here because the distribution is so wildly skewed. Almost all of the expected value—about $4.79—comes from an exact winning ticket.

A better way to get reasonable expectations is with a simulation. I wrote a quick one in Python, which you can find the code for in this Jupyter notebook, and I used it to analyze what we can really expect when playing Powerball.

First I ran a simulation of 100 thousand people buying a single ticket, with no powerplay. How did they turn out?

Well, 95,907 people won exactly jack. 3,763 doubled their money with a four dollar win, while a 324 won the princely sum of seven dollars. A full 6 folks won $100. Woohoo! So if you buy a ticket, what’s reasonable to expect?

Big fat goose egg, that’s what. If you’re really lucky? Four bucks.

If you add in the PowerPlay, things aren’t much better:

One guy won $400 in 100,000 trials. 96,078 won nothing. Sure, the prizes are bigger, but for the vast majority of cases, you’re just throwing away $3 instead of two.

But wait a minute. What if I buy more tickets to improve my odds? That should get me closer to the expected value, right? Well, sort of. Remember, the expected value is dominated by that one winning combo.

I ran the simulation again assuming someone bought, 1, 2, 5, 10, 15, 20, 25, 30, 40, and 50 tickets, with 100,000 trials at each ticket level. Here’s the results without PowerPlay:

The dotted line means you break even. The solid blue line is the median outcome, while the shaded area shows the 5th to the 95th percentile. The winnings you can reasonably expect certainly go up over time, but they go up slowly—much more slowly than the cost of buying tickets. Even though you’re getting more wins, you’re generally getting low-value results and spending a lot on worthless tickest to make it happen. Even the 95% percentile is earns way below the break-even level once you get past two tickets. Here’s what it looks like with the PowerPlay:

Basically the same story here. The upside, 95th-percentile result is a bit better, but the more realistic median result isn’t. And a bit better in this case means you would still lose money, but not much.

The best median outcome comes from buying one ticket without the PowerPlay option, which is when you just lose $2 and no more. Best feasible result is with the PowerPlay and 2 tickets, where the 95th percentile spent $6 to make $8, the only profit in the 5th to 95th percentile range in the entire simulation.

So what’s the takeaway? First, expected value is overrated for this kind of situation, because you can end up paying too much attention to rare, extreme possiblities. Running a Monte Carlo simulation like the one above gives you more realistic guidance. Expected value may be appropriate for large organizations where you do so many analyses that the one-in-a-million shot actually shows up every now and then, but they’re less suited to one-shot decisions.

But second, and more directly, don’t play Powerball. The lottery is still a tax on people who are bad at math. And people who tell you that the statistics say otherwise have not thought all the way through their statistics.

All the math not in the post can be seen in this Jupyter notebook ↩︎
Actually, I’m being a bit lazy here. Alex Tabarrok has an excellent post explaining why the expected value is a good bit lower than the simple analysis suggests, but that’s beside the point I’m making.↩︎