Overconfidence
Conflicting
state and
national polls for the 2020 Democratic presidential nomination are a
common and ongoing occurrence. The polls illustrate the
substantial theoretical
limitations of survey sampling. Pollsters, poll sponsors, and
polling aggregators tend to ignore those limitations in favor of
assigning a level of accuracy to the polls that are unjustified and
unrealistic. The 95% confidence interval reported by pollsters is as
misleading as it is misunderstood. Here is an example:
In
late
August, Monmouth University released the results from a national
survey among a
subsample of 298 of respondents identifying
themselves as Democrats or leaning Democrats showing Elizabeth Warren
and Joe Biden tied in the race for the Democratic
presidential nomination at 20% each and Bernie
Sanders at 16%. Biden had dropped from 32% in a Monmouth poll completed
in June while Warren was up from 15% and Sanders was up from
14% compared to that June survey.
At the
same time, a
nationwide YouGov survey
including a subsample of 559 Democrats had Biden at 22%, Sanders
at 19%, and Warren at 17%. Four polls released later in the week,
however, had
Biden up by 13 points in three of the polls and Biden up by 18 points
in the other poll.
In
response to this,
Patrick Murray, the director of the Monmouth
University Poll,
released a statement
that included the following:
As
other national polls of
2020 Democratic presidential race have been released this week, it is
clear that the Monmouth University Poll published Monday is an outlier.
This a product of the uncertainty that is inherent in the polling
process. We tend to focus on the margin of sampling error, but that
margin is driven by something called the confidence interval which
states that every so often you will naturally have a poll that falls
outside the standard margin of error. It occurs very infrequently, but
every pollster who has been in this business a while recognizes that
outliers happen. This appears to be one of those instances.
Nate
Silver followed up
on FiveThrityEight.com with an article titled "How
to Handle an Outlier Poll."
He writes, in part:
But
Murray doesn’t have any real reason to apologize. Outliers are a part
of the business. In theory, 1 in 20 polls should fall outside the
margin of error as a result of chance alone. One out of 20 might not
sound like a lot, but by the time we get to the stretch run of the
Democratic primary campaign in January, we’ll be getting literally
dozens of new state and national polls every week. Inevitably, some of
them are going to be outliers. Not to mention that the margin of error,
which traditionally describes sampling error — what you get from
surveying only a subset of voters rather than the whole population — is
only one of several major sources of error in polls.
Murray
and Silver provide nonsensical technical definitions of
outliers. It is impossible for the results of a poll
to be outside of its own margin of error. It
is easy enough to know
what they mean by an outlier — simply a poll that
differs from other
polls. But, more importantly, they are incorrect about the
frequency of outliers in individual election polls.
Sampling
theory is based on "frequently
and independently"
repeating the same survey. Using the
sample size and the results from each repeated sampling, margins of
error
and the resulting confidence intervals are calculated for each
sample. As the sampling is
repeated using the same sample size and methodology, the theory states
that the population values should be contained within
the
confidence intervals very close to the desired level of confidence,
usually 95% of the time for public polls, if an infinite
number of samples are drawn.
Jerzy Neyman pointed out when introducing the concept of
confidence intervals that frequentist probability
theory is "helpless"
in providing the true population values
from any single sample because the values obtained
from the single sample provide no information about the actual
population values.
Here
is a demonstration of sampling
error at work. Note it is unlikely that any of the 50
samples exactly match the population values. The
average of
the 50 samples is more likely than any single sample to match the
population. This is why polling aggregation works.
Pollsters
generally incorrectly assume that not only their poll results reflect the
actual
population values (which
the sampling
error demonstration proves to be untrue), but also the population
results fall within
the the confidence intervals 95% of the time for a single poll.
This is why Murray writes that outliers happen "every so
often"
and "very infrequently" and Silver puts outliers happening at
"1 in 20 polls" (or 5% of the time).
In fact, actual outliers should
occur about 25%
of the time
for individual election polls, or about 1 in 4
polls.
While it would seem that
an
event with a 95% probability of occurring in the long run would also be
very likely to occur in a single event, this is not the case.
Calculating for the worst case shows that an event with a 95%
probability of occurring in the long run has a worst-case uncertainty
of about 29%
for a single event, not 5%. It takes an event with a 99.5% probability
of
occurring in the long run to have a worst-case uncertainty of
5%
for a single event. (Hard to believe, but an event with a 71.4% probability of occurring in the
long run has a worst-case uncertainty of about 86% for a single
event.)
This
can be tested
empirically by comparing election polls to election results.
Our
election polling accuracy ratings for over 5,000 final election surveys
from 45
pollsters show on average that 75% of election poll
results
are within
the 95% confidence intervals when compared to the actual vote
totals (the population values), confirming the theory for
individual polls. (This paper
has election polls within the theoretical margins of
error 73%
of the time for senatorial polls, 74% of the time for gubernatorial
polls, and 88% of the time for presidential polls.)
Monmouth
University election poll results have been within their
respective
theoretical margins of error about 83% of the time, which is above
average. But that still means about 1 in 6 election
polls from Monmouth fell outside their respective theoretical margins
of error when compared to the actual election outcomes.
Our
accuracy
ratings
include sample sizes, which allow the accuracy
of any poll to be compared to the accuracy of any other poll. The
smaller the sample
size of a poll, the wider the confidence interval, and,
conversely, the larger the sample size of a poll, the narrower the
confidence interval.
Monmouth
polls tend to have smaller sample sizes, and, therefore, wider
confidence intervals (note the sample size of just 298 Democratic
primary voters in
the national survey above).
In the
2016 presidential race in Wisconsin, for example, the final
Monmouth poll was
47% for Clinton and 40% for Trump with a sample size of 403.
When
the poll results are compared to the actual Wisconsin results of 46.45%
for
Clinton and 47.22% for Trump, the Monmouth poll results
were within the poll's theoretical margin of error with an accuracy
score of 0.18. The final
Marquette
Law School poll in Wisconsin had Clinton at 46% and Trump at
40% with a sample size of 1,255. The results from the
Marquette Law School poll, because of the larger sample size, fell
outside of that poll's theoretical margin of error, but the poll had an
accuracy score of 0.16. The Monmouth poll
in Wisconsin was not very accurate, but the results were what could be
expected based on sampling theory with such a small sample size. The
Marquette Law School poll was more accurate than the Monmouth poll, but
it should have been even more accurate based on its larger sample size.
For
elections from 1978 through 2018, results from most pollsters perform
as sampling theory predicts. Looking at all 45 pollsters and
constructing a 95% confidence interval for accuracy by pollster shows
that the results from only 4
pollsters fall
outside of that 95% confidence interval. They are Harris Interactive,
the Trafalgar Group,
SurveyMonkey, and We Ask
America. In election polling in 2014, for example, SurveyMonkey
polls were
outside their respective theoretical margins of error for the
actual vote about 48% of the time and that trend continued in 2016.
Silver
writes that sampling error "is only one of several major sources
of error in polls." Sampling error can be determined from the
survey data. In the polls we have tested, the average non-sampling
error accounts for less than one-third of all error (including "house effects"), making sampling
error the major source of error in those polls. This
is reflected
in our polling accuracy scores. The average absolute accuracy
for the
75% of election polls that fall within their respective 95% confidence
intervals is 0.08 while the average absolute accuracy for
the 25% of election polls that fall outside their respective 95%
confidence intervals is 0.29. Unfortunately, sampling error can be substantial, but there is no way to control
for sampling error.
As
for Monmouth's election polling accuracy,
the average absolute accuracy for the 83% of Monmouth
University
election polls that fall within
their respective 95% confidence intervals is 0.09 and the average absolute accuracy for the 17% of election polls that fall
outside of
their respective 95% confidence intervals is 0.34.
When
looking
at 2020 election polls, remember that (1) the
results from about 1 in 4 polls on average will fall outside the
respective confidence intervals for the population values based on
sampling theory alone, (2) the average of all polls
should be a better indication of the actual state of the race
than any single poll, and (3) the results from most
pollsters should be included in polling averages.
Update: Election polls from the University of New Hampshire have been
within their respective theoretical margins of error about 72% of the
time, which is below average. The average absolute accuracy for the 72%
of University of New Hampshire election polls that fall within their
respective 95% confidence intervals is 0.11 and the average absolute
accuracy for the 28% of election polls that fall outside of their
respective 95% confidence intervals is 0.33.
|