No truth stands alone.

Examining potential causes of the 2019 Australian federal polling error

Introduction

(note: this piece can get rather technical; I originally intended for it to be released after a series explaining how polling worked. If it’s confusing, I suggest waiting for the series; it’ll build up an understanding of polling and how it does/doesn’t work so hopefully this piece will be easier to understand)

When dealing with polls, you’ll often hear advice to look at an average of all polls, instead of simply relying on a single poll or pollster. Polling averages often outperform individual polls because they effectively combine the samples of multiple polls, reducing the effects of outlier samples/methods and producing an overall lower error than any single pollster could hope to achieve on their own (without prohibitively large and hence expensive sample sizes).

As a result, polling averages have significantly lower sampling error – i.e. random error which is unavoidable due to the fact that pollsters (and by extension poll aggregators) only interview a random sample of the voting population. Very generally speaking, if you assume that all polls are roughly equally accurate and that each poll is weighted equally in the average, then a polling average of n completely uncorrelated polls should reduce sampling error by 1/n as compared to a random poll. You can think of sampling error as being similar to what happens when you flip a fair coin, say, 2000 times – while on average you would expect to see 1000 heads, getting between 956 and 1044 heads in that 2000-flip sample (i.e. a proportion of 47.8 – 52.2% heads where the “true” proportion is 50%) shouldn’t surprise you due to the random chance inherent in the process. Averaging polls is very similar to increasing the number of coin flips; while the theoretical margin of error on the proportion of heads in 2000 coin-flips is ± 2.2%, the margin of error on 4000 coin-flips is just ± 1.6%.

Increasing the number of polls in an average significantly reduces error when polls are uncorrelated
Theoretical average error of a polling average where the sample proportion = 0.5 and sample size = 2000 in each poll

However, for a polling average to perform effectively, a key assumption is that the polls in the average have to be relatively uncorrelated, i.e. whether Essential polls overestimate Labor should give you little information about whether Morgan polls are going to overestimate Labor as well. If this assumption is violated, then polling averages are less useful (though still somewhat better than individual polls), as polling errors do not “cancel out” – instead of having overestimates of Labor’s vote cancelled out by underestimates, the pollsters’ overestimates of Labor’s vote end up being averaged into a polling average which also overestimates Labor’s vote.

Sampling bias

In real life, this assumption (as with many other assumptions) does not always hold. Unlike with coin flips, when dealing with human respondents, pollsters have to both find and get people to respond to their surveys (whereas coins don’t have the option of flipping you off instead), meaning that the methods used to do so can produce samples of respondents unrepresentative of the broader population. I’m not going to go through them all, but some fairly well-documented ones:

  • Live interviews may be more prone to social-desirability bias, where respondents are likely to over-report things viewed as socially desirable (e.g. donating to charity) and under-report things viewed as socially undesirable (e.g. drug use).
  • Phone interviews tend to sample a different subset of the population depending on the method and time of day. For example, calls at 10 am on a weekday are more likely to garner responses from non-working people (e.g. housewives, retirees), while robocalls are likely to reach older people who don’t have some form of call-blocking enabled.
  • Online polling is likely to under-sample sections of the population who are less tech-savvy (as well as areas that don’t have reliable Internet access, e.g. rural areas)

To minimise these issues, pollsters are often able to weight their samples by certain characteristics to improve their representativeness of the general population – e.g. weighting by age to reduce a bias towards older respondents. However, weighting is not a panacea for sampling bias – in some cases, the sample is unrepresentative of the population in a way which was not accounted for by weighting (e.g. failure to weight by education in the 2016 US Presidential election), or, worse, in a way which cannot be accounted for by weighting.

Examples of the latter include a hypothesis that people with low social trust were both less likely to respond to polls and more likely to vote Republican in 2020; if people who don’t respond to polls have a significantly different opinion on an issue as compared to survey respondents and the key characteristic driving said difference is not something the pollster can account for, there can be a significant sampling bias which affects all pollsters in a similar fashion. Note that both factors are required for there to be sampling bias; for example, 800 men and 400 women respond to a poll, but both men and women intend to vote 51-49 for the Coalition, then there will not be any sampling bias in the results even though the sample was not representative of the population. On the other hand, if men intend to vote 53-47 for the Coalition while women intend to vote 51-49 for Labor, then failing to weight (or, inability to weight) for gender would produce a 0.7% bias to the Coalition in that poll.

Pollster herding

Another potential source of correlation is what is known as pollster herding, which refers to when pollsters produce results which are much closer to each other than we should expect by chance. Pollster herding is not necessarily data fraud – there are many decisions which go into the modelling and weighting of data where there is no clear answer, For example – when estimating 2pp, do you ask respondents to nominate which party they’ll preference (a method known as respondent-allocated), or do you estimate the 2pp from the primary vote data, using preference flows between parties at the last election?

The former sounds reasonable, except doing so has proven to be less accurate than using the last-election preference flows method.

Further question: when using last-election preference flows, do you simply use the data from the last election of its type? e.g. using preference flows at the last federal election for federal polling. Or do you mix it with the preference flows at recent elections in other jurisdictions (e.g. preference flows at recent state elections, for federal polling).

The former may cause you to miss shifts in voter preferencing behaviour (e.g. the shift towards preferencing the Coalition by One Nation voters in 2019) while the latter may cause you to mistakenly assume similar state/federal voter preferencing behaviour when that isn’t true (e.g. in the final Newspoll in WA 2021, about 1.2% of the 2pp error would have been avoided if a pure last-election preference flow formula was used).

That’s not even including the question of how to estimate the preference flows, or how to mix state/federal preference flows (if you’d opt to do so).
and many assumptions which are legitimate given the available data. However, there are many ways to pick a different set of assumptions when your initial model comes out with a result which seems implausible or wildly out of line with other polls, and doing so results in polls which are regularly closer to each other than should happen by chance.

Pollster herding was a significant concern in both the 2016 and 2019 Australian federal elections due to how under-dispersed the polls were, although only the latter produced a systematic polling error. At the latter, the reported 2pp varied within just a 1% range (51% to 52% for Labor), which is very unlikely to occur with random sampling. With the exception of the Greens primary vote – and even that was more thanks to Ipsos’ long-running over-estimate of the Greens and underestimate of Labor in their polls – the primary votes were also fairly underdispersed, although not nearly as much as the 2pp.

For example, I estimate that there should be a 4.5% chance of getting Coalition primary votes as underdispersed as was the case in the final five polls of the 2019 Australian federal election.

(estimate calculated using sampling error formula assuming p = 0.386 (average share of respondents who intended to vote Coalition in the final polls), n_poll = 2500 to generate random 5-tuples of deviates and estimate the stdev for each; I then used MLE to fit a beta distribution to the stdevs. The beta distribution (alpha=6.899, beta=835.607) suggested there was a 4.53% of getting 5-tuples of polls with a stdev equal to or lower than what was found in the final polls)
Under-dispersal (i.e. the polls being much closer to each other, or less dispersed, than expected from random sampling) makes polls more correlated with each other, which reduces the accuracy of polling averages.

How these sources of error relate to the 2019 polling failure

In 2019, the polls missed by a significantly larger margin than usual (I estimate the average error on individual polls since 1983 is about 2%; the 2019 misses were between 2.5% to 3.5%) on the 2pp, especially when one takes into account the various improvements in polling (for example, the 4 – 5.5% misses by Morgan in 2004 would have been smaller if they had used last-election preference flow modelling, as all pollsters now do). More crucially, unlike in past elections where polls tended to err in random directions (e.g. in 2013, Morgan, Nielsen and Newspoll overestimated the Coalition while ReachTEL, YouGov and Essential underestimated them, producing a highly accurate average overall), the polls all overestimated Labor in 2019 by nearly the same amount (2.5 – 3.5%), which is highly unlikely through sampling error alone. If the samples were perfectly representative of the population (i.e. of the people who would respond to polls, the same % intended to vote for the Coalition as in the % of people who weren’t/refused to be polled) and there was no herding whatsover, the probability of a single poll with sample size = 2500 getting 49% Coalition or less when the electorate would vote 51.5% Coalition is about 0.6%.

The probability of five independent polls averaging out to 48.6% Coalition is ridiculously low (in percentage terms, 0 with 9 more zeroes after the decimal place percent low). Hence it is insanely unlikely that the pollsters did everything correctly but just got hit by bad luck.

Others have discussed (and dismissed) other explanations such as lateswing and the “shy-Tory effect“. I will only add to that discussion a note that the exit poll not only showed Labor ahead, it showed a small swing to Labor (1%) compared to YouGov-Galaxy’s final pre-election polling, making it even less likely that the polls were correct up until 2 days before the election then got hit by late swing.

Instead, it is very likely that either the samples acquired by the pollsters were all skewed towards Labor (sampling bias), and/or that pollsters suppressed the release of/adjusted the models for polls which showed the Coalition ahead (pollster herding). Both hypotheses have evidence for them – the inquiry into the 2019 polls by AMSRO suggests that failure to weight by education might be responsible for about 23% of the error on the margin between the two major parties (which if true would have reduced the error by about 1.2% in 2019), while there have been reports of pollsters shoving polls showing the Coalition ahead in the drawer for fear of being the embarrassing outlier.

To test both hypotheses, I’ve been building and refining a model of pollster herding which includes a variety of factors such as sample bias, models of changing voting intention throughout a campaign and of course pollster herding. I’ve uploaded here in R for those who want to look into how it works (warning, the code is very unpolished and I haven’t had the time to insert comments yet), but broadly speaking:

  • With both models, there’s a time series model (i.e. it “starts” at a certain number of weeks before the election, then changes the vote towards the final result every week according to a randomly generated model). The “true value” in each simulation always reaches the pre-set actual vote margin by the final week (so no last-minute errors).
  • With the unherded model, it basically generates a random sample bias for each simulation, then uses sampling error to generate a random set of polls from the “sample”.
  • With the herded model, it does whatever the unherded model does, but a few other factors also come in. Firstly, with outliers, there’s some chance they get junked by the “pollster” (though this chance reduces to 0 by election day so I don’t have simulations where there are no final polls). With the remainder, most are adjusted such that they are no longer outliers (e.g. if the poll showed Coalition 51%, the most recent polling average showed Labor 52%, and an outlier was defined as being more than 2% out from the average, the “pollster” would adjust the result to be 50-50), though there’s a small number of outliers anyhow.

    What is defined as an outlier changes as the campaign goes on – taking my cues from Nate Silver’s graph of deviation from polling average versus time to election, the model starts out at defining 2% or more deviation from the last polling average to be an outlier (so if the average is 52% Coalition, anything >54% Coalition or with Labor ahead would be an outlier), then uses a logistic function to reduce this limit to 1% by election week (with more rapid reduction the closer to the election it is). This model is already being very generous to the extreme level of herding seen in the 2019 election – if I were to accurately model the extreme under-dispersion in there, the limit would have to be halved in both cases.

With all polls, I’ve followed our Australian pollsters’ conventions and rounded to the nearest 0.5% before “release”. Polling averages and other statistics are calculated from these rounded outputs instead of the raw data to more closely simulate how the polls shift in Australian elections; this may result in some rather odd-looking distributions in our graphs.

Testing hypotheses of the 2019 polling failure

Hypothesis 1: Random chance; no sample bias or pollster herding

This is a fairly simple check I perform to see how likely it is that our pollsters got hit by “bad luck” – i.e. what’s the chance that we would have gotten polling errors like the ones we saw if the polls were unbiased and there was no herding?

It's unlikely we could have gotten the polls we did if there was no sample bias or herding
Data generated under the assumption of sample size = 2500. If the sample size was higher, the chance of getting polls this far out from the true result would be even lower.

So it’s fairly unlikely. More importantly, it’s even less likely that we would have gotten a 5-poll average as wrong as it was in 2019 if the samples were unbiased and there was no herding:

It's very unlikely we would have gotten the polling average we did if there was no sample bias or herding

Furthermore, it’s fairly unlikely we would have gotten polls as under-dispersed as they did if there was no herding:

It's very unlikely we would have gotten a sample standard deviation this low with no herding
More specifically, there’s a 2.9% chance we would have gotten a standard deviation less than what we found in our sample (0.374) if the polls weren’t herding.
Also, note the irregular shape of the distribution – in my modelling I rounded all poll outputs to the nearest 0.5% as our pollsters do.

With the possibility of random sampling error being a cause out of the way, let’s look at our other two hypotheses:


Hypothesis 2: Samples were biased to Labor, no herding

Given the final error in the polls was about 3% to the Coalition, one possible cause could be that the samples used in constructing the polls were systematically biased to Labor by about 3% throughout the campaign.

To simulate this, I’ve set the sampling bias to be 3% to Labor (-3 in my model), with a further 0 set as the initial value – the model “starts” at 10 weeks prior to the election, and at 10 weeks before the election Labor was polling at about 53-47. If I don’t set the initial value to be equal to 0 (no side is favoured), the model will basically assume that the “average” initial value is 51.5 to Labor instead (final result is 48.5 Coalition, 3% bias to Labor means Labor up 51.5), which would not match up with the actual events in 2019; setting it to zero means the initial value (or 2pp in polling, 10 weeks out from election) would be Labor 53-47.

A quick caveat: this is a simulation done knowing the results – we won’t know what the sample bias will be for future elections (if we did we would basically know the results of the election). These simulations are solely for the purpose of seeing which models best explain the pattern of errors seen at the 2019 Australian federal election, knowing what we already do about what happened.

With these settings in place, we can see that a sample bias can explain the errors seen in the 2019 Australian federal election fairly well:

Sample bias would explain the error in the 2019 polls

Sample bias also explains the error in an average of the polls:

Sample bias would explain the error in an average of the 2019 Australian federal polling

However, sample bias simply explains why the polls are as far off as they are, and not why the polls were so closely clustered together by the end of the campaign. This is reflected in a graph of the standard deviation in the polling sample:

Sample bias cannot explain why the polls were under-dispersed

Hence, it’s pretty unlikely that the pattern of polling errors we saw in 2019 was solely due to an unrepresentative sample. Even if, say, weighting by education would have fixed the systematic bias to Labor (which it’s not clear it would fully do so – going off the estimates by AMSRO the polls would still have come out around 50.5 to 51 to Labor), the pattern of highly clustered polling is something we should be concerned about. Although highly clustered polls won’t always fail like they did in 2019 (see 2016 for an example of similarly under-dispersed polls which performed very well), when they do they will systematically err in the same direction, making polling averages useless and damaging public confidence in opinion polling as a means to gauge public opinion.


Hypothesis 3: No systematic sample bias, polls herd according to model

To simulate this, we first set our sample bias to the default (i.e. no net sample bias in either direction), then we change a few things to match how polling shifted over the course of the 2019 campaign. Firstly, we set the initial value, or the average 2pp at the start of the campaign, to 53% Labor (-3 in the model) – in other words we assume that the polls were broadly correct as of March 2019, when they reported Labor up around 53-47. Note that we don’t actually have to make this assumption per se; I just do it because it’s a convenient way to make the model output polls around 53% Labor for the “start” of the campaign. Pollster herding refers to pollsters producing results which are abnormally in-line with other polls, not with the true result (which is unknown ahead of the election) – even if the actual share of people intending to vote/preference Labor was about e.g. 48%, the patterns of pollster herding would be roughly similar with the same set of initial polls.

Secondly, we set the time series model – the model of how voting intention shifted throughout the campaign – to type 4, which basically produces less change at the start and more change in the final weeks, though the underlying “true” value will always be Coalition 51.53% by the end of the election. We did this as it seems like there was more swing to the Coalition at the end of the campaign, although it usually doesn’t have a large impact in the model compared to the linear time series model.

In other words, we broadly assume that the polls were fairly accurate as of 2 months out; and that the pattern of the shift in voting intention as picked up in polls is broadly correct though too small. Doing so we can see that pollster herding also serves to explain the polling error in 2019:

Herding would explain the errors seen in the 2019 polls

Although the herded distribution is slightly less representative of the polls we got (in particular it’s slightly closer to the “true” result than the sample-biased distribution), it still does a pretty good job of explaining the polling error. Ditto with the herded polling average:

Pollster herding would also explain the error seen in the polling average
Note how much flatter, or spread out, the distribution of herded polling averages is compared to the non-herded averages. As noted above, herding means that polling errors don’t cancel out; this means that when the polls are significantly “off” (e.g. when the polls start out around 54-46 Labor) all the polls will be off in the same direction and by similar magnitudes.

Most importantly, pollster herding also explains the under-dispersal in the polls seen in 2019:

Pollster herding better explains the under-dispersal in the 2019 Australian federal election polling
Note how much “tighter” the distribution of possible deviations from polling average is for herded polling.

Hence, it appears very likely that pollster herding was the largest source of the polling error seen at the 2019 federal election. Based on our simulations, it appears that the primary reason was Labor’s early strength upon Scott Morrison’s replacement of Malcolm Turnbull in polling, resulting in pollsters herding towards a Labor win. In contrast, sampling bias does not explain the severe degree of under-dispersal seen in the polls; a fact which we actually understate here, considering we only looked at the under-dispersal for the final polls prior to the election and not the high degree of under-dispersal throughout the campaign.

From the data above, it seems that sampling bias played a relatively small role in the 2019 polling failure, once herding has been accounted for. The average polling average in the herding simulations was 50.9% to Labor, suggesting that if our model of herding is broadly correct, at most the various forms of sampling bias e.g. failure to weight for education would only have had an effect of about 0.5% on the 2pp. Interestingly, this lines up rather well with the estimated effect of failure to weight by education in the 2019 election; if we use the AMSRO’s estimate of a 23% reduction in the error on the margin between the Labor and the Coalition by weighting for education, the error is reduced from 5.2% to 4%, or about a 1.2% reduction in the (Coalition primary – Labor primary) statistic. This would translate into an approximately 0.6% reduction of the 2pp error, which is very close to the statistic we derived here; suggesting that pollster herding and not sampling bias was primarily to blame for the 2019 Australian polling error.

Edit: I’ve written an addendum where I analyse what the impacts would be if pollsters had weighted by education but continued to herd in 2019; that’s up here.


Add Your Comment

× Home Politics Economics Projects
×

You appear to be using an outdated browser, for which this site is not optimised.

For your security, we strongly recommend you download a newer browser.

Although any of the latest browsers will do, we suggest the latest version of Firefox.

Download Firefox here

(Click on the button in the top-right to close this reminder)