What ended up happening
Although there are still some votes to go, at this point, all districts in the 2021 Western Australian state election have pretty much been “decided” – i.e. the remaining vote to be counted is very unlikely to change the winner of the seat. Labor looks to have won 53 of the 59 electoral districts and over 68% of the two-party-preferred (2pp) vote, easily topping the metrics for largest share of lower house seats and 2pp vote in state elections. Of the remainder, the Liberals have just two (Cottesloe and Vasse) and the WA Nationals four (North West Central, Moore, Central Wheatbelt and Roe), meaning that it’s very likely that a party which barely contested a quarter of the state The Nationals fielded candidates in just 16 of the 59 electoral districts. is going to be accorded the status of official Opposition and the ensuing perks.
Such an outcome was well within Meridiem’s margin of error, although it looks like the Labor vote and seat count will end up on the high side of that range. Of the 59 electorates, Meridiem correctly “called” i.e. gave the highest probability of winning to the actual winner.
I don’t like this metric – if you have a seat which your model estimates is won 40% of the time by Labor, 35% Liberal and 25% National, a Labor victory would be considered a “correct call” even though Labor was more likely than not to lose. However, it happens to be the metric that everyone usually talks about (unfortunately) so I’ve included it here. 55 (93%), which was fairly close to the 56 of 59 it expected to call correctly. Furthermore, in many districts where there was a significant gap between Meridiem and punters’ wisdom (e.g. Bateman, Nedlands, South Perth, Warren-Blackwood) Meridiem’s forecast turned out to be closer to the mark. So overall, Meridiem seems to have done fairly well.
What didn’t go so well
Before I proceed to the scoring, I’d like to go through some of the parts of Meridiem which didn’t go so well.
Usually, I’d hesitate to say that any part of a model is outright wrong simply because of a single outcome to which the model gave low probability – that’s not how statistics works. However, there are some cases where a model simply doesn’t fit designer’s intent, so to speak; for example if your model is “take the polls at face value”, the polls say 66% Labor but your model outputs 65% – that’s a case of something in the model’s code which is clearly wrong.
Meridiem had quite a few such bugs, even until the final forecast. Part of this is due to my not having back-tested the full model on previous elections (I did do back-testing, but on individual components, e.g. back-testing the outputs for 2pp swing in each seat), resulting in uncaught bugs which had to be fixed later on. Two of these bugs ended up in the final forecast:
- Failure to integrate seat polling into the model: Basically, some parts of our model run on Excel (because Excel is generally better at offloading lots of simulations to different processor cores, making the Monte Carlo process much faster), and the Excel formula we used to aggregate data from seat polling didn’t run for some unknown reason (it only updated itself when I manually went into and clicked Enter on the formula). As a result, the inputs the model should have received from the seat poll of Dawesville was instead set to 0, meaning that the code inexplicably did not integrate seat polling into its forecast.
This was a pretty costly bug as it turns out that a combination of the seat poll in Dawesville and final polling would have came closer to calling the statewide swing to Labor on primary votes than the final statewide Newspoll (at time of writing the swing looks like it’ll end up being around +18 to Labor, final Newspoll had it at +15 and our model would have combined that with the +21.5 of the Dawesville seat poll to estimate a statewide swing of +16.3). - Artifacts in the elasticity model: This is a very specific issue which mostly affected the seat of Churchlands. Basically, we model elasticity by comparing the 2pp margin in each polling booth to the statewide 2pp through linear regression, which produces a higher elasticity estimate for polling booths where the swing tends to be greater than average and a lower estimate for polling booths where the swing is smaller than average.
As it turns out, there was a booth in Churchlands (Kapinara Primary School) where we had just 3 elections of data and the Liberal share of the vote surged in 2017 despite a large swing to Labor statewide in 2017; this resulted in a very negative elasticity being estimated for that particular booth (-2.55) which resulted in the elasticity estimate for Churchlands being dragged down to 0.7. Although I will note that there was a commenter who pointed out the oddity of Churchlands being forecasted to be more Liberal-leaning than Cottesloe to me, I personally have a philosophy of minimising midstream changes to models (as they give the developer too much freedom to tweak results they don’t like, which is usually counterproductive in modelling) and the elasticity estimate was not so implausible that I thought it might have been a bug rather than the outcome of the code (if the estimate was e.g. 0.1, I might have realised it was a bug and looked into it).
Going forward, I have several ideas for how to refine the elasticity model, which probably involve some combination of changing to a non-linear model of elasticity (i.e. elasticity should be different when the statewide/national 2pp is 70-30 versus when it is 51-49) and implementing some form of check to prevent the system from generating wildly implausible elasticity estimates. However, I do think the elasticity model performs well overall – I’ve found that a version of Meridiem where the elasticity model is turned off (i.e. a uniform swing model) underperforms Meridiem overall in the rest of the electorate (in Brier and log-score), and hence I do intend to keep a modified version of it around in future forecasts.
So I will note that the version of Meridiem which ended up published on our forecast page is actually slightly sub-optimal; had we fixed those bugs ahead of time, the forecast would have called 57 of 59 and scored better. I’m aware that this might well appear to be a case of hindsight bias (see the pundits who were very confident Trump would lose in 2016 but then ended up being the ones explaining the Trump phenomenon to the rest of us on media), so in my defence I’ll point out that we did mention several times in our method piece that we intended to integrate seat polling into our model and that our elasticity model was meant to weight towards uniform swing, so these two issues aren’t exactly something we didn’t think of beforehand and are only thinking to integrate now. There’s other problems with the model – e.g. our weightage of different methods of estimating the 2pp and seat polls, our decision to apply a vote effect for the Liberals’ leader’s seat switching from Cottesloe to Dawesville seems to have been counterproductive etc – but the two I bring up here are ones which we did think of ahead of time but got messed up on due to coding issues.
(and yes, I am indeed regarding parts of a model which called “just” 55 of 59 seats “correctly” to have failed; you don’t get to code models like these without being at least a little obsessive about how they perform)
A caveat before we go through the scores
I mentioned this on our forecast page when putting up the expected Brier score of our model, but forecast scoring rules like the Brier and log-score generally require a fairly large sample size to be able to accurately score forecasts for very common or very uncommon events. This is usually not a big problem, but in Western Australia, not only are there relatively few electorates (for a single-winner system), this election was a massive landslide. Of the forecasts made by Meridiem, 48 were estimated to have probabilities for either side of > 95%, Basic had 41 such forecasts. meaning that there are very few forecasts made in electorates where either side could have plausibly won. In other words, it would take a massive error – e.g. getting more than two of those 48 forecasts wrong It’s not an ideal way to estimate it, but I constructed a 90% confidence interval using the average probability in that bracket (> 95%) – which is 0.989 or 98.9%.
Using the formula for stdev on a sample proportion sqrt((p*(1-p))/n), where p = 0.989 and n = 48, I estimate that if a model which produced said probabilities is correct, you should expect to get 2 or more such calls wrongly just 5% of the time. to be able to say that a model is clearly incorrect (if you were wondering, both Meridiem and Basic Basic is a much simpler forecast with fewer features which we used as a baseline to judge Meridiem’s performance against. got all such forecasts “correct”).
As a result, the difference in scores between pretty much any model of this election is going to be fairly small, unless one of the models was spectacularly bad. Thus, we’re also planning to compare the 2pp predictions of each model to the final 2pp result; however that will have to wait until the 2pp figures are calculated and released by the Electoral Commission (hence why this is only a Part I).
Meridiem vs Basic: How our forecast scored
I’ve scaled and shifted all the scores such that the highest possible score is +1 and a score for a forecast which gave 50% for all outcomes would be 0. Due to the possibility of the log-score reaching negative infinity, there’s no way to provide a lower limit for our forecast scores; however if a forecast’s overall score is negative that implies that the forecast is worse than outright guessing 50% for all scored outcomes.
First up, the scores for Meridiem and Basic when scoring the 2-party-preferred winner in each electorate (i.e. whether the Labor or the Liberal/National candidate ended up ahead on the 2-party-preferred in the electorate, regardless of whether the candidate won the seat overall):
Brier score | Log-score | |
---|---|---|
Basic | 0.7283 | 0.6869 |
Meridiem | 0.7467 | 0.7377 |
% difference | +2.5% | +7.4% |
On both scores, Meridiem outperformed Basic somewhat (for context, our QLD 2020 forecast outperformed its Basic counterpart by 18% on Brier), with a somewhat greater outperformance on the log-score, which penalises “wrong” forecasts more than Brier whilst awarding a smaller bonus for “correct” but uncertain forecasts:
Next, the scores for Meridiem and Brier when scoring the probabilities given to the winner in each electorate:
Brier score | Log-score | |
---|---|---|
Basic | 0.925 | 0.6867 |
Meridiem | 0.9294 | 0.7297 |
% difference | +0.5% | +6.3% |
Again, Meridiem outperforms Basic, although it’s a very close call on the Brier score. Part of that is the fact that Basic correctly integrated the Dawesville seat poll into its simulations while Meridiem didn’t; another major factor is the fact that due to the bug in the elasticity model, the Liberals were larger favourites in Meridiem’s forecast for Churchlands than they should have been.
I will note that, as above, the differences here are going to be very minor compared to an “average” election (one where the 2pp is 54-46 or narrower), partly because some of the modules in Meridiem simply don’t have much impact as compared to a regular election. For example, our candidate effects model predicted Labor would gain more in seats they had just flipped at the 2017 state election, as they would be gaining the personal vote effect Basically the share of the electorate who only vote for that MP and not anyone else from their party; I estimate this at about 1% of the vote. of their incoming MP whilst removing the personal vote of the opposing MP. However, in this election, Labor was tipped to get a 5 – 14% swing on the 2pp on top of their last election win, meaning that a small boost from the personal vote in seats like Balcatta and Burns Beach had very little effect on their Labor MPs’ chances – when you’re tipped to win by more than 60% anyway, an extra 2% isn’t going to meaningfully change your probability of winning. Comparing the Labor MP’s chances of winning in Meridiem vs a version of Meridiem where I turn off the candidate effects model in Balcatta and Burns Beach:
Balcatta: 99.991% vs 99.97%
Burns Beach: 99.984% vs 99.96%
It is possible for forecast scoring to judge between such forecasts, but you would need a lot of trials. This is also why we intend to look at the 2pp error of Basic and Meridiem when (if) such figures are released.
How Meridiem’s components did
Here, what I’ve done is taken the bug-fixed version of Meridiem, turned off various modules (e.g. the composite lean model, the elasticity model) and then ran a bunch of simulations, scored them, and then compared them to the original bug-fixed Meridiem model. The reason I use the bug-fixed version of Meridiem is because there might be some bugs in the code which happen to be fixed by turning off a buggy part of the model.
For example, in the elasticity model, while the bug would have made the model significantly worse in Churchlands, the elasticity model overall outperformed uniform swing in every other seat. Switching the buggy Meridiem model to a uniform swing (turning off the elasticity model, basically) would have made it better overall, but that would miss out on the fact that the elasticity model is mostly fine save for one coding bug.
Not all of the parts which I mentioned in my previous post on how we intend/intended to score our model were tested; I didn’t bother to test whether the fundamentals weightage model would work better if switched to the less-theoretically-appropriate-but-backtested-better version because the answer is pretty clear (giving the fundamentals more weight would have made the forecast less accurate, as the fundamentals predicted a smaller win for Labor than the polls did, and Labor outperformed the polls).
Meridiem’s electorate lean model
This is basically the model which tells us what we should expect the vote in each electorate to look like in an average election, calculated as (Party X vote in seat – Party X vote statewide). Most models, even uniform swing models, include some data regarding how Labor/Liberal/National-leaning they expect each electorate to be (uniform swing effectively assumes that the electorate’s lean remains the same between elections); our model combines the lean data from the last two elections (with the most recent election given twice the weight) to estimate each electorate’s leaning.
Going into the election, I was slightly concerned that the redistributed data from 2013 would be inaccurate (due to uncertainty inherent in attempting to redistribute electorates from 2 elections back), however by the looks of it, I needn’t have been too worried. Seats with massive differences in the 2pp lean (e.g. Bunbury – 4% lean to Liberals in 2013 but 4.6% lean to Labor in 2017; our model estimated a 2.6% lean to Labor) mostly went the way our model expected them to (e.g. currently the 2pp in Bunbury looks like it’ll end up around 72%, while the statewide 2pp will end up around 69% for a 3% lean to Labor), which is reflected in the forecast scores:
Brier score | Log-score | |
---|---|---|
Meridiem (Lean module disabled) | 0.9317 | 0.7314 |
Meridiem | 0.9595 | 0.8238 |
% difference | +3% | +12.6% |
Meridiem’s elasticity model
As noted above, there was a fairly serious bug in the elasticity model; hence, turning it off altogether would make the model better overall. However, comparing a version of the model where the bug has been fixed to one which simply assumes uniform swing, it appears that modelling differences in elasticity in electorates does improve forecast performance somewhat:
Brier score | Log-score | |
---|---|---|
Meridiem (uniform swing) | 0.9562 | 0.8077 |
Meridiem | 0.9595 | 0.8238 |
% difference | +0.3% | +2% |
At the same time, I do recognise that it was pretty much touch-and-go with the Brier score (and even the difference on the log-score isn’t that large); hence I’ll be reviewing and adjusting the elasticity model for future elections.
Meridiem’s candidate effects model
I actually didn’t expect Meridiem’s candidate effects model to demonstrate any significant improvement. Part of this is because, as noted above, a lot of the seats where the candidate effects model would have made a significant difference are (in a 60-40 Labor environment) very safe for Labor anyways. Part of it is also the fact that as part of our candidate effects model, we applied a small (1%) penalty for the Liberal candidate in Cottesloe while assigning a small bonus (1%) to the Liberals in Dawesville, due to the switch in party leader. It looks like that decision was wrong (the swing to Labor in Cottesloe is significantly below average while the swing to Labor in Dawesville looks to be average or higher), but surprisingly (to me) the version of our model which included candidate effects significantly outperformed one with candidate effects turned off:
Brier | Log-score | |
---|---|---|
Meridiem (no personal vote effects) | 0.9431 | 0.7718 |
Meridiem | 0.9595 | 0.8238 |
% difference | +1.7% | +6.7% |
Looking through the scores for individual electorates, it seems like the candidate effects model helped the forecast to be slightly more confident in Labor’s chances of holding marginal seats it had just flipped at the 2017 state election (e.g. Joondalup, Kingsley), with a bigger boost in helping the forecast correctly predict the Labor wins in usually safe Liberal seats with retiring Liberal members (Bateman, Riverton, South Perth).
Overall, I would say that while there are still some issues to work out, Meridiem and its individual components seem to have done fairly well at the WA state election, outperforming Basic as well as versions of itself with said components turned off. I will probably write a follow-up piece analysing the error on our forecasted vote shares and highlighting possible areas of improvement once the final vote tallies and distributions of preferences come in.
Update: Part II is now online, here.