Census 2016 — a case study in how not to manage load

Many, many people think that they know how to test server performance, and many people think that they know how to manage load. As someone who has done those things professionally for a few years, I can assure you that most people are wrong.

Last night (and several times this morning) a stressed-out systems engineer was asked “What’s happening?” and I’ll bet that in response he (it is a very gender-imbalanced industry) pulled out a chart that looks like this:

Image for post
Image for post
Requests over time

And he said “You see those spikes? The way that the request volumes just jump from zero to incredible? Those are denial-of-service attacks.” And his boss, not being an expert in load management, takes this advice in good faith and tells their security guys to get onto the proper authorities.

It’s a fundamentally flawed theory that makes a lot of sense. A typical denial of service attack happens when people send massive numbers of valid requests to a system, with the objective of exhausting all of the system’s resources. Most (but not all) systems will slow down and/or crash if you do this to them.

There’s an important difference between hacking into a system and making a denial of service attack. If ‘hacking in’ is like stealing your wallet, then denial of service is like punching you in the face. Just like in the physical world, there are some real psychos out there who punch people in the face just to prove that they can. (In the real world, punching someone in the face makes it easier to steal their wallet — but it’s a bit trickier in cyberspace, because if they fall unconscious, then it becomes impossible to steal anything at all.)

So the minister is right when he says ‘We weren’t hacked, and nobody took anything’ and when he says ‘We weren’t attacked,’ he’s contradicting the people who are looking at the chart and seeing those horrifying spikes; those concerted efforts to put demand into the system to bring it down.

However, the people advising the minister are wrong, and they’ve been kind enough to make mistakes where we can all see them. Hours before disaster, an ABS spokesman was quoted by The Age making a very silly statement:

an ABS spokesman said the site could handle “1,000,000 form submissions every hour. That’s twice the capacity we expect to need.”

That statement in itself indicates a pretty serious planning failure. If we assume 9.27 million households in Australia, with 78% of those households in Eastern states, and a 66% online participation rate (sorry, not sure where I found that one), then that’s 4.7 million submissions from the Eastern states on ‘Tuesday night’. Let’s assume that ‘Tuesday night’ runs from 6pm to midnight, that’s 0.8 million forms an hour.

If you expect that half a million forms an hour is enough capacity to handle 4.7 million forms in 6 hours, then I think I know where your problem started.

But what about the ‘attack’? Here’s a video that shows a denial of service taking place. As we can see, the massive number of people in the shop means that moving through the shop and paying for your goods is much slower than it would normally be. A denial of service is undoubtedly in progress — but was it an attack? No, it was a special event — a once-in-a-year sale, with ‘never to be repeated’ prices and a limited number of items on offer.

As it turns out, if you publicize something, and tell people that they’re only going to have one shot at doing something really important, there’s a chance that they will believe you, and they’ll all show up at the same time.

But what about those spikes? Surely the spiky graph shows that this wasn’t ‘normal demand amplified’, this was a coordinated effort, on top of the demand. Doesn’t it?

Almost certainly not. The graph above is similar to a chart that I did several years ago for a financial services company, analyzing phone calls into their call centre. This traffic pattern is completely typical. There’s a group of people who like to call just after 9am, before starting the working day. Peak volume hits like a truck at the beginning of lunchtime, with a secondary spike when people start getting back from lunch.

Similar spikes would have happened with the Census — the kids getting home from school, workers getting home from work, and (just after 7pm), “Let’s get this out of the way so that we can relax for the evening.”

I also did reports for an emergency information call centre whose busiest hour in 2014 was more phone calls than the previous 3 years combined. (You won’t have heard about that one, because the demand management was very successful.) In the trade, we call these ‘avalanche’ scenarios. Avalanches happen, and they are not nice, neat orderly affairs: they are messy, savage monsters.

There are subtler problems that I suspect might also have contributed to this situation — people tend to focus very heavily on counts like ‘form submissions’ and ‘page requests’ in these situations. They’re very important numbers, and they’re relatively easy to count. There are other numbers that are harder to count, and they can mean that your ‘form submissions’ number never gets tested, because something else breaks. (‘Concurrent sessions’ is much harder to estimate, and it might have been equally important.)

And, of course, once you start falling behind, things get much worse, very quickly. People try again later when the demand is higher. They think that hammering the ‘refresh’ button is somehow helpful. They try from a different device. They ask a friend (or friendly server) to check from somewhere else. They turn on that bit of software that makes their network requests look like they come from the USA, and see if that fixes the problem. (International hackers? More like ‘people who like watching TV shows.’)

Matthew Hackling has checked for the signs that typically appear when a large-scale denial of service attack is made, and found no such signs. We should not attribute this event to malice when there’s a much simpler alternative that hasn’t been ruled out.

Written by

Nick Argall is an organization engineer, structuring activities to help businesses achieve their goals. nargall@gmail.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store