Setting Reasonable Limits (and testing them)
If I have interviewed you for a job in the last several years, we probably talked about one of the first Microsoft XML parsers¹. The parser used a stack to match beginning and ending tags, and the developers didn’t think to include any kind of limit on the size of the stack. Why would they? Who would ever think to create a document that only contained opening tags? And even if they did, the number of opening tags would be bounded by the size of the XML document, right? Unfortunately, this was very wrong.
At the same time as the developer was writing the core parser, another developer was implementing support for XSLT higher up in the stack. This had a cool feature that enabled embedded JavaScript to transform the document as it was being parsed². Thus, it was possible to craft a relatively small XML document that contained nothing but opening tags. Expose this as an endpoint on the Internet, and bam! - you have a denial-of-service attack waiting to happen. And since the XML parser was used by virtually every Windows Server application in the world – it was a denial-of-service attack with a pretty big blast radius.
During interviews, I ask candidates to put themselves in the team's shoes and how they would fix the bug. It is pretty obvious you can/should/must limit the number of opening tags put on the stack – but what should the limit be? How do you set a limit that works for a library that runs on tiny, embedded devices and the largest supercomputer? Often, candidates start down a path of selecting a limit based on available memory (e.g., cap the stack at TotalPhysicalMemory / 2). But this ‘solution’ is neither sufficient to solve the problem (an attacker could just send two simultaneous requests) nor workable (the library is used on embedded devices where it may be necessary to exceed physical memory to parse some documents).
So, we probably need to enforce a limit on the number of cascading opening tags. What should this be? 500? 1000? 13,621? What number makes sense, and how do you come up with it? Simple: write a quick script to analyze any XML documents you can find (source repositories, internet, etc.), and a little bit of a buffer, and then make it easy for consumers of the API to override the default. If you do this – the number turns out to be 256 (although I swear it was closer to 15).
As engineers, we have a strong, intrinsic desire to make our code as flexible and adaptable as possible. We tend to resist putting limits and boundary conditions in our code – but even when we force ourselves to do it, we usually try to think of really big numbers ‘just to be safe.’ Instead, we need to take a little extra time to research and set reasonable, data-based limits.
Once you have set and enforced limits in your code, it is critical to test your system and understand the behaviour when those limits are hit. When I have failed to do this in the past, it has come back to bite me.
Intune’s hybrid MDM feature used a queue to temporarily hold device updates in the cloud until they could be downloaded and applied to the customer’s System Center Configuration Manager database. Dropping messages could result in devices getting orphaned – so it was essential to keep the queue size large enough to account for some customer downtime. I chose a big number that allowed our largest customer to be down for seven days³. We added code to enforce this limit, shipped it to production, and didn’t give it a second thought.
That is until 2 am on a Saturday several months later when the entire team got paged for a live site⁴. One of our scale units was down, and we eventually discovered that one of our largest customers had changed its firewall rules and was failing to download messages. The max queue size hadn’t been reached – but it turned out that even when the queue grew to 75% of the limit, we hit a catastrophic performance bottleneck and all calls to retrieve messages from the queue were failing for all customers.
We were able to get the service back up and running pretty quickly – but I learned a vital lesson that day about testing the behaviour of a system when the limits we set are reached.
Recommended by LinkedIn
There are two other important things to do when setting limits in a service. First, ensure that your code is instrumented and generates alerts when approaching or exceeding limits. Second, provide a way to override the limits with a config-only change that can rapidly be deployed to production if/when needed. There is no greater sin than having to hear from a customer that calls to your service is failing because you didn’t add proper monitoring, and no worse feeling than having to tell an incident manager that they will have to stay on the bridge for 8 hours as a one-line fix gets coded and deployed to dev, staging, pre-prod, and production.
Entire classes can (and probably should) be taught on the art and the science of setting and enforcing limits in code and engineering graceful degradation (as opposed to catastrophic failure) in services. But hopefully, you can learn from some of my mistakes above.
Be Happy!
Like this post? Please consider sharing, checking out my other articles, and following me here on LinkedIn for more articles on software engineering and careers in tech.
Footnotes
Please note that the opinions stated here are my own, not those of my company.
Principal Product Architect | Windows 365, Azure Virtual Desktop & Remote Desktop Services
3yI remember that FW rule outage. Doesn't help that the Firewall CSP design is so inefficient... but does reinforce your main point, that (especially when your dependent on others code) limits are necessary but tricky to define.
I owned MSXML on Windows CE and had to port the change you're talking about over. I wish I'd been interviewed by you since I would've just written 256 on the board and dropped the marker:).