Setting Reasonable Limits (and testing them)

Setting Reasonable Limits (and testing them)

If I have interviewed you for a job in the last several years, we probably talked about one of the first Microsoft XML parsers¹.  The parser used a stack to match beginning and ending tags, and the developers didn’t think to include any kind of limit on the size of the stack.  Why would they?  Who would ever think to create a document that only contained opening tags?  And even if they did, the number of opening tags would be bounded by the size of the XML document, right?  Unfortunately, this was very wrong.  

At the same time as the developer was writing the core parser, another developer was implementing support for XSLT higher up in the stack. This had a cool feature that enabled embedded JavaScript to transform the document as it was being parsed². Thus, it was possible to craft a relatively small XML document that contained nothing but opening tags.  Expose this as an endpoint on the Internet, and bam! - you have a denial-of-service attack waiting to happen. And since the XML parser was used by virtually every Windows Server application in the world – it was a denial-of-service attack with a pretty big blast radius.

During interviews, I ask candidates to put themselves in the team's shoes and how they would fix the bug.  It is pretty obvious you can/should/must limit the number of opening tags put on the stack – but what should the limit be?  How do you set a limit that works for a library that runs on tiny, embedded devices and the largest supercomputer? Often, candidates start down a path of selecting a limit based on available memory (e.g., cap the stack at TotalPhysicalMemory / 2).  But this ‘solution’ is neither sufficient to solve the problem (an attacker could just send two simultaneous requests) nor workable (the library is used on embedded devices where it may be necessary to exceed physical memory to parse some documents).  

So, we probably need to enforce a limit on the number of cascading opening tags. What should this be?  500? 1000? 13,621? What number makes sense, and how do you come up with it?  Simple: write a quick script to analyze any XML documents you can find (source repositories, internet, etc.), and a little bit of a buffer, and then make it easy for consumers of the API to override the default.   If you do this – the number turns out to be 256 (although I swear it was closer to 15).  

As engineers, we have a strong, intrinsic desire to make our code as flexible and adaptable as possible.  We tend to resist putting limits and boundary conditions in our code – but even when we force ourselves to do it, we usually try to think of really big numbers ‘just to be safe.’ Instead, we need to take a little extra time to research and set reasonable, data-based limits.

Once you have set and enforced limits in your code, it is critical to test your system and understand the behaviour when those limits are hit.  When I have failed to do this in the past, it has come back to bite me.

Intune’s hybrid MDM feature used a queue to temporarily hold device updates in the cloud until they could be downloaded and applied to the customer’s System Center Configuration Manager database. Dropping messages could result in devices getting orphaned – so it was essential to keep the queue size large enough to account for some customer downtime. I chose a big number that allowed our largest customer to be down for seven days³. We added code to enforce this limit, shipped it to production, and didn’t give it a second thought.

That is until 2 am on a Saturday several months later when the entire team got paged for a live site⁴. One of our scale units was down, and we eventually discovered that one of our largest customers had changed its firewall rules and was failing to download messages.  The max queue size hadn’t been reached – but it turned out that even when the queue grew to 75% of the limit, we hit a catastrophic performance bottleneck and all calls to retrieve messages from the queue were failing for all customers.

We were able to get the service back up and running pretty quickly – but I learned a vital lesson that day about testing the behaviour of a system when the limits we set are reached.  

There are two other important things to do when setting limits in a service.  First, ensure that your code is instrumented and generates alerts when approaching or exceeding limits. Second, provide a way to override the limits with a config-only change that can rapidly be deployed to production if/when needed.  There is no greater sin than having to hear from a customer that calls to your service is failing because you didn’t add proper monitoring, and no worse feeling than having to tell an incident manager that they will have to stay on the bridge for 8 hours as a one-line fix gets coded and deployed to dev, staging, pre-prod, and production.  

Entire classes can (and probably should) be taught on the art and the science of setting and enforcing limits in code and engineering graceful degradation (as opposed to catastrophic failure) in services. But hopefully, you can learn from some of my mistakes above.

Be Happy!

Like this post?  Please consider sharing, checking out my other articles, and following me here on LinkedIn for more articles on software engineering and careers in tech.

Footnotes

  1. I find it amusing that I have recently found myself explaining what XML is to some new CS grads when I ask the question.  When I was interviewing for jobs a couple of decades ago, I included XML in the “languages” section of my resume and landed more than one interview-based solely on XMLis (even at the time, I thought that was a bit odd).  The rise and fall of XML as the interchange format of choice can/will be an interesting story to tell in another article.
  2. Okay, I say ‘cool feature,’ but com’on an embedded scripting engine in a data file?  That was just asking for trouble and has been disabled by default for the last decade.
  3. In retrospect, this was a pretty bad way to calculate the limit.  I most certainly should have chosen a variable limit based on the size of the customer (i.e. smaller queue for smaller customers), and we should have had a better queue eviction policy (e.g. drop less important messages from the queue as the limit was approached).
  4. I can’t find someone who has already claimed it, so for the record, I will call this Flegg’s Law: If you fail to properly set, handle, and test limits in your service, the limits will inevitably be exceeded at the most inconvenient time possible (be that cyber-Monday, right before your manager was going to put you up for promotion, or 2am on a Saturday morning when the developer responsible is backpacking through the Cascades) 


Please note that the opinions stated here are my own, not those of my company.

Matt Shadbolt

Principal Product Architect | Windows 365, Azure Virtual Desktop & Remote Desktop Services

3y

I remember that FW rule outage. Doesn't help that the Firewall CSP design is so inefficient... but does reinforce your main point, that (especially when your dependent on others code) limits are necessary but tricky to define.

I owned MSXML on Windows CE and had to port the change you're talking about over. I wish I'd been interviewed by you since I would've just written 256 on the board and dropped the marker:).

To view or add a comment, sign in

More articles by Brett Flegg

  • Getting Old(er)

    When I first started my professional career, it was hard to envision what it would be like to have a life-long career…

    7 Comments
  • A Tough Year to Graduate

    Summer internships are wrapping up, and rising seniors¹ are heading back to school for their final year. All signs…

    3 Comments
  • The Joys and Sorrows of Soft Delete

    If you are browsing the ConfigMgr database schema (a perfectly normal Sunday afternoon activity for at least some of…

  • Dress like DJam Day

    I am on vacation this week, so just a super short article to remind everyone that this coming Saturday, August 13th is…

    5 Comments
  • Synthetic Transactions

    At Google, we call them probers; at Microsoft, they are called runners; more generically, they are synthetic…

    16 Comments
  • Seagull Management

    One of the favourite parts of my job that the pandemic took away was the chance to walk through team rooms at the end…

    6 Comments
  • Consistency Checkers

    In my article on queues, I alluded to one of the mistakes I often see developers make in modern microservices design:…

    1 Comment
  • Optimal Stress

    In this week’s article, I will discuss stress and its relationship to productivity. A couple of important disclaimers:…

  • When to use a Queue

    I conduct many systems design interviews, and I have recently noticed that candidates seem to have an unnatural…

    4 Comments
  • The Sun Never Sets on Software Development

    Heads-up. If I am interviewing you for an L7 product management position at Google, I will probably ask how you would…

    4 Comments

Insights from the community

Others also viewed

Explore topics