In the other blog post, we tackled the problem of scaling software companies by adding more people.

In this post, we will focus on another common problem: scaling software to bigger customers.

This usually comes with distinct challenges that need to be solved at the same time

New feature requests
Load of larger clients
Bugs related to larger clients

I have personally been on the receiving end of this challenge being underestimated. Hopefully, the below stories will convince you not to do the same.

Feature requests

Generally when you take on new customers that are bigger than your current ones they will be needing features that you are less likely to have.

Common requirements are:

Compliance
SSO
(More) granular permissions or sub-organizations
Full-featured analytics and/or data export API
Audit log
If user-generated content: tools for nastiness

The above are mostly self-explanatory. The last one is a bit more tricky. If the customer has users and those can somehow put content into the product, then your customers will need tools to manage it. Maybe you will too. At scale, there will always be bad apples that will try to put nasty content into the product or in some other way misbehave.

In general, there is not much you can do about these feature requests except to build them.

Load of larger clients

One important thing to understand is that larger clients might put considerably more load on your system than might be expected.

The load might also show up in unexpected places.

For example, one product I worked on roughly had a "load formula" that was something like this: load = for all queues -> max(conversations in the biggest queue * agents assigned to the queue). The total amount of agents, conversations and queues was largely irrelevant compared to the number of conversations and agents assigned to the biggest one. We then onboarded a big customer that relative to total numbers was big (~40% increase), but in terms of the load experienced by the system, it was a lot bigger (for some parts up to x25).

The load was generated by a single queue which was monstrous by the standards at the time.

Most of these sorts of problems come in one of two flavors:

Unbounded resultsets
Bottlenecks caused by either
- Synchronization requirements
- Lack of resources

Unbounded result sets should never happen. Always do pagination and make sensible decisions about what to do if the amount of items is large. But unbounded result sets might take more forms than is immediately obvious. For example one of the major culprits, when we onboarded the big customer, was that we wanted to update dashboards. We did this by sending an update every time a conversation was queued or dequeued. This generated so much load that the system was barely keeping up. The dashboard itself also looked pretty stupid as it was updating at a rate of up to 25 times per second. A small targeted fix capped this to 1 update per second and the system immediately reacted by scaling down some services to 1/10 of the capacity. In this case, the "unbounded result set" was an unbounded stream of updates, by capping it we significantly reduced the load.

Bottlenecks can also be interesting and sometimes difficult to spot. Another example from the same product was our use of Redis. It was used among other things for broadcasting events to all agents. As the Redis instance load got high we started adding read-replicas to it. Before onboarding the big customer we added additional read-replicas to be on the safe side. This worked well until the main Redis instance started to get overloaded. It was running out of memory for what seemed like no good reason. After profusely sweating over this for 10 minutes it dawned on us that the read-replicas were saturating the outgoing network bandwidth for the main Redis instance. This caused it to hold onto updates which very quickly meant it ran out of memory. The fix in this case was to remove the read replicas that had been added in anticipation of the load.

In conclusion: Scaling bottlenecks are not always obvious, but when designing systems try to think about "what happens if I turn up the volume to 11". Is there a result set that might grow unbounded? Or an operation that might run too often.

Bugs related to larger clients

Larger customers usually entail more users and this tends to mean that code paths are exercised more regularly. Something that happens maybe once a month for a small customer might happen several times a day for a large one. The nature of these bugs is that they are often difficult to reproduce and only manifest under very specific conditions. This means that these can be an unusually big drain on engineering productivity.

One example from the same customer onboarding was that they used email extensively. Email is an old protocol and many clients will not follow the spec to the letter. This means that you will sometimes have to deal with edge cases that are not covered by the spec. We didn't deal with those until the customer demanded we fix it because our inbound email handling would fail to react properly in about 0.2% of the cases. This was an internal KPI for the customer so the entire deal was at risk. It was not that difficult to fix, but this was mostly luck on our part.

Work done a year earlier to fix a completely different problem helped us tremendously here. Without that, we would have been in a bit of a bind. We did have a somewhat extensive PoC, but it seemed that many of the edge cases were geographically specific and the PoC was limited to a single region.

Conclusion

This was mostly a random collection of stories. The one point I want to convey here is that one should not underestimate the impact of scaling to larger customers.

Large customers tend to create several different types of problems. You could be pressed with a tirade of feature requests, frequent downtime and especially nasty bugs all at the same time. This can easily stretch the engineering department to its limits.

Scaling the engineering department before adding the large customers might not be a bad idea.