Great question, and throttling and backend safety are active design priorities for our MCP Servers. So, the short answer is yes. Details:
Outbound (MCP Server → BAM (or our other backend systems))
- Concurrency: the MCP servers are designed for concurrency, driving asynchronous API calls to the backend systems. We limit the concurrency via a configurable parameter that drives the maximum number of permits (a task requires a permit to execute). We are able to estimate the cost of most API calls based on a quick survey of the size of the data set, and this allows us to reserve a faster path for simple API calls so the permits are never all “reserved” by potential longer running tasks.
- Atomic FSM Circuit Breaker: While we can estimate the cost of API calls, obviously there are events outside of our control where BAM (or any of our products) may be under load and therefore APIs can fail with a 5xx or similar error. After a configurable number of errors, the MCP server backs off to not add insult to injury. We use an atomic finite state machine algorithm that backs off and slowly starts to retry. The current state is available via prometheus metrics, OTEL export, and logging. The MCP server will let the MCP client know the state so the user doesn’t get unintelligible errors.
- 401 Refresh Rate Limiter: If there is an issue with authentication, repeatedly trying can lock out user accounts, etc, we stop before that can happen and again, let the user know.
Inbound (LiveAssist, or any other MCP client → MCP Server)
- Token bucket rate limiting: The default 5 requests per second / up to 20 burst per session. Refused requests get `HTTP 429 Too Many Requests` with `Retry-After`, so it is clear to the caller. This, again, is tunable. Note even if you set to 1000, the concurrency meter about in the outbound section would still protect the server. The different is the concurrency meter waits on asynchronous tasks (they stack up and are executed in an orderly fashion) while the token bucket fails back to the client with a 429.
- Global in-flight concurrency cap: If the tasks are stacking up on the MCP Server, in other words, while protecting BAM, we don’t want to knock over the MCP Server itself, then the MCP Server will start returning 503 errors until the awaited tasks are below the threshold. Again, this is ultimately tunable.
- Response-level caps: The MCP Servers aren’t intended to be data extraction tools, and there is no reason to blow out an LLM’s context sending it a ton of data. However, we need to guard against an LLM thinking it has a full data set, when we are only returning partial. Therefore if the data is summarized or truncated, we return “partial-result metadata” (both a coverage percentage, and a truncation reason) so the LLM never mistakes a trimmed view for reality. The truncation reason includes advice on how to get the same intended result. Further, if the user simply does want the data, we provide a one time use, 5 min TTL, cryptographically signed URL that can be used to get the data directly as opposed via the LLM. The advice upon truncation includes suggested filtering, for example.
As I mentioned on the call, we test against very large scaled systems, For instance, for BAM: > 10k Zones, > 300 servers, > 1MM resource records -- with similar large scale on blocks and networks, and IPs. This is how we are setting the defaults for some of the parameters above -- and we do generate traffic while testing, they are not static systems. This work is ongoing, and have yet to publish the guidance on high volume systems -- but all of the work above represents our work to date to get to your ask, For certain we will be delivering that guidance.
Happy to go deeper on any of these — if you have you BAM instance size (block / network / zone / record counts) and an expected query pattern, API load, etc, you are welcome to share the numbers and then we can walk through how the limits above apply to your deployment.
Also, after reading above, please let me know if you’d like some further clarifications. Hopefully it makes sense, and gives you some confidence that the team has put some serious thought and engineering into this.
Andrew
oh, ps, we need rate limiting in BAM, and hope to update you on that soon,