High Performance Optimization
The grpc_graphql_gateway is designed for extreme throughput requirements, capable of handling 100,000+ requests per second (RPS) per instance. To achieve these targets, the gateway employs several advanced architectural optimizations.
Performance Targets
With High-Performance mode enabled:
- 100K+ RPS: For cached queries serving from memory.
- 50K+ RPS: For uncached queries performing gRPC backend calls.
- Sub-millisecond P99: Latency for cache hits.
Key Optimizations
SIMD-Accelerated JSON Parsing
Standard JSON parsing is often the primary bottleneck in GraphQL gateways. We use simd-json, which employs SIMD (Single Instruction, Multiple Data) instructions (AVX2, SSE4.2, NEON) to parse JSON.
- 2x – 5x faster than
serde_jsonfor typical payloads. - Reduced CPU cycles per request, allowing more concurrency on the same hardware.
Lock-Free Sharded Caching
Global locks cause severe contention as CPU core counts increase. Our ShardedCache implementation:
- Splits the cache into 64 – 128 independent shards.
- Uses lock-free reads and independent write locks per shard.
- Eliminates the “Global Lock” bottleneck.
Object Pooling
Memory allocation is expensive at 100K RPS. We use high-performance object pools for request/response buffers:
- Zero-allocation steady state for many request patterns.
- Pre-allocated buffers are returned to a lock-free
ArrayQueuefor reuse.
Connection Pool Tuning
The gateway automatically tunes gRPC and HTTP/2 settings for maximum throughput:
- HTTP/2 Prior Knowledge: Skips nested version negotiation.
- Adaptive Window Sizes: Optimizes flow control for high-bandwidth/low-latency local networks.
- TCP NoDelay: Disables Nagle’s algorithm for immediate packet dispatch.
Configuration
Enable High-Performance mode in your GatewayBuilder:
use grpc_graphql_gateway::{Gateway, HighPerfConfig};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
Gateway::builder()
// ... standard config ...
.with_high_performance(HighPerfConfig::ultra_fast())
.build()?
.serve("0.0.0.0:8888")
.await
}
Configuration Profiles
We provide three pre-tuned profiles:
| Profile | Use Case |
|---|---|
ultra_fast() | Maximum Throughput: Optimized for 100K+ RPS. |
balanced() | Balanced: Good mix of throughput and latency. |
low_latency() | Low Latency: Optimized for minimal response time over raw RPS. |
Benchmarking
We include a performance benchmark suite in the repository.
# Start the example server
cargo run --example greeter --release
# Run the benchmark
cargo run --bin benchmark --release -- --concurrency=200 --duration=30
For a complete automated test, use ./benchmark.sh which handles builds and runs multiple profiles.