Autoscaling and Load Balancing
This guide covers setting up comprehensive autoscaling and load balancing for the gRPC GraphQL Gateway.
Overview
The gateway supports three types of scaling and load balancing:
- Horizontal Pod Autoscaler (HPA) - Scales the number of pods based on metrics
- Vertical Pod Autoscaler (VPA) - Adjusts resource requests/limits for pods
- LoadBalancer - External load balancing for traffic distribution
Horizontal Pod Autoscaler (HPA)
HPA automatically scales the number of pods based on observed CPU, memory, or custom metrics.
Basic Configuration
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
Custom Metrics
For advanced scaling based on custom metrics:
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
customMetrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
Deployment
helm install my-gateway ./grpc-graphql-gateway \
--set autoscaling.enabled=true \
--set autoscaling.minReplicas=3 \
--set autoscaling.maxReplicas=10
Monitoring HPA
# Watch HPA status
kubectl get hpa -w
# Describe HPA for detailed metrics
kubectl describe hpa my-gateway
# View current metrics
kubectl top pods -l app.kubernetes.io/name=grpc-graphql-gateway
Vertical Pod Autoscaler (VPA)
VPA automatically adjusts CPU and memory requests/limits based on actual usage.
Prerequisites
Install VPA in your cluster:
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
Configuration
verticalPodAutoscaler:
enabled: true
updateMode: "Auto" # Off, Initial, Recreate, Auto
minAllowed:
cpu: 100m
memory: 128Mi
maxAllowed:
cpu: 2000m
memory: 2Gi
controlledResources:
- cpu
- memory
Update Modes
| Mode | Description | Use Case |
|---|---|---|
| Off | Only provides recommendations | Safe to use with HPA |
| Initial | Applies recommendations on pod creation only | Good for initial sizing |
| Recreate | Updates running pods (requires restart) | When you want automatic updates |
| Auto | Automatically applies recommendations | Full automation |
Using VPA with HPA
⚠️ Important: VPA and HPA should not target the same metrics (CPU/Memory).
Recommended Setup:
# Use VPA in "Off" mode for recommendations
verticalPodAutoscaler:
enabled: true
updateMode: "Off"
# Use HPA for horizontal scaling
autoscaling:
enabled: true
minReplicas: 3
maxReplicas: 10
Alternative: Use VPA for CPU/Memory and HPA for custom metrics.
Viewing VPA Recommendations
# Get VPA status
kubectl describe vpa my-gateway
# View recommendations
kubectl get vpa my-gateway -o jsonpath='{.status.recommendation}'
LoadBalancer Service
LoadBalancer provides external access with cloud provider integration.
Basic Configuration
loadBalancer:
enabled: true
httpPort: 80
grpcPort: 50051
externalTrafficPolicy: Cluster
AWS Network Load Balancer
loadBalancer:
enabled: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
service.beta.kubernetes.io/aws-load-balancer-internal: "false"
externalTrafficPolicy: Local # Preserve source IP
loadBalancerSourceRanges:
- "10.0.0.0/8" # Restrict to VPC
Google Cloud Load Balancer
loadBalancer:
enabled: true
annotations:
cloud.google.com/load-balancer-type: "Internal"
cloud.google.com/backend-config: '{"default": "backend-config"}'
externalTrafficPolicy: Cluster
Azure Load Balancer
loadBalancer:
enabled: true
annotations:
service.beta.kubernetes.io/azure-load-balancer-internal: "true"
loadBalancerIP: "10.0.0.10" # Static internal IP
External Traffic Policy
| Policy | Pros | Cons |
|---|---|---|
| Cluster | Even load distribution across nodes | Loses source IP |
| Local | Preserves source IP, lower latency | May cause uneven load distribution |
Complete Example
Production Deployment with All Features
# values-production.yaml
loadBalancer:
enabled: true
annotations:
service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
externalTrafficPolicy: Local
httpPort: 80
loadBalancerSourceRanges:
- "0.0.0.0/0"
autoscaling:
enabled: true
minReplicas: 5
maxReplicas: 50
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
verticalPodAutoscaler:
enabled: true
updateMode: "Off" # Get recommendations without conflicts
minAllowed:
cpu: 250m
memory: 256Mi
maxAllowed:
cpu: 4000m
memory: 4Gi
resources:
limits:
cpu: 2000m
memory: 2Gi
requests:
cpu: 1000m
memory: 1Gi
podDisruptionBudget:
enabled: true
minAvailable: 3
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app.kubernetes.io/name
operator: In
values:
- grpc-graphql-gateway
topologyKey: kubernetes.io/hostname
Deploy:
helm install gateway ./grpc-graphql-gateway \
-f helm/values-production.yaml \
--namespace production \
--create-namespace
Load Balancing Strategies
At Service Level
service:
sessionAffinity: ClientIP # Sticky sessions
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # 3 hours
At Ingress Level
ingress:
annotations:
# Round Robin (default)
nginx.ingress.kubernetes.io/load-balance: "round_robin"
# Least Connections
# nginx.ingress.kubernetes.io/load-balance: "least_conn"
# IP Hash
# nginx.ingress.kubernetes.io/load-balance: "ip_hash"
Monitoring and Troubleshooting
Check Load Distribution
# View pod distribution across nodes
kubectl get pods -o wide -l app.kubernetes.io/name=grpc-graphql-gateway
# Check service endpoints
kubectl get endpoints my-gateway
# Check LoadBalancer status
kubectl get svc my-gateway-lb
Monitor Autoscaling
# Watch HPA
watch kubectl get hpa
# Monitor resource usage
kubectl top pods
# Check VPA recommendations
kubectl describe vpa my-gateway
Load Testing
# Install k6
brew install k6
# Run load test
k6 run --vus 100 --duration 5m - <<EOF
import http from 'k6/http';
export default function () {
const query = JSON.stringify({
query: '{ __typename }'
});
http.post('http://<loadbalancer-ip>/graphql', query, {
headers: { 'Content-Type': 'application/json' },
});
}
EOF
# Watch scaling in action
watch kubectl get pods,hpa
Best Practices
-
Start Conservative: Begin with moderate min/max replicas and adjust based on observed patterns
-
VPA + HPA: Use VPA in “Off” mode alongside HPA to get recommendations without conflicts
-
LoadBalancer: Use
externalTrafficPolicy: Localwhen you need source IP preservation -
PodDisruptionBudget: Always configure PDB to maintain availability during updates
-
Multi-AZ: Use pod anti-affinity to spread pods across availability zones
-
Gradual Rollouts: Test autoscaling in staging before production
-
Monitor Costs: Set reasonable maxReplicas to prevent runaway costs
-
Health Checks: Ensure liveness and readiness probes are properly configured
Federation with Autoscaling
For federated deployments, each subgraph can scale independently:
# Deploy user subgraph with autoscaling
helm install user-subgraph ./grpc-graphql-gateway \
-f helm/values-federation-user.yaml \
--set autoscaling.maxReplicas=20
# Deploy product subgraph with different scaling
helm install product-subgraph ./grpc-graphql-gateway \
-f helm/values-federation-product.yaml \
--set autoscaling.maxReplicas=30