Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Autoscaling and Load Balancing

This guide covers setting up comprehensive autoscaling and load balancing for the gRPC GraphQL Gateway.

Overview

The gateway supports three types of scaling and load balancing:

  1. Horizontal Pod Autoscaler (HPA) - Scales the number of pods based on metrics
  2. Vertical Pod Autoscaler (VPA) - Adjusts resource requests/limits for pods
  3. LoadBalancer - External load balancing for traffic distribution

Horizontal Pod Autoscaler (HPA)

HPA automatically scales the number of pods based on observed CPU, memory, or custom metrics.

Basic Configuration

autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

Custom Metrics

For advanced scaling based on custom metrics:

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50
  customMetrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "1000"

Deployment

helm install my-gateway ./grpc-graphql-gateway \
  --set autoscaling.enabled=true \
  --set autoscaling.minReplicas=3 \
  --set autoscaling.maxReplicas=10

Monitoring HPA

# Watch HPA status
kubectl get hpa -w

# Describe HPA for detailed metrics
kubectl describe hpa my-gateway

# View current metrics
kubectl top pods -l app.kubernetes.io/name=grpc-graphql-gateway

Vertical Pod Autoscaler (VPA)

VPA automatically adjusts CPU and memory requests/limits based on actual usage.

Prerequisites

Install VPA in your cluster:

git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh

Configuration

verticalPodAutoscaler:
  enabled: true
  updateMode: "Auto"  # Off, Initial, Recreate, Auto
  minAllowed:
    cpu: 100m
    memory: 128Mi
  maxAllowed:
    cpu: 2000m
    memory: 2Gi
  controlledResources:
    - cpu
    - memory

Update Modes

ModeDescriptionUse Case
OffOnly provides recommendationsSafe to use with HPA
InitialApplies recommendations on pod creation onlyGood for initial sizing
RecreateUpdates running pods (requires restart)When you want automatic updates
AutoAutomatically applies recommendationsFull automation

Using VPA with HPA

⚠️ Important: VPA and HPA should not target the same metrics (CPU/Memory).

Recommended Setup:

# Use VPA in "Off" mode for recommendations
verticalPodAutoscaler:
  enabled: true
  updateMode: "Off"
  
# Use HPA for horizontal scaling
autoscaling:
  enabled: true
  minReplicas: 3
  maxReplicas: 10

Alternative: Use VPA for CPU/Memory and HPA for custom metrics.

Viewing VPA Recommendations

# Get VPA status
kubectl describe vpa my-gateway

# View recommendations
kubectl get vpa my-gateway -o jsonpath='{.status.recommendation}'

LoadBalancer Service

LoadBalancer provides external access with cloud provider integration.

Basic Configuration

loadBalancer:
  enabled: true
  httpPort: 80
  grpcPort: 50051
  externalTrafficPolicy: Cluster

AWS Network Load Balancer

loadBalancer:
  enabled: true
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
    service.beta.kubernetes.io/aws-load-balancer-cross-zone-load-balancing-enabled: "true"
    service.beta.kubernetes.io/aws-load-balancer-internal: "false"
  externalTrafficPolicy: Local  # Preserve source IP
  loadBalancerSourceRanges:
    - "10.0.0.0/8"  # Restrict to VPC

Google Cloud Load Balancer

loadBalancer:
  enabled: true
  annotations:
    cloud.google.com/load-balancer-type: "Internal"
    cloud.google.com/backend-config: '{"default": "backend-config"}'
  externalTrafficPolicy: Cluster

Azure Load Balancer

loadBalancer:
  enabled: true
  annotations:
    service.beta.kubernetes.io/azure-load-balancer-internal: "true"
  loadBalancerIP: "10.0.0.10"  # Static internal IP

External Traffic Policy

PolicyProsCons
ClusterEven load distribution across nodesLoses source IP
LocalPreserves source IP, lower latencyMay cause uneven load distribution

Complete Example

Production Deployment with All Features

# values-production.yaml
loadBalancer:
  enabled: true
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: "nlb"
  externalTrafficPolicy: Local
  httpPort: 80
  loadBalancerSourceRanges:
    - "0.0.0.0/0"

autoscaling:
  enabled: true
  minReplicas: 5
  maxReplicas: 50
  targetCPUUtilizationPercentage: 70
  targetMemoryUtilizationPercentage: 80

verticalPodAutoscaler:
  enabled: true
  updateMode: "Off"  # Get recommendations without conflicts
  minAllowed:
    cpu: 250m
    memory: 256Mi
  maxAllowed:
    cpu: 4000m
    memory: 4Gi

resources:
  limits:
    cpu: 2000m
    memory: 2Gi
  requests:
    cpu: 1000m
    memory: 1Gi

podDisruptionBudget:
  enabled: true
  minAvailable: 3

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchExpressions:
            - key: app.kubernetes.io/name
              operator: In
              values:
                - grpc-graphql-gateway
        topologyKey: kubernetes.io/hostname

Deploy:

helm install gateway ./grpc-graphql-gateway \
  -f helm/values-production.yaml \
  --namespace production \
  --create-namespace

Load Balancing Strategies

At Service Level

service:
  sessionAffinity: ClientIP  # Sticky sessions
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 10800  # 3 hours

At Ingress Level

ingress:
  annotations:
    # Round Robin (default)
    nginx.ingress.kubernetes.io/load-balance: "round_robin"
    
    # Least Connections
    # nginx.ingress.kubernetes.io/load-balance: "least_conn"
    
    # IP Hash
    # nginx.ingress.kubernetes.io/load-balance: "ip_hash"

Monitoring and Troubleshooting

Check Load Distribution

# View pod distribution across nodes
kubectl get pods -o wide -l app.kubernetes.io/name=grpc-graphql-gateway

# Check service endpoints
kubectl get endpoints my-gateway

# Check LoadBalancer status
kubectl get svc my-gateway-lb

Monitor Autoscaling

# Watch HPA
watch kubectl get hpa

# Monitor resource usage
kubectl top pods

# Check VPA recommendations
kubectl describe vpa my-gateway

Load Testing

# Install k6
brew install k6

# Run load test
k6 run --vus 100 --duration 5m - <<EOF
import http from 'k6/http';

export default function () {
  const query = JSON.stringify({
    query: '{ __typename }'
  });
  
  http.post('http://<loadbalancer-ip>/graphql', query, {
    headers: { 'Content-Type': 'application/json' },
  });
}
EOF

# Watch scaling in action
watch kubectl get pods,hpa

Best Practices

  1. Start Conservative: Begin with moderate min/max replicas and adjust based on observed patterns

  2. VPA + HPA: Use VPA in “Off” mode alongside HPA to get recommendations without conflicts

  3. LoadBalancer: Use externalTrafficPolicy: Local when you need source IP preservation

  4. PodDisruptionBudget: Always configure PDB to maintain availability during updates

  5. Multi-AZ: Use pod anti-affinity to spread pods across availability zones

  6. Gradual Rollouts: Test autoscaling in staging before production

  7. Monitor Costs: Set reasonable maxReplicas to prevent runaway costs

  8. Health Checks: Ensure liveness and readiness probes are properly configured

Federation with Autoscaling

For federated deployments, each subgraph can scale independently:

# Deploy user subgraph with autoscaling
helm install user-subgraph ./grpc-graphql-gateway \
  -f helm/values-federation-user.yaml \
  --set autoscaling.maxReplicas=20

# Deploy product subgraph with different scaling
helm install product-subgraph ./grpc-graphql-gateway \
  -f helm/values-federation-product.yaml \
  --set autoscaling.maxReplicas=30

Next Steps