Question
I'm attempting to configure a Horizontal Pod Autoscaler to scale a deployment based on the duty cycle of attached GPUs.
I'm using GKE, and my Kubernetes master version is 1.10.7-gke.6 .
I'm working off the tutorial at <https://cloud.google.com/kubernetes- engine/docs/tutorials/external-metrics-autoscaling> . In particular, I ran the following command to set up custom metrics:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/k8s-stackdriver/master/custom-metrics-stackdriver-adapter/deploy/production/adapter.yaml
This appears to have worked, or at least I can access a list of metrics at /apis/custom.metrics.k8s.io/v1beta1 .
This is my YAML:
apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
name: images-srv-hpa
spec:
minReplicas: 1
maxReplicas: 10
metrics:
- type: External
external:
metricName: container.googleapis.com|container|accelerator|duty_cycle
targetAverageValue: 50
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: images-srv-deployment
I believe that the metricName exists because it's listed in /apis/custom.metrics.k8s.io/v1beta1 , and because it's described on https://cloud.google.com/monitoring/api/metrics_gcp .
This is the error I get when describing the HPA:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedGetExternalMetric 18s (x3 over 1m) horizontal-pod-autoscaler unable to get external metric prod/container.googleapis.com|container|accelerator|duty_cycle/nil: no metrics returned from external metrics API
Warning FailedComputeMetricsReplicas 18s (x3 over 1m) horizontal-pod-autoscaler failed to get container.googleapis.com|container|accelerator|duty_cycle external metric: unable to get external metric prod/container.googleapis.com|container|accelerator|duty_cycle/nil: no metrics returned from external metrics API
I don't really know how to go about debugging this. Does anyone know what might be wrong, or what I could do next?
Answer
This problem went away on its own once I placed the system under load. It's working fine now with the same configuration.
I'm not sure why. My best guess is that StackMetrics wasn't reporting a duty cycle value until it went above 1%.