qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* Kubernetes gitlab-runner jobs cannot be scheduled
@ 2025-03-01  6:19 Stefan Hajnoczi
  2025-03-01  6:36 ` Paolo Bonzini
  0 siblings, 1 reply; 7+ messages in thread
From: Stefan Hajnoczi @ 2025-03-01  6:19 UTC (permalink / raw)
  To: Paolo Bonzini, Camilla Conte; +Cc: Thomas Huth, qemu-devel

Hi,
On February 26th GitLab CI started failing many jobs because they
could not be scheduled. I've been unable to merge pull requests
because the CI is not working.

Here is an example failed job:
https://gitlab.com/qemu-project/qemu/-/jobs/9281757413

One issue seems to be that the gitlab-cache-pvc PVC is ReadWriteOnce
and Pods scheduled on new nodes therefore cannot start until existing
Pods running on another node complete, causing gitlab-runner timeouts
and failed jobs.

When trying to figure out how the Digital Ocean Kubernetes cluster is
configured I noticed that the
digitalocean-runner-manager-gitlab-runner ConfigMap created on
2024-12-03 does not match qemu.git's
scripts/ci/gitlab-kubernetes-runners/values.yaml. Here is the diff:
--- /tmp/upstream.yaml    2025-03-01 12:47:40.495216401 +0800
+++ /tmp/deployed.yaml    2025-03-01 12:47:38.884216210 +0800
@@ -9,6 +9,7 @@
   [runners.kubernetes]
     poll_timeout = 1200
     image = "ubuntu:20.04"
+    privileged = true
     cpu_request = "0.5"
     service_cpu_request = "0.5"
     helper_cpu_request = "0.25"
@@ -18,5 +19,6 @@
     name = "docker-certs"
     mount_path = "/certs/client"
     medium = "Memory"
-  [runners.kubernetes.node_selector]
-    agentpool = "jobs"
+  [[runners.kubernetes.volumes.pvc]]
+    name = "gitlab-cache-pvc"
+    mount_path = "/cache"

The cache PVC appears to be a manual addition made to the running
cluster but not committed to qemu.git. I don't understand why the
problems only started surfacing now. Maybe a recent .gitlab-ci.d/
change changed how the timeout behaves or maybe the gitlab-runner
configuration that enables the cache PVC simply wasn't picked up by
the gitlab-runner Pod until February 26th?

In the short term I made a manual edit to the ConfigMap removing
gitlab-cache-pvc (but I didn't delete the PVC resource itself). Jobs
are at least running now, although they may take longer due to the
lack of cache.

In the long term maybe we should deploy minio
(https://github.com/minio/minio) or another Kubernetes S3-like service
so gitlab-runner can properly use a global cache without ReadWriteOnce
limitations?

Since I don't know the details of how the Digital Ocean Kubernetes
cluster was configured for gitlab-runner I don't want to make too many
changes without your input. Please let me know what you think.

Stefan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-03-03 13:12 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-01  6:19 Kubernetes gitlab-runner jobs cannot be scheduled Stefan Hajnoczi
2025-03-01  6:36 ` Paolo Bonzini
2025-03-01  7:27   ` Stefan Hajnoczi
2025-03-03  7:35   ` Stefan Hajnoczi
2025-03-03  9:25     ` Paolo Bonzini
2025-03-03 11:01       ` Stefan Hajnoczi
2025-03-03 13:11         ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).