flakiness on CI jobs run via k8s

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* flakiness on CI jobs run via k8s
@ 2024-09-17 15:48 Peter Maydell
  2024-09-18  9:24 ` Daniel P. Berrangé
  0 siblings, 1 reply; 2+ messages in thread
From: Peter Maydell @ 2024-09-17 15:48 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Alex Bennée, Paolo Bonzini

I notice that a lot of the CI job flakiness I'm seeing with main
CI runs involves jobs that are run via the k8s runners. Notably
cross-i686-tci and cross-i686-system and cross-i686-user are like this.
These jobs run with no flakiness that I've noticed when they're run
by an individual gitlab user (in which case they're not running on
k8s, I believe). So something seems to be up with the environment
we're using to run the jobs for the main CI. My impression is that
the time things take to run can be very variable, especially if the
CI job believes the reported number of CPUs and actually tries to run
8 or 9 test cases in parallel.

Any ideas what might be causing issues here, or config tweaks
we might be able to make to ensure that the environment reports
to the CI job a number of CPUs/etc that accurately reflects
the amount of resource it really has?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: flakiness on CI jobs run via k8s
  2024-09-17 15:48 flakiness on CI jobs run via k8s Peter Maydell
@ 2024-09-18  9:24 ` Daniel P. Berrangé
  0 siblings, 0 replies; 2+ messages in thread
From: Daniel P. Berrangé @ 2024-09-18  9:24 UTC (permalink / raw)
  To: Peter Maydell; +Cc: QEMU Developers, Alex Bennée, Paolo Bonzini

On Tue, Sep 17, 2024 at 04:48:45PM +0100, Peter Maydell wrote:
> I notice that a lot of the CI job flakiness I'm seeing with main
> CI runs involves jobs that are run via the k8s runners. Notably
> cross-i686-tci and cross-i686-system and cross-i686-user are like this.
> These jobs run with no flakiness that I've noticed when they're run
> by an individual gitlab user (in which case they're not running on
> k8s, I believe). So something seems to be up with the environment
> we're using to run the jobs for the main CI. My impression is that
> the time things take to run can be very variable, especially if the
> CI job believes the reported number of CPUs and actually tries to run
> 8 or 9 test cases in parallel.
> 
> Any ideas what might be causing issues here, or config tweaks
> we might be able to make to ensure that the environment reports
> to the CI job a number of CPUs/etc that accurately reflects
> the amount of resource it really has?

Didn't we change the hosting for our k8s runners recently ? They were
running on Azure, but I vaguely recall hearing that it was being
switched again.

Anyway, perhaps the cloud provider is over-committing the env such
that we have excessive streal time and thus not getting the full
power of the CPUs we expect.  I know gitlab's own public runners
will suffer from this periodically, due to the very cheap VMs they
host on.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2024-09-18  9:25 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-17 15:48 flakiness on CI jobs run via k8s Peter Maydell
2024-09-18  9:24 ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).