qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* AWS CI Oddities
@ 2025-10-23 19:03 Richard Henderson
  2025-10-23 20:31 ` Alex Bennée
  2025-10-23 21:34 ` Stefan Hajnoczi
  0 siblings, 2 replies; 5+ messages in thread
From: Richard Henderson @ 2025-10-23 19:03 UTC (permalink / raw)
  To: Paolo Bonzini, Stefan Hajnoczi; +Cc: qemu-devel

https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010

ERROR: Job failed (system failure): pod 
"gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is disrupted: 
reason "EvictionByEvictionAPI", message "Eviction API: evicting"


So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a 1 hour timeout, 
when it gets kicked off the machine?

This has happened a few times the the last week or two.  Is there any way to fix this?


r~


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AWS CI Oddities
  2025-10-23 19:03 AWS CI Oddities Richard Henderson
@ 2025-10-23 20:31 ` Alex Bennée
  2025-10-23 21:34 ` Stefan Hajnoczi
  1 sibling, 0 replies; 5+ messages in thread
From: Alex Bennée @ 2025-10-23 20:31 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel

Richard Henderson <richard.henderson@linaro.org> writes:

> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>
> ERROR: Job failed (system failure): pod
> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx"
> is disrupted: reason "EvictionByEvictionAPI", message "Eviction API:
> evicting"
>
>
> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in
> to a 1 hour timeout, when it gets kicked off the machine?
>
> This has happened a few times the the last week or two.  Is there any
> way to fix this?

Is this because we are using spot priced AWS resources? Is our CI
workload a low priority as is job?

>
>
> r~

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AWS CI Oddities
  2025-10-23 19:03 AWS CI Oddities Richard Henderson
  2025-10-23 20:31 ` Alex Bennée
@ 2025-10-23 21:34 ` Stefan Hajnoczi
  2025-10-28  9:41   ` Camilla Conte
  1 sibling, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-10-23 21:34 UTC (permalink / raw)
  To: Camilla Conte
  Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 615 bytes --]

On Thu, Oct 23, 2025, 15:03 Richard Henderson <richard.henderson@linaro.org>
wrote:

> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>
> ERROR: Job failed (system failure): pod
> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is
> disrupted:
> reason "EvictionByEvictionAPI", message "Eviction API: evicting"
>
>
> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a
> 1 hour timeout,
> when it gets kicked off the machine?
>
> This has happened a few times the the last week or two.  Is there any way
> to fix this?
>

Hi Camilla,
Any ideas?

Stan

>

[-- Attachment #2: Type: text/html, Size: 1443 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AWS CI Oddities
  2025-10-23 21:34 ` Stefan Hajnoczi
@ 2025-10-28  9:41   ` Camilla Conte
  2025-10-28 18:25     ` Stefan Hajnoczi
  0 siblings, 1 reply; 5+ messages in thread
From: Camilla Conte @ 2025-10-28  9:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, Richard Henderson

Hi Stefan,

This can happen because the worker node is out of resources (likely
memory) for the running pods.
We can set a memory minimum for all pods, or we can set it per-job.

If there are some specific jobs that are memory-hungry, please set
KUBERNETES_MEMORY_REQUEST in the job variables.
https://docs.gitlab.com/runner/executors/kubernetes/#overwrite-container-resources

Else, I can set a global default.

Camilla

On Thu, Oct 23, 2025 at 11:34 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, Oct 23, 2025, 15:03 Richard Henderson <richard.henderson@linaro.org> wrote:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>>
>> ERROR: Job failed (system failure): pod
>> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is disrupted:
>> reason "EvictionByEvictionAPI", message "Eviction API: evicting"
>>
>>
>> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a 1 hour timeout,
>> when it gets kicked off the machine?
>>
>> This has happened a few times the the last week or two.  Is there any way to fix this?
>
>
> Hi Camilla,
> Any ideas?
>
> Stan



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: AWS CI Oddities
  2025-10-28  9:41   ` Camilla Conte
@ 2025-10-28 18:25     ` Stefan Hajnoczi
  0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-10-28 18:25 UTC (permalink / raw)
  To: Camilla Conte
  Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]

On Tue, Oct 28, 2025 at 10:41:32AM +0100, Camilla Conte wrote:
> Hi Stefan,
> 
> This can happen because the worker node is out of resources (likely
> memory) for the running pods.
> We can set a memory minimum for all pods, or we can set it per-job.
> 
> If there are some specific jobs that are memory-hungry, please set
> KUBERNETES_MEMORY_REQUEST in the job variables.
> https://docs.gitlab.com/runner/executors/kubernetes/#overwrite-container-resources
> 
> Else, I can set a global default.

Hi Camilla,
Sizing each CI job requires memory metrics that I don't have. Does AWS give
you any insight into how much free RAM is available on k8s worker nodes
over time?

Gemini suggests that Kubernetes Metric Server can be enabled and then
`kubectl top` can be used to monitor Pod resource usage. Alternatively,
CloudWatch Container Insights can be enabled in AWS to get Pod memory
usage.

Once we have data we can either raise the limit (assuming there are no
huge outliers) or label the outlier jobs (to avoid overprovisioning the
normal jobs).

By the way, I haven't figured out how to access the AWS resources from
my own AWS account. I only see the total cost from our accounts, but
don't have visibility or control over what is running. So I'm unable to
investigate the EKS cluster myself at the moment.

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-10-29 12:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 19:03 AWS CI Oddities Richard Henderson
2025-10-23 20:31 ` Alex Bennée
2025-10-23 21:34 ` Stefan Hajnoczi
2025-10-28  9:41   ` Camilla Conte
2025-10-28 18:25     ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).