* AWS CI Oddities
@ 2025-10-23 19:03 Richard Henderson
2025-10-23 20:31 ` Alex Bennée
2025-10-23 21:34 ` Stefan Hajnoczi
0 siblings, 2 replies; 5+ messages in thread
From: Richard Henderson @ 2025-10-23 19:03 UTC (permalink / raw)
To: Paolo Bonzini, Stefan Hajnoczi; +Cc: qemu-devel
https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
ERROR: Job failed (system failure): pod
"gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is disrupted:
reason "EvictionByEvictionAPI", message "Eviction API: evicting"
So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a 1 hour timeout,
when it gets kicked off the machine?
This has happened a few times the the last week or two. Is there any way to fix this?
r~
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: AWS CI Oddities
2025-10-23 19:03 AWS CI Oddities Richard Henderson
@ 2025-10-23 20:31 ` Alex Bennée
2025-10-23 21:34 ` Stefan Hajnoczi
1 sibling, 0 replies; 5+ messages in thread
From: Alex Bennée @ 2025-10-23 20:31 UTC (permalink / raw)
To: Richard Henderson; +Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel
Richard Henderson <richard.henderson@linaro.org> writes:
> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>
> ERROR: Job failed (system failure): pod
> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx"
> is disrupted: reason "EvictionByEvictionAPI", message "Eviction API:
> evicting"
>
>
> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in
> to a 1 hour timeout, when it gets kicked off the machine?
>
> This has happened a few times the the last week or two. Is there any
> way to fix this?
Is this because we are using spot priced AWS resources? Is our CI
workload a low priority as is job?
>
>
> r~
--
Alex Bennée
Virtualisation Tech Lead @ Linaro
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: AWS CI Oddities
2025-10-23 19:03 AWS CI Oddities Richard Henderson
2025-10-23 20:31 ` Alex Bennée
@ 2025-10-23 21:34 ` Stefan Hajnoczi
2025-10-28 9:41 ` Camilla Conte
1 sibling, 1 reply; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-10-23 21:34 UTC (permalink / raw)
To: Camilla Conte
Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, Richard Henderson
[-- Attachment #1: Type: text/plain, Size: 615 bytes --]
On Thu, Oct 23, 2025, 15:03 Richard Henderson <richard.henderson@linaro.org>
wrote:
> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>
> ERROR: Job failed (system failure): pod
> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is
> disrupted:
> reason "EvictionByEvictionAPI", message "Eviction API: evicting"
>
>
> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a
> 1 hour timeout,
> when it gets kicked off the machine?
>
> This has happened a few times the the last week or two. Is there any way
> to fix this?
>
Hi Camilla,
Any ideas?
Stan
>
[-- Attachment #2: Type: text/html, Size: 1443 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: AWS CI Oddities
2025-10-23 21:34 ` Stefan Hajnoczi
@ 2025-10-28 9:41 ` Camilla Conte
2025-10-28 18:25 ` Stefan Hajnoczi
0 siblings, 1 reply; 5+ messages in thread
From: Camilla Conte @ 2025-10-28 9:41 UTC (permalink / raw)
To: Stefan Hajnoczi
Cc: Paolo Bonzini, Stefan Hajnoczi, qemu-devel, Richard Henderson
Hi Stefan,
This can happen because the worker node is out of resources (likely
memory) for the running pods.
We can set a memory minimum for all pods, or we can set it per-job.
If there are some specific jobs that are memory-hungry, please set
KUBERNETES_MEMORY_REQUEST in the job variables.
https://docs.gitlab.com/runner/executors/kubernetes/#overwrite-container-resources
Else, I can set a global default.
Camilla
On Thu, Oct 23, 2025 at 11:34 PM Stefan Hajnoczi <stefanha@gmail.com> wrote:
>
> On Thu, Oct 23, 2025, 15:03 Richard Henderson <richard.henderson@linaro.org> wrote:
>>
>> https://gitlab.com/qemu-project/qemu/-/jobs/11827686852#L1010
>>
>> ERROR: Job failed (system failure): pod
>> "gitlab-runner/runner-o82plkzob-project-11167699-concurrent-9-ther90lx" is disrupted:
>> reason "EvictionByEvictionAPI", message "Eviction API: evicting"
>>
>>
>> So.. if I'm guessing correctly, the job is doing fine, 28 minutes in to a 1 hour timeout,
>> when it gets kicked off the machine?
>>
>> This has happened a few times the the last week or two. Is there any way to fix this?
>
>
> Hi Camilla,
> Any ideas?
>
> Stan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: AWS CI Oddities
2025-10-28 9:41 ` Camilla Conte
@ 2025-10-28 18:25 ` Stefan Hajnoczi
0 siblings, 0 replies; 5+ messages in thread
From: Stefan Hajnoczi @ 2025-10-28 18:25 UTC (permalink / raw)
To: Camilla Conte
Cc: Stefan Hajnoczi, Paolo Bonzini, qemu-devel, Richard Henderson
[-- Attachment #1: Type: text/plain, Size: 1360 bytes --]
On Tue, Oct 28, 2025 at 10:41:32AM +0100, Camilla Conte wrote:
> Hi Stefan,
>
> This can happen because the worker node is out of resources (likely
> memory) for the running pods.
> We can set a memory minimum for all pods, or we can set it per-job.
>
> If there are some specific jobs that are memory-hungry, please set
> KUBERNETES_MEMORY_REQUEST in the job variables.
> https://docs.gitlab.com/runner/executors/kubernetes/#overwrite-container-resources
>
> Else, I can set a global default.
Hi Camilla,
Sizing each CI job requires memory metrics that I don't have. Does AWS give
you any insight into how much free RAM is available on k8s worker nodes
over time?
Gemini suggests that Kubernetes Metric Server can be enabled and then
`kubectl top` can be used to monitor Pod resource usage. Alternatively,
CloudWatch Container Insights can be enabled in AWS to get Pod memory
usage.
Once we have data we can either raise the limit (assuming there are no
huge outliers) or label the outlier jobs (to avoid overprovisioning the
normal jobs).
By the way, I haven't figured out how to access the AWS resources from
my own AWS account. I only see the total cost from our accounts, but
don't have visibility or control over what is running. So I'm unable to
investigate the EKS cluster myself at the moment.
Thanks,
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-10-29 12:41 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-23 19:03 AWS CI Oddities Richard Henderson
2025-10-23 20:31 ` Alex Bennée
2025-10-23 21:34 ` Stefan Hajnoczi
2025-10-28 9:41 ` Camilla Conte
2025-10-28 18:25 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).