runaway avocado

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* runaway avocado
@ 2020-10-26 22:35 Peter Maydell
  2020-10-26 22:43 ` Philippe Mathieu-Daudé
  2021-02-05 19:23 ` Peter Maydell
  0 siblings, 2 replies; 11+ messages in thread
From: Peter Maydell @ 2020-10-26 22:35 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Alex Bennée

So, I somehow ended up with this process still running on my
local machine after a (probably failed) 'make check-acceptance':

petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
./qemu-system-aarch64 -display none -vga none -chardev
socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
chardev=mon,mode=control -machine virt -chardev
socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
-serial chardev:console -icount
shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
-net none -drive
file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
-kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
-cpu cortex-a53

and it was continuing to log to a deleted file
/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin

which was steadily eating my disk space and got up to nearly 100GB
in used disk (invisible to du, of course, since it was an unlinked
file) before I finally figured out what was going on and killed it
about six hours later...

Any suggestions for how we might improve the robustness of the
relevant test ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2020-10-26 22:35 runaway avocado Peter Maydell
@ 2020-10-26 22:43 ` Philippe Mathieu-Daudé
  2020-10-27  0:28   ` Cleber Rosa
  2021-02-05 19:23 ` Peter Maydell
  1 sibling, 1 reply; 11+ messages in thread
From: Philippe Mathieu-Daudé @ 2020-10-26 22:43 UTC (permalink / raw)
  To: Peter Maydell, avocado-devel, Cleber Rosa, Eduardo Habkost
  Cc: Alex Bennée, QEMU Developers, Pavel Dovgalyuk

Cc'ing avocado-devel@

On 10/26/20 11:35 PM, Peter Maydell wrote:
> So, I somehow ended up with this process still running on my
> local machine after a (probably failed) 'make check-acceptance':
> 
> petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
> ./qemu-system-aarch64 -display none -vga none -chardev
> socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
> chardev=mon,mode=control -machine virt -chardev
> socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
> -serial chardev:console -icount
> shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
> -net none -drive
> file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
> -kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
> -cpu cortex-a53
> 
> and it was continuing to log to a deleted file
> /var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin
> 
> which was steadily eating my disk space and got up to nearly 100GB
> in used disk (invisible to du, of course, since it was an unlinked
> file) before I finally figured out what was going on and killed it
> about six hours later...
> 
> Any suggestions for how we might improve the robustness of the
> relevant test ?
> 
> thanks
> -- PMM
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2020-10-26 22:43 ` Philippe Mathieu-Daudé
@ 2020-10-27  0:28   ` Cleber Rosa
  2020-12-07 20:45     ` John Snow
  0 siblings, 1 reply; 11+ messages in thread
From: Cleber Rosa @ 2020-10-27  0:28 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Peter Maydell, Eduardo Habkost, QEMU Developers, avocado-devel,
	Pavel Dovgalyuk, Alex Bennée

[-- Attachment #1: Type: text/plain, Size: 2813 bytes --]

On Mon, Oct 26, 2020 at 11:43:36PM +0100, Philippe Mathieu-Daudé wrote:
> Cc'ing avocado-devel@
> 
> On 10/26/20 11:35 PM, Peter Maydell wrote:
> > So, I somehow ended up with this process still running on my
> > local machine after a (probably failed) 'make check-acceptance':
> > 
> > petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
> > ./qemu-system-aarch64 -display none -vga none -chardev
> > socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
> > chardev=mon,mode=control -machine virt -chardev
> > socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
> > -serial chardev:console -icount
> > shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
> > -net none -drive
> > file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
> > -kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
> > -cpu cortex-a53
> > 
> > and it was continuing to log to a deleted file
> > /var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin
> > 
> > which was steadily eating my disk space and got up to nearly 100GB
> > in used disk (invisible to du, of course, since it was an unlinked
> > file) before I finally figured out what was going on and killed it
> > about six hours later...
> >

Ouch!

> > Any suggestions for how we might improve the robustness of the
> > relevant test ?
> >

While this test may be less robust/reliable than others, the core
issue is that the automatic shutdown of the QEMU "vms" can be
improved.  My best guess is that this specific test ended in ERROR,
and (or because?) the tearDown() method failed to end these processes.

All tests can be improved at once by adding a second, even more
forceful round of shutdown.  Currently the process gets, in the worst
case scenario, a SIGKILL.

But, in addition to that, an upper layer above the test could be given
the responsibility to look for and clean up resouces initiated by a
test.  The Avocado job has hooks for running callbacks right before
its own process exits, but, with the new Avocado architecture (AKA "N(ext)
Runner") this should probably be implemented as async cleanup actions
that begin right after a test ends.

I'll give the "second more forceful round of shutdown" approach some
and testing, and in addition to that, open an issue to track the upper
layer resource cleanup on Avocado.

Thanks,
- Cleber.

> > thanks
> > -- PMM
> > 
> 
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2020-10-27  0:28   ` Cleber Rosa
@ 2020-12-07 20:45     ` John Snow
  0 siblings, 0 replies; 11+ messages in thread
From: John Snow @ 2020-12-07 20:45 UTC (permalink / raw)
  To: Cleber Rosa, Philippe Mathieu-Daudé
  Cc: Peter Maydell, Eduardo Habkost, QEMU Developers, avocado-devel,
	Pavel Dovgalyuk, Alex Bennée

On 10/26/20 8:28 PM, Cleber Rosa wrote:
> On Mon, Oct 26, 2020 at 11:43:36PM +0100, Philippe Mathieu-Daudé wrote:
>> Cc'ing avocado-devel@
>>
>> On 10/26/20 11:35 PM, Peter Maydell wrote:
>>> So, I somehow ended up with this process still running on my
>>> local machine after a (probably failed) 'make check-acceptance':
>>>
>>> petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
>>> ./qemu-system-aarch64 -display none -vga none -chardev
>>> socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
>>> chardev=mon,mode=control -machine virt -chardev
>>> socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
>>> -serial chardev:console -icount
>>> shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
>>> -net none -drive
>>> file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
>>> -kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
>>> -cpu cortex-a53
>>>
>>> and it was continuing to log to a deleted file
>>> /var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin
>>>
>>> which was steadily eating my disk space and got up to nearly 100GB
>>> in used disk (invisible to du, of course, since it was an unlinked
>>> file) before I finally figured out what was going on and killed it
>>> about six hours later...
>>>
> 
> Ouch!
> 
>>> Any suggestions for how we might improve the robustness of the
>>> relevant test ?
>>>
> 
> While this test may be less robust/reliable than others, the core
> issue is that the automatic shutdown of the QEMU "vms" can be
> improved.  My best guess is that this specific test ended in ERROR,
> and (or because?) the tearDown() method failed to end these processes.
> 
> All tests can be improved at once by adding a second, even more
> forceful round of shutdown.  Currently the process gets, in the worst
> case scenario, a SIGKILL.
> 
> But, in addition to that, an upper layer above the test could be given
> the responsibility to look for and clean up resouces initiated by a
> test.  The Avocado job has hooks for running callbacks right before
> its own process exits, but, with the new Avocado architecture (AKA "N(ext)
> Runner") this should probably be implemented as async cleanup actions
> that begin right after a test ends.
> 
> I'll give the "second more forceful round of shutdown" approach some
> and testing, and in addition to that, open an issue to track the upper
> layer resource cleanup on Avocado.
> 

machine.py should have a timeout that it adheres to, unless it was 
disabled explicitly -- then I guess it can't help you.

--js



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2020-10-26 22:35 runaway avocado Peter Maydell
  2020-10-26 22:43 ` Philippe Mathieu-Daudé
@ 2021-02-05 19:23 ` Peter Maydell
  2021-02-11 17:25   ` Cleber Rosa
  1 sibling, 1 reply; 11+ messages in thread
From: Peter Maydell @ 2021-02-05 19:23 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Alex Bennée

On Mon, 26 Oct 2020 at 22:35, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> So, I somehow ended up with this process still running on my
> local machine after a (probably failed) 'make check-acceptance':
>
> petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
> ./qemu-system-aarch64 -display none -vga none -chardev
> socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
> chardev=mon,mode=control -machine virt -chardev
> socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
> -serial chardev:console -icount
> shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
> -net none -drive
> file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
> -kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
> -cpu cortex-a53
>
> and it was continuing to log to a deleted file
> /var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin
>
> which was steadily eating my disk space and got up to nearly 100GB
> in used disk (invisible to du, of course, since it was an unlinked
> file) before I finally figured out what was going on and killed it
> about six hours later...

Just got hit by this test framework bug again :-( Same thing,
runaway avacado record-and-replay test ate all my disk space.

-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-05 19:23 ` Peter Maydell
@ 2021-02-11 17:25   ` Cleber Rosa
  2021-02-11 17:37     ` Peter Maydell
  0 siblings, 1 reply; 11+ messages in thread
From: Cleber Rosa @ 2021-02-11 17:25 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Alex Bennée, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 2487 bytes --]

On Fri, Feb 05, 2021 at 07:23:22PM +0000, Peter Maydell wrote:
> On Mon, 26 Oct 2020 at 22:35, Peter Maydell <peter.maydell@linaro.org> wrote:
> >
> > So, I somehow ended up with this process still running on my
> > local machine after a (probably failed) 'make check-acceptance':
> >
> > petmay01 13710 99.7  3.7 2313448 1235780 pts/16 Sl  16:10 378:00
> > ./qemu-system-aarch64 -display none -vga none -chardev
> > socket,id=mon,path=/var/tmp/tmp5szft2yi/qemu-13290-monitor.sock -mon
> > chardev=mon,mode=control -machine virt -chardev
> > socket,id=console,path=/var/tmp/tmp5szft2yi/qemu-13290-console.sock,server,nowait
> > -serial chardev:console -icount
> > shift=7,rr=record,rrfile=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin,rrsnapshot=init
> > -net none -drive
> > file=/var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/disk.qcow2,if=none
> > -kernel /home/petmay01/avocado/data/cache/by_location/a00ac4ae676ef0322126abd2f7d38f50cc9cbc95/vmlinuz
> > -cpu cortex-a53
> >
> > and it was continuing to log to a deleted file
> > /var/tmp/avocado_iv8dehpo/avocado_job_w9efukj5/32-tests_acceptance_reverse_debugging.py_ReverseDebugging_AArch64.test_aarch64_virt/replay.bin
> >
> > which was steadily eating my disk space and got up to nearly 100GB
> > in used disk (invisible to du, of course, since it was an unlinked
> > file) before I finally figured out what was going on and killed it
> > about six hours later...
> 
> Just got hit by this test framework bug again :-( Same thing,
> runaway avacado record-and-replay test ate all my disk space.
> 
> -- PMM
> 

Hi Peter,

I'm sorry this caused you trouble again.

IIUC, this specic issue was caused by a runaway QEMU.  Granted, it was
started by an Avocado test.  I've opened a bug report to look into the
possibilities to mitigate or prevent this from happening again:

   https://bugs.launchpad.net/qemu/+bug/1915431

The bug report contains a bit more context into why Avocado does not
try to kill all processes started by a test by default.

BTW, we've been working with Pavel on identifying issues with
replay/reverse features that are causing test failures.  So far,
I've seen a couple of issues that may be related to this runaway
QEMU writing to to the replay.bin file.

Regards,
- Cleber.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-11 17:25   ` Cleber Rosa
@ 2021-02-11 17:37     ` Peter Maydell
  2021-02-11 18:47       ` Cleber Rosa
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Maydell @ 2021-02-11 17:37 UTC (permalink / raw)
  To: Cleber Rosa; +Cc: Alex Bennée, QEMU Developers

On Thu, 11 Feb 2021 at 17:25, Cleber Rosa <crosa@redhat.com> wrote:
> IIUC, this specic issue was caused by a runaway QEMU.  Granted, it was
> started by an Avocado test.  I've opened a bug report to look into the
> possibilities to mitigate or prevent this from happening again:

I wonder if we could have avocado run all our acceptance cases
under a 'ulimit -f' setting that restricts the amount of disk
space they can use? That would restrict the damage that could
be done by any runaways. A CPU usage limit might also be good.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-11 17:37     ` Peter Maydell
@ 2021-02-11 18:47       ` Cleber Rosa
  2021-02-11 19:21         ` Peter Maydell
  0 siblings, 1 reply; 11+ messages in thread
From: Cleber Rosa @ 2021-02-11 18:47 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Alex Bennée, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 1359 bytes --]

On Thu, Feb 11, 2021 at 05:37:20PM +0000, Peter Maydell wrote:
> On Thu, 11 Feb 2021 at 17:25, Cleber Rosa <crosa@redhat.com> wrote:
> > IIUC, this specic issue was caused by a runaway QEMU.  Granted, it was
> > started by an Avocado test.  I've opened a bug report to look into the
> > possibilities to mitigate or prevent this from happening again:
> 
> I wonder if we could have avocado run all our acceptance cases
> under a 'ulimit -f' setting that restricts the amount of disk
> space they can use? That would restrict the damage that could
> be done by any runaways. A CPU usage limit might also be good.
> 
> thanks
> -- PMM
> 

To me that sounds a lot like Linux cgroups.

I can see either someone setting up cgroups and having Avocado
run in it (then all tests inherit from this common parent),
or alternatively Avocado setting up cgroups for each of the
tests.

The former seems simpler and effective wrt preventing system
resources.  I can see a use case for the later when tests actually
want to verify a behavior when certain resources are constrained.

We can have a script setting up a cgroup as part of a
gitlab-ci.{yml,d} job for the jobs that will run on the non-shared
GitLab runners (such as the s390 and aarch64 machines owned by the
QEMU project).

Does this sound like a solution?

Thanks,
- Cleber.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-11 18:47       ` Cleber Rosa
@ 2021-02-11 19:21         ` Peter Maydell
  2021-02-11 23:59           ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Maydell @ 2021-02-11 19:21 UTC (permalink / raw)
  To: Cleber Rosa; +Cc: Alex Bennée, QEMU Developers

On Thu, 11 Feb 2021 at 18:47, Cleber Rosa <crosa@redhat.com> wrote:
> On Thu, Feb 11, 2021 at 05:37:20PM +0000, Peter Maydell wrote:
> > I wonder if we could have avocado run all our acceptance cases
> > under a 'ulimit -f' setting that restricts the amount of disk
> > space they can use? That would restrict the damage that could
> > be done by any runaways. A CPU usage limit might also be good.

> To me that sounds a lot like Linux cgroups.

...except that ulimits are a well-established mechanism that
is straightforward, works for any user and is cross-platform
for most Unixes, whereas cgroups are complicated, Linux specific,
and AIUI require root access to set them up and configure them.

> We can have a script setting up a cgroup as part of a
> gitlab-ci.{yml,d} job for the jobs that will run on the non-shared
> GitLab runners (such as the s390 and aarch64 machines owned by the
> QEMU project).
>
> Does this sound like a solution?

We want a solution that works for anybody running
"make check-acceptance" in any situation, not just for
the CI runners.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-11 19:21         ` Peter Maydell
@ 2021-02-11 23:59           ` Philippe Mathieu-Daudé
  2021-02-12  2:31             ` Cleber Rosa
  0 siblings, 1 reply; 11+ messages in thread
From: Philippe Mathieu-Daudé @ 2021-02-11 23:59 UTC (permalink / raw)
  To: Cleber Rosa
  Cc: Lukáš Doktor, Peter Maydell, Yonggang Luo,
	Alex Bennée, QEMU Developers

On 2/11/21 8:21 PM, Peter Maydell wrote:
> On Thu, 11 Feb 2021 at 18:47, Cleber Rosa <crosa@redhat.com> wrote:
>> On Thu, Feb 11, 2021 at 05:37:20PM +0000, Peter Maydell wrote:
>>> I wonder if we could have avocado run all our acceptance cases
>>> under a 'ulimit -f' setting that restricts the amount of disk
>>> space they can use? That would restrict the damage that could
>>> be done by any runaways. A CPU usage limit might also be good.
> 
>> To me that sounds a lot like Linux cgroups.
> 
> ...except that ulimits are a well-established mechanism that
> is straightforward, works for any user and is cross-platform
> for most Unixes, whereas cgroups are complicated, Linux specific,
> and AIUI require root access to set them up and configure them.

I agree with Peter, having being POSIX compliant is better than
restricting to (recent) Linux. But also note we have users interested
running tests for Windows builds. See the Cirrus-CI.

> 
>> We can have a script setting up a cgroup as part of a
>> gitlab-ci.{yml,d} job for the jobs that will run on the non-shared
>> GitLab runners (such as the s390 and aarch64 machines owned by the
>> QEMU project).
>>
>> Does this sound like a solution?
> 
> We want a solution that works for anybody running
> "make check-acceptance" in any situation, not just for
> the CI runners.

Indeed. Public CI time being limited, I expect users to run tests
elsewhere. We don't mind about data loss on CI runners.

FWIW similar complain last year:
https://www.mail-archive.com/qemu-devel@nongnu.org/msg672277.html

Regards,

Phil.



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: runaway avocado
  2021-02-11 23:59           ` Philippe Mathieu-Daudé
@ 2021-02-12  2:31             ` Cleber Rosa
  0 siblings, 0 replies; 11+ messages in thread
From: Cleber Rosa @ 2021-02-12  2:31 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Lukáš Doktor, Peter Maydell, Yonggang Luo,
	Alex Bennée, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 3100 bytes --]

On Fri, Feb 12, 2021 at 12:59:23AM +0100, Philippe Mathieu-Daudé wrote:
> On 2/11/21 8:21 PM, Peter Maydell wrote:
> > On Thu, 11 Feb 2021 at 18:47, Cleber Rosa <crosa@redhat.com> wrote:
> >> On Thu, Feb 11, 2021 at 05:37:20PM +0000, Peter Maydell wrote:
> >>> I wonder if we could have avocado run all our acceptance cases
> >>> under a 'ulimit -f' setting that restricts the amount of disk
> >>> space they can use? That would restrict the damage that could
> >>> be done by any runaways. A CPU usage limit might also be good.
> > 
> >> To me that sounds a lot like Linux cgroups.
> > 
> > ...except that ulimits are a well-established mechanism that
> > is straightforward, works for any user and is cross-platform
> > for most Unixes, whereas cgroups are complicated, Linux specific,
> > and AIUI require root access to set them up and configure them.
> 
> I agree with Peter, having being POSIX compliant is better than
> restricting to (recent) Linux. But also note we have users interested
> running tests for Windows builds. See the Cirrus-CI.
> 

Sure, I feel like cgroups is more comprehensive, but definitely have
the drawbacks you both listed.

> > 
> >> We can have a script setting up a cgroup as part of a
> >> gitlab-ci.{yml,d} job for the jobs that will run on the non-shared
> >> GitLab runners (such as the s390 and aarch64 machines owned by the
> >> QEMU project).
> >>
> >> Does this sound like a solution?
> > 
> > We want a solution that works for anybody running
> > "make check-acceptance" in any situation, not just for
> > the CI runners.
> 
> Indeed. Public CI time being limited, I expect users to run tests
> elsewhere. We don't mind about data loss on CI runners.
>

That was kind of my point.  We want to use all the resources the
GitLab CI shared runners give us, so extra limit enforcements make no
sense to me.  Also, on my personal machines, I also prefer to have
faster test turnarounds, so putting extra limits is not beneficial to
me.  YMMV, so my opinion is that this should be an opt-in, *not*
enabled by default.

My initial take on this is that we can have a few pre-defined scripts
that set those limits.  Users get to activate those profiles by
name if say, a given environment variable is set.  Something like:

  RESOURCE_LIMIT_PROFILE=low_cpu_4g_files
  if [ -n $RESOURCE_LIMIT_PROFILE ]; then
  ./scripts/limit-resources/$RESOUCE_LIMIT_PROFILE $*

> FWIW similar complain last year:
> https://www.mail-archive.com/qemu-devel@nongnu.org/msg672277.html
>

The specific issue of Avocado's cache size should be addressed in this
development cycle, and a solution available on 86.0.  It's being tracked
here:

  https://github.com/avocado-framework/avocado/issues/4311

Now, in Peter's case, it was QEMU writing to a replay.bin file, and I
don't see a practical way that Avocado could limit the overall disk
space usage by whathever gets run on a test unless disk quotas are
set.  Not sure if this belongs on a test framework though.

Cheers,
- Cleber.

> Regards,
> 
> Phil.
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-12  2:32 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-10-26 22:35 runaway avocado Peter Maydell
2020-10-26 22:43 ` Philippe Mathieu-Daudé
2020-10-27  0:28   ` Cleber Rosa
2020-12-07 20:45     ` John Snow
2021-02-05 19:23 ` Peter Maydell
2021-02-11 17:25   ` Cleber Rosa
2021-02-11 17:37     ` Peter Maydell
2021-02-11 18:47       ` Cleber Rosa
2021-02-11 19:21         ` Peter Maydell
2021-02-11 23:59           ` Philippe Mathieu-Daudé
2021-02-12  2:31             ` Cleber Rosa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).