debugging functional tests that only fail in 'make check-functional'

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* debugging functional tests that only fail in 'make check-functional'
@ 2025-02-20 18:39 Peter Maydell
  2025-02-21 12:54 ` Thomas Huth
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Maydell @ 2025-02-20 18:39 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Thomas Huth

I'm trying to debug some functional tests that fail for me
with 'make check-functional' on a debug build. Consistently
(well, same set of tests in two runs) when I run
'make -j8 check-functional' these fail:

 7/44 qemu:func-thorough+func-arm-thorough+thorough / func-arm-arm_sx1
                        ERROR           173.31s   exit status 1
10/44 qemu:func-thorough+func-aarch64-thorough+thorough /
func-aarch64-aarch64_virt            TIMEOUT         720.04s   killed
by signal 15 SIGTERM
11/44 qemu:func-thorough+func-arm-thorough+thorough /
func-arm-arm_aspeed_ast2600              TIMEOUT         720.07s
killed by signal 15 SIGTERM
12/44 qemu:func-thorough+func-aarch64-thorough+thorough /
func-aarch64-aarch64_sbsaref_alpine  TIMEOUT         720.07s   killed
by signal 15 SIGTERM
40/44 qemu:func-thorough+func-arm-thorough+thorough /
func-arm-arm_aspeed_ast2500              TIMEOUT         480.01s
killed by signal 15 SIGTERM

The aarch64-virt one is gpu issue, so I know about that one.
The others pass OK on a clang no-debug sanitizer build.

If I try to run just the sx1 tests "by hand":

$ (cd build/x86 && PYTHONPATH=../../python:../../tests/functional
QEMU_TEST_QEMU_BINARY=./qemu-system-arm ./pyvenv/bin/python3
../../tests/functional/test_arm_sx1.py)
TAP version 13
ok 1 test_arm_sx1.SX1Test.test_arm_sx1_flash
ok 2 test_arm_sx1.SX1Test.test_arm_sx1_initrd
ok 3 test_arm_sx1.SX1Test.test_arm_sx1_sd
1..3

they pass; but inside the test framework that third sd test
errors: testlog-thorough.txt says:

===begin===
----------------------------------- stdout -----------------------------------
TAP version 13
ok 1 test_arm_sx1.SX1Test.test_arm_sx1_flash
ok 2 test_arm_sx1.SX1Test.test_arm_sx1_initrd
not ok 3 test_arm_sx1.SX1Test.test_arm_sx1_sd
1..3
----------------------------------- stderr -----------------------------------
Traceback (most recent call last):
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/python/qemu/machine/machine.py",
line 611, in _do_shutdown
    self._soft_shutdown(timeout)
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/python/qemu/machine/machine.py",
line 596, in _soft_shutdown
    self._subp.wait(timeout=timeout)
  File "/usr/lib/python3.12/subprocess.py", line 1264, in wait
    return self._wait(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 2045, in _wait
    raise TimeoutExpired(self.args, timeout)
subprocess.TimeoutExpired: Command
'('/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/x86/qemu-system-arm',
'-display', 'none', '-vga', 'none', '-chardev', 'socket,id=mon,fd=5',
'-mon', 'chardev=mon,mode=control', '-machine', 'sx1', '-chardev',
'socket,id=console,fd=10', '-serial', 'chardev:console', '-append',
'kunit.enable=0 root=/dev/mmcblk0 rootwait console=ttyS0,115200
earlycon=uart8250,mmio32,0xfffb0000,115200n8', '-no-reboot',
'-snapshot', '-drive',
'format=raw,if=sd,file=/home/petmay01/.cache/qemu/download/c1db7f43ef92469ebc8605013728c8950e7608439f01d13678994f0ce101c3a8',
'-kernel', '/home/petmay01/.cache/qemu/download/a0271899a8dc2165f9e0adb2d0a57fc839ae3a469722ffc56c77e108a8887615')'
timed out after 60 seconds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/tests/functional/test_arm_sx1.py",
line 58, in test_arm_sx1_sd
    self.vm.wait(timeout=60)
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/python/qemu/machine/machine.py",
line 666, in wait
    self.shutdown(timeout=timeout)
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/python/qemu/machine/machine.py",
line 648, in shutdown
    self._do_shutdown(timeout)
  File "/mnt/nvmedisk/linaro/qemu-from-laptop/qemu/python/qemu/machine/machine.py",
line 618, in _do_shutdown
    raise AbnormalShutdown("Could not perform graceful shutdown") \
qemu.machine.machine.AbnormalShutdown: Could not perform graceful shutdown

More information on test_arm_sx1.SX1Test.test_arm_sx1_sd could be found here:
 /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/x86/tests/functional/arm/test_arm_sx1.SX1Test.test_arm_sx1_sd/base.log
 /mnt/nvmedisk/linaro/qemu-from-laptop/qemu/build/x86/tests/functional/arm/test_arm_sx1.SX1Test.test_arm_sx1_sd/console.log

(test program exited with status code 1)
===endit===

which I interpret to mean "we waited the 60 seconds the test says,
but the test didn't exit within that time".

Any suggestions for how to debug?

(Also the console.log is empty regardless of whether the
test passes or fails; this doesn't seem right.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: debugging functional tests that only fail in 'make check-functional'
  2025-02-20 18:39 debugging functional tests that only fail in 'make check-functional' Peter Maydell
@ 2025-02-21 12:54 ` Thomas Huth
  2025-02-21 13:00   ` Peter Maydell
  2025-02-21 13:08   ` Thomas Huth
  0 siblings, 2 replies; 5+ messages in thread
From: Thomas Huth @ 2025-02-21 12:54 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Daniel P. Berrange

On 20/02/2025 19.39, Peter Maydell wrote:
> I'm trying to debug some functional tests that fail for me
> with 'make check-functional' on a debug build. Consistently
> (well, same set of tests in two runs) when I run
> 'make -j8 check-functional' these fail:
> 
>   7/44 qemu:func-thorough+func-arm-thorough+thorough / func-arm-arm_sx1
>                          ERROR           173.31s   exit status 1
> 10/44 qemu:func-thorough+func-aarch64-thorough+thorough /
> func-aarch64-aarch64_virt            TIMEOUT         720.04s   killed
> by signal 15 SIGTERM
> 11/44 qemu:func-thorough+func-arm-thorough+thorough /
> func-arm-arm_aspeed_ast2600              TIMEOUT         720.07s
> killed by signal 15 SIGTERM
> 12/44 qemu:func-thorough+func-aarch64-thorough+thorough /
> func-aarch64-aarch64_sbsaref_alpine  TIMEOUT         720.07s   killed
> by signal 15 SIGTERM
> 40/44 qemu:func-thorough+func-arm-thorough+thorough /
> func-arm-arm_aspeed_ast2500              TIMEOUT         480.01s
> killed by signal 15 SIGTERM
> 
> The aarch64-virt one is gpu issue, so I know about that one.
> The others pass OK on a clang no-debug sanitizer build.
> 
> If I try to run just the sx1 tests "by hand":
> 
> $ (cd build/x86 && PYTHONPATH=../../python:../../tests/functional
> QEMU_TEST_QEMU_BINARY=./qemu-system-arm ./pyvenv/bin/python3
> ../../tests/functional/test_arm_sx1.py)
> TAP version 13
> ok 1 test_arm_sx1.SX1Test.test_arm_sx1_flash
> ok 2 test_arm_sx1.SX1Test.test_arm_sx1_initrd
> ok 3 test_arm_sx1.SX1Test.test_arm_sx1_sd
> 1..3
> 
> they pass; but inside the test framework that third sd test
> errors: testlog-thorough.txt says:
[...]
> timed out after 60 seconds
[...]
> which I interpret to mean "we waited the 60 seconds the test says,
> but the test didn't exit within that time".
> 
> Any suggestions for how to debug?

Some TCG-based tests are slowing down very much when running on a shared hyperthreaded CPU ... Do you have 8 real cores in your system, or rather 4 real cores with 2 SMT threads each? In the latter case, have a try whether "make -j4" works better.

We apparently also increased the timeout in this test in the past already, see commit 92ee59bf56ba42954166e56ab112afe10f3c7556 ... does it work better if you increase the timeout even further?

> (Also the console.log is empty regardless of whether the
> test passes or fails; this doesn't seem right.)

I think we only log the console output when we look for strings
in the output. Since this test does not look for any strings,
there is no log.
Something like this causes some log to be generated:

diff --git a/tests/functional/test_arm_sx1.py b/tests/functional/test_arm_sx1.py
--- a/tests/functional/test_arm_sx1.py
+++ b/tests/functional/test_arm_sx1.py
@@ -43,7 +43,8 @@ def test_arm_sx1_initrd(self):
          self.vm.add_args('-append', f'kunit.enable=0 rdinit=/sbin/init {self.CONSOLE_ARGS}')
          self.vm.add_args('-no-reboot')
          self.launch_kernel(zimage_path,
-                           initrd=initrd_path)
+                           initrd=initrd_path,
+                           wait_for='Boot successful')
          self.vm.wait(timeout=60)
  
      def test_arm_sx1_sd(self):
@@ -54,7 +55,7 @@ def test_arm_sx1_sd(self):
          self.vm.add_args('-no-reboot')
          self.vm.add_args('-snapshot')
          self.vm.add_args('-drive', f'format=raw,if=sd,file={sd_fs_path}')
-        self.launch_kernel(zimage_path)
+        self.launch_kernel(zimage_path, wait_for='Boot successful')
          self.vm.wait(timeout=60)
  
      def test_arm_sx1_flash(self):
@@ -65,7 +66,7 @@ def test_arm_sx1_flash(self):
          self.vm.add_args('-no-reboot')
          self.vm.add_args('-snapshot')
          self.vm.add_args('-drive', f'format=raw,if=pflash,file={flash_path}')
-        self.launch_kernel(zimage_path)
+        self.launch_kernel(zimage_path, wait_for='Boot successful')
          self.vm.wait(timeout=60)
  
  if __name__ == '__main__':

... but maybe we should also provide a knob to flush the serial console
when tearing down the test setup?

  Thomas



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: debugging functional tests that only fail in 'make check-functional'
  2025-02-21 12:54 ` Thomas Huth
@ 2025-02-21 13:00   ` Peter Maydell
  2025-02-21 13:08   ` Thomas Huth
  1 sibling, 0 replies; 5+ messages in thread
From: Peter Maydell @ 2025-02-21 13:00 UTC (permalink / raw)
  To: Thomas Huth; +Cc: QEMU Developers, Daniel P. Berrange

On Fri, 21 Feb 2025 at 12:54, Thomas Huth <thuth@redhat.com> wrote:
>
> On 20/02/2025 19.39, Peter Maydell wrote:
> > Any suggestions for how to debug?
>
> Some TCG-based tests are slowing down very much when running on a shared hyperthreaded CPU ... Do you have 8 real cores in your system, or rather 4 real cores with 2 SMT threads each? In the latter case, have a try whether "make -j4" works better.

Yeah, it is a hyperthreaded 2-threads-per-core setup.
-j4 is a bit better but a few things still time out.
I'm currently playing with bumping the timeouts.

> > (Also the console.log is empty regardless of whether the
> > test passes or fails; this doesn't seem right.)
>
> I think we only log the console output when we look for strings
> in the output. Since this test does not look for any strings,
> there is no log.

That is very surprising to me. I think we should always
log all the guest output, because it's often an important
clue about why a test failed. (Ideally we'd log it in some
way that let you match up guest output with the logging
in base.log -- currently because of the split into three
different log files you can't tell what the relationship
is between e.g. QEMU stderr output and the test's progress
or the test progress and the guest console output.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: debugging functional tests that only fail in 'make check-functional'
  2025-02-21 12:54 ` Thomas Huth
  2025-02-21 13:00   ` Peter Maydell
@ 2025-02-21 13:08   ` Thomas Huth
  2025-02-21 13:24     ` Peter Maydell
  1 sibling, 1 reply; 5+ messages in thread
From: Thomas Huth @ 2025-02-21 13:08 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Daniel P. Berrange

On 21/02/2025 13.54, Thomas Huth wrote:
> On 20/02/2025 19.39, Peter Maydell wrote:
>> I'm trying to debug some functional tests that fail for me
>> with 'make check-functional' on a debug build. Consistently
>> (well, same set of tests in two runs) when I run
>> 'make -j8 check-functional' these fail:
>>
>>   7/44 qemu:func-thorough+func-arm-thorough+thorough / func-arm-arm_sx1
>>                          ERROR           173.31s   exit status 1
>> 10/44 qemu:func-thorough+func-aarch64-thorough+thorough /
>> func-aarch64-aarch64_virt            TIMEOUT         720.04s   killed
>> by signal 15 SIGTERM
>> 11/44 qemu:func-thorough+func-arm-thorough+thorough /
>> func-arm-arm_aspeed_ast2600              TIMEOUT         720.07s
>> killed by signal 15 SIGTERM
>> 12/44 qemu:func-thorough+func-aarch64-thorough+thorough /
>> func-aarch64-aarch64_sbsaref_alpine  TIMEOUT         720.07s   killed
>> by signal 15 SIGTERM
>> 40/44 qemu:func-thorough+func-arm-thorough+thorough /
>> func-arm-arm_aspeed_ast2500              TIMEOUT         480.01s
>> killed by signal 15 SIGTERM
>>
>> The aarch64-virt one is gpu issue, so I know about that one.
>> The others pass OK on a clang no-debug sanitizer build.
>>
>> If I try to run just the sx1 tests "by hand":
>>
>> $ (cd build/x86 && PYTHONPATH=../../python:../../tests/functional
>> QEMU_TEST_QEMU_BINARY=./qemu-system-arm ./pyvenv/bin/python3
>> ../../tests/functional/test_arm_sx1.py)
>> TAP version 13
>> ok 1 test_arm_sx1.SX1Test.test_arm_sx1_flash
>> ok 2 test_arm_sx1.SX1Test.test_arm_sx1_initrd
>> ok 3 test_arm_sx1.SX1Test.test_arm_sx1_sd
>> 1..3
>>
>> they pass; but inside the test framework that third sd test
>> errors: testlog-thorough.txt says:
> [...]
>> timed out after 60 seconds
> [...]
>> which I interpret to mean "we waited the 60 seconds the test says,
>> but the test didn't exit within that time".
>>
>> Any suggestions for how to debug?
> 
> Some TCG-based tests are slowing down very much when running on a shared 
> hyperthreaded CPU ... Do you have 8 real cores in your system, or rather 4 
> real cores with 2 SMT threads each? In the latter case, have a try whether 
> "make -j4" works better.
> 
> We apparently also increased the timeout in this test in the past already, 
> see commit 92ee59bf56ba42954166e56ab112afe10f3c7556 ... does it work better 
> if you increase the timeout even further?
> 
>> (Also the console.log is empty regardless of whether the
>> test passes or fails; this doesn't seem right.)
> 
> I think we only log the console output when we look for strings
> in the output. Since this test does not look for any strings,
> there is no log.
> Something like this causes some log to be generated:
> 
> diff --git a/tests/functional/test_arm_sx1.py b/tests/functional/ 
> test_arm_sx1.py
> --- a/tests/functional/test_arm_sx1.py
> +++ b/tests/functional/test_arm_sx1.py
> @@ -43,7 +43,8 @@ def test_arm_sx1_initrd(self):
>           self.vm.add_args('-append', f'kunit.enable=0 rdinit=/sbin/init 
> {self.CONSOLE_ARGS}')
>           self.vm.add_args('-no-reboot')
>           self.launch_kernel(zimage_path,
> -                           initrd=initrd_path)
> +                           initrd=initrd_path,
> +                           wait_for='Boot successful')
>           self.vm.wait(timeout=60)

Actually, thinking about this twice, this might even be a valid fix for your 
problem. Without waiting for a a string that signals a successful boot, the 
test was just firing up the guest kernel and then waiting for 60 seconds for 
the guest kernel to boot up and shut down again. If the boot is delayed for 
some reasons, these 60 seconds might still not be enough. So if we wait for 
a successful boot first before starting with the timeout, the 60 seconds 
afterwards should be enough for a successful shut down, I guess?

  Thomas



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: debugging functional tests that only fail in 'make check-functional'
  2025-02-21 13:08   ` Thomas Huth
@ 2025-02-21 13:24     ` Peter Maydell
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Maydell @ 2025-02-21 13:24 UTC (permalink / raw)
  To: Thomas Huth; +Cc: QEMU Developers, Daniel P. Berrange

On Fri, 21 Feb 2025 at 13:08, Thomas Huth <thuth@redhat.com> wrote:
>
> On 21/02/2025 13.54, Thomas Huth wrote:
> > diff --git a/tests/functional/test_arm_sx1.py b/tests/functional/
> > test_arm_sx1.py
> > --- a/tests/functional/test_arm_sx1.py
> > +++ b/tests/functional/test_arm_sx1.py
> > @@ -43,7 +43,8 @@ def test_arm_sx1_initrd(self):
> >           self.vm.add_args('-append', f'kunit.enable=0 rdinit=/sbin/init
> > {self.CONSOLE_ARGS}')
> >           self.vm.add_args('-no-reboot')
> >           self.launch_kernel(zimage_path,
> > -                           initrd=initrd_path)
> > +                           initrd=initrd_path,
> > +                           wait_for='Boot successful')
> >           self.vm.wait(timeout=60)
>
> Actually, thinking about this twice, this might even be a valid fix for your
> problem. Without waiting for a a string that signals a successful boot, the
> test was just firing up the guest kernel and then waiting for 60 seconds for
> the guest kernel to boot up and shut down again. If the boot is delayed for
> some reasons, these 60 seconds might still not be enough. So if we wait for
> a successful boot first before starting with the timeout, the 60 seconds
> afterwards should be enough for a successful shut down, I guess?

Doesn't that just shuffle the timeouts around? If it
takes ages for the "boot" part then it will hit the timeout in
the meson.build file.

I'm trying with just bumping all the per-test timeouts
in the sx1 test to 120s.

-- PMM


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-02-21 13:25 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-20 18:39 debugging functional tests that only fail in 'make check-functional' Peter Maydell
2025-02-21 12:54 ` Thomas Huth
2025-02-21 13:00   ` Peter Maydell
2025-02-21 13:08   ` Thomas Huth
2025-02-21 13:24     ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).