has anybody looked at the long-standing intermittent raspi2 avocado test failure?

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* has anybody looked at the long-standing intermittent raspi2 avocado test failure?
@ 2022-10-18 10:23 Peter Maydell
  2022-10-19 20:38 ` John Snow
  0 siblings, 1 reply; 4+ messages in thread
From: Peter Maydell @ 2022-10-18 10:23 UTC (permalink / raw)
  To: QEMU Developers
  Cc: Alex Bennée, Cleber Rosa, Philippe Mathieu-Daudé,
	Wainer dos Santos Moschetta, Beraldo Leal, John Snow

If you run this single avocado test:

while ./build/x86/tests/venv/bin/avocado run
build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
; do true; done

then on my machine it will fail within 4 or 5 iterations. Has anybody
looked into this? Looking at a log from a failed test
https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log
what seems to happen is that the test sends the "halt" command to
the guest. The DEBUG log reports a transition from RUNNING to
DISCONNECTING and the QMP connection disconnects, so presumably
the guest really did exit -- or is that something the test framework
is doing from its end? Anyway, the next thing that happens is
that after a 30s timeout something (the avocado framework?) sends
the QEMU process a SIGTERM, and there's a bunch of timeout related
python backtraces in the log.

In comparison, with a pass
https://people.linaro.org/~peter.maydell/raspi-avocado-pass.log
the transition from DISCONNECTING to IDLE is followed within about
5 seconds by a test pass.

So:

(1) has anybody looked at what's happening in this test in
particular?
(2) is it possible for avocado to log all of the communication
on the serial console? Currently the logging of what the guest
prints cuts out early (in both success and failure cases),
and what the test case sends to the guest is not logged at all
(3) what's the best way to debug this to figure out what's
actually happening?

I tried passing timeout=None to the self.vm.wait() call in
test_arm_raspi2_initrd(), which the comments in machine.py
is supposed to mean "wait forever", but that didn't actually
seem to change the timeout. (I had the idea of looking at
the status of the QEMU process etc at that point.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure?
  2022-10-18 10:23 has anybody looked at the long-standing intermittent raspi2 avocado test failure? Peter Maydell
@ 2022-10-19 20:38 ` John Snow
  2022-10-20  7:52   ` Thomas Huth
  0 siblings, 1 reply; 4+ messages in thread
From: John Snow @ 2022-10-19 20:38 UTC (permalink / raw)
  To: Peter Maydell
  Cc: QEMU Developers, Alex Bennée, Cleber Rosa,
	Philippe Mathieu-Daudé, Wainer dos Santos Moschetta,
	Beraldo Leal

On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote:
>
> If you run this single avocado test:
>
> while ./build/x86/tests/venv/bin/avocado run
> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
> ; do true; done
>
> then on my machine it will fail within 4 or 5 iterations. Has anybody
> looked into this? Looking at a log from a failed test
> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log
> what seems to happen is that the test sends the "halt" command to
> the guest. The DEBUG log reports a transition from RUNNING to
> DISCONNECTING and the QMP connection disconnects, so presumably
> the guest really did exit -- or is that something the test framework
> is doing from its end?

The avocado test calls self.vm.wait(), which goes down this call
chain, all in QEMUMachine (machine.py):

wait()
.shutdown()
.._do_shutdown()
..._soft_shutdown()
...._early_cleanup()
...._close_qmp_connection()

QEMUMachine._early_cleanup, as you've noticed, closes the console
socket right away. It has this to say on the matter:

# If we keep the console socket open, we may deadlock waiting
# for QEMU to exit, while QEMU is waiting for the socket to
# become writable.

If we leave it open, QEMU is not guaranteed to close, and the
synchronous code in machine.py isn't good enough to poll both. This
was one motivation for switching the QMP library to asyncio, but those
changes haven't hit machine.py yet. I am hoping to have it done soon,
but some other tasks of finalizing splitting out the QMP library
itself took priority in my time budget before returning to converting
more utilities to asyncio.

Next, the QMP connection itself is closed, which is that transition
from RUNNING to DISCONNECTING you see from the "protocol" module in
Avocado's log.
(Avocado is truncating the full module names here; the real source
module names are "qemu.qmp.protocol", "qemu.qmp.qmp_client", and
"qemu.machine.machine".)

>                        Anyway, the next thing that happens is
> that after a 30s timeout something (the avocado framework?) sends
> the QEMU process a SIGTERM, and there's a bunch of timeout related
> python backtraces in the log.

It looks like the subprocess wait in _soft_shutdown never returns. We
assume that because wait() was called that the caller is expecting the
VM to exit and has presumably already issued a "quit" or "halt" or
equivalent, so the Machine code does not issue any further commands to
ensure this happens.

At this point, my guess is that -- based on your later issue filings
-- the "halt" command isn't actually getting processed because of the
rapid console disconnect, so the machine.py code deadlocks waiting for
QEMU to exit when it never will.

>
> In comparison, with a pass
> https://people.linaro.org/~peter.maydell/raspi-avocado-pass.log
> the transition from DISCONNECTING to IDLE is followed within about
> 5 seconds by a test pass.
>
> So:
>
> (1) has anybody looked at what's happening in this test in
> particular?
> (2) is it possible for avocado to log all of the communication
> on the serial console? Currently the logging of what the guest
> prints cuts out early (in both success and failure cases),
> and what the test case sends to the guest is not logged at all

The output logging being truncated is due to other problems as you've
noticed in later issue filings.

I think that Input logging could be added by overriding
ConsoleSocket.send() in console_socket.py; see
https://docs.python.org/3/library/socket.html#socket.socket.send for
signature and use _drain_socket and/or recv() as a reference basis.
You might want to not write to the same logfile to avoid an
unreadable, interleaved mess. The data buffers here are, I think, not
line-based and may not produce coherent interleaved output.

Naive and untested:
```
    def send(self, data: bytes, flags: int = 0) -> bytes:
        self._some_other_logfile.write(data)
        self._some_other_logfile.flush()
        return super().send(data, flags)
```

I am working on replacing machine.py with an async version which
should solve some of the socket problems; but if you want a band-aid
in the meantime I'm happy to take any patches to improve what we have
in the short-term. I will probably focus my own efforts on just
getting the rewrite in.

> (3) what's the best way to debug this to figure out what's
> actually happening?
>
> I tried passing timeout=None to the self.vm.wait() call in
> test_arm_raspi2_initrd(), which the comments in machine.py
> is supposed to mean "wait forever", but that didn't actually
> seem to change the timeout. (I had the idea of looking at
> the status of the QEMU process etc at that point.)

Hm ... The stack trace for the TimeoutExpired exception sure makes it
look like the Machine appliance is hitting its own timeout (Should be
at about 30 seconds -- matches what you observed), but if that's the
case, passing timeout=None should have alleviated that -- up until you
hit the Avocado timeout, at least.

I'd expect the traceback output and failure mode to look a little
different depending on which component decides to time out.

>
> thanks
> -- PMM
>

"Thanks for the analysis, but, like, what are you going to do to fix
any of this?"

Long term, an asyncio version of machine.py will help alleviate these
kinds of race conditions when dealing with multiple sockets from
Python as our testing requirements become more complex.

Short term, I'm not sure... I think removing the early socket close
code will just cause new race condition deadlocks elsewhere. I
recommend adding some kind of wait or flush to the socket code after
sending a "halt" to ensure that the command goes through. You could
also decide not to send "halt" and just use vm.shutdown() directly,
which will use QMP to issue 'quit', which always waits to get a
command success acknowledgment from QEMU which should avoid the
problem too.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure?
  2022-10-19 20:38 ` John Snow
@ 2022-10-20  7:52   ` Thomas Huth
  2022-10-20 10:16     ` Peter Maydell
  0 siblings, 1 reply; 4+ messages in thread
From: Thomas Huth @ 2022-10-20  7:52 UTC (permalink / raw)
  To: John Snow, Peter Maydell
  Cc: QEMU Developers, Alex Bennée, Cleber Rosa,
	Philippe Mathieu-Daudé, Wainer dos Santos Moschetta,
	Beraldo Leal

On 19/10/2022 22.38, John Snow wrote:
> On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote:
>>
>> If you run this single avocado test:
>>
>> while ./build/x86/tests/venv/bin/avocado run
>> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
>> ; do true; done
>>
>> then on my machine it will fail within 4 or 5 iterations. Has anybody
>> looked into this? Looking at a log from a failed test
>> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log
>> what seems to happen is that the test sends the "halt" command to
>> the guest. The DEBUG log reports a transition from RUNNING to
>> DISCONNECTING and the QMP connection disconnects, so presumably
>> the guest really did exit -- or is that something the test framework
>> is doing from its end?
[...]
> "Thanks for the analysis, but, like, what are you going to do to fix
> any of this?"
> 
> Long term, an asyncio version of machine.py will help alleviate these
> kinds of race conditions when dealing with multiple sockets from
> Python as our testing requirements become more complex.
> 
> Short term, I'm not sure...

As a short term fix, what about simply changing the test
like this:

diff a/tests/avocado/boot_linux_console.py b/tests/avocado/boot_linux_console.py
--- a/tests/avocado/boot_linux_console.py
+++ b/tests/avocado/boot_linux_console.py
@@ -489,7 +489,7 @@ def test_arm_raspi2_initrd(self):
                                                  'BCM2835')
          exec_command_and_wait_for_pattern(self, 'cat /proc/iomem',
                                                  '/soc/cprman@7e101000')
-        exec_command(self, 'halt')
+        exec_command_and_wait_for_pattern(self, 'halt', 'System halted')
          # Wait for VM to shut down gracefully
          self.vm.wait()
  
By waiting for another console pattern that the guest sends
out after "halt", you can make sure that the connection is
not closed too early.

  Thomas



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure?
  2022-10-20  7:52   ` Thomas Huth
@ 2022-10-20 10:16     ` Peter Maydell
  0 siblings, 0 replies; 4+ messages in thread
From: Peter Maydell @ 2022-10-20 10:16 UTC (permalink / raw)
  To: Thomas Huth
  Cc: John Snow, QEMU Developers, Alex Bennée, Cleber Rosa,
	Philippe Mathieu-Daudé, Wainer dos Santos Moschetta,
	Beraldo Leal

On Thu, 20 Oct 2022 at 08:52, Thomas Huth <thuth@redhat.com> wrote:
>
> On 19/10/2022 22.38, John Snow wrote:
> > On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote:
> >>
> >> If you run this single avocado test:
> >>
> >> while ./build/x86/tests/venv/bin/avocado run
> >> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd
> >> ; do true; done
> >>
> >> then on my machine it will fail within 4 or 5 iterations. Has anybody
> >> looked into this? Looking at a log from a failed test
> >> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log
> >> what seems to happen is that the test sends the "halt" command to
> >> the guest. The DEBUG log reports a transition from RUNNING to
> >> DISCONNECTING and the QMP connection disconnects, so presumably
> >> the guest really did exit -- or is that something the test framework
> >> is doing from its end?
> [...]
> > "Thanks for the analysis, but, like, what are you going to do to fix
> > any of this?"
> >
> > Long term, an asyncio version of machine.py will help alleviate these
> > kinds of race conditions when dealing with multiple sockets from
> > Python as our testing requirements become more complex.
> >
> > Short term, I'm not sure...
>
> As a short term fix, what about simply changing the test
> like this:
>
> diff a/tests/avocado/boot_linux_console.py b/tests/avocado/boot_linux_console.py
> --- a/tests/avocado/boot_linux_console.py
> +++ b/tests/avocado/boot_linux_console.py
> @@ -489,7 +489,7 @@ def test_arm_raspi2_initrd(self):
>                                                   'BCM2835')
>           exec_command_and_wait_for_pattern(self, 'cat /proc/iomem',
>                                                   '/soc/cprman@7e101000')
> -        exec_command(self, 'halt')
> +        exec_command_and_wait_for_pattern(self, 'halt', 'System halted')
>           # Wait for VM to shut down gracefully
>           self.vm.wait()
>
> By waiting for another console pattern that the guest sends
> out after "halt", you can make sure that the connection is
> not closed too early.

That is pretty much the patch I'm testing currently :-)

-- PMM


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-10-20 10:28 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-18 10:23 has anybody looked at the long-standing intermittent raspi2 avocado test failure? Peter Maydell
2022-10-19 20:38 ` John Snow
2022-10-20  7:52   ` Thomas Huth
2022-10-20 10:16     ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).