* has anybody looked at the long-standing intermittent raspi2 avocado test failure? @ 2022-10-18 10:23 Peter Maydell 2022-10-19 20:38 ` John Snow 0 siblings, 1 reply; 4+ messages in thread From: Peter Maydell @ 2022-10-18 10:23 UTC (permalink / raw) To: QEMU Developers Cc: Alex Bennée, Cleber Rosa, Philippe Mathieu-Daudé, Wainer dos Santos Moschetta, Beraldo Leal, John Snow If you run this single avocado test: while ./build/x86/tests/venv/bin/avocado run build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd ; do true; done then on my machine it will fail within 4 or 5 iterations. Has anybody looked into this? Looking at a log from a failed test https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log what seems to happen is that the test sends the "halt" command to the guest. The DEBUG log reports a transition from RUNNING to DISCONNECTING and the QMP connection disconnects, so presumably the guest really did exit -- or is that something the test framework is doing from its end? Anyway, the next thing that happens is that after a 30s timeout something (the avocado framework?) sends the QEMU process a SIGTERM, and there's a bunch of timeout related python backtraces in the log. In comparison, with a pass https://people.linaro.org/~peter.maydell/raspi-avocado-pass.log the transition from DISCONNECTING to IDLE is followed within about 5 seconds by a test pass. So: (1) has anybody looked at what's happening in this test in particular? (2) is it possible for avocado to log all of the communication on the serial console? Currently the logging of what the guest prints cuts out early (in both success and failure cases), and what the test case sends to the guest is not logged at all (3) what's the best way to debug this to figure out what's actually happening? I tried passing timeout=None to the self.vm.wait() call in test_arm_raspi2_initrd(), which the comments in machine.py is supposed to mean "wait forever", but that didn't actually seem to change the timeout. (I had the idea of looking at the status of the QEMU process etc at that point.) thanks -- PMM ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure? 2022-10-18 10:23 has anybody looked at the long-standing intermittent raspi2 avocado test failure? Peter Maydell @ 2022-10-19 20:38 ` John Snow 2022-10-20 7:52 ` Thomas Huth 0 siblings, 1 reply; 4+ messages in thread From: John Snow @ 2022-10-19 20:38 UTC (permalink / raw) To: Peter Maydell Cc: QEMU Developers, Alex Bennée, Cleber Rosa, Philippe Mathieu-Daudé, Wainer dos Santos Moschetta, Beraldo Leal On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote: > > If you run this single avocado test: > > while ./build/x86/tests/venv/bin/avocado run > build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd > ; do true; done > > then on my machine it will fail within 4 or 5 iterations. Has anybody > looked into this? Looking at a log from a failed test > https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log > what seems to happen is that the test sends the "halt" command to > the guest. The DEBUG log reports a transition from RUNNING to > DISCONNECTING and the QMP connection disconnects, so presumably > the guest really did exit -- or is that something the test framework > is doing from its end? The avocado test calls self.vm.wait(), which goes down this call chain, all in QEMUMachine (machine.py): wait() .shutdown() .._do_shutdown() ..._soft_shutdown() ...._early_cleanup() ...._close_qmp_connection() QEMUMachine._early_cleanup, as you've noticed, closes the console socket right away. It has this to say on the matter: # If we keep the console socket open, we may deadlock waiting # for QEMU to exit, while QEMU is waiting for the socket to # become writable. If we leave it open, QEMU is not guaranteed to close, and the synchronous code in machine.py isn't good enough to poll both. This was one motivation for switching the QMP library to asyncio, but those changes haven't hit machine.py yet. I am hoping to have it done soon, but some other tasks of finalizing splitting out the QMP library itself took priority in my time budget before returning to converting more utilities to asyncio. Next, the QMP connection itself is closed, which is that transition from RUNNING to DISCONNECTING you see from the "protocol" module in Avocado's log. (Avocado is truncating the full module names here; the real source module names are "qemu.qmp.protocol", "qemu.qmp.qmp_client", and "qemu.machine.machine".) > Anyway, the next thing that happens is > that after a 30s timeout something (the avocado framework?) sends > the QEMU process a SIGTERM, and there's a bunch of timeout related > python backtraces in the log. It looks like the subprocess wait in _soft_shutdown never returns. We assume that because wait() was called that the caller is expecting the VM to exit and has presumably already issued a "quit" or "halt" or equivalent, so the Machine code does not issue any further commands to ensure this happens. At this point, my guess is that -- based on your later issue filings -- the "halt" command isn't actually getting processed because of the rapid console disconnect, so the machine.py code deadlocks waiting for QEMU to exit when it never will. > > In comparison, with a pass > https://people.linaro.org/~peter.maydell/raspi-avocado-pass.log > the transition from DISCONNECTING to IDLE is followed within about > 5 seconds by a test pass. > > So: > > (1) has anybody looked at what's happening in this test in > particular? > (2) is it possible for avocado to log all of the communication > on the serial console? Currently the logging of what the guest > prints cuts out early (in both success and failure cases), > and what the test case sends to the guest is not logged at all The output logging being truncated is due to other problems as you've noticed in later issue filings. I think that Input logging could be added by overriding ConsoleSocket.send() in console_socket.py; see https://docs.python.org/3/library/socket.html#socket.socket.send for signature and use _drain_socket and/or recv() as a reference basis. You might want to not write to the same logfile to avoid an unreadable, interleaved mess. The data buffers here are, I think, not line-based and may not produce coherent interleaved output. Naive and untested: ``` def send(self, data: bytes, flags: int = 0) -> bytes: self._some_other_logfile.write(data) self._some_other_logfile.flush() return super().send(data, flags) ``` I am working on replacing machine.py with an async version which should solve some of the socket problems; but if you want a band-aid in the meantime I'm happy to take any patches to improve what we have in the short-term. I will probably focus my own efforts on just getting the rewrite in. > (3) what's the best way to debug this to figure out what's > actually happening? > > I tried passing timeout=None to the self.vm.wait() call in > test_arm_raspi2_initrd(), which the comments in machine.py > is supposed to mean "wait forever", but that didn't actually > seem to change the timeout. (I had the idea of looking at > the status of the QEMU process etc at that point.) Hm ... The stack trace for the TimeoutExpired exception sure makes it look like the Machine appliance is hitting its own timeout (Should be at about 30 seconds -- matches what you observed), but if that's the case, passing timeout=None should have alleviated that -- up until you hit the Avocado timeout, at least. I'd expect the traceback output and failure mode to look a little different depending on which component decides to time out. > > thanks > -- PMM > "Thanks for the analysis, but, like, what are you going to do to fix any of this?" Long term, an asyncio version of machine.py will help alleviate these kinds of race conditions when dealing with multiple sockets from Python as our testing requirements become more complex. Short term, I'm not sure... I think removing the early socket close code will just cause new race condition deadlocks elsewhere. I recommend adding some kind of wait or flush to the socket code after sending a "halt" to ensure that the command goes through. You could also decide not to send "halt" and just use vm.shutdown() directly, which will use QMP to issue 'quit', which always waits to get a command success acknowledgment from QEMU which should avoid the problem too. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure? 2022-10-19 20:38 ` John Snow @ 2022-10-20 7:52 ` Thomas Huth 2022-10-20 10:16 ` Peter Maydell 0 siblings, 1 reply; 4+ messages in thread From: Thomas Huth @ 2022-10-20 7:52 UTC (permalink / raw) To: John Snow, Peter Maydell Cc: QEMU Developers, Alex Bennée, Cleber Rosa, Philippe Mathieu-Daudé, Wainer dos Santos Moschetta, Beraldo Leal On 19/10/2022 22.38, John Snow wrote: > On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote: >> >> If you run this single avocado test: >> >> while ./build/x86/tests/venv/bin/avocado run >> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd >> ; do true; done >> >> then on my machine it will fail within 4 or 5 iterations. Has anybody >> looked into this? Looking at a log from a failed test >> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log >> what seems to happen is that the test sends the "halt" command to >> the guest. The DEBUG log reports a transition from RUNNING to >> DISCONNECTING and the QMP connection disconnects, so presumably >> the guest really did exit -- or is that something the test framework >> is doing from its end? [...] > "Thanks for the analysis, but, like, what are you going to do to fix > any of this?" > > Long term, an asyncio version of machine.py will help alleviate these > kinds of race conditions when dealing with multiple sockets from > Python as our testing requirements become more complex. > > Short term, I'm not sure... As a short term fix, what about simply changing the test like this: diff a/tests/avocado/boot_linux_console.py b/tests/avocado/boot_linux_console.py --- a/tests/avocado/boot_linux_console.py +++ b/tests/avocado/boot_linux_console.py @@ -489,7 +489,7 @@ def test_arm_raspi2_initrd(self): 'BCM2835') exec_command_and_wait_for_pattern(self, 'cat /proc/iomem', '/soc/cprman@7e101000') - exec_command(self, 'halt') + exec_command_and_wait_for_pattern(self, 'halt', 'System halted') # Wait for VM to shut down gracefully self.vm.wait() By waiting for another console pattern that the guest sends out after "halt", you can make sure that the connection is not closed too early. Thomas ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: has anybody looked at the long-standing intermittent raspi2 avocado test failure? 2022-10-20 7:52 ` Thomas Huth @ 2022-10-20 10:16 ` Peter Maydell 0 siblings, 0 replies; 4+ messages in thread From: Peter Maydell @ 2022-10-20 10:16 UTC (permalink / raw) To: Thomas Huth Cc: John Snow, QEMU Developers, Alex Bennée, Cleber Rosa, Philippe Mathieu-Daudé, Wainer dos Santos Moschetta, Beraldo Leal On Thu, 20 Oct 2022 at 08:52, Thomas Huth <thuth@redhat.com> wrote: > > On 19/10/2022 22.38, John Snow wrote: > > On Tue, Oct 18, 2022 at 6:23 AM Peter Maydell <peter.maydell@linaro.org> wrote: > >> > >> If you run this single avocado test: > >> > >> while ./build/x86/tests/venv/bin/avocado run > >> build/x86/tests/avocado/boot_linux_console.py:BootLinuxConsole.test_arm_raspi2_initrd > >> ; do true; done > >> > >> then on my machine it will fail within 4 or 5 iterations. Has anybody > >> looked into this? Looking at a log from a failed test > >> https://people.linaro.org/~peter.maydell/raspi-avocado-fail.log > >> what seems to happen is that the test sends the "halt" command to > >> the guest. The DEBUG log reports a transition from RUNNING to > >> DISCONNECTING and the QMP connection disconnects, so presumably > >> the guest really did exit -- or is that something the test framework > >> is doing from its end? > [...] > > "Thanks for the analysis, but, like, what are you going to do to fix > > any of this?" > > > > Long term, an asyncio version of machine.py will help alleviate these > > kinds of race conditions when dealing with multiple sockets from > > Python as our testing requirements become more complex. > > > > Short term, I'm not sure... > > As a short term fix, what about simply changing the test > like this: > > diff a/tests/avocado/boot_linux_console.py b/tests/avocado/boot_linux_console.py > --- a/tests/avocado/boot_linux_console.py > +++ b/tests/avocado/boot_linux_console.py > @@ -489,7 +489,7 @@ def test_arm_raspi2_initrd(self): > 'BCM2835') > exec_command_and_wait_for_pattern(self, 'cat /proc/iomem', > '/soc/cprman@7e101000') > - exec_command(self, 'halt') > + exec_command_and_wait_for_pattern(self, 'halt', 'System halted') > # Wait for VM to shut down gracefully > self.vm.wait() > > By waiting for another console pattern that the guest sends > out after "halt", you can make sure that the connection is > not closed too early. That is pretty much the patch I'm testing currently :-) -- PMM ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-10-20 10:28 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-10-18 10:23 has anybody looked at the long-standing intermittent raspi2 avocado test failure? Peter Maydell 2022-10-19 20:38 ` John Snow 2022-10-20 7:52 ` Thomas Huth 2022-10-20 10:16 ` Peter Maydell
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).