* acceptance-system-fedora failures @ 2020-10-06 23:07 John Snow 2020-10-07 5:20 ` Philippe Mathieu-Daudé 0 siblings, 1 reply; 19+ messages in thread From: John Snow @ 2020-10-06 23:07 UTC (permalink / raw) To: Cleber Rosa, Alex Bennée; +Cc: QEMU Developers I'm seeing this gitlab test fail quite often in my Python work; I don't *think* this has anything to do with my patches, but maybe I need to try and bisect this more aggressively. The very first hint of an error I see is on line 154: https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154 22:05:36 ERROR| 22:05:36 ERROR| Reproduced traceback from: /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753 22:05:36 ERROR| Traceback (most recent call last): 22:05:36 ERROR| File "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py", line 171, in setUp 22:05:36 ERROR| self.cancel("No QEMU binary defined or found in the build tree") Is this a known problem? --js ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-06 23:07 acceptance-system-fedora failures John Snow @ 2020-10-07 5:20 ` Philippe Mathieu-Daudé 2020-10-07 7:13 ` Philippe Mathieu-Daudé ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-07 5:20 UTC (permalink / raw) To: John Snow, Cleber Rosa Cc: Thomas Huth, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson On 10/7/20 1:07 AM, John Snow wrote: > I'm seeing this gitlab test fail quite often in my Python work; I don't > *think* this has anything to do with my patches, but maybe I need to try > and bisect this more aggressively. > > The very first hint of an error I see is on line 154: > > https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154 > > 22:05:36 ERROR| > 22:05:36 ERROR| Reproduced traceback from: > /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753 > > 22:05:36 ERROR| Traceback (most recent call last): > 22:05:36 ERROR| File > "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py", > line 171, in setUp > 22:05:36 ERROR| self.cancel("No QEMU binary defined or found in the > build tree") Last year the Avocado developers said we could have a clearer log error report, but meanwhile this verbose output is better that not having anything ¯\_(ツ)_/¯ > > Is this a known problem? "No QEMU binary defined or found in the build tree" is not a problem, it means a test is skipped because the qemu-system-$ARCH binary is not found. In your case this is because your job (acceptance-system-fedora) is based on build-system-fedora which only build the following targets: TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu sparc64-softmmu Now I don't understand what binary the EmptyCPUModel/Migration tests are expecting. Maybe these tests only work when a single target is built? IOW not expecting that the python code searching for a binary return multiple ones? -> Eduardo/Cleber. w.r.t. the error in your build, I told Thomas about the test_ppc_mac99/day15/invaders.elf timeouting but he said this is not his area. Richard has been looking yesterday to see if it is a TCG regression, and said the test either finished/crashed raising SIGCHLD, but Avocado parent is still waiting for a timeout, so the children become zombie and the test hang. Not sure that helps :) Regards, Phil. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 5:20 ` Philippe Mathieu-Daudé @ 2020-10-07 7:13 ` Philippe Mathieu-Daudé 2020-10-07 8:23 ` Thomas Huth 2020-10-07 7:23 ` Thomas Huth 2020-10-07 14:38 ` Cleber Rosa 2 siblings, 1 reply; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-07 7:13 UTC (permalink / raw) To: John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Thomas Huth, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: > On 10/7/20 1:07 AM, John Snow wrote: >> I'm seeing this gitlab test fail quite often in my Python work; I don't >> *think* this has anything to do with my patches, but maybe I need to try >> and bisect this more aggressively. >> >> The very first hint of an error I see is on line 154: >> >> https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154 >> >> 22:05:36 ERROR| >> 22:05:36 ERROR| Reproduced traceback from: >> /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753 >> >> 22:05:36 ERROR| Traceback (most recent call last): >> 22:05:36 ERROR| File >> "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py", >> line 171, in setUp >> 22:05:36 ERROR| self.cancel("No QEMU binary defined or found in the >> build tree") > > Last year the Avocado developers said we could have a clearer > log error report, but meanwhile this verbose output is better > that not having anything ¯\_(ツ)_/¯ > >> >> Is this a known problem? > > "No QEMU binary defined or found in the build tree" is not a > problem, it means a test is skipped because the qemu-system-$ARCH > binary is not found. In your case this is because your job > (acceptance-system-fedora) is based on build-system-fedora > which only build the following targets: > > TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu > xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu > sparc64-softmmu > > Now I don't understand what binary the EmptyCPUModel/Migration tests > are expecting. Maybe these tests only work when a single target is > built? IOW not expecting that the python code searching for a binary > return multiple ones? -> Eduardo/Cleber. > > w.r.t. the error in your build, I told Thomas about the > test_ppc_mac99/day15/invaders.elf timeouting but he said this is > not his area. Richard has been looking yesterday to see if it is > a TCG regression, and said the test either finished/crashed raising > SIGCHLD, but Avocado parent is still waiting for a timeout, so the > children become zombie and the test hang. Expected output: Quiescing Open Firmware ... Booting Linux via __start() @ 0x01000000 ... But QEMU exits in replay_char_write_event_load(): Quiescing Open Firmware ... qemu-system-ppc: Missing character write event in the replay log $ echo $? 1 Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. Replay file is ~22MiB. End of record using "system_powerdown + quit" in HMP. I guess we have 2 bugs: - replay log - avocado doesn't catch children exit(1) Quick reproducer: $ make qemu-system-ppc check-venv $ tests/venv/bin/python -m \ avocado --show=app,console,replay \ run --job-timeout 300 -t machine:mac99 \ tests/acceptance/replay_kernel.py > > Not sure that helps :) > > Regards, > > Phil. > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 7:13 ` Philippe Mathieu-Daudé @ 2020-10-07 8:23 ` Thomas Huth 2020-10-07 8:51 ` Pavel Dovgalyuk 2020-10-07 14:03 ` John Snow 0 siblings, 2 replies; 19+ messages in thread From: Thomas Huth @ 2020-10-07 8:23 UTC (permalink / raw) To: Philippe Mathieu-Daudé, John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Richard Henderson, Alex Bennée, Eduardo Habkost, Wainer dos Santos Moschetta, QEMU Developers On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: > On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >> On 10/7/20 1:07 AM, John Snow wrote: >>> I'm seeing this gitlab test fail quite often in my Python work; I don't >>> *think* this has anything to do with my patches, but maybe I need to try >>> and bisect this more aggressively. [...] >> w.r.t. the error in your build, I told Thomas about the >> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >> not his area. Richard has been looking yesterday to see if it is >> a TCG regression, and said the test either finished/crashed raising >> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >> children become zombie and the test hang. > > Expected output: > > Quiescing Open Firmware ... > Booting Linux via __start() @ 0x01000000 ... > > But QEMU exits in replay_char_write_event_load(): > > Quiescing Open Firmware ... > qemu-system-ppc: Missing character write event in the replay log > $ echo $? > 1 > > Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. > > Replay file is ~22MiB. End of record using "system_powerdown + quit" > in HMP. > > I guess we have 2 bugs: > - replay log > - avocado doesn't catch children exit(1) > > Quick reproducer: > > $ make qemu-system-ppc check-venv > $ tests/venv/bin/python -m \ > avocado --show=app,console,replay \ > run --job-timeout 300 -t machine:mac99 \ > tests/acceptance/replay_kernel.py Thanks, that was helpful. ... and the winner is: commit 55adb3c45620c31f29978f209e2a44a08d34e2da Author: John Snow <jsnow@redhat.com> Date: Fri Jul 24 01:23:00 2020 -0400 Subject: ide: cancel pending callbacks on SRST ... starting with this commit, the tests starts failing. John, any idea what might be causing this? Thomas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 8:23 ` Thomas Huth @ 2020-10-07 8:51 ` Pavel Dovgalyuk 2020-10-07 9:57 ` Philippe Mathieu-Daudé 2020-10-07 14:03 ` John Snow 1 sibling, 1 reply; 19+ messages in thread From: Pavel Dovgalyuk @ 2020-10-07 8:51 UTC (permalink / raw) To: Thomas Huth, Philippe Mathieu-Daudé, John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Richard Henderson, Alex Bennée, Eduardo Habkost, Wainer dos Santos Moschetta, QEMU Developers On 07.10.2020 11:23, Thomas Huth wrote: > On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >>> On 10/7/20 1:07 AM, John Snow wrote: >>>> I'm seeing this gitlab test fail quite often in my Python work; I don't >>>> *think* this has anything to do with my patches, but maybe I need to try >>>> and bisect this more aggressively. > [...] >>> w.r.t. the error in your build, I told Thomas about the >>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >>> not his area. Richard has been looking yesterday to see if it is >>> a TCG regression, and said the test either finished/crashed raising >>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >>> children become zombie and the test hang. >> >> Expected output: >> >> Quiescing Open Firmware ... >> Booting Linux via __start() @ 0x01000000 ... >> >> But QEMU exits in replay_char_write_event_load(): >> >> Quiescing Open Firmware ... >> qemu-system-ppc: Missing character write event in the replay log >> $ echo $? >> 1 >> >> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. >> >> Replay file is ~22MiB. End of record using "system_powerdown + quit" >> in HMP. >> >> I guess we have 2 bugs: >> - replay log >> - avocado doesn't catch children exit(1) >> >> Quick reproducer: >> >> $ make qemu-system-ppc check-venv >> $ tests/venv/bin/python -m \ >> avocado --show=app,console,replay \ >> run --job-timeout 300 -t machine:mac99 \ >> tests/acceptance/replay_kernel.py > > Thanks, that was helpful. ... and the winner is: > > commit 55adb3c45620c31f29978f209e2a44a08d34e2da > Author: John Snow <jsnow@redhat.com> > Date: Fri Jul 24 01:23:00 2020 -0400 > Subject: ide: cancel pending callbacks on SRST > > ... starting with this commit, the tests starts failing. John, any idea what > might be causing this? This patch includes the following lines: + aio_bh_schedule_oneshot(qemu_get_aio_context(), + ide_bus_perform_srst, bus); replay_bh_schedule_oneshot_event should be used instead of this function, because it synchronizes non-deterministic BHs. > > Thomas > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 8:51 ` Pavel Dovgalyuk @ 2020-10-07 9:57 ` Philippe Mathieu-Daudé 2020-10-07 11:22 ` Alex Bennée 2020-10-07 12:17 ` Pavel Dovgalyuk 0 siblings, 2 replies; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-07 9:57 UTC (permalink / raw) To: Pavel Dovgalyuk, Thomas Huth, John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Richard Henderson, Alex Bennée, Eduardo Habkost, Wainer dos Santos Moschetta, QEMU Developers On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: > On 07.10.2020 11:23, Thomas Huth wrote: >> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >>>> On 10/7/20 1:07 AM, John Snow wrote: >>>>> I'm seeing this gitlab test fail quite often in my Python work; I >>>>> don't >>>>> *think* this has anything to do with my patches, but maybe I need >>>>> to try >>>>> and bisect this more aggressively. >> [...] >>>> w.r.t. the error in your build, I told Thomas about the >>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >>>> not his area. Richard has been looking yesterday to see if it is >>>> a TCG regression, and said the test either finished/crashed raising >>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >>>> children become zombie and the test hang. >>> >>> Expected output: >>> >>> Quiescing Open Firmware ... >>> Booting Linux via __start() @ 0x01000000 ... >>> >>> But QEMU exits in replay_char_write_event_load(): >>> >>> Quiescing Open Firmware ... >>> qemu-system-ppc: Missing character write event in the replay log >>> $ echo $? >>> 1 >>> >>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. >>> >>> Replay file is ~22MiB. End of record using "system_powerdown + quit" >>> in HMP. >>> >>> I guess we have 2 bugs: >>> - replay log >>> - avocado doesn't catch children exit(1) >>> >>> Quick reproducer: >>> >>> $ make qemu-system-ppc check-venv >>> $ tests/venv/bin/python -m \ >>> avocado --show=app,console,replay \ >>> run --job-timeout 300 -t machine:mac99 \ >>> tests/acceptance/replay_kernel.py >> >> Thanks, that was helpful. ... and the winner is: >> >> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >> Author: John Snow <jsnow@redhat.com> >> Date: Fri Jul 24 01:23:00 2020 -0400 >> Subject: ide: cancel pending callbacks on SRST >> >> ... starting with this commit, the tests starts failing. John, any >> idea what >> might be causing this? > > This patch includes the following lines: > > + aio_bh_schedule_oneshot(qemu_get_aio_context(), > + ide_bus_perform_srst, bus); > > replay_bh_schedule_oneshot_event should be used instead of this > function, because it synchronizes non-deterministic BHs. Why do we have 2 different functions? BH are already complex enough, and we need to also think about the replay API... What about the other cases such vhost-user (blk/net), virtio-blk? > > >> >> Thomas >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 9:57 ` Philippe Mathieu-Daudé @ 2020-10-07 11:22 ` Alex Bennée 2020-10-07 12:20 ` Pavel Dovgalyuk 2020-10-07 12:17 ` Pavel Dovgalyuk 1 sibling, 1 reply; 19+ messages in thread From: Alex Bennée @ 2020-10-07 11:22 UTC (permalink / raw) To: Philippe Mathieu-Daudé Cc: Thomas Huth, Pavel Dovgalyuk, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa, John Snow, Richard Henderson Philippe Mathieu-Daudé <philmd@redhat.com> writes: > On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >> On 07.10.2020 11:23, Thomas Huth wrote: >>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >>>>> On 10/7/20 1:07 AM, John Snow wrote: >>>>>> I'm seeing this gitlab test fail quite often in my Python work; I >>>>>> don't >>>>>> *think* this has anything to do with my patches, but maybe I need >>>>>> to try >>>>>> and bisect this more aggressively. >>> [...] >>>>> w.r.t. the error in your build, I told Thomas about the >>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >>>>> not his area. Richard has been looking yesterday to see if it is >>>>> a TCG regression, and said the test either finished/crashed raising >>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >>>>> children become zombie and the test hang. >>>> >>>> Expected output: >>>> >>>> Quiescing Open Firmware ... >>>> Booting Linux via __start() @ 0x01000000 ... >>>> >>>> But QEMU exits in replay_char_write_event_load(): >>>> >>>> Quiescing Open Firmware ... >>>> qemu-system-ppc: Missing character write event in the replay log >>>> $ echo $? >>>> 1 >>>> >>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. >>>> >>>> Replay file is ~22MiB. End of record using "system_powerdown + quit" >>>> in HMP. >>>> >>>> I guess we have 2 bugs: >>>> - replay log >>>> - avocado doesn't catch children exit(1) >>>> >>>> Quick reproducer: >>>> >>>> $ make qemu-system-ppc check-venv >>>> $ tests/venv/bin/python -m \ >>>> avocado --show=app,console,replay \ >>>> run --job-timeout 300 -t machine:mac99 \ >>>> tests/acceptance/replay_kernel.py >>> >>> Thanks, that was helpful. ... and the winner is: >>> >>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>> Author: John Snow <jsnow@redhat.com> >>> Date: Fri Jul 24 01:23:00 2020 -0400 >>> Subject: ide: cancel pending callbacks on SRST >>> >>> ... starting with this commit, the tests starts failing. John, any >>> idea what >>> might be causing this? >> >> This patch includes the following lines: >> >> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >> + ide_bus_perform_srst, bus); >> >> replay_bh_schedule_oneshot_event should be used instead of this >> function, because it synchronizes non-deterministic BHs. > > Why do we have 2 different functions? BH are already complex > enough, and we need to also think about the replay API... > > What about the other cases such vhost-user (blk/net), virtio-blk? This does seem like something that should be wrapped up inside aio_bh_schedule_oneshot itself or maybe we need a aio_bh_schedule_transaction_oneshot to distinguish it from the other uses the function has. -- Alex Bennée ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 11:22 ` Alex Bennée @ 2020-10-07 12:20 ` Pavel Dovgalyuk 2020-10-07 12:49 ` Philippe Mathieu-Daudé 0 siblings, 1 reply; 19+ messages in thread From: Pavel Dovgalyuk @ 2020-10-07 12:20 UTC (permalink / raw) To: Alex Bennée, Philippe Mathieu-Daudé Cc: Thomas Huth, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa, John Snow, Richard Henderson On 07.10.2020 14:22, Alex Bennée wrote: > > Philippe Mathieu-Daudé <philmd@redhat.com> writes: > >> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>> On 07.10.2020 11:23, Thomas Huth wrote: >>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>> Thanks, that was helpful. ... and the winner is: >>>> >>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>>> Author: John Snow <jsnow@redhat.com> >>>> Date: Fri Jul 24 01:23:00 2020 -0400 >>>> Subject: ide: cancel pending callbacks on SRST >>>> >>>> ... starting with this commit, the tests starts failing. John, any >>>> idea what >>>> might be causing this? >>> >>> This patch includes the following lines: >>> >>> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >>> + ide_bus_perform_srst, bus); >>> >>> replay_bh_schedule_oneshot_event should be used instead of this >>> function, because it synchronizes non-deterministic BHs. >> >> Why do we have 2 different functions? BH are already complex >> enough, and we need to also think about the replay API... >> >> What about the other cases such vhost-user (blk/net), virtio-blk? > > This does seem like something that should be wrapped up inside > aio_bh_schedule_oneshot itself or maybe we need a > aio_bh_schedule_transaction_oneshot to distinguish it from the other > uses the function has. > Maybe there should be two functions: - one for the guest modification - one for internal qemu things The first one may be implemented though the rr+second one. Maybe replay_ prefix is confusing and the whole BH interface should look like that? Pavel Dovgalyuk ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 12:20 ` Pavel Dovgalyuk @ 2020-10-07 12:49 ` Philippe Mathieu-Daudé 2020-10-07 13:11 ` Pavel Dovgalyuk 0 siblings, 1 reply; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-07 12:49 UTC (permalink / raw) To: Pavel Dovgalyuk, Alex Bennée, Stefan Hajnoczi Cc: Thomas Huth, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Pavel Dovgalyuk, Cleber Rosa, Paolo Bonzini, John Snow On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: > On 07.10.2020 14:22, Alex Bennée wrote: >> >> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >> >>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>>> On 07.10.2020 11:23, Thomas Huth wrote: >>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>>> Thanks, that was helpful. ... and the winner is: >>>>> >>>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>>>> Author: John Snow <jsnow@redhat.com> >>>>> Date: Fri Jul 24 01:23:00 2020 -0400 >>>>> Subject: ide: cancel pending callbacks on SRST >>>>> >>>>> ... starting with this commit, the tests starts failing. John, any >>>>> idea what >>>>> might be causing this? >>>> >>>> This patch includes the following lines: >>>> >>>> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >>>> + ide_bus_perform_srst, bus); >>>> >>>> replay_bh_schedule_oneshot_event should be used instead of this >>>> function, because it synchronizes non-deterministic BHs. >>> >>> Why do we have 2 different functions? BH are already complex >>> enough, and we need to also think about the replay API... >>> >>> What about the other cases such vhost-user (blk/net), virtio-blk? >> >> This does seem like something that should be wrapped up inside >> aio_bh_schedule_oneshot itself or maybe we need a >> aio_bh_schedule_transaction_oneshot to distinguish it from the other >> uses the function has. >> > > Maybe there should be two functions: > - one for the guest modification aio_bh_schedule_oneshot_deterministic()? > - one for internal qemu things Not sure why there is a difference, BH are used to avoid delaying the guest, so there always something related to "guest modification". > > The first one may be implemented though the rr+second one. > Maybe replay_ prefix is confusing and the whole BH interface should look > like that? Yes, but it would be safer/clearer if we don't need to use a replay_ API. Can we embed these functions? - replay_bh_schedule_event - replay_bh_schedule_oneshot_event If compiled without rr, events_enabled=false and compiler can optimize: -- >8 -- diff --git a/util/async.c b/util/async.c index f758354c6a..376b6d4e27 100644 --- a/util/async.c +++ b/util/async.c @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head, unsigned *flags) void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque) { - QEMUBH *bh; - bh = g_new(QEMUBH, 1); - *bh = (QEMUBH){ - .ctx = ctx, - .cb = cb, - .opaque = opaque, - }; - aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); + if (events_enabled) { + replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, + opaque, replay_get_current_icount()); + } else { + QEMUBH *bh; + bh = g_new(QEMUBH, 1); + *bh = (QEMUBH){ + .ctx = ctx, + .cb = cb, + .opaque = opaque, + }; + aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); + } } QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque) @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh) void qemu_bh_schedule(QEMUBH *bh) { - aio_bh_enqueue(bh, BH_SCHEDULED); + if (events_enabled) { + replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL, + replay_get_current_icount()); + } else { + aio_bh_enqueue(bh, BH_SCHEDULED); + } } --- > > > Pavel Dovgalyuk > ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 12:49 ` Philippe Mathieu-Daudé @ 2020-10-07 13:11 ` Pavel Dovgalyuk 2020-10-08 10:26 ` Philippe Mathieu-Daudé 0 siblings, 1 reply; 19+ messages in thread From: Pavel Dovgalyuk @ 2020-10-07 13:11 UTC (permalink / raw) To: Philippe Mathieu-Daudé, Alex Bennée, Stefan Hajnoczi Cc: Thomas Huth, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Cleber Rosa, Paolo Bonzini, John Snow On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote: > On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: >> On 07.10.2020 14:22, Alex Bennée wrote: >>> >>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>> >>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>>>> On 07.10.2020 11:23, Thomas Huth wrote: >>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>>>> Thanks, that was helpful. ... and the winner is: >>>>>> >>>>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>>>>> Author: John Snow <jsnow@redhat.com> >>>>>> Date: Fri Jul 24 01:23:00 2020 -0400 >>>>>> Subject: ide: cancel pending callbacks on SRST >>>>>> >>>>>> ... starting with this commit, the tests starts failing. John, any >>>>>> idea what >>>>>> might be causing this? >>>>> >>>>> This patch includes the following lines: >>>>> >>>>> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >>>>> + ide_bus_perform_srst, bus); >>>>> >>>>> replay_bh_schedule_oneshot_event should be used instead of this >>>>> function, because it synchronizes non-deterministic BHs. >>>> >>>> Why do we have 2 different functions? BH are already complex >>>> enough, and we need to also think about the replay API... >>>> >>>> What about the other cases such vhost-user (blk/net), virtio-blk? >>> >>> This does seem like something that should be wrapped up inside >>> aio_bh_schedule_oneshot itself or maybe we need a >>> aio_bh_schedule_transaction_oneshot to distinguish it from the other >>> uses the function has. >>> >> >> Maybe there should be two functions: >> - one for the guest modification > > aio_bh_schedule_oneshot_deterministic()? > >> - one for internal qemu things > > Not sure why there is a difference, BH are used to > avoid delaying the guest, so there always something > related to "guest modification". Not exactly. At least there is one non-related-to-guest case in monitor_init_qmp: /* * We can't call qemu_chr_fe_set_handlers() directly here * since chardev might be running in the monitor I/O * thread. Schedule a bottom half. */ aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), monitor_qmp_setup_handlers_bh, mon); > >> >> The first one may be implemented though the rr+second one. >> Maybe replay_ prefix is confusing and the whole BH interface should look >> like that? > > Yes, but it would be safer/clearer if we don't need to use > a replay_ API. > > Can we embed these functions? > > - replay_bh_schedule_event > - replay_bh_schedule_oneshot_event > > If compiled without rr, events_enabled=false and > compiler can optimize: > > -- >8 -- > diff --git a/util/async.c b/util/async.c > index f758354c6a..376b6d4e27 100644 > --- a/util/async.c > +++ b/util/async.c > @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head, > unsigned *flags) > > void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque) > { > - QEMUBH *bh; > - bh = g_new(QEMUBH, 1); > - *bh = (QEMUBH){ > - .ctx = ctx, > - .cb = cb, > - .opaque = opaque, > - }; > - aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); > + if (events_enabled) { > + replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, > + opaque, replay_get_current_icount()); > + } else { > + QEMUBH *bh; > + bh = g_new(QEMUBH, 1); > + *bh = (QEMUBH){ > + .ctx = ctx, > + .cb = cb, > + .opaque = opaque, > + }; > + aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); > + } > } > > QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque) > @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh) > > void qemu_bh_schedule(QEMUBH *bh) qemu_bh_schedule is even worse. Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no need to use replay version there. In such cases QEMU will halt if trying to call replay_bh_schedule_event. > { > - aio_bh_enqueue(bh, BH_SCHEDULED); > + if (events_enabled) { > + replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL, > + replay_get_current_icount()); > + } else { > + aio_bh_enqueue(bh, BH_SCHEDULED); > + } > } > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 13:11 ` Pavel Dovgalyuk @ 2020-10-08 10:26 ` Philippe Mathieu-Daudé 2020-10-08 11:50 ` Kevin Wolf 0 siblings, 1 reply; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-08 10:26 UTC (permalink / raw) To: Pavel Dovgalyuk, Alex Bennée, Stefan Hajnoczi Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers, Wainer dos Santos Moschetta, Cleber Rosa, Paolo Bonzini, John Snow On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote: > On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote: >> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: >>> On 07.10.2020 14:22, Alex Bennée wrote: >>>> >>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>>> >>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>>>>> On 07.10.2020 11:23, Thomas Huth wrote: >>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>>>>> Thanks, that was helpful. ... and the winner is: >>>>>>> >>>>>>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>>>>>> Author: John Snow <jsnow@redhat.com> >>>>>>> Date: Fri Jul 24 01:23:00 2020 -0400 >>>>>>> Subject: ide: cancel pending callbacks on SRST >>>>>>> >>>>>>> ... starting with this commit, the tests starts failing. John, any >>>>>>> idea what >>>>>>> might be causing this? >>>>>> >>>>>> This patch includes the following lines: >>>>>> >>>>>> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >>>>>> + ide_bus_perform_srst, bus); >>>>>> >>>>>> replay_bh_schedule_oneshot_event should be used instead of this >>>>>> function, because it synchronizes non-deterministic BHs. >>>>> >>>>> Why do we have 2 different functions? BH are already complex >>>>> enough, and we need to also think about the replay API... >>>>> >>>>> What about the other cases such vhost-user (blk/net), virtio-blk? >>>> >>>> This does seem like something that should be wrapped up inside >>>> aio_bh_schedule_oneshot itself or maybe we need a >>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other >>>> uses the function has. >>>> >>> >>> Maybe there should be two functions: >>> - one for the guest modification >> >> aio_bh_schedule_oneshot_deterministic()? >> >>> - one for internal qemu things >> >> Not sure why there is a difference, BH are used to >> avoid delaying the guest, so there always something >> related to "guest modification". > > Not exactly. At least there is one non-related-to-guest case > in monitor_init_qmp: > /* > * We can't call qemu_chr_fe_set_handlers() directly here > * since chardev might be running in the monitor I/O > * thread. Schedule a bottom half. > */ > aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), > monitor_qmp_setup_handlers_bh, mon); I don't understand the documentation in docs/devel/replay.txt: --- Bottom halves ============= Bottom half callbacks, that affect the guest state, should be invoked through replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. Their invocations are saved in record mode and synchronized with the existing log in replay mode. --- But then it is only used in block drivers, which are not related to guest state: $ git grep replay_bh_schedule_oneshot_event block/block-backend.c:1385: replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), block/block-backend.c:1450: replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), block/io.c:371: replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs), block/iscsi.c:286: replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context, block/nfs.c:262: replay_bh_schedule_oneshot_event(task->client->aio_context, block/null.c:183: replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs), block/nvme.c:323: replay_bh_schedule_oneshot_event(q->aio_context, block/nvme.c:1075: replay_bh_schedule_oneshot_event(data->ctx, nvme_rw_cb_bh, data); block/rbd.c:865: replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs), docs/devel/replay.txt:25:replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. include/sysemu/replay.h:178:void replay_bh_schedule_oneshot_event(AioContext *ctx, replay/replay-events.c:141:void replay_bh_schedule_oneshot_event(AioContext *ctx, stubs/replay-user.c:5:void replay_bh_schedule_oneshot_event(AioContext *ctx, Is replay_bh_schedule_oneshot_event ever used by replay? Maybe we can remove it and use aio_bh_schedule_oneshot() in place? Else the documentation need to be clarified please. > > >> >>> >>> The first one may be implemented though the rr+second one. >>> Maybe replay_ prefix is confusing and the whole BH interface should look >>> like that? >> >> Yes, but it would be safer/clearer if we don't need to use >> a replay_ API. >> >> Can we embed these functions? >> >> - replay_bh_schedule_event >> - replay_bh_schedule_oneshot_event >> >> If compiled without rr, events_enabled=false and >> compiler can optimize: >> >> -- >8 -- >> diff --git a/util/async.c b/util/async.c >> index f758354c6a..376b6d4e27 100644 >> --- a/util/async.c >> +++ b/util/async.c >> @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head, >> unsigned *flags) >> >> void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void >> *opaque) >> { >> - QEMUBH *bh; >> - bh = g_new(QEMUBH, 1); >> - *bh = (QEMUBH){ >> - .ctx = ctx, >> - .cb = cb, >> - .opaque = opaque, >> - }; >> - aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); >> + if (events_enabled) { >> + replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, >> + opaque, replay_get_current_icount()); >> + } else { >> + QEMUBH *bh; >> + bh = g_new(QEMUBH, 1); >> + *bh = (QEMUBH){ >> + .ctx = ctx, >> + .cb = cb, >> + .opaque = opaque, >> + }; >> + aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); >> + } >> } >> >> QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque) >> @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh) >> >> void qemu_bh_schedule(QEMUBH *bh) > > qemu_bh_schedule is even worse. > Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no > need to use replay version there. In such cases QEMU will halt if trying > to call replay_bh_schedule_event. > >> { >> - aio_bh_enqueue(bh, BH_SCHEDULED); >> + if (events_enabled) { >> + replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL, >> + replay_get_current_icount()); >> + } else { >> + aio_bh_enqueue(bh, BH_SCHEDULED); >> + } >> } >> > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-08 10:26 ` Philippe Mathieu-Daudé @ 2020-10-08 11:50 ` Kevin Wolf 2020-10-09 10:37 ` Pavel Dovgalyuk 0 siblings, 1 reply; 19+ messages in thread From: Kevin Wolf @ 2020-10-08 11:50 UTC (permalink / raw) To: Philippe Mathieu-Daudé Cc: Thomas Huth, Pavel Dovgalyuk, Eduardo Habkost, Qemu-block, QEMU Developers, Wainer dos Santos Moschetta, Stefan Hajnoczi, Cleber Rosa, Paolo Bonzini, Alex Bennée Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben: > On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote: > > On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote: > >> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: > >>> On 07.10.2020 14:22, Alex Bennée wrote: > >>>> > >>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: > >>>> > >>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: > >>>>>> On 07.10.2020 11:23, Thomas Huth wrote: > >>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: > >>>>>>> Thanks, that was helpful. ... and the winner is: > >>>>>>> > >>>>>>>       commit  55adb3c45620c31f29978f209e2a44a08d34e2da > >>>>>>>       Author: John Snow <jsnow@redhat.com> > >>>>>>>       Date:   Fri Jul 24 01:23:00 2020 -0400 > >>>>>>>       Subject: ide: cancel pending callbacks on SRST > >>>>>>> > >>>>>>> ... starting with this commit, the tests starts failing. John, any > >>>>>>> idea what > >>>>>>> might be causing this? > >>>>>> > >>>>>> This patch includes the following lines: > >>>>>> > >>>>>> +       aio_bh_schedule_oneshot(qemu_get_aio_context(), > >>>>>> +                               ide_bus_perform_srst, bus); > >>>>>> > >>>>>> replay_bh_schedule_oneshot_event should be used instead of this > >>>>>> function, because it synchronizes non-deterministic BHs. > >>>>> > >>>>> Why do we have 2 different functions? BH are already complex > >>>>> enough, and we need to also think about the replay API... > >>>>> > >>>>> What about the other cases such vhost-user (blk/net), virtio-blk? > >>>> > >>>> This does seem like something that should be wrapped up inside > >>>> aio_bh_schedule_oneshot itself or maybe we need a > >>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other > >>>> uses the function has. > >>>> > >>> > >>> Maybe there should be two functions: > >>> - one for the guest modification > >> > >> aio_bh_schedule_oneshot_deterministic()? > >> > >>> - one for internal qemu things > >> > >> Not sure why there is a difference, BH are used to > >> avoid delaying the guest, so there always something > >> related to "guest modification". > > > > Not exactly. At least there is one non-related-to-guest case > > in monitor_init_qmp: > >        /* > >         * We can't call qemu_chr_fe_set_handlers() directly here > >         * since chardev might be running in the monitor I/O > >         * thread. Schedule a bottom half. > >         */ > >        aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), > >                                monitor_qmp_setup_handlers_bh, mon); > > I don't understand the documentation in docs/devel/replay.txt: > > --- > Bottom halves > ============= > > Bottom half callbacks, that affect the guest state, should be invoked > through > replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. > Their invocations are saved in record mode and synchronized with the > existing > log in replay mode. > --- > > But then it is only used in block drivers, which are not > related to guest state: Pavel can tell you the details, but I think the idea was that you need to use this function not when the code calling it modifies guest state, but when the BH implementation can do so. In the case of generic callbacks like provided by the blk_aio_*() functions, we don't know whether this is the case, but it's generally device emulation code, so chances are relatively high that they do. I seem to remember that when reviewing the code that introduced replay_bh_schedule_event(), I was relatively sure that we didn't catch all necessary instances, but since it worked for Pavel and I didn't feel like getting too involved with replay code, we just merged it anyway. As I said, the details are a question for Pavel. Kevin > $ git grep replay_bh_schedule_oneshot_event > block/block-backend.c:1385: > replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), > block/block-backend.c:1450: > replay_bh_schedule_oneshot_event(blk_get_aio_context(blk), > block/io.c:371: > replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs), > block/iscsi.c:286: > replay_bh_schedule_oneshot_event(iTask->iscsilun->aio_context, > block/nfs.c:262: > replay_bh_schedule_oneshot_event(task->client->aio_context, > block/null.c:183: > replay_bh_schedule_oneshot_event(bdrv_get_aio_context(bs), > block/nvme.c:323: replay_bh_schedule_oneshot_event(q->aio_context, > block/nvme.c:1075: replay_bh_schedule_oneshot_event(data->ctx, > nvme_rw_cb_bh, data); > block/rbd.c:865: > replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs), > docs/devel/replay.txt:25:replay_bh_schedule_event or > replay_bh_schedule_oneshot_event functions. > include/sysemu/replay.h:178:void > replay_bh_schedule_oneshot_event(AioContext *ctx, > replay/replay-events.c:141:void > replay_bh_schedule_oneshot_event(AioContext *ctx, > stubs/replay-user.c:5:void replay_bh_schedule_oneshot_event(AioContext *ctx, > > Is replay_bh_schedule_oneshot_event ever used by replay? > Maybe we can remove it and use aio_bh_schedule_oneshot() > in place? > > Else the documentation need to be clarified please. > > > > > > >> > >>> > >>> The first one may be implemented though the rr+second one. > >>> Maybe replay_ prefix is confusing and the whole BH interface should look > >>> like that? > >> > >> Yes, but it would be safer/clearer if we don't need to use > >> a replay_ API. > >> > >> Can we embed these functions? > >> > >> - replay_bh_schedule_event > >> - replay_bh_schedule_oneshot_event > >> > >> If compiled without rr, events_enabled=false and > >> compiler can optimize: > >> > >> -- >8 -- > >> diff --git a/util/async.c b/util/async.c > >> index f758354c6a..376b6d4e27 100644 > >> --- a/util/async.c > >> +++ b/util/async.c > >> @@ -109,14 +109,19 @@ static QEMUBH *aio_bh_dequeue(BHList *head, > >> unsigned *flags) > >> > >>  void aio_bh_schedule_oneshot(AioContext *ctx, QEMUBHFunc *cb, void > >> *opaque) > >>  { > >> -   QEMUBH *bh; > >> -   bh = g_new(QEMUBH, 1); > >> -   *bh = (QEMUBH){ > >> -       .ctx = ctx, > >> -       .cb = cb, > >> -       .opaque = opaque, > >> -   }; > >> -   aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); > >> +   if (events_enabled) { > >> +       replay_add_event(REPLAY_ASYNC_EVENT_BH_ONESHOT, cb, > >> +                        opaque, replay_get_current_icount()); > >> +   } else { > >> +       QEMUBH *bh; > >> +       bh = g_new(QEMUBH, 1); > >> +       *bh = (QEMUBH){ > >> +           .ctx = ctx, > >> +           .cb = cb, > >> +           .opaque = opaque, > >> +       }; > >> +       aio_bh_enqueue(bh, BH_SCHEDULED | BH_ONESHOT); > >> +   } > >>  } > >> > >>  QEMUBH *aio_bh_new(AioContext *ctx, QEMUBHFunc *cb, void *opaque) > >> @@ -178,7 +183,12 @@ void qemu_bh_schedule_idle(QEMUBH *bh) > >> > >>  void qemu_bh_schedule(QEMUBH *bh) > > > > qemu_bh_schedule is even worse. > > Many modules use it (e.g., loadvm_postcopy_handle_run), and there is no > > need to use replay version there. In such cases QEMU will halt if trying > > to call replay_bh_schedule_event. > > > >>  { > >> -   aio_bh_enqueue(bh, BH_SCHEDULED); > >> +   if (events_enabled) { > >> +       replay_add_event(REPLAY_ASYNC_EVENT_BH, bh, NULL, > >> +                        replay_get_current_icount()); > >> +   } else { > >> +       aio_bh_enqueue(bh, BH_SCHEDULED); > >> +   } > >>  } > >> > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-08 11:50 ` Kevin Wolf @ 2020-10-09 10:37 ` Pavel Dovgalyuk 2020-10-13 8:57 ` Philippe Mathieu-Daudé 0 siblings, 1 reply; 19+ messages in thread From: Pavel Dovgalyuk @ 2020-10-09 10:37 UTC (permalink / raw) To: Kevin Wolf, Philippe Mathieu-Daudé Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers, Wainer dos Santos Moschetta, Stefan Hajnoczi, Cleber Rosa, Paolo Bonzini, Alex Bennée On 08.10.2020 14:50, Kevin Wolf wrote: > Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben: >> On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote: >>> On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote: >>>> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: >>>>> On 07.10.2020 14:22, Alex Bennée wrote: >>>>>> >>>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>>>>> >>>>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>>>>>>> On 07.10.2020 11:23, Thomas Huth wrote: >>>>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>>>>>>> Thanks, that was helpful. ... and the winner is: >>>>>>>>> >>>>>>>>>       commit  55adb3c45620c31f29978f209e2a44a08d34e2da >>>>>>>>>       Author: John Snow <jsnow@redhat.com> >>>>>>>>>       Date:   Fri Jul 24 01:23:00 2020 -0400 >>>>>>>>>       Subject: ide: cancel pending callbacks on SRST >>>>>>>>> >>>>>>>>> ... starting with this commit, the tests starts failing. John, any >>>>>>>>> idea what >>>>>>>>> might be causing this? >>>>>>>> >>>>>>>> This patch includes the following lines: >>>>>>>> >>>>>>>> +       aio_bh_schedule_oneshot(qemu_get_aio_context(), >>>>>>>> +                               ide_bus_perform_srst, bus); >>>>>>>> >>>>>>>> replay_bh_schedule_oneshot_event should be used instead of this >>>>>>>> function, because it synchronizes non-deterministic BHs. >>>>>>> >>>>>>> Why do we have 2 different functions? BH are already complex >>>>>>> enough, and we need to also think about the replay API... >>>>>>> >>>>>>> What about the other cases such vhost-user (blk/net), virtio-blk? >>>>>> >>>>>> This does seem like something that should be wrapped up inside >>>>>> aio_bh_schedule_oneshot itself or maybe we need a >>>>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other >>>>>> uses the function has. >>>>>> >>>>> >>>>> Maybe there should be two functions: >>>>> - one for the guest modification >>>> >>>> aio_bh_schedule_oneshot_deterministic()? >>>> >>>>> - one for internal qemu things >>>> >>>> Not sure why there is a difference, BH are used to >>>> avoid delaying the guest, so there always something >>>> related to "guest modification". >>> >>> Not exactly. At least there is one non-related-to-guest case >>> in monitor_init_qmp: >>>        /* >>>         * We can't call qemu_chr_fe_set_handlers() directly here >>>         * since chardev might be running in the monitor I/O >>>         * thread. Schedule a bottom half. >>>         */ >>>        aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), >>>                                monitor_qmp_setup_handlers_bh, mon); >> >> I don't understand the documentation in docs/devel/replay.txt: >> >> --- >> Bottom halves >> ============= >> >> Bottom half callbacks, that affect the guest state, should be invoked >> through >> replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. >> Their invocations are saved in record mode and synchronized with the >> existing >> log in replay mode. >> --- >> >> But then it is only used in block drivers, which are not >> related to guest state: > > Pavel can tell you the details, but I think the idea was that you need > to use this function not when the code calling it modifies guest state, > but when the BH implementation can do so. > > In the case of generic callbacks like provided by the blk_aio_*() > functions, we don't know whether this is the case, but it's generally > device emulation code, so chances are relatively high that they do. > > I seem to remember that when reviewing the code that introduced > replay_bh_schedule_event(), I was relatively sure that we didn't catch > all necessary instances, but since it worked for Pavel and I didn't feel > like getting too involved with replay code, we just merged it anyway. That's right. Block layer does not touch guest by itself. But it includes callbacks that may invoke interrupts on finishing disk operations. That is why we synchronize these callbacks with vCPU through the replay layer. Pavel Dovgalyuk ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-09 10:37 ` Pavel Dovgalyuk @ 2020-10-13 8:57 ` Philippe Mathieu-Daudé 0 siblings, 0 replies; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-13 8:57 UTC (permalink / raw) To: Pavel Dovgalyuk, Kevin Wolf Cc: Thomas Huth, Eduardo Habkost, Qemu-block, QEMU Developers, Wainer dos Santos Moschetta, Stefan Hajnoczi, Cleber Rosa, Paolo Bonzini, Alex Bennée On 10/9/20 12:37 PM, Pavel Dovgalyuk wrote: > On 08.10.2020 14:50, Kevin Wolf wrote: >> Am 08.10.2020 um 12:26 hat Philippe Mathieu-Daudé geschrieben: >>> On 10/7/20 3:11 PM, Pavel Dovgalyuk wrote: >>>> On 07.10.2020 15:49, Philippe Mathieu-Daudé wrote: >>>>> On 10/7/20 2:20 PM, Pavel Dovgalyuk wrote: >>>>>> On 07.10.2020 14:22, Alex Bennée wrote: >>>>>>> >>>>>>> Philippe Mathieu-Daudé <philmd@redhat.com> writes: >>>>>>> >>>>>>>> On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >>>>>>>>> On 07.10.2020 11:23, Thomas Huth wrote: >>>>>>>>>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>>>>>>>> Thanks, that was helpful. ... and the winner is: >>>>>>>>>> >>>>>>>>>>       commit  55adb3c45620c31f29978f209e2a44a08d34e2da >>>>>>>>>>       Author: John Snow <jsnow@redhat.com> >>>>>>>>>>       Date:   Fri Jul 24 01:23:00 2020 -0400 >>>>>>>>>>       Subject: ide: cancel pending callbacks on SRST >>>>>>>>>> >>>>>>>>>> ... starting with this commit, the tests starts failing. John, >>>>>>>>>> any >>>>>>>>>> idea what >>>>>>>>>> might be causing this? >>>>>>>>> >>>>>>>>> This patch includes the following lines: >>>>>>>>> >>>>>>>>> +       aio_bh_schedule_oneshot(qemu_get_aio_context(), >>>>>>>>> +                               >>>>>>>>> ide_bus_perform_srst, bus); >>>>>>>>> >>>>>>>>> replay_bh_schedule_oneshot_event should be used instead of this >>>>>>>>> function, because it synchronizes non-deterministic BHs. >>>>>>>> >>>>>>>> Why do we have 2 different functions? BH are already complex >>>>>>>> enough, and we need to also think about the replay API... >>>>>>>> >>>>>>>> What about the other cases such vhost-user (blk/net), virtio-blk? >>>>>>> >>>>>>> This does seem like something that should be wrapped up inside >>>>>>> aio_bh_schedule_oneshot itself or maybe we need a >>>>>>> aio_bh_schedule_transaction_oneshot to distinguish it from the other >>>>>>> uses the function has. >>>>>>> >>>>>> >>>>>> Maybe there should be two functions: >>>>>> - one for the guest modification >>>>> >>>>> aio_bh_schedule_oneshot_deterministic()? >>>>> >>>>>> - one for internal qemu things >>>>> >>>>> Not sure why there is a difference, BH are used to >>>>> avoid delaying the guest, so there always something >>>>> related to "guest modification". >>>> >>>> Not exactly. At least there is one non-related-to-guest case >>>> in monitor_init_qmp: >>>>        /* >>>>         * We can't call qemu_chr_fe_set_handlers() directly >>>> here >>>>         * since chardev might be running in the monitor I/O >>>>         * thread. Schedule a bottom half. >>>>         */ >>>>        >>>> aio_bh_schedule_oneshot(iothread_get_aio_context(mon_iothread), >>>>                                >>>> monitor_qmp_setup_handlers_bh, mon); >>> >>> I don't understand the documentation in docs/devel/replay.txt: >>> >>> --- >>> Bottom halves >>> ============= >>> >>> Bottom half callbacks, that affect the guest state, should be invoked >>> through >>> replay_bh_schedule_event or replay_bh_schedule_oneshot_event functions. >>> Their invocations are saved in record mode and synchronized with the >>> existing >>> log in replay mode. >>> --- >>> >>> But then it is only used in block drivers, which are not >>> related to guest state: >> >> Pavel can tell you the details, but I think the idea was that you need >> to use this function not when the code calling it modifies guest state, >> but when the BH implementation can do so. >> >> In the case of generic callbacks like provided by the blk_aio_*() >> functions, we don't know whether this is the case, but it's generally >> device emulation code, so chances are relatively high that they do. >> >> I seem to remember that when reviewing the code that introduced >> replay_bh_schedule_event(), I was relatively sure that we didn't catch >> all necessary instances, but since it worked for Pavel and I didn't feel >> like getting too involved with replay code, we just merged it anyway. > > That's right. > Block layer does not touch guest by itself. > But it includes callbacks that may invoke interrupts on finishing disk > operations. That is why we synchronize these callbacks with vCPU through > the replay layer. Instead having to remember to use replay_bh_schedule_event when guest state is modified else the code is buggy, what about expecting replay used everywhere, and disabling its use when we know guest state is not modified? > > Pavel Dovgalyuk > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 9:57 ` Philippe Mathieu-Daudé 2020-10-07 11:22 ` Alex Bennée @ 2020-10-07 12:17 ` Pavel Dovgalyuk 1 sibling, 0 replies; 19+ messages in thread From: Pavel Dovgalyuk @ 2020-10-07 12:17 UTC (permalink / raw) To: Philippe Mathieu-Daudé, Thomas Huth, John Snow, Cleber Rosa Cc: Richard Henderson, Alex Bennée, Eduardo Habkost, Wainer dos Santos Moschetta, QEMU Developers On 07.10.2020 12:57, Philippe Mathieu-Daudé wrote: > On 10/7/20 10:51 AM, Pavel Dovgalyuk wrote: >> On 07.10.2020 11:23, Thomas Huth wrote: >>> On 07/10/2020 09.13, Philippe Mathieu-Daudé wrote: >>>> On 10/7/20 7:20 AM, Philippe Mathieu-Daudé wrote: >>>>> On 10/7/20 1:07 AM, John Snow wrote: >>>>>> I'm seeing this gitlab test fail quite often in my Python work; I >>>>>> don't >>>>>> *think* this has anything to do with my patches, but maybe I need >>>>>> to try >>>>>> and bisect this more aggressively. >>> [...] >>>>> w.r.t. the error in your build, I told Thomas about the >>>>> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >>>>> not his area. Richard has been looking yesterday to see if it is >>>>> a TCG regression, and said the test either finished/crashed raising >>>>> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >>>>> children become zombie and the test hang. >>>> >>>> Expected output: >>>> >>>> Quiescing Open Firmware ... >>>> Booting Linux via __start() @ 0x01000000 ... >>>> >>>> But QEMU exits in replay_char_write_event_load(): >>>> >>>> Quiescing Open Firmware ... >>>> qemu-system-ppc: Missing character write event in the replay log >>>> $ echo $? >>>> 1 >>>> >>>> Latest events are CHECKPOINT CHECKPOINT INTERRUPT INTERRUPT INTERRUPT. >>>> >>>> Replay file is ~22MiB. End of record using "system_powerdown + quit" >>>> in HMP. >>>> >>>> I guess we have 2 bugs: >>>> - replay log >>>> - avocado doesn't catch children exit(1) >>>> >>>> Quick reproducer: >>>> >>>> $ make qemu-system-ppc check-venv >>>> $ tests/venv/bin/python -m \ >>>> avocado --show=app,console,replay \ >>>> run --job-timeout 300 -t machine:mac99 \ >>>> tests/acceptance/replay_kernel.py >>> >>> Thanks, that was helpful. ... and the winner is: >>> >>> commit 55adb3c45620c31f29978f209e2a44a08d34e2da >>> Author: John Snow <jsnow@redhat.com> >>> Date: Fri Jul 24 01:23:00 2020 -0400 >>> Subject: ide: cancel pending callbacks on SRST >>> >>> ... starting with this commit, the tests starts failing. John, any >>> idea what >>> might be causing this? >> >> This patch includes the following lines: >> >> + aio_bh_schedule_oneshot(qemu_get_aio_context(), >> + ide_bus_perform_srst, bus); >> >> replay_bh_schedule_oneshot_event should be used instead of this >> function, because it synchronizes non-deterministic BHs. > > Why do we have 2 different functions? BH are already complex > enough, and we need to also think about the replay API... > There is note about it in docs/devel/replay.txt It's hard to protect the guest state from incorrect operations. There was the similar problem with icount - everyone who modify translate modules, needs to take it info account. But now we have record/replay tests that assert that patches do not break icount/rr. > What about the other cases such vhost-user (blk/net), virtio-blk? I do not know much about these modules. The main idea is the following: if the code works with the guest state, it should deal with icount and rr functions. Pavel Dovgalyuk ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 8:23 ` Thomas Huth 2020-10-07 8:51 ` Pavel Dovgalyuk @ 2020-10-07 14:03 ` John Snow 1 sibling, 0 replies; 19+ messages in thread From: John Snow @ 2020-10-07 14:03 UTC (permalink / raw) To: Thomas Huth, Philippe Mathieu-Daudé, Cleber Rosa, Pavel Dovgalyuk Cc: Richard Henderson, Alex Bennée, Eduardo Habkost, Wainer dos Santos Moschetta, QEMU Developers On 10/7/20 4:23 AM, Thomas Huth wrote: > > ... starting with this commit, the tests starts failing. John, any idea what > might be causing this? > > Thomas Ahh, oh no! It *was* my fault, just nothing to do with what I was actually working on. good! I'll look a little more closely, I see some more discussion below; I'll see if the recommended changes to the BH oneshot help alleviate the problem. Thanks. --js ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 5:20 ` Philippe Mathieu-Daudé 2020-10-07 7:13 ` Philippe Mathieu-Daudé @ 2020-10-07 7:23 ` Thomas Huth 2020-10-07 8:19 ` Philippe Mathieu-Daudé 2020-10-07 14:38 ` Cleber Rosa 2 siblings, 1 reply; 19+ messages in thread From: Thomas Huth @ 2020-10-07 7:23 UTC (permalink / raw) To: Philippe Mathieu-Daudé, John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Peter Maydell, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson On 07/10/2020 07.20, Philippe Mathieu-Daudé wrote: > On 10/7/20 1:07 AM, John Snow wrote: >> I'm seeing this gitlab test fail quite often in my Python work; I don't >> *think* this has anything to do with my patches, but maybe I need to try >> and bisect this more aggressively. [...] > w.r.t. the error in your build, I told Thomas about the > test_ppc_mac99/day15/invaders.elf timeouting but he said this is > not his area. Richard has been looking yesterday to see if it is > a TCG regression, and said the test either finished/crashed raising > SIGCHLD, but Avocado parent is still waiting for a timeout, so the > children become zombie and the test hang. > > Not sure that helps :) No clue why that invaders.elf is now failing with the replay test, but if you look at the history of the CI runs: https://gitlab.com/qemu-project/qemu/-/pipelines ... you can clearly see that the problem started with John's ide-pull-request 5 days ago: https://gitlab.com/qemu-project/qemu/-/pipelines/197124608 So maybe it's worth the effort to have a look at the patches that got merged here? Thomas ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 7:23 ` Thomas Huth @ 2020-10-07 8:19 ` Philippe Mathieu-Daudé 0 siblings, 0 replies; 19+ messages in thread From: Philippe Mathieu-Daudé @ 2020-10-07 8:19 UTC (permalink / raw) To: Thomas Huth, John Snow, Cleber Rosa, Pavel Dovgalyuk Cc: Peter Maydell, Eduardo Habkost, QEMU Developers, Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson On 10/7/20 9:23 AM, Thomas Huth wrote: > On 07/10/2020 07.20, Philippe Mathieu-Daudé wrote: >> On 10/7/20 1:07 AM, John Snow wrote: >>> I'm seeing this gitlab test fail quite often in my Python work; I don't >>> *think* this has anything to do with my patches, but maybe I need to try >>> and bisect this more aggressively. > [...] >> w.r.t. the error in your build, I told Thomas about the >> test_ppc_mac99/day15/invaders.elf timeouting but he said this is >> not his area. Richard has been looking yesterday to see if it is >> a TCG regression, and said the test either finished/crashed raising >> SIGCHLD, but Avocado parent is still waiting for a timeout, so the >> children become zombie and the test hang. >> >> Not sure that helps :) > > No clue why that invaders.elf is now failing with the replay test, but if > you look at the history of the CI runs: > > https://gitlab.com/qemu-project/qemu/-/pipelines > > ... you can clearly see that the problem started with John's > ide-pull-request 5 days ago: > > https://gitlab.com/qemu-project/qemu/-/pipelines/197124608 > > So maybe it's worth the effort to have a look at the patches that got merged > here? Great idea! Bisected using: $ make qemu-system-ppc check-venv && \ tests/venv/bin/python -m \ avocado --show=app,console,replay \ run -t machine:mac99 \ tests/acceptance/replay_kernel.py 55adb3c45620c31f29978f209e2a44a08d34e2da is the first bad commit commit 55adb3c45620c31f29978f209e2a44a08d34e2da Author: John Snow <jsnow@redhat.com> Date: Fri Jul 24 01:23:00 2020 -0400 ide: cancel pending callbacks on SRST The SRST implementation did not keep up with the rest of IDE; it is possible to perform a weak reset on an IDE device to remove the BSY/DRQ bits, and then issue writes to the control/device registers which can cause chaos with the state machine. Fix that by actually performing a real reset. Reported-by: Alexander Bulekov <alxndr@bu.edu> Fixes: https://bugs.launchpad.net/qemu/+bug/1878253 Fixes: https://bugs.launchpad.net/qemu/+bug/1887303 Fixes: https://bugs.launchpad.net/qemu/+bug/1887309 Signed-off-by: John Snow <jsnow@redhat.com> hw/ide/core.c | 58 ++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 40 insertions(+), 18 deletions(-) Regards, Phil. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: acceptance-system-fedora failures 2020-10-07 5:20 ` Philippe Mathieu-Daudé 2020-10-07 7:13 ` Philippe Mathieu-Daudé 2020-10-07 7:23 ` Thomas Huth @ 2020-10-07 14:38 ` Cleber Rosa 2 siblings, 0 replies; 19+ messages in thread From: Cleber Rosa @ 2020-10-07 14:38 UTC (permalink / raw) To: Philippe Mathieu-Daudé Cc: Thomas Huth, Eduardo Habkost, John Snow, QEMU Developers, Wainer dos Santos Moschetta, Alex Bennée, Richard Henderson [-- Attachment #1: Type: text/plain, Size: 2810 bytes --] On Wed, Oct 07, 2020 at 07:20:49AM +0200, Philippe Mathieu-Daudé wrote: > On 10/7/20 1:07 AM, John Snow wrote: > > I'm seeing this gitlab test fail quite often in my Python work; I don't > > *think* this has anything to do with my patches, but maybe I need to try > > and bisect this more aggressively. > > > > The very first hint of an error I see is on line 154: > > > > https://gitlab.com/jsnow/qemu/-/jobs/776334918#L154 > > > > 22:05:36 ERROR| > > 22:05:36 ERROR| Reproduced traceback from: > > /builds/jsnow/qemu/build/tests/venv/lib64/python3.8/site-packages/avocado/core/test.py:753 > > > > 22:05:36 ERROR| Traceback (most recent call last): > > 22:05:36 ERROR| File > > "/builds/jsnow/qemu/build/tests/acceptance/avocado_qemu/__init__.py", > > line 171, in setUp > > 22:05:36 ERROR| self.cancel("No QEMU binary defined or found in the > > build tree") > One very bad aspect of this output is that the test outcome (a test cancelation) is determined by an exception handler by the runner, and the "ERROR" prefix is indeed very misleading. But yes, in short, it's not an *error*, but the test getting canceled. > Last year the Avocado developers said we could have a clearer > log error report, but meanwhile this verbose output is better > that not having anything ¯\_(ツ)_/¯ > With the new test runner ("avocado run --test-runner=nrunner") that just made its way into Avocado 81.0, there's a much better separation of what happens within the test, and within the runner. The next step is to post the QEMU and "Avocado >= 82.0" integration, so hopefully this will improve soon. > > > > Is this a known problem? > > "No QEMU binary defined or found in the build tree" is not a > problem, it means a test is skipped because the qemu-system-$ARCH > binary is not found. In your case this is because your job > (acceptance-system-fedora) is based on build-system-fedora > which only build the following targets: > > TARGETS: tricore-softmmu microblaze-softmmu mips-softmmu > xtensa-softmmu m68k-softmmu riscv32-softmmu ppc-softmmu > sparc64-softmmu > Right. > Now I don't understand what binary the EmptyCPUModel/Migration tests > are expecting. Maybe these tests only work when a single target is > built? IOW not expecting that the python code searching for a binary > return multiple ones? -> Eduardo/Cleber. > `avocado_qemu.pick_default_qemu()`, if not given an arch, will try to find a target binary that matches the host. The problem with picking any (first available?) built target is the non-deterministic aspect. BTW, with regards to how `avocado_qemu.pick_default_qemu()` will get an arch, it can come either from a test parameter, or from an "arch" tag. Cheers, - Cleber. [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2020-10-13 9:02 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2020-10-06 23:07 acceptance-system-fedora failures John Snow 2020-10-07 5:20 ` Philippe Mathieu-Daudé 2020-10-07 7:13 ` Philippe Mathieu-Daudé 2020-10-07 8:23 ` Thomas Huth 2020-10-07 8:51 ` Pavel Dovgalyuk 2020-10-07 9:57 ` Philippe Mathieu-Daudé 2020-10-07 11:22 ` Alex Bennée 2020-10-07 12:20 ` Pavel Dovgalyuk 2020-10-07 12:49 ` Philippe Mathieu-Daudé 2020-10-07 13:11 ` Pavel Dovgalyuk 2020-10-08 10:26 ` Philippe Mathieu-Daudé 2020-10-08 11:50 ` Kevin Wolf 2020-10-09 10:37 ` Pavel Dovgalyuk 2020-10-13 8:57 ` Philippe Mathieu-Daudé 2020-10-07 12:17 ` Pavel Dovgalyuk 2020-10-07 14:03 ` John Snow 2020-10-07 7:23 ` Thomas Huth 2020-10-07 8:19 ` Philippe Mathieu-Daudé 2020-10-07 14:38 ` Cleber Rosa
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).