* [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
@ 2018-12-07 15:59 Peter Maydell
2018-12-08 8:47 ` Jaap Crezee
` (6 more replies)
0 siblings, 7 replies; 13+ messages in thread
From: Peter Maydell @ 2018-12-07 15:59 UTC (permalink / raw)
To: qemu-devel; +Cc: patches, Paolo Bonzini, Jaap Crezee
We use cpu_stop_current() to ensure the current CPU has stopped
from places like qemu_system_reset_request(). Unfortunately its
current implementation has a race. It calls qemu_cpu_stop(),
which sets cpu->stopped to true even though the CPU hasn't
actually stopped yet. The main thread will look at the flags
set by qemu_system_reset_request() and call pause_all_vcpus().
pause_all_vcpus() waits for every cpu to have cpu->stopped true,
so it can continue (and we will start the system reset operation)
before the vcpu thread has got back to its top level loop.
Instead, just set cpu->stop and call cpu_exit(). This will
cause the vcpu to exit back to the top level loop, and there
(as part of the wait_io_event code) it will call qemu_cpu_stop().
This fixes bugs where the reset request appeared to be ignored
or the CPU misbehaved because the reset operation started
to change vcpu state while the vcpu thread was still using it.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
---
We discussed this a little while back:
https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
and Jaap reported a bug which I suspect of being the same thing:
https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
Annoyingly I have lost the test case that demonstrated this
race, but I analysed it at the time and this should definitely
fix it. I have opted not to try to address any of the other
possible cleanup here (eg vm_stop() has a potential similar
race if called from a vcpu thread I suspect), since it gets
pretty tangled.
Jaap: could you test whether this patch fixes the issue you
were seeing, please?
---
cpus.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/cpus.c b/cpus.c
index 0ddeeefc14f..b09b7027126 100644
--- a/cpus.c
+++ b/cpus.c
@@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
void cpu_stop_current(void)
{
if (current_cpu) {
- qemu_cpu_stop(current_cpu, true);
+ current_cpu->stop = true;
+ cpu_exit(current_cpu);
}
}
--
2.19.2
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
@ 2018-12-08 8:47 ` Jaap Crezee
2018-12-10 7:43 ` Jaap Crezee
` (5 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Jaap Crezee @ 2018-12-08 8:47 UTC (permalink / raw)
To: Peter Maydell, qemu-devel; +Cc: patches, Paolo Bonzini
Hi all,
On 12/7/18 4:59 PM, Peter Maydell wrote:
> Jaap: could you test whether this patch fixes the issue you
> were seeing, please?
I have applied the patch and started my test tool against it. It will need some time as I have also seen cases where it only failed after 600 reboots.
My testtool logs into the VM over ssh and reboots it. Meanwhile it will check the device going offline and coming back online again and starts over
(including a random wait).
I will keep you updated. Thanks!
Jaap
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
2018-12-08 8:47 ` Jaap Crezee
@ 2018-12-10 7:43 ` Jaap Crezee
2018-12-10 11:05 ` Alex Bennée
` (4 subsequent siblings)
6 siblings, 0 replies; 13+ messages in thread
From: Jaap Crezee @ 2018-12-10 7:43 UTC (permalink / raw)
To: Peter Maydell, qemu-devel; +Cc: patches, Paolo Bonzini
Hello all,
On 12/7/18 4:59 PM, Peter Maydell wrote:
> Jaap: could you test whether this patch fixes the issue you
> were seeing, please?
My test is going very well. With the patch applied, I have no longer been able to freeze/hang the VM. Currently at 7024 reboots and counting over
runtime 1 day 23 hours. I will start testing on my production environment as well.
Tested-by: Jaap Crezee <jaap@jcz.nl>
regards,
Jaap
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
2018-12-08 8:47 ` Jaap Crezee
2018-12-10 7:43 ` Jaap Crezee
@ 2018-12-10 11:05 ` Alex Bennée
2018-12-10 11:17 ` Peter Maydell
2018-12-10 14:30 ` KONRAD Frederic
` (3 subsequent siblings)
6 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2018-12-10 11:05 UTC (permalink / raw)
To: qemu-devel; +Cc: Paolo Bonzini, Jaap Crezee, patches
Peter Maydell <peter.maydell@linaro.org> writes:
> We use cpu_stop_current() to ensure the current CPU has stopped
> from places like qemu_system_reset_request(). Unfortunately its
> current implementation has a race. It calls qemu_cpu_stop(),
> which sets cpu->stopped to true even though the CPU hasn't
> actually stopped yet. The main thread will look at the flags
> set by qemu_system_reset_request() and call pause_all_vcpus().
> pause_all_vcpus() waits for every cpu to have cpu->stopped true,
> so it can continue (and we will start the system reset operation)
> before the vcpu thread has got back to its top level loop.
>
> Instead, just set cpu->stop and call cpu_exit(). This will
> cause the vcpu to exit back to the top level loop, and there
> (as part of the wait_io_event code) it will call qemu_cpu_stop().
>
> This fixes bugs where the reset request appeared to be ignored
> or the CPU misbehaved because the reset operation started
> to change vcpu state while the vcpu thread was still using it.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> We discussed this a little while back:
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
> and Jaap reported a bug which I suspect of being the same thing:
> https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
>
> Annoyingly I have lost the test case that demonstrated this
> race, but I analysed it at the time and this should definitely
> fix it. I have opted not to try to address any of the other
> possible cleanup here (eg vm_stop() has a potential similar
> race if called from a vcpu thread I suspect), since it gets
> pretty tangled.
>
> Jaap: could you test whether this patch fixes the issue you
> were seeing, please?
> ---
> cpus.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/cpus.c b/cpus.c
> index 0ddeeefc14f..b09b7027126 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
> void cpu_stop_current(void)
> {
> if (current_cpu) {
> - qemu_cpu_stop(current_cpu, true);
> + current_cpu->stop = true;
> + cpu_exit(current_cpu);
Should the FIXME in vm_stop also be fixed?
/*
* FIXME: should not return to device code in case
* vm_stop() has been requested.
*/
cpu_stop_current();
return 0;
> }
> }
--
Alex Bennée
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-10 11:05 ` Alex Bennée
@ 2018-12-10 11:17 ` Peter Maydell
2018-12-10 12:15 ` Alex Bennée
0 siblings, 1 reply; 13+ messages in thread
From: Peter Maydell @ 2018-12-10 11:17 UTC (permalink / raw)
To: Alex Bennée
Cc: QEMU Developers, Paolo Bonzini, Jaap Crezee, patches@linaro.org
On Mon, 10 Dec 2018 at 11:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>
>
> Peter Maydell <peter.maydell@linaro.org> writes:
> > We discussed this a little while back:
> > https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
> > and Jaap reported a bug which I suspect of being the same thing:
> > https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
> >
> > Annoyingly I have lost the test case that demonstrated this
> > race, but I analysed it at the time and this should definitely
> > fix it. I have opted not to try to address any of the other
> > possible cleanup here (eg vm_stop() has a potential similar
> > race if called from a vcpu thread I suspect), since it gets
> > pretty tangled.
> >
> > Jaap: could you test whether this patch fixes the issue you
> > were seeing, please?
> > ---
> > cpus.c | 3 ++-
> > 1 file changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/cpus.c b/cpus.c
> > index 0ddeeefc14f..b09b7027126 100644
> > --- a/cpus.c
> > +++ b/cpus.c
> > @@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
> > void cpu_stop_current(void)
> > {
> > if (current_cpu) {
> > - qemu_cpu_stop(current_cpu, true);
> > + current_cpu->stop = true;
> > + cpu_exit(current_cpu);
>
> Should the FIXME in vm_stop also be fixed?
>
> /*
> * FIXME: should not return to device code in case
> * vm_stop() has been requested.
> */
> cpu_stop_current();
> return 0;
This is one of the things I had in mind with:
> > I have opted not to try to address any of the other
> > possible cleanup here (eg vm_stop() has a potential similar
> > race if called from a vcpu thread I suspect), since it gets
> > pretty tangled.
though I might actually have meant pause_all_vcpus().
(For pause_all_vcpus() I think the correct thing is to
fix the hw/i386/kvmvapic.c code to work in some other way,
and then assert that pause_all_vcpus() is never called from
the VCPU thread.)
At any rate, this code is quite complicated, so I think
it's worth just fixing this specific race without getting
tangled up in everything else we could potentially refactor.
I'm not even sure how we would arrange for vm_stop() to
avoid returning to device emulation code if it has been
called from there -- I would favour instead changing/defining
the semantics to be like the shutdown-request and reset-request
where the device code expects that control will return but
the VM stop happens at the next opportunity, ie as soon
as the execution of the insn which wrote to the device
register has completed.
thanks
-- PMM
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-10 11:17 ` Peter Maydell
@ 2018-12-10 12:15 ` Alex Bennée
2018-12-10 13:07 ` Peter Maydell
0 siblings, 1 reply; 13+ messages in thread
From: Alex Bennée @ 2018-12-10 12:15 UTC (permalink / raw)
To: Peter Maydell
Cc: QEMU Developers, Paolo Bonzini, Jaap Crezee, patches@linaro.org
Peter Maydell <peter.maydell@linaro.org> writes:
> On Mon, 10 Dec 2018 at 11:06, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>>
>> Peter Maydell <peter.maydell@linaro.org> writes:
>> > We discussed this a little while back:
>> > https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
>> > and Jaap reported a bug which I suspect of being the same thing:
>> > https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
>> >
>> > Annoyingly I have lost the test case that demonstrated this
>> > race, but I analysed it at the time and this should definitely
>> > fix it. I have opted not to try to address any of the other
>> > possible cleanup here (eg vm_stop() has a potential similar
>> > race if called from a vcpu thread I suspect), since it gets
>> > pretty tangled.
>> >
>> > Jaap: could you test whether this patch fixes the issue you
>> > were seeing, please?
>> > ---
>> > cpus.c | 3 ++-
>> > 1 file changed, 2 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/cpus.c b/cpus.c
>> > index 0ddeeefc14f..b09b7027126 100644
>> > --- a/cpus.c
>> > +++ b/cpus.c
>> > @@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
>> > void cpu_stop_current(void)
>> > {
>> > if (current_cpu) {
>> > - qemu_cpu_stop(current_cpu, true);
>> > + current_cpu->stop = true;
>> > + cpu_exit(current_cpu);
>>
>> Should the FIXME in vm_stop also be fixed?
>>
>> /*
>> * FIXME: should not return to device code in case
>> * vm_stop() has been requested.
>> */
>> cpu_stop_current();
>> return 0;
>
> This is one of the things I had in mind with:
>> > I have opted not to try to address any of the other
>> > possible cleanup here (eg vm_stop() has a potential similar
>> > race if called from a vcpu thread I suspect), since it gets
>> > pretty tangled.
>
> though I might actually have meant pause_all_vcpus().
> (For pause_all_vcpus() I think the correct thing is to
> fix the hw/i386/kvmvapic.c code to work in some other way,
> and then assert that pause_all_vcpus() is never called from
> the VCPU thread.)
I thought we had already fixed this, but it is yet another case. See the
patch_instruction code in the same file. The only niggle is ensuring
that either the helper is called from last instruction in the block or
forcing a cpu_exit after queuing the work.
The i386 helpers are tricky because they seem to be very deeply nested
so its hard to be sure everything really is done.
But yes I agree that pause_all_vcpus() should be reserved for non-vCPU
contexts only.
> At any rate, this code is quite complicated, so I think
> it's worth just fixing this specific race without getting
> tangled up in everything else we could potentially refactor.
Fair enough.
>
> I'm not even sure how we would arrange for vm_stop() to
> avoid returning to device emulation code if it has been
> called from there -- I would favour instead changing/defining
> the semantics to be like the shutdown-request and reset-request
> where the device code expects that control will return but
> the VM stop happens at the next opportunity, ie as soon
> as the execution of the insn which wrote to the device
> register has completed.
Is there anyway we can assert we are in a helper that has caused all
globals to be saved before the call? Otherwise we need to make similar
guarantees that the ARM TLB flushes have that they are always the last
in a block of instructions.
prep_port0092_write in PPC seems to do a similar thing. Perhaps we need
to expose some common helpers for vcpus?
--
Alex Bennée
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-10 12:15 ` Alex Bennée
@ 2018-12-10 13:07 ` Peter Maydell
0 siblings, 0 replies; 13+ messages in thread
From: Peter Maydell @ 2018-12-10 13:07 UTC (permalink / raw)
To: Alex Bennée
Cc: QEMU Developers, Paolo Bonzini, Jaap Crezee, patches@linaro.org
On Mon, 10 Dec 2018 at 12:15, Alex Bennée <alex.bennee@linaro.org> wrote:
> Peter Maydell <peter.maydell@linaro.org> writes:
> > though I might actually have meant pause_all_vcpus().
> > (For pause_all_vcpus() I think the correct thing is to
> > fix the hw/i386/kvmvapic.c code to work in some other way,
> > and then assert that pause_all_vcpus() is never called from
> > the VCPU thread.)
>
> I thought we had already fixed this, but it is yet another case. See the
> patch_instruction code in the same file. The only niggle is ensuring
> that either the helper is called from last instruction in the block or
> forcing a cpu_exit after queuing the work.
Note that the call to pause_all_vcpus() is inside an
"if (kvm_enabled())" so it doesn't matter what the TCG
code does in the way of dividing code up into blocks.
(Though the comment suggests that making it work in TCG
might be nice in theory.)
thanks
-- PMM
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
` (2 preceding siblings ...)
2018-12-10 11:05 ` Alex Bennée
@ 2018-12-10 14:30 ` KONRAD Frederic
2018-12-10 14:39 ` Peter Maydell
2018-12-10 20:58 ` Jaap Crezee
` (2 subsequent siblings)
6 siblings, 1 reply; 13+ messages in thread
From: KONRAD Frederic @ 2018-12-10 14:30 UTC (permalink / raw)
To: Peter Maydell, qemu-devel; +Cc: Paolo Bonzini, Jaap Crezee, patches
Hi Peter,
Thanks for that patch!
I'm seeing the same kind of issue when I run 8 qemu in parallel but it doesn't
seem to be fixed by this patch. Is it supposed to fix the issue when we are
doing a reset_request through a MMIO device?
It happens (rarely) with this kind of guest code:
exit:
write to the register to reset the device
loop:
branch loop
The code after the reset is executed.. can't we exit the loop directly with
cpu_loop_exit after cpu_exit?
Thanks,
Fred
Le 12/7/18 à 4:59 PM, Peter Maydell a écrit :
> We use cpu_stop_current() to ensure the current CPU has stopped
> from places like qemu_system_reset_request(). Unfortunately its
> current implementation has a race. It calls qemu_cpu_stop(),
> which sets cpu->stopped to true even though the CPU hasn't
> actually stopped yet. The main thread will look at the flags
> set by qemu_system_reset_request() and call pause_all_vcpus().
> pause_all_vcpus() waits for every cpu to have cpu->stopped true,
> so it can continue (and we will start the system reset operation)
> before the vcpu thread has got back to its top level loop.
>
> Instead, just set cpu->stop and call cpu_exit(). This will
> cause the vcpu to exit back to the top level loop, and there
> (as part of the wait_io_event code) it will call qemu_cpu_stop().
>
> This fixes bugs where the reset request appeared to be ignored
> or the CPU misbehaved because the reset operation started
> to change vcpu state while the vcpu thread was still using it.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> We discussed this a little while back:
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
> and Jaap reported a bug which I suspect of being the same thing:
> https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
>
> Annoyingly I have lost the test case that demonstrated this
> race, but I analysed it at the time and this should definitely
> fix it. I have opted not to try to address any of the other
> possible cleanup here (eg vm_stop() has a potential similar
> race if called from a vcpu thread I suspect), since it gets
> pretty tangled.
>
> Jaap: could you test whether this patch fixes the issue you
> were seeing, please?
> ---
> cpus.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/cpus.c b/cpus.c
> index 0ddeeefc14f..b09b7027126 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
> void cpu_stop_current(void)
> {
> if (current_cpu) {
> - qemu_cpu_stop(current_cpu, true);
> + current_cpu->stop = true;
> + cpu_exit(current_cpu);
> }
> }
>
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-10 14:30 ` KONRAD Frederic
@ 2018-12-10 14:39 ` Peter Maydell
2018-12-10 14:52 ` KONRAD Frederic
0 siblings, 1 reply; 13+ messages in thread
From: Peter Maydell @ 2018-12-10 14:39 UTC (permalink / raw)
To: KONRAD Frederic
Cc: QEMU Developers, Paolo Bonzini, Jaap Crezee, patches@linaro.org
On Mon, 10 Dec 2018 at 14:30, KONRAD Frederic
<frederic.konrad@adacore.com> wrote:
>
> Hi Peter,
>
> Thanks for that patch!
>
> I'm seeing the same kind of issue when I run 8 qemu in parallel but it doesn't
> seem to be fixed by this patch. Is it supposed to fix the issue when we are
> doing a reset_request through a MMIO device?
>
> It happens (rarely) with this kind of guest code:
>
> exit:
> write to the register to reset the device
> loop:
> branch loop
>
> The code after the reset is executed.. can't we exit the loop directly with
> cpu_loop_exit after cpu_exit?
cpu_loop_exit would abort the execution of the store instruction
that writes to the reset register. I'm not sure that's a great
idea. My thought was more that we should just make sure that insn
is the last one in the TB, so effectively we execute that insn and
then reset the system before executing any further insns. Thinking
it over though I'm not sure that we do do anything that could
avoid having more insns following in the same TB, unless you're
using singlestep or icount...
thanks
-- PMM
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-10 14:39 ` Peter Maydell
@ 2018-12-10 14:52 ` KONRAD Frederic
0 siblings, 0 replies; 13+ messages in thread
From: KONRAD Frederic @ 2018-12-10 14:52 UTC (permalink / raw)
To: Peter Maydell
Cc: QEMU Developers, Paolo Bonzini, Jaap Crezee, patches@linaro.org
Le 12/10/18 à 3:39 PM, Peter Maydell a écrit :
> On Mon, 10 Dec 2018 at 14:30, KONRAD Frederic
> <frederic.konrad@adacore.com> wrote:
>>
>> Hi Peter,
>>
>> Thanks for that patch!
>>
>> I'm seeing the same kind of issue when I run 8 qemu in parallel but it doesn't
>> seem to be fixed by this patch. Is it supposed to fix the issue when we are
>> doing a reset_request through a MMIO device?
>>
>> It happens (rarely) with this kind of guest code:
>>
>> exit:
>> write to the register to reset the device
>> loop:
>> branch loop
>>
>> The code after the reset is executed.. can't we exit the loop directly with
>> cpu_loop_exit after cpu_exit?
>
> cpu_loop_exit would abort the execution of the store instruction
> that writes to the reset register. I'm not sure that's a great
> idea. My thought was more that we should just make sure that insn
> is the last one in the TB, so effectively we execute that insn and
> then reset the system before executing any further insns. Thinking
> it over though I'm not sure that we do do anything that could
> avoid having more insns following in the same TB, unless you're
> using singlestep or icount...
Exactly I think we don't do anything for that.. But we can't guess which IO will
require the loop to be exited though..
Fred
>
> thanks
> -- PMM
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
` (3 preceding siblings ...)
2018-12-10 14:30 ` KONRAD Frederic
@ 2018-12-10 20:58 ` Jaap Crezee
2018-12-11 1:06 ` Emilio G. Cota
2019-01-04 15:36 ` Peter Maydell
6 siblings, 0 replies; 13+ messages in thread
From: Jaap Crezee @ 2018-12-10 20:58 UTC (permalink / raw)
To: Peter Maydell, qemu-devel; +Cc: patches, Paolo Bonzini
Hi again,
On 12/7/18 4:59 PM, Peter Maydell wrote:
> We use cpu_stop_current() to ensure the current CPU has stopped
> from places like qemu_system_reset_request(). Unfortunately its
> current implementation has a race. It calls qemu_cpu_stop(),
> which sets cpu->stopped to true even though the CPU hasn't
> actually stopped yet. The main thread will look at the flags
> set by qemu_system_reset_request() and call pause_all_vcpus().
> pause_all_vcpus() waits for every cpu to have cpu->stopped true,
> so it can continue (and we will start the system reset operation)
> before the vcpu thread has got back to its top level loop.
>
> Instead, just set cpu->stop and call cpu_exit(). This will
> cause the vcpu to exit back to the top level loop, and there
> (as part of the wait_io_event code) it will call qemu_cpu_stop().
>
> This fixes bugs where the reset request appeared to be ignored
> or the CPU misbehaved because the reset operation started
> to change vcpu state while the vcpu thread was still using it.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
> ---
> We discussed this a little while back:
> https://lists.gnu.org/archive/html/qemu-devel/2018-10/msg00154.html
> and Jaap reported a bug which I suspect of being the same thing:
> https://lists.gnu.org/archive/html/qemu-discuss/2018-10/msg00014.html
>
> Annoyingly I have lost the test case that demonstrated this
> race, but I analysed it at the time and this should definitely
> fix it. I have opted not to try to address any of the other
> possible cleanup here (eg vm_stop() has a potential similar
> race if called from a vcpu thread I suspect), since it gets
> pretty tangled.
>
> Jaap: could you test whether this patch fixes the issue you
> were seeing, please?
> ---
> cpus.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/cpus.c b/cpus.c
> index 0ddeeefc14f..b09b7027126 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -2100,7 +2100,8 @@ void qemu_init_vcpu(CPUState *cpu)
> void cpu_stop_current(void)
> {
> if (current_cpu) {
> - qemu_cpu_stop(current_cpu, true);
> + current_cpu->stop = true;
> + cpu_exit(current_cpu);
> }
> }
>
>
The patch also fixed the issue on my production system. Thanks!
regards,
Jaap
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
` (4 preceding siblings ...)
2018-12-10 20:58 ` Jaap Crezee
@ 2018-12-11 1:06 ` Emilio G. Cota
2019-01-04 15:36 ` Peter Maydell
6 siblings, 0 replies; 13+ messages in thread
From: Emilio G. Cota @ 2018-12-11 1:06 UTC (permalink / raw)
To: Peter Maydell; +Cc: qemu-devel, Paolo Bonzini, Jaap Crezee, patches
On Fri, Dec 07, 2018 at 15:59:11 +0000, Peter Maydell wrote:
> We use cpu_stop_current() to ensure the current CPU has stopped
> from places like qemu_system_reset_request(). Unfortunately its
> current implementation has a race. It calls qemu_cpu_stop(),
> which sets cpu->stopped to true even though the CPU hasn't
> actually stopped yet. The main thread will look at the flags
> set by qemu_system_reset_request() and call pause_all_vcpus().
> pause_all_vcpus() waits for every cpu to have cpu->stopped true,
> so it can continue (and we will start the system reset operation)
> before the vcpu thread has got back to its top level loop.
>
> Instead, just set cpu->stop and call cpu_exit(). This will
> cause the vcpu to exit back to the top level loop, and there
> (as part of the wait_io_event code) it will call qemu_cpu_stop().
>
> This fixes bugs where the reset request appeared to be ignored
> or the CPU misbehaved because the reset operation started
> to change vcpu state while the vcpu thread was still using it.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Emilio G. Cota <cota@braap.org>
Thanks,
E.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current()
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
` (5 preceding siblings ...)
2018-12-11 1:06 ` Emilio G. Cota
@ 2019-01-04 15:36 ` Peter Maydell
6 siblings, 0 replies; 13+ messages in thread
From: Peter Maydell @ 2019-01-04 15:36 UTC (permalink / raw)
To: QEMU Developers; +Cc: Paolo Bonzini, Jaap Crezee, patches@linaro.org
On Fri, 7 Dec 2018 at 16:00, Peter Maydell <peter.maydell@linaro.org> wrote:
>
> We use cpu_stop_current() to ensure the current CPU has stopped
> from places like qemu_system_reset_request(). Unfortunately its
> current implementation has a race. It calls qemu_cpu_stop(),
> which sets cpu->stopped to true even though the CPU hasn't
> actually stopped yet. The main thread will look at the flags
> set by qemu_system_reset_request() and call pause_all_vcpus().
> pause_all_vcpus() waits for every cpu to have cpu->stopped true,
> so it can continue (and we will start the system reset operation)
> before the vcpu thread has got back to its top level loop.
>
> Instead, just set cpu->stop and call cpu_exit(). This will
> cause the vcpu to exit back to the top level loop, and there
> (as part of the wait_io_event code) it will call qemu_cpu_stop().
>
> This fixes bugs where the reset request appeared to be ignored
> or the CPU misbehaved because the reset operation started
> to change vcpu state while the vcpu thread was still using it.
>
> Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
I'm going to put this in via target-arm.next, unless anybody
would like to suggest another route.
thanks
-- PMM
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2019-01-04 15:37 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-07 15:59 [Qemu-devel] [PATCH] cpus.c: Fix race condition in cpu_stop_current() Peter Maydell
2018-12-08 8:47 ` Jaap Crezee
2018-12-10 7:43 ` Jaap Crezee
2018-12-10 11:05 ` Alex Bennée
2018-12-10 11:17 ` Peter Maydell
2018-12-10 12:15 ` Alex Bennée
2018-12-10 13:07 ` Peter Maydell
2018-12-10 14:30 ` KONRAD Frederic
2018-12-10 14:39 ` Peter Maydell
2018-12-10 14:52 ` KONRAD Frederic
2018-12-10 20:58 ` Jaap Crezee
2018-12-11 1:06 ` Emilio G. Cota
2019-01-04 15:36 ` Peter Maydell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).