[PATCH] [RFC] Fix a small window on CPU online/offline

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [RFC] Fix a small window on CPU online/offline
@ 2010-04-01  9:22 Jiang, Yunhong
  2010-04-01 10:25 ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Jiang, Yunhong @ 2010-04-01  9:22 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com, Jan Beulich

This is a RFC patch for a small window on CPU online/offline. It is not a clean solution, I try to send it out before I finish checking all the related code, so that I can get feedback from community.
Currently there is a small window on CPU online/offline. During take_cpu_down() in stop_machine_run() context, the CPU is marked offline and irq is disabled. But it is only at play_dead(), in idle_task_exit() from cpu_exit_clear(), the offlining CPU try to sync the lazy exec states. The window is, when play_dead(), the stop_machine_run() is done already, and the vcpu whose context is out-of-sync may be scheduled on another CPU.

This may cause several issues:
a) When the vcpu is scheduled on another CPU, it will try to sync the context on the original CPU, through flush_tlb_mask, as following code in context_switch(). Because the original CPU is marked as offline and irq disabled, it will hang in flush_area_mask. I try to send patch 21079:8ab60a883fd5 to avoid the hang.
    if ( unlikely(!cpu_isset(cpu, dirty_mask) && !cpus_empty(dirty_mask)) )
    {
        /* Other cpus call __sync_lazy_execstate from flush ipi handler. */
        flush_tlb_mask(&dirty_mask);
    }

b) However, changeset 21079 is not the right solution still, although the patch itself is ok. With this changeset, system will not hang. But the vCPU's context is not synced.
c) More is, when the offlining CPU execute the idle_task_exit(), it may try to re-sync the vcpu context with the guest, this will clobber the running vCPU.

The following code try to sync the vcpu context in stop_machine_run() context, so that the vCPU will get the the context synced. However, it still not resolve issue c. I'm considering to mark the curr_vcpu() to be idle also, so that idle_task_exit() will not try to sync context again, but I suspect that is not a right way.

Any suggestion?

BTW, the flush_local is to make sure we flush all TLB context, so that when CPU online again, there is no garbage on the CPU, especially if the CPU has no deep C state.
--jyh

diff -r ebd84be3420a xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c    Tue Mar 30 18:31:39 2010 +0100
+++ b/xen/arch/x86/smpboot.c    Thu Apr 01 16:47:57 2010 +0800
@@ -34,6 +34,7 @@
 *      Rusty Russell   :   Hacked into shape for new "hotplug" boot process. */

 #include <xen/config.h>
+#include <asm/i387.h>
 #include <xen/init.h>
 #include <xen/kernel.h>
 #include <xen/mm.h>
@@ -1308,6 +1309,22 @@ int __cpu_disable(void)

    cpu_disable_scheduler();

+
+    if ( !is_idle_vcpu(this_cpu(curr_vcpu)) )
+    {
+        struct cpu_user_regs *stack_regs = guest_cpu_user_regs();
+        struct vcpu *v;
+
+        v = this_cpu(curr_vcpu);
+        memcpy(&v->arch.guest_context.user_regs,
+          stack_regs,
+          CTXT_SWITCH_STACK_BYTES);
+        unlazy_fpu(v);
+        current->arch.ctxt_switch_from(v);
+    }
+
+    flush_local(FLUSH_CACHE | FLUSH_TLB_GLOBAL |FLUSH_TLB);
+
    return 0;
 }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [RFC] Fix a small window on CPU online/offline
  2010-04-01  9:22 [PATCH] [RFC] Fix a small window on CPU online/offline Jiang, Yunhong
@ 2010-04-01 10:25 ` Keir Fraser
  2010-04-01 10:33   ` Keir Fraser
  2010-04-02  3:31   ` Jiang, Yunhong
  0 siblings, 2 replies; 6+ messages in thread
From: Keir Fraser @ 2010-04-01 10:25 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel@lists.xensource.com, Jan Beulich

On 01/04/2010 10:22, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> The following code try to sync the vcpu context in stop_machine_run() context,
> so that the vCPU will get the the context synced. However, it still not
> resolve issue c. I'm considering to mark the curr_vcpu() to be idle also, so
> that idle_task_exit() will not try to sync context again, but I suspect that
> is not a right way.
> 
> Any suggestion?

Painful though it is to say it, the answer may be to run the stop_machine in
a 'hypervisor thread' context, as Linux does. That sidesteps the whole
issue, but of course would need us to introduce a whole bunch of
infrastructure which we don't have in Xen presently.

Another approach would be to defer a lot of what goes on in __cpu_disable()
until play_dead()/cpu_exit_clear(). You could do the stop_machine_run()
invocation from there, knowing that you can sync guest state before zapping
the cpu_online_map... Actually this approach does start to unravel and need
quite a lot of subtle changes itself!

I would probably investigate option A (a more Linux-y stop_machine_run) but
instead of the kthread_create() infrastructure I would consider extending
the purpose of our 'idle vcpus' to provide a sufficiently more generic
'hypervisor vcpu context'. For our purposes that would mean:
 1. Allow the scheduling priority of an idle vcpu to be changed to
highest-priority (would mean some, hopefully not very major, scheduler
surgery).
 2. Add a hook to the idle loop to call out to a hook in stop_machine.c.

Then you would loop over all online CPUs, like in Linux, whacking up the
priority and getting the idle vcpu to call back to us. Umm, we would also
need some kind of wait_for_completion() mechanism, which might look a bit
like aspects of continue_hypercall_on_cpu() -- we would pause the running
vcpu, change schedule_tail hook, and exit. We would then get our pause count
dropped when the stop_machine stuff is done, and re-gain control via
schedule_tail.

Well, it wouldn't be a trivial patch by any means, but potentially not *too*
bad, and less fragile than some other approaches? I think it's a major
benefit that it takes us closer to Linux semantics, as this stuff is
fragile, and we're already quite a way down the road of Linux but currently
our stop_machine is just a bit half-arsed and causing us problems.

By the way, you could consider that c/s 21049 starts to take us down this
path: the spin_trylock()/create_continuation() semantics is not totally
dissimilar to Linux's mutex_lock (in which other softirqs/vcpus can get to
run while we wait for the lock to be released), which are used for the
cpu-hotplug related locks in Linux.

 -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [RFC] Fix a small window on CPU online/offline
  2010-04-01 10:25 ` Keir Fraser
@ 2010-04-01 10:33   ` Keir Fraser
  2010-04-02  3:31   ` Jiang, Yunhong
  1 sibling, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2010-04-01 10:33 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: xen-devel@lists.xensource.com, Jan Beulich

On 01/04/2010 11:25, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

> By the way, you could consider that c/s 21049 starts to take us down this
> path: the spin_trylock()/create_continuation() semantics is not totally
> dissimilar to Linux's mutex_lock (in which other softirqs/vcpus can get to
> run while we wait for the lock to be released), which are used for the
> cpu-hotplug related locks in Linux.

Thinking some more, we could even have more Linux-y mutex and completion
semantics if we allowed vcpus to be voluntarily preemptible in the Linux
way. Obviously our barrier to that currently is that we have per-cpu stacks,
not per-vcpu stacks.

One way to get around this without needing to hack out the concept of
per-cpu stacks would be to realise that all this mutex/completion stuff
would only get used on a few controloperation paths, mainly from
physdevop/sysctl/domctl -- i.e., dom0 management hypercalls. We could wrap
those up in a start_preemptible()/end_preemptible() region which would
alloc/free a shadow stack. Within those regions it would be safe to use new
mutex/completion abstractions which could then behave just like in Linux. We
would have an underlying mechanism which could copy the active stack out
into the shadow, adjust schedule_tail, and then do_softirq() to get into the
scheduler.

I don't know whether it is worth going this far, but perhaps it is easier to
have some easier-to-use synchronisation primitives like this in the long
run. You can certainly see it's going to mke the dom0 paths through Xen
easier to understand, and make code-borrowing from Linux a less difficult
and fragile proposition.

I could help with some of this, and/or what I described in my previous
email, by the way.

 -- Keir

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] [RFC] Fix a small window on CPU online/offline
  2010-04-01 10:25 ` Keir Fraser
  2010-04-01 10:33   ` Keir Fraser
@ 2010-04-02  3:31   ` Jiang, Yunhong
  2010-04-02  7:14     ` Keir Fraser
  1 sibling, 1 reply; 6+ messages in thread
From: Jiang, Yunhong @ 2010-04-02  3:31 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Jan Beulich

Keir, thanks for your detailed suggestion!

But I suspect even if we backport all Kernel infrastructure, we may still need some changes for this. Per my understanding, In kernel, the task is migrated after the offline CPU is dead, there is no lazy state sync, that means we need fixing this issue still.

Have a short discussion with Kevin, maybe we can sync the state in cpu_disable_scheduler if current is idle, and then set a flag so that we will not sync again in context siwtch later. If the current is not idle, we can leave the context switch to do the sync for us. I will do more investigate to see how many changes are needed.

--jyh

>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>Sent: Thursday, April 01, 2010 6:25 PM
>To: Jiang, Yunhong
>Cc: Jan Beulich; xen-devel@lists.xensource.com
>Subject: Re: [PATCH] [RFC] Fix a small window on CPU online/offline
>
>On 01/04/2010 10:22, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:
>
>> The following code try to sync the vcpu context in stop_machine_run() context,
>> so that the vCPU will get the the context synced. However, it still not
>> resolve issue c. I'm considering to mark the curr_vcpu() to be idle also, so
>> that idle_task_exit() will not try to sync context again, but I suspect that
>> is not a right way.
>>
>> Any suggestion?
>
>Painful though it is to say it, the answer may be to run the stop_machine in
>a 'hypervisor thread' context, as Linux does. That sidesteps the whole
>issue, but of course would need us to introduce a whole bunch of
>infrastructure which we don't have in Xen presently.
>
>Another approach would be to defer a lot of what goes on in __cpu_disable()
>until play_dead()/cpu_exit_clear(). You could do the stop_machine_run()
>invocation from there, knowing that you can sync guest state before zapping
>the cpu_online_map... Actually this approach does start to unravel and need
>quite a lot of subtle changes itself!
>
>I would probably investigate option A (a more Linux-y stop_machine_run) but
>instead of the kthread_create() infrastructure I would consider extending
>the purpose of our 'idle vcpus' to provide a sufficiently more generic
>'hypervisor vcpu context'. For our purposes that would mean:
> 1. Allow the scheduling priority of an idle vcpu to be changed to
>highest-priority (would mean some, hopefully not very major, scheduler
>surgery).
> 2. Add a hook to the idle loop to call out to a hook in stop_machine.c.
>
>Then you would loop over all online CPUs, like in Linux, whacking up the
>priority and getting the idle vcpu to call back to us. Umm, we would also
>need some kind of wait_for_completion() mechanism, which might look a bit
>like aspects of continue_hypercall_on_cpu() -- we would pause the running
>vcpu, change schedule_tail hook, and exit. We would then get our pause count
>dropped when the stop_machine stuff is done, and re-gain control via
>schedule_tail.
>
>Well, it wouldn't be a trivial patch by any means, but potentially not *too*
>bad, and less fragile than some other approaches? I think it's a major
>benefit that it takes us closer to Linux semantics, as this stuff is
>fragile, and we're already quite a way down the road of Linux but currently
>our stop_machine is just a bit half-arsed and causing us problems.
>
>By the way, you could consider that c/s 21049 starts to take us down this
>path: the spin_trylock()/create_continuation() semantics is not totally
>dissimilar to Linux's mutex_lock (in which other softirqs/vcpus can get to
>run while we wait for the lock to be released), which are used for the
>cpu-hotplug related locks in Linux.
>
> -- Keir
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] [RFC] Fix a small window on CPU online/offline
  2010-04-02  3:31   ` Jiang, Yunhong
@ 2010-04-02  7:14     ` Keir Fraser
  2010-04-02  7:27       ` Keir Fraser
  0 siblings, 1 reply; 6+ messages in thread
From: Keir Fraser @ 2010-04-02  7:14 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Jan Beulich

On 02/04/2010 04:31, "Jiang, Yunhong" <yunhong.jiang@intel.com> wrote:

> Have a short discussion with Kevin, maybe we can sync the state in
> cpu_disable_scheduler if current is idle, and then set a flag so that we will
> not sync again in context siwtch later. If the current is not idle, we can
> leave the context switch to do the sync for us. I will do more investigate to
> see how many changes are needed.

Hm, actually maybe that could work. You might not even need a flag in case
current is non-idle in cpu_disable_scheduler(). It might suffice to force
context_switch() to do full context switch synchronously when the CPU is
going offline (see appended patch). I was thinking there was a race as soon
as the cpu is cleared from cpu_online_map, but actually the problem occurs
only once the vcpu is descheduled, so if we can synchronously sync its state
before calling context_saved(), perhaps we are fine.

This could be a very small patch after all! :-)

 -- Keir

--- a/xen/arch/x86/domain.c    Thu Apr 01 09:55:27 2010 +0100
+++ b/xen/arch/x86/domain.c    Fri Apr 02 08:12:14 2010 +0100
@@ -1442,7 +1442,8 @@

     set_current(next);

-    if ( (per_cpu(curr_vcpu, cpu) == next) || is_idle_vcpu(next) )
+    if ( (per_cpu(curr_vcpu, cpu) == next) ||
+         (is_idle_vcpu(next) && !cpu_is_offline(cpu)) )
     {
         local_irq_enable();
     }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Re: [PATCH] [RFC] Fix a small window on CPU online/offline
  2010-04-02  7:14     ` Keir Fraser
@ 2010-04-02  7:27       ` Keir Fraser
  0 siblings, 0 replies; 6+ messages in thread
From: Keir Fraser @ 2010-04-02  7:27 UTC (permalink / raw)
  To: Jiang, Yunhong; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Jan Beulich

[-- Attachment #1: Type: text/plain, Size: 757 bytes --]

On 02/04/2010 08:14, "Keir Fraser" <keir.fraser@eu.citrix.com> wrote:

> Hm, actually maybe that could work. You might not even need a flag in case
> current is non-idle in cpu_disable_scheduler(). It might suffice to force
> context_switch() to do full context switch synchronously when the CPU is
> going offline (see appended patch). I was thinking there was a race as soon
> as the cpu is cleared from cpu_online_map, but actually the problem occurs
> only once the vcpu is descheduled, so if we can synchronously sync its state
> before calling context_saved(), perhaps we are fine.
> 
> This could be a very small patch after all! :-)

Perhaps even as small as the attached patch?

It's simple enough we could almost consider it for 4.0.0.

 -- Keir


[-- Attachment #2: 00-offline --]
[-- Type: application/octet-stream, Size: 1012 bytes --]

diff -r 537451477469 xen/arch/x86/domain.c
--- a/xen/arch/x86/domain.c	Thu Apr 01 09:55:27 2010 +0100
+++ b/xen/arch/x86/domain.c	Fri Apr 02 08:25:30 2010 +0100
@@ -1442,7 +1442,8 @@
 
     set_current(next);
 
-    if ( (per_cpu(curr_vcpu, cpu) == next) || is_idle_vcpu(next) )
+    if ( (per_cpu(curr_vcpu, cpu) == next) ||
+         (is_idle_vcpu(next) && cpu_online(cpu)) )
     {
         local_irq_enable();
     }
diff -r 537451477469 xen/arch/x86/smpboot.c
--- a/xen/arch/x86/smpboot.c	Thu Apr 01 09:55:27 2010 +0100
+++ b/xen/arch/x86/smpboot.c	Fri Apr 02 08:25:30 2010 +0100
@@ -1302,6 +1302,13 @@
 
 	remove_siblinginfo(cpu);
 
+	/*
+	 * If we are running the idle vcpu, sync last non-idle vcpu's state
+	 * before changing cpu_online_map. If we are running non-idle vcpu,
+         * we will synchronously sync the state in context_switch() later.
+	 */
+	__sync_lazy_execstate();
+
 	/* It's now safe to remove this processor from the online map */
 	cpu_clear(cpu, cpu_online_map);
 	fixup_irqs();

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2010-04-02  7:27 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-04-01  9:22 [PATCH] [RFC] Fix a small window on CPU online/offline Jiang, Yunhong
2010-04-01 10:25 ` Keir Fraser
2010-04-01 10:33   ` Keir Fraser
2010-04-02  3:31   ` Jiang, Yunhong
2010-04-02  7:14     ` Keir Fraser
2010-04-02  7:27       ` Keir Fraser

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).