xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* cpuidle causing Dom0 soft lockups
@ 2010-01-21  9:51 Jan Beulich
  2010-01-21 10:11 ` Keir Fraser
                   ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Jan Beulich @ 2010-01-21  9:51 UTC (permalink / raw)
  To: xen-devel

On large systems and with Dom0 booting with (significantly) more than
32 vCPU-s we have got multiple reports that the now by default
enabled C-state management is causing soft lockups, usually preventing
the boot from completing.

The observations are:

Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
the systems work.

max_cstate=0 makes the systems work.

max_cstate=1 makes the problem less severe on one (bigger) system,
and eliminates it completely on another (smaller) one.

When appearing to hang, all vCPU-s are in Dom0's timer_interrupt(),
and all (sometimes all but one) are attempting to acquire xtime_lock.
However, due to our use of ticket locks we can verify that this is not
a deadlock (repeatedly sending '0' shows forward progress, as the
tickets [visible on the stack] continue to increase). Additionally, there
is always one vCPU that has its polling event channel (used for
waking the next waiting vCPU when a lock becomes available)
signaled.

In one case (but not in the other) it is always the same vCPU that
is apparently taking very long to wake up from the polling request.
This may be coincidence, but output after sending 'c' also indicates
a significantly higher (about 3 times) usage value for C2 than the
second highest one; the duration printed is roughly the same for
all CPUs.

While I don't know this code well, it would seem that we're suffering
from extremely long wakeup times. This suggests that there likely is
a (performance) problem even for smaller numbers of vCPU-s.
Hence, unless it can be fixed before 4.0 releases, I would suggest
disabling C-state management by default again.

I can provide full logs in case needed.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21  9:51 cpuidle causing Dom0 soft lockups Jan Beulich
@ 2010-01-21 10:11 ` Keir Fraser
  2010-01-21 10:16   ` Jan Beulich
  2010-01-21 10:35 ` Wei, Gang
  2010-01-21 12:07 ` Yu, Ke
  2 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2010-01-21 10:11 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xensource.com

On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com> wrote:

> While I don't know this code well, it would seem that we're suffering
> from extremely long wakeup times. This suggests that there likely is
> a (performance) problem even for smaller numbers of vCPU-s.
> Hence, unless it can be fixed before 4.0 releases, I would suggest
> disabling C-state management by default again.

It's enabled in 3.4 by feault too though, already. Are there problems there
too?

 -- keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 10:11 ` Keir Fraser
@ 2010-01-21 10:16   ` Jan Beulich
  2010-01-21 10:26     ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-01-21 10:16 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:11 >>>
>On 21/01/2010 09:51, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>> While I don't know this code well, it would seem that we're suffering
>> from extremely long wakeup times. This suggests that there likely is
>> a (performance) problem even for smaller numbers of vCPU-s.
>> Hence, unless it can be fixed before 4.0 releases, I would suggest
>> disabling C-state management by default again.
>
>It's enabled in 3.4 by feault too though, already. Are there problems there
>too?

We never shipped 3.4.x for anything that would normally be run on
large systems. And 3.4 doesn't support more than 32 vCPU-s per
domain, hence the problem - even if present - would be hidden.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 10:16   ` Jan Beulich
@ 2010-01-21 10:26     ` Keir Fraser
  2010-01-21 10:53       ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2010-01-21 10:26 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xensource.com

On 21/01/2010 10:16, "Jan Beulich" <JBeulich@novell.com> wrote:

>>> While I don't know this code well, it would seem that we're suffering
>>> from extremely long wakeup times. This suggests that there likely is
>>> a (performance) problem even for smaller numbers of vCPU-s.
>>> Hence, unless it can be fixed before 4.0 releases, I would suggest
>>> disabling C-state management by default again.
>> 
>> It's enabled in 3.4 by feault too though, already. Are there problems there
>> too?
> 
> We never shipped 3.4.x for anything that would normally be run on
> large systems. And 3.4 doesn't support more than 32 vCPU-s per
> domain, hence the problem - even if present - would be hidden.

Hm, okay. Another thing to consider is not using ticket locks when running
on Xen. Regardless of slow wakeup times (and you can't really depend on them
being fast, in general, in a virtualised environment) experiments have shown
that, as you might expect, contended ticket locks perform *abysmally* in
virtualised environments.

Turning off our major source of power-management wins by default in Xen 4.0
is not really a goer.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-01-21  9:51 cpuidle causing Dom0 soft lockups Jan Beulich
  2010-01-21 10:11 ` Keir Fraser
@ 2010-01-21 10:35 ` Wei, Gang
  2010-01-21 12:07 ` Yu, Ke
  2 siblings, 0 replies; 61+ messages in thread
From: Wei, Gang @ 2010-01-21 10:35 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xensource.com

Jan Beulich wrote:
> Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
> the systems work.

How many vCPU-s or pCPU-s at most can system work with in your experience?

> max_cstate=1 makes the problem less severe on one (bigger) system,
> and eliminates it completely on another (smaller) one.

How bigger/smaller? vCPU nr & pCPU nr.

> I can provide full logs in case needed.

Please provide full logs.

Jimmy

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 10:26     ` Keir Fraser
@ 2010-01-21 10:53       ` Jan Beulich
  2010-01-21 11:03         ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-01-21 10:53 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 11:26 >>>
>Hm, okay. Another thing to consider is not using ticket locks when running
>on Xen. Regardless of slow wakeup times (and you can't really depend on them
>being fast, in general, in a virtualised environment) experiments have shown
>that, as you might expect, contended ticket locks perform *abysmally* in
>virtualised environments.

But that's exactly why we use the polling method: That way we don't
need to wake up all potential waiters, but just the one that's going to
get the lock next. We now even go further and detect whether the
lock owner is not currently running, and go polling right away (instead
of spinning on the lock a few hundred times first).

>Turning off our major source of power-management wins by default in
>Xen 4.0 is not really a goer.

I can see your point. But how can you consider shipping with something
apparently severely broken. As said before - the fact that this manifests
itself by hanging many-vCPU Dom0 has the very likely implication that
there are (so far unnoticed) problems with smaller Dom0-s. If I had a
machine at hand that supports C3, I'd try to do some measurements
with smaller domains...

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 10:53       ` Jan Beulich
@ 2010-01-21 11:03         ` Keir Fraser
  2010-01-21 11:13           ` Jan Beulich
  2010-02-02  7:54           ` Jan Beulich
  0 siblings, 2 replies; 61+ messages in thread
From: Keir Fraser @ 2010-01-21 11:03 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xensource.com

On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:

> But that's exactly why we use the polling method: That way we don't
> need to wake up all potential waiters, but just the one that's going to
> get the lock next. We now even go further and detect whether the
> lock owner is not currently running, and go polling right away (instead
> of spinning on the lock a few hundred times first).

Ah, these are pv spinlocks?

> I can see your point. But how can you consider shipping with something
> apparently severely broken. As said before - the fact that this manifests
> itself by hanging many-vCPU Dom0 has the very likely implication that
> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
> machine at hand that supports C3, I'd try to do some measurements
> with smaller domains...

Well it's a fallback I guess. If we can't make progress on solving it then I
suppose I agree.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 11:03         ` Keir Fraser
@ 2010-01-21 11:13           ` Jan Beulich
  2010-02-02  7:54           ` Jan Beulich
  1 sibling, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-01-21 11:13 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>>
>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>> But that's exactly why we use the polling method: That way we don't
>> need to wake up all potential waiters, but just the one that's going to
>> get the lock next. We now even go further and detect whether the
>> lock owner is not currently running, and go polling right away (instead
>> of spinning on the lock a few hundred times first).
>
>Ah, these are pv spinlocks?

Yes.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-01-21  9:51 cpuidle causing Dom0 soft lockups Jan Beulich
  2010-01-21 10:11 ` Keir Fraser
  2010-01-21 10:35 ` Wei, Gang
@ 2010-01-21 12:07 ` Yu, Ke
  2010-01-25  8:08   ` Jan Beulich
  2 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-01-21 12:07 UTC (permalink / raw)
  To: Jan Beulich, xen-devel@lists.xensource.com

Hi Jan,

Could you try the following debugging patch. it can help to narrow down the root cause:

diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c
+++ b/xen/arch/x86/acpi/cpu_idle.c
@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)

     cpufreq_dbs_timer_suspend();

-    sched_tick_suspend();
     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
     process_pending_softirqs();

Regards
Ke
>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Jan Beulich
>Sent: Thursday, January 21, 2010 5:52 PM
>To: xen-devel@lists.xensource.com
>Subject: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On large systems and with Dom0 booting with (significantly) more than
>32 vCPU-s we have got multiple reports that the now by default
>enabled C-state management is causing soft lockups, usually preventing
>the boot from completing.
>
>The observations are:
>
>Reducing the number of vCPU-s (or pCPU-s) sufficiently much makes
>the systems work.
>
>max_cstate=0 makes the systems work.
>
>max_cstate=1 makes the problem less severe on one (bigger) system,
>and eliminates it completely on another (smaller) one.
>
>When appearing to hang, all vCPU-s are in Dom0's timer_interrupt(),
>and all (sometimes all but one) are attempting to acquire xtime_lock.
>However, due to our use of ticket locks we can verify that this is not
>a deadlock (repeatedly sending '0' shows forward progress, as the
>tickets [visible on the stack] continue to increase). Additionally, there
>is always one vCPU that has its polling event channel (used for
>waking the next waiting vCPU when a lock becomes available)
>signaled.
>
>In one case (but not in the other) it is always the same vCPU that
>is apparently taking very long to wake up from the polling request.
>This may be coincidence, but output after sending 'c' also indicates
>a significantly higher (about 3 times) usage value for C2 than the
>second highest one; the duration printed is roughly the same for
>all CPUs.
>
>While I don't know this code well, it would seem that we're suffering
>from extremely long wakeup times. This suggests that there likely is
>a (performance) problem even for smaller numbers of vCPU-s.
>Hence, unless it can be fixed before 4.0 releases, I would suggest
>disabling C-state management by default again.
>
>I can provide full logs in case needed.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-01-21 12:07 ` Yu, Ke
@ 2010-01-25  8:08   ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-01-25  8:08 UTC (permalink / raw)
  To: Ke Yu; +Cc: xen-devel@lists.xensource.com

>>> "Yu, Ke" <ke.yu@intel.com> 21.01.10 13:07 >>>
>Could you try the following debugging patch. it can help to narrow down the root cause:

According to our testing this did not have any effect. I added you to
the Cc list of the respective bug, so you have access to the relevant
logs.

>diff -r ea02c95af387 xen/arch/x86/acpi/cpu_idle.c
>--- a/xen/arch/x86/acpi/cpu_idle.c
>+++ b/xen/arch/x86/acpi/cpu_idle.c
>@@ -228,7 +228,6 @@ static void acpi_processor_idle(void)
>
>     cpufreq_dbs_timer_suspend();
>
>-    sched_tick_suspend();
>     /* sched_tick_suspend() can raise TIMER_SOFTIRQ. Process it now. */
>     process_pending_softirqs();

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-01-21 11:03         ` Keir Fraser
  2010-01-21 11:13           ` Jan Beulich
@ 2010-02-02  7:54           ` Jan Beulich
  2010-02-02  8:13             ` Juergen Gross
                               ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-02  7:54 UTC (permalink / raw)
  To: Keir Fraser, ke.yu; +Cc: xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>>
>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>> I can see your point. But how can you consider shipping with something
>> apparently severely broken. As said before - the fact that this manifests
>> itself by hanging many-vCPU Dom0 has the very likely implication that
>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>> machine at hand that supports C3, I'd try to do some measurements
>> with smaller domains...
>
>Well it's a fallback I guess. If we can't make progress on solving it then I
>suppose I agree.

Just fyi, we now also have seen an issue on a 24-CPU system that went
away with cpuidle=0 (and static analysis of the hang hinted in that
direction). All I can judge so far is that this likely has something to do
with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
not waking up from the call despite there being pending unmasked or
polled for events).

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-02  7:54           ` Jan Beulich
@ 2010-02-02  8:13             ` Juergen Gross
  2010-02-02 17:07             ` Yu, Ke
  2010-02-03  7:32             ` Yu, Ke
  2 siblings, 0 replies; 61+ messages in thread
From: Juergen Gross @ 2010-02-02  8:13 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xensource.com, Keir Fraser, ke.yu

Jan Beulich wrote:
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>>
>> On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>>> I can see your point. But how can you consider shipping with something
>>> apparently severely broken. As said before - the fact that this manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>> machine at hand that supports C3, I'd try to do some measurements
>>> with smaller domains...
>> Well it's a fallback I guess. If we can't make progress on solving it then I
>> suppose I agree.
> 
> Just fyi, we now also have seen an issue on a 24-CPU system that went
> away with cpuidle=0 (and static analysis of the hang hinted in that
> direction). All I can judge so far is that this likely has something to do
> with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
> not waking up from the call despite there being pending unmasked or
> polled for events).

Interesting. I see this problem on a 4-core system.
Can I help investigating?

Data of my machine (Fujitsu TX300-S5):

# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
stepping        : 5
...


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-02  7:54           ` Jan Beulich
  2010-02-02  8:13             ` Juergen Gross
@ 2010-02-02 17:07             ` Yu, Ke
  2010-02-03 10:15               ` Jan Beulich
  2010-02-03  7:32             ` Yu, Ke
  2 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-02-02 17:07 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel@lists.xensource.com

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Tuesday, February 02, 2010 3:55 PM
>To: Keir Fraser; Yu, Ke
>Cc: xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>>
>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>>> I can see your point. But how can you consider shipping with something
>>> apparently severely broken. As said before - the fact that this manifests
>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>> machine at hand that supports C3, I'd try to do some measurements
>>> with smaller domains...
>>
>>Well it's a fallback I guess. If we can't make progress on solving it then I
>>suppose I agree.
>
>Just fyi, we now also have seen an issue on a 24-CPU system that went
>away with cpuidle=0 (and static analysis of the hang hinted in that
>direction). All I can judge so far is that this likely has something to do
>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>not waking up from the call despite there being pending unmasked or
>polled for events).
>
>Jan

Hi Jan,

We just identified the cause of this issue, and is trying to find appropriate way to fix it.

This issue is the result of following sequence:
1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU timer_interrupt handler will acquire a global ticket spin lock xtime_lock. When xtime_lock is hold by other vCPU, the vCPU will poll event channel and become blocked. As a result, the pCPU where the vCPU runs will become idle. Later, when the lock holder release xtime_lock, it will notify event channel to wake up the vCPU. As a result, the pCPU will wake up from idle state, and schedule the vCPU to run.

>From the above, we can see the latency of vCPU timer interrupt is consisted of the following items. The "latency" here means the time between beginning to acquire lock and finally lock acquired.
T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel notification time)
T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g. C3) to C0, usually it is in the order of several 10us or 100us

2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64 VCPU in dom0, let's assume the lock holding sequence is VCPU0 -> VCPU1->VCPU2 ... ->VCPU63. 
Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is 100us, then the total latency would be ~6ms.
As we have known that the timer is 250HZ, or 4ms period, so when event channel notification issued, and pCPU schedule vCPU63, hypervisor will find the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will be always busy handling timer interrupt, and not be able to update the watch dog, thus cause the softlock up.

So from the above sequence, we can see:
- cpuidle driver add extra latency, thus make this issue more easy to occurs.
- Large number of CPU multiply the latency
- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
and the fundamental cause of this issue is that vCPU timer interrupt handler is not good for scaling, due to the global xtime_lock.

>From cpuidle point of view, one thing we are trying to do is: changing the cpuidle driver to not enter deep C state when there is vCPU with local irq disabled and event channel polling. In this case, the T2 latency will be eliminated. 

Anyway, cpuidle is just one side, we can anticipate that if CPU number is large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way is to make dom0 scaling well by not using xtime_lock, although this is pretty hard currently. Or another way is to limit dom0 vCPU number to certain reasonable level.

Regards
Ke

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-02  7:54           ` Jan Beulich
  2010-02-02  8:13             ` Juergen Gross
  2010-02-02 17:07             ` Yu, Ke
@ 2010-02-03  7:32             ` Yu, Ke
  2010-02-03 10:23               ` Jan Beulich
  2010-02-05  8:48               ` Jan Beulich
  2 siblings, 2 replies; 61+ messages in thread
From: Yu, Ke @ 2010-02-03  7:32 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 4675 bytes --]

Hi Jan,

Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch.

Best Regards
Ke

>-----Original Message-----
>From: Yu, Ke
>Sent: Wednesday, February 03, 2010 1:07 AM
>To: Jan Beulich; Keir Fraser
>Cc: xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>-----Original Message-----
>>From: Jan Beulich [mailto:JBeulich@novell.com]
>>Sent: Tuesday, February 02, 2010 3:55 PM
>>To: Keir Fraser; Yu, Ke
>>Cc: xen-devel@lists.xensource.com
>>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 21.01.10 12:03 >>>
>>>On 21/01/2010 10:53, "Jan Beulich" <JBeulich@novell.com> wrote:
>>>> I can see your point. But how can you consider shipping with something
>>>> apparently severely broken. As said before - the fact that this manifests
>>>> itself by hanging many-vCPU Dom0 has the very likely implication that
>>>> there are (so far unnoticed) problems with smaller Dom0-s. If I had a
>>>> machine at hand that supports C3, I'd try to do some measurements
>>>> with smaller domains...
>>>
>>>Well it's a fallback I guess. If we can't make progress on solving it then I
>>>suppose I agree.
>>
>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>direction). All I can judge so far is that this likely has something to do
>>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>>not waking up from the call despite there being pending unmasked or
>>polled for events).
>>
>>Jan
>
>Hi Jan,
>
>We just identified the cause of this issue, and is trying to find appropriate way
>to fix it.
>
>This issue is the result of following sequence:
>1. every dom0 vCPU has one 250HZ timer (i.e. 4ms period). The vCPU
>timer_interrupt handler will acquire a global ticket spin lock xtime_lock.
>When xtime_lock is hold by other vCPU, the vCPU will poll event channel and
>become blocked. As a result, the pCPU where the vCPU runs will become idle.
>Later, when the lock holder release xtime_lock, it will notify event channel to
>wake up the vCPU. As a result, the pCPU will wake up from idle state, and
>schedule the vCPU to run.
>
>From the above, we can see the latency of vCPU timer interrupt is consisted
>of the following items. The "latency" here means the time between beginning
>to acquire lock and finally lock acquired.
>T1 - CPU execution time ( e.g. timer interrupt lock holding time, event channel
>notification time)
>T2 - CPU idle wake up time, i.e. the time CPU wake up from deep C state (e.g.
>C3) to C0, usually it is in the order of several 10us or 100us
>
>2. then let's consider the case of large number of CPUs, e.g. 64 pCPU and 64
>VCPU in dom0, let's assume the lock holding sequence is VCPU0 ->
>VCPU1->VCPU2 ... ->VCPU63.
>Then vCPU63 will spend 64*(T1 + T2) to acquire the xtime_lock. if T1+T2 is
>100us, then the total latency would be ~6ms.
>As we have known that the timer is 250HZ, or 4ms period, so when event
>channel notification issued, and pCPU schedule vCPU63, hypervisor will find
>the timer is over-due, and will send another TIMER_VIRQ for vCPU63 (see
>schedule()->vcpu_periodic_timer_work() for detail). In this case, vCPU63 will
>be always busy handling timer interrupt, and not be able to update the watch
>dog, thus cause the softlock up.
>
>So from the above sequence, we can see:
>- cpuidle driver add extra latency, thus make this issue more easy to occurs.
>- Large number of CPU multiply the latency
>- ticket spin lock lead fixed lock acquiring sequence, thus lead the latency
>repeatedly being 64*(T1+T2), thus make this issue more easy to occurs.
>and the fundamental cause of this issue is that vCPU timer interrupt handler
>is not good for scaling, due to the global xtime_lock.
>
>From cpuidle point of view, one thing we are trying to do is: changing the
>cpuidle driver to not enter deep C state when there is vCPU with local irq
>disabled and event channel polling. In this case, the T2 latency will be
>eliminated.
>
>Anyway, cpuidle is just one side, we can anticipate that if CPU number is large
>enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another
>way is to make dom0 scaling well by not using xtime_lock, although this is
>pretty hard currently. Or another way is to limit dom0 vCPU number to
>certain reasonable level.
>
>Regards
>Ke

[-- Attachment #2: cpuidle-hint-v2.patch --]
[-- Type: application/octet-stream, Size: 2623 bytes --]

diff -r 2636e5619708 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/arch/x86/acpi/cpu_idle.c	Wed Feb 03 15:20:57 2010 +0800
@@ -216,6 +216,31 @@ static inline void trace_exit_reason(u32
     }
 }
 
+/* vcpu is urgent iff
+ *  - has local event delivery disabled, and
+ *  - polling event channel
+ *
+ * if urgent vcpu exists, CPU should not enter deep C state
+ */
+static int has_urgent_vcpu(void)
+{
+    struct domain *d;
+    struct vcpu *v;
+    unsigned int cpu = smp_processor_id();
+
+    for_each_domain ( d )
+    {
+        for_each_vcpu ( d, v )
+        {
+            if ( v->processor == cpu ){
+                if ( unlikely(vcpu_info(v, evtchn_upcall_mask) && v->poll_evtchn) )
+                    return 1;
+            }
+        }
+    }
+    return 0;
+}
+
 static void acpi_processor_idle(void)
 {
     struct acpi_processor_power *power = processor_powers[smp_processor_id()];
@@ -225,6 +250,26 @@ static void acpi_processor_idle(void)
     u32 t1, t2 = 0;
     u32 exp = 0, pred = 0;
     u32 irq_traced[4] = { 0 };
+
+    if ( max_cstate > 0 && power && !has_urgent_vcpu() &&
+         (next_state = cpuidle_current_governor->select(power)) > 0 )
+    {
+        cx = &power->states[next_state];
+        if ( power->flags.bm_check && acpi_idle_bm_check()
+             && cx->type == ACPI_STATE_C3 )
+            cx = power->safe_state;
+        if ( cx->idx > max_cstate )
+            cx = &power->states[max_cstate];
+        menu_get_trace_data(&exp, &pred);
+    }
+    if ( !cx )
+    {
+        if ( pm_idle_save )
+            pm_idle_save();
+        else
+            acpi_safe_halt();
+        return;
+    }
 
     cpufreq_dbs_timer_suspend();
 
@@ -241,28 +286,6 @@ static void acpi_processor_idle(void)
     if ( softirq_pending(smp_processor_id()) )
     {
         local_irq_enable();
-        sched_tick_resume();
-        cpufreq_dbs_timer_resume();
-        return;
-    }
-
-    if ( max_cstate > 0 && power && 
-         (next_state = cpuidle_current_governor->select(power)) > 0 )
-    {
-        cx = &power->states[next_state];
-        if ( power->flags.bm_check && acpi_idle_bm_check()
-             && cx->type == ACPI_STATE_C3 )
-            cx = power->safe_state;
-        if ( cx->idx > max_cstate )
-            cx = &power->states[max_cstate];
-        menu_get_trace_data(&exp, &pred);
-    }
-    if ( !cx )
-    {
-        if ( pm_idle_save )
-            pm_idle_save();
-        else
-            acpi_safe_halt();
         sched_tick_resume();
         cpufreq_dbs_timer_resume();
         return;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-02 17:07             ` Yu, Ke
@ 2010-02-03 10:15               ` Jan Beulich
  2010-02-03 12:10                 ` Tian, Kevin
  2010-02-03 14:46                 ` Yu, Ke
  0 siblings, 2 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-03 10:15 UTC (permalink / raw)
  To: Ke Yu; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>>
>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>direction). All I can judge so far is that this likely has something to do
>>with our kernel's intensive use of the poll hypercall (i.e. we see vCPU-s
>>not waking up from the call despite there being pending unmasked or
>>polled for events).
>
>We just identified the cause of this issue, and is trying to find appropriate way to fix it.

Hmm, while I agree that the scenario you describe can be a problem, I
don't think it can explain the behavior on the 24-CPU system pointed
out above, nor the one Juergen Gross pointed out yesterday.

Nor can it explain why this happens at boot time (when you can take it
for granted that several/most of the CPUs are idle [and hence would
have their periodic timer stopped]).

Also I would think that the rate at which xtime_lock is being acquired
may not be the highest one in the entire system, and hence problems
may continue to result even if we fixed timer_interrupt().

>Anyway, cpuidle is just one side, we can anticipate that if CPU number is large enough to lead NR_CPU * T1 > 4ms, this issue will occurs again. So another way is to make dom0 scaling well by not using xtime_lock, although this is pretty hard currently. Or another way is to limit dom0 vCPU number to certain reasonable level.

I would not think that dealing with the xtime_lock scalability issue in
timer_interrupt() should be *that* difficult. In particular it should be
possibly to assign an on-duty CPU (permanent or on a round-robin
basis) that deals with updating jiffies/wallclock, and all other CPUs
just update their local clocks. I had thought about this before, but
never found a strong need to experiment with that.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03  7:32             ` Yu, Ke
@ 2010-02-03 10:23               ` Jan Beulich
  2010-02-05  8:48               ` Jan Beulich
  1 sibling, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-03 10:23 UTC (permalink / raw)
  To: Ke Yu; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>>
>Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch.

We could try it, but I'm not convinced of the approach. Why is the
urgent determination dependent upon event delivery being disabled
on the respective vCPU? If at all, it should imo be polling *or* event
delivery disabled, not *and*.

Also, iterating over all vCPU-s in that function doesn't seem very
scalable. It would seem more reasonable for the scheduler to track
how many "urgent" vCPU-s a pCPU currently has.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03 10:15               ` Jan Beulich
@ 2010-02-03 12:10                 ` Tian, Kevin
  2010-02-03 12:18                   ` Juergen Gross
  2010-02-03 13:20                   ` Jan Beulich
  2010-02-03 14:46                 ` Yu, Ke
  1 sibling, 2 replies; 61+ messages in thread
From: Tian, Kevin @ 2010-02-03 12:10 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: Keir, xen-devel@lists.xensource.com, Fraser

[-- Attachment #1: Type: text/plain, Size: 5251 bytes --]

>From: Jan Beulich
>Sent: 2010年2月3日 18:16
>
>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>>
>>>Just fyi, we now also have seen an issue on a 24-CPU system that went
>>>away with cpuidle=0 (and static analysis of the hang hinted in that
>>>direction). All I can judge so far is that this likely has 
>something to do
>>>with our kernel's intensive use of the poll hypercall (i.e. 
>we see vCPU-s
>>>not waking up from the call despite there being pending unmasked or
>>>polled for events).
>>
>>We just identified the cause of this issue, and is trying to 
>find appropriate way to fix it.
>
>Hmm, while I agree that the scenario you describe can be a problem, I
>don't think it can explain the behavior on the 24-CPU system pointed
>out above, nor the one Juergen Gross pointed out yesterday.

Is 24-CPU system observed with same likelihood as 64-CPU system to
hang at boot time, or less frequent? Ke just did some theoretical analysis
by assuming some values. There could be other factors added to latency
and each system may have different characteristics too. We can't
draw conclusion whether smaller system will face same issue, by simply
changing CPU number in Ke's formula. :-) Possibly you can provide cpuidle
information on your 24-core system for further comparison.

>
>Nor can it explain why this happens at boot time (when you can take it
>for granted that several/most of the CPUs are idle [and hence would
>have their periodic timer stopped]).

Well, it's true that several/most of CPUs are idle at boot time, and when
they are blocked into Xen, periodic timer is stopped. However it doesn't 
mean that those 'idle' CPUs will block into Xen and never wake up in 
whole boot process (here we are talking about dozens of seconds). 
There're random events to wake blocked vCPU up: random process/
thread creations, device interrupts (based on affinity), per-cpu kernel 
thread may periodically wake up, etc. Most importantly is, that idle
vCPU will hook a singleshot timer when it's blocked into xen, and that
singleshot timer is calculated based on min_val of nearest timer
expiration and soft lockup threshold. Based on those assumption, 
although one vCPU of dom0 may be mostly idle in whole boot process,
it will still wake up occasionally.

Once a blocked vCPU is waken up, Xen will check whether it's next
periodic timer interrupt should be delivered. If vCPU has been blocked
over one dom0 tick (4ms in this specific SLES case), a virtual timer
will pend before resuming to guest. So every dom0 vCPU still enters
timer interrupt to acquire xtime_lock intermittently. I'm not sure how
much possibility to have those idle vCPUs waken up at same time. If
there's no special treatment to interleave vCPU timer interrupt, they
may be closely injected based on vCPU boot time point. Anyway, given
a situation where some idle vCPUs begin to contend with each other, 
and several of them exceeds busy-wait threshod to block into xen, 
latency from deep Cx will add chance for more vCPUs to contend same 
xtime_lock in same loop. Once many vCPUs may contend at same time,
and a full circle from 1st ticket to last ticket exceeds 4ms expected
by dom0, a new timer interrupt will be injected again. Then no vCPU
can escape from that contending loop.

>
>Also I would think that the rate at which xtime_lock is being acquired
>may not be the highest one in the entire system, and hence problems
>may continue to result even if we fixed timer_interrupt().

Although the rate acquiring xtime_look is unlikely the highest one, it's
a big kernel lock to be acquired on each CPU. ticket spinlock order
plus cpuidle latency may rendezvous more and more vCPUs in same 
contending loop, once latency for a vCPU to wait for specific ticket
starts to accumulate.

As Ke said, cpuidle just exaggerated spinlock issue in virtualization.
Considering another case when system is very busy and physical
CPU is overcommitted. Even when one vCPU is waken up, it may
be still in runqueue for dozens of milliseconds. Similarly vCPU holding
larger tickets have to wait for long time. If applying to xtime_lock,
you may still observe softlockup warning then. Possibly the real
solution is to not have dom0 with large virtual vCPUs. 

Thanks,
Kevin

>
>>Anyway, cpuidle is just one side, we can anticipate that if 
>CPU number is large enough to lead NR_CPU * T1 > 4ms, this 
>issue will occurs again. So another way is to make dom0 
>scaling well by not using xtime_lock, although this is pretty 
>hard currently. Or another way is to limit dom0 vCPU number to 
>certain reasonable level.
>
>I would not think that dealing with the xtime_lock scalability issue in
>timer_interrupt() should be *that* difficult. In particular it 
>should be
>possibly to assign an on-duty CPU (permanent or on a round-robin
>basis) that deals with updating jiffies/wallclock, and all other CPUs
>just update their local clocks. I had thought about this before, but
>never found a strong need to experiment with that.
>
>Jan
>
>
>_______________________________________________
>Xen-devel mailing list
>Xen-devel@lists.xensource.com
>http://lists.xensource.com/xen-devel
>

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-03 12:10                 ` Tian, Kevin
@ 2010-02-03 12:18                   ` Juergen Gross
  2010-02-04  1:40                     ` Tian, Kevin
  2010-02-03 13:20                   ` Jan Beulich
  1 sibling, 1 reply; 61+ messages in thread
From: Juergen Gross @ 2010-02-03 12:18 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: xen-devel@lists.xensource.com, Yu, Ke, Jan Beulich, Keir Fraser

Tian, Kevin wrote:
>> From: Jan Beulich
>> Sent: 2010年2月3日 18:16
>>
>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>>
>>>> Just fyi, we now also have seen an issue on a 24-CPU system that went
>>>> away with cpuidle=0 (and static analysis of the hang hinted in that
>>>> direction). All I can judge so far is that this likely has 
>> something to do
>>>> with our kernel's intensive use of the poll hypercall (i.e. 
>> we see vCPU-s
>>>> not waking up from the call despite there being pending unmasked or
>>>> polled for events).
>>> We just identified the cause of this issue, and is trying to 
>> find appropriate way to fix it.
>>
>> Hmm, while I agree that the scenario you describe can be a problem, I
>> don't think it can explain the behavior on the 24-CPU system pointed
>> out above, nor the one Juergen Gross pointed out yesterday.
> 
> Is 24-CPU system observed with same likelihood as 64-CPU system to
> hang at boot time, or less frequent? Ke just did some theoretical analysis
> by assuming some values. There could be other factors added to latency
> and each system may have different characteristics too. We can't
> draw conclusion whether smaller system will face same issue, by simply
> changing CPU number in Ke's formula. :-) Possibly you can provide cpuidle
> information on your 24-core system for further comparison.

My 4-core system hangs _always_. For minutes. If I press any key on the
console it will resume booting with soft lockup messages (all cpus were
in xen_safe_halt).
Sometimes another hang occurs, sometimes the system will come up without
further hangs.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03 12:10                 ` Tian, Kevin
  2010-02-03 12:18                   ` Juergen Gross
@ 2010-02-03 13:20                   ` Jan Beulich
  2010-02-04  1:48                     ` Tian, Kevin
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-03 13:20 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, KeirFraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10 13:10 >>>
> Possibly the real solution is to not have dom0 with large virtual vCPUs. 

But you realize that this is a Dom0-only issue only as long as DomU-s
with more vCPU-s cannot be created? I.e. it'll become an issue affecting
any kind of guest as soon as that limitation gets out of the way. So I
don't view this as a solution - it's at best a workaround until a solution
can be found.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03 10:15               ` Jan Beulich
  2010-02-03 12:10                 ` Tian, Kevin
@ 2010-02-03 14:46                 ` Yu, Ke
  1 sibling, 0 replies; 61+ messages in thread
From: Yu, Ke @ 2010-02-03 14:46 UTC (permalink / raw)
  To: Jan Beulich; +Cc: xen-devel@lists.xensource.com, Keir Fraser


Kevin has explained details on your comments, so I just pick some point for more explanation.

>
>I would not think that dealing with the xtime_lock scalability issue in
>timer_interrupt() should be *that* difficult. In particular it should be
>possibly to assign an on-duty CPU (permanent or on a round-robin
>basis) that deals with updating jiffies/wallclock, and all other CPUs
>just update their local clocks. I had thought about this before, but
>never found a strong need to experiment with that.
>
>Jan

This  is good. Eliminating global lock is always good practice for scalability, especially that there will be more and more CPUs in the future. I would expect this to be the best solution to the softlockup issue.

And If the global xtime_lock can be eliminated, the cpuidle patch may not be needed anymore.

>>Could you please try the attached patch. this patch try to avoid entering
>deep C state when there is vCPU local irq disabled, and polling event channel.
>When tested in my 64 CPU box, this issue is gone with this patch.
>
>We could try it, but I'm not convinced of the approach. Why is the
>urgent determination dependent upon event delivery being disabled
>on the respective vCPU? If at all, it should imo be polling *or* event
>delivery disabled, not *and*.

Then rationale of this patch is: disabling vCPU local irq usually means vCPU have urgent task to finish ASAP, and don't want to be interrupted. 

As first step patch, I am a bit conservative to combine them by *and*. Once it is verified working, I can extend this hint to *or *, as long as the *or*does not include unwanted case that hurt the power saving significantly.

>
>Also, iterating over all vCPU-s in that function doesn't seem very
>scalable. It would seem more reasonable for the scheduler to track
>how many "urgent" vCPU-s a pCPU currently has.

Yes, we can do this optimization. Current patch is just for quick verification purpose.

Regards
Ke

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03 12:18                   ` Juergen Gross
@ 2010-02-04  1:40                     ` Tian, Kevin
  2010-02-04  6:31                       ` Juergen Gross
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-04  1:40 UTC (permalink / raw)
  To: Juergen Gross
  Cc: xen-devel@lists.xensource.com, Yu, Ke, Jan Beulich, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]

>From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] 
>Sent: 2010年2月3日 20:19
>
>Tian, Kevin wrote:
>>> From: Jan Beulich
>>> Sent: 2010年2月3日 18:16
>>>
>>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>>
>>>>> Just fyi, we now also have seen an issue on a 24-CPU 
>system that went
>>>>> away with cpuidle=0 (and static analysis of the hang 
>hinted in that
>>>>> direction). All I can judge so far is that this likely has 
>>> something to do
>>>>> with our kernel's intensive use of the poll hypercall (i.e. 
>>> we see vCPU-s
>>>>> not waking up from the call despite there being pending 
>unmasked or
>>>>> polled for events).
>>>> We just identified the cause of this issue, and is trying to 
>>> find appropriate way to fix it.
>>>
>>> Hmm, while I agree that the scenario you describe can be a 
>problem, I
>>> don't think it can explain the behavior on the 24-CPU system pointed
>>> out above, nor the one Juergen Gross pointed out yesterday.
>> 
>> Is 24-CPU system observed with same likelihood as 64-CPU system to
>> hang at boot time, or less frequent? Ke just did some 
>theoretical analysis
>> by assuming some values. There could be other factors added 
>to latency
>> and each system may have different characteristics too. We can't
>> draw conclusion whether smaller system will face same issue, 
>by simply
>> changing CPU number in Ke's formula. :-) Possibly you can 
>provide cpuidle
>> information on your 24-core system for further comparison.
>
>My 4-core system hangs _always_. For minutes. If I press any key on the
>console it will resume booting with soft lockup messages (all cpus were
>in xen_safe_halt).
>Sometimes another hang occurs, sometimes the system will come 
>up without
>further hangs.
>
>Juergen
>

interesting. Then did you also observe hang disappeared by disabling
cpuidle? Your case really looks like some missed event scenario, in
which key press just kicks cpu alive...

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03 13:20                   ` Jan Beulich
@ 2010-02-04  1:48                     ` Tian, Kevin
  0 siblings, 0 replies; 61+ messages in thread
From: Tian, Kevin @ 2010-02-04  1:48 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, KeirFraser

[-- Attachment #1: Type: text/plain, Size: 1199 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月3日 21:21
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 03.02.10 13:10 >>>
>> Possibly the real solution is to not have dom0 with large 
>virtual vCPUs. 
>
>But you realize that this is a Dom0-only issue only as long as DomU-s
>with more vCPU-s cannot be created? I.e. it'll become an issue 
>affecting
>any kind of guest as soon as that limitation gets out of the way. So I
>don't view this as a solution - it's at best a workaround 
>until a solution
>can be found.

Sure, but then it's all about how spinlock itself should be implemented 
in virtualization environment, or how xtime_lock should be used in this
specific case, to allow scale up with large vcpu numbers.

There's always limitation about how it may scale w/o triggering softlock
warning. We agree that cpuidle made it worse inadvertently, and then
come up patch to recover it back to the level of original limitation. And
we just realize that current ticket pv spinlock may still encounter such
limitation once system becomes hot when lock holder of xtime_lock
may stay in runqueue for relatively long time, even w/o cpuidle.


Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-04  1:40                     ` Tian, Kevin
@ 2010-02-04  6:31                       ` Juergen Gross
  0 siblings, 0 replies; 61+ messages in thread
From: Juergen Gross @ 2010-02-04  6:31 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: xen-devel@lists.xensource.com, Jan Beulich, Yu, Ke, Keir Fraser

Tian, Kevin wrote:
>> From: Juergen Gross [mailto:juergen.gross@ts.fujitsu.com] 
>> Sent: 2010年2月3日 20:19
>>
>> Tian, Kevin wrote:
>>>> From: Jan Beulich
>>>> Sent: 2010年2月3日 18:16
>>>>
>>>>>>> "Yu, Ke" <ke.yu@intel.com> 02.02.10 18:07 >>>
>>>>>> Just fyi, we now also have seen an issue on a 24-CPU 
>> system that went
>>>>>> away with cpuidle=0 (and static analysis of the hang 
>> hinted in that
>>>>>> direction). All I can judge so far is that this likely has 
>>>> something to do
>>>>>> with our kernel's intensive use of the poll hypercall (i.e. 
>>>> we see vCPU-s
>>>>>> not waking up from the call despite there being pending 
>> unmasked or
>>>>>> polled for events).
>>>>> We just identified the cause of this issue, and is trying to 
>>>> find appropriate way to fix it.
>>>>
>>>> Hmm, while I agree that the scenario you describe can be a 
>> problem, I
>>>> don't think it can explain the behavior on the 24-CPU system pointed
>>>> out above, nor the one Juergen Gross pointed out yesterday.
>>> Is 24-CPU system observed with same likelihood as 64-CPU system to
>>> hang at boot time, or less frequent? Ke just did some 
>> theoretical analysis
>>> by assuming some values. There could be other factors added 
>> to latency
>>> and each system may have different characteristics too. We can't
>>> draw conclusion whether smaller system will face same issue, 
>> by simply
>>> changing CPU number in Ke's formula. :-) Possibly you can 
>> provide cpuidle
>>> information on your 24-core system for further comparison.
>> My 4-core system hangs _always_. For minutes. If I press any key on the
>> console it will resume booting with soft lockup messages (all cpus were
>> in xen_safe_halt).
>> Sometimes another hang occurs, sometimes the system will come 
>> up without
>> further hangs.
>>
>> Juergen
>>
> 
> interesting. Then did you also observe hang disappeared by disabling
> cpuidle? Your case really looks like some missed event scenario, in
> which key press just kicks cpu alive...

Yes, cpuidle=0 made the problem disappear.

Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technolgy Solutions               e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-03  7:32             ` Yu, Ke
  2010-02-03 10:23               ` Jan Beulich
@ 2010-02-05  8:48               ` Jan Beulich
  2010-02-05  9:00                 ` Tian, Kevin
                                   ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-05  8:48 UTC (permalink / raw)
  To: Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]

Yes, this patch works for us too. So a non-hacky version of it would be
appreciated.

I also meanwhile tried out the idea to reduce the contention on
xtime_lock (attached for reference). Things appear to work fine, but
there is an obvious problem (with - thus far to me - no obvious
explanation) with it: The number of timer interrupts on CPUs not on
duty to run do_timer() and alike is increasing significantly, with spikes
of over 100,000 per second. I'm investigating this, but of course any
idea anyone of you might have what could be causing this would be
very welcome.

Jan

>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>>
Hi Jan,

Could you please try the attached patch. this patch try to avoid entering deep C state when there is vCPU local irq disabled, and polling event channel. When tested in my 64 CPU box, this issue is gone with this patch.

Best Regards
Ke


[-- Attachment #2: xen-x86-xtime-lock.patch --]
[-- Type: text/plain, Size: 5222 bytes --]

From: jbeulich@novell.com
Subject: reduce contention on xtime_lock
Patch-mainline: n/a
References: bnc#569014, bnc#571041, bnc#571769, bnc#572146

Especially on large systems the number of CPUs queueing up on
xtime_lock may become signficiant, and (as reported in the bugs above)
may even prevent proper operation of the system when Xen is using
deep C-states. There is, however, no need for all CPUs in the system
to update global time - it is sufficient to have a single (at any given
point in time) CPU being responsible for this.

--- sle11sp1-2010-02-02.orig/arch/x86/kernel/time-xen.c	2010-02-04 15:51:34.000000000 +0100
+++ sle11sp1-2010-02-02/arch/x86/kernel/time-xen.c	2010-02-05 09:04:48.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/interrupt.h>
 #include <linux/time.h>
 #include <linux/sysctl.h>
+#include <linux/cpu.h>
 #include <linux/percpu.h>
 #include <linux/kernel_stat.h>
 #include <linux/posix-timers.h>
@@ -122,6 +123,21 @@ static int __init __permitted_clock_jitt
 }
 __setup("permitted_clock_jitter=", __permitted_clock_jitter);
 
+static unsigned int duty_cpu = NR_CPUS;
+
+static int __cpuinit duty_cpu_callback(struct notifier_block *nb,
+				       unsigned long action, void *hcpu)
+{
+	if ((action & ~CPU_TASKS_FROZEN) == CPU_DOWN_PREPARE
+	    && duty_cpu == (unsigned long)hcpu)
+		duty_cpu = NR_CPUS;
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block __cpuinitdata duty_cpu_notifier = {
+	.notifier_call = duty_cpu_callback
+};
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
@@ -415,6 +431,7 @@ static irqreturn_t timer_interrupt(int i
 	s64 delta, delta_cpu, stolen, blocked;
 	unsigned int i, cpu = smp_processor_id();
 	struct shadow_time_info *shadow = &per_cpu(shadow_time, cpu);
+	bool duty = false;
 	struct vcpu_runstate_info runstate;
 
 	/* Keep nmi watchdog up to date */
@@ -427,7 +444,12 @@ static irqreturn_t timer_interrupt(int i
 	 * the irq version of write_lock because as just said we have irq
 	 * locally disabled. -arca
 	 */
-	write_seqlock(&xtime_lock);
+	if (duty_cpu == cpu
+	    || (duty_cpu == NR_CPUS
+		&& cmpxchg(&duty_cpu, NR_CPUS, cpu) == NR_CPUS)) {
+		duty = true;
+		write_seqlock(&xtime_lock);
+	}
 
 	do {
 		get_time_values_from_xen(cpu);
@@ -441,8 +463,7 @@ static irqreturn_t timer_interrupt(int i
 		get_runstate_snapshot(&runstate);
 	} while (!time_values_up_to_date());
 
-	if ((unlikely(delta < -(s64)permitted_clock_jitter) ||
-	     unlikely(delta_cpu < -(s64)permitted_clock_jitter))
+	if (duty && unlikely(delta < -(s64)permitted_clock_jitter)
 	    && printk_ratelimit()) {
 		printk("Timer ISR/%u: Time went backwards: "
 		       "delta=%lld delta_cpu=%lld shadow=%lld "
@@ -454,27 +475,45 @@ static irqreturn_t timer_interrupt(int i
 		for (i = 0; i < num_online_cpus(); i++)
 			printk(" %d: %lld\n", i,
 			       per_cpu(processed_system_time, i));
+	} else if (unlikely(delta_cpu < -(s64)permitted_clock_jitter)
+		   && printk_ratelimit()) {
+		if (!duty)
+			write_seqlock(&xtime_lock);
+		printk("Timer ISR/%u: Time went backwards: "
+		       "delta=-%Lx shadow=%Lx off=%Lx processed=%Lx/%Lx\n",
+		       cpu, -delta_cpu, shadow->system_timestamp,
+		       get_nsec_offset(shadow), processed_system_time,
+		       per_cpu(processed_system_time, cpu));
+		for_each_cpu_and(i, cpu_online_mask, cpumask_of(cpu))
+			printk(" %u: %Lx\n", i,
+			       per_cpu(processed_system_time, i));
+		if (!duty)
+			write_sequnlock(&xtime_lock);
 	}
 
-	/* System-wide jiffy work. */
-	if (delta >= NS_PER_TICK) {
-		do_div(delta, NS_PER_TICK);
-		processed_system_time += delta * NS_PER_TICK;
-		while (delta > HZ) {
-			clobber_induction_variable(delta);
-			do_timer(HZ);
-			delta -= HZ;
+	if (duty) {
+kstat_incr_irqs_this_cpu(0, irq_to_desc(0));//temp
+		/* System-wide jiffy work. */
+		if (delta >= NS_PER_TICK) {
+			do_div(delta, NS_PER_TICK);
+			processed_system_time += delta * NS_PER_TICK;
+			while (delta > HZ) {
+				clobber_induction_variable(delta);
+				do_timer(HZ);
+				delta -= HZ;
+			}
+			do_timer(delta);
 		}
-		do_timer(delta);
-	}
 
-	if (shadow_tv_version != HYPERVISOR_shared_info->wc_version) {
-		update_wallclock();
-		if (keventd_up())
-			schedule_work(&clock_was_set_work);
-	}
+		if (shadow_tv_version != HYPERVISOR_shared_info->wc_version) {
+			update_wallclock();
+			if (keventd_up())
+				schedule_work(&clock_was_set_work);
+		}
 
-	write_sequnlock(&xtime_lock);
+		write_sequnlock(&xtime_lock);
+	}
+else percpu_inc(mce_exception_count);//temp
 
 	/*
 	 * Account stolen ticks.
@@ -692,6 +731,7 @@ static void __init setup_cpu0_timer_irq(
 {
 	timer_irq = bind_virq_to_irqaction(VIRQ_TIMER, 0, &timer_action);
 	BUG_ON(timer_irq < 0);
+	register_hotcpu_notifier(&duty_cpu_notifier);
 }
 
 void __init time_init(void)
@@ -765,6 +805,8 @@ static void stop_hz_timer(void)
 	unsigned long j;
 	int rc;
 
+	if (duty_cpu == cpu)
+		duty_cpu = NR_CPUS;
 	cpumask_set_cpu(cpu, nohz_cpu_mask);
 
 	/* See matching smp_mb in rcu_start_batch in rcupdate.c.  These mbs  */

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  8:48               ` Jan Beulich
@ 2010-02-05  9:00                 ` Tian, Kevin
  2010-02-05  9:14                   ` Jan Beulich
  2010-02-05  9:16                 ` Yu, Ke
  2010-02-07 15:36                 ` Yu, Ke
  2 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-05  9:00 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 910 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 16:49
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.
>
>I also meanwhile tried out the idea to reduce the contention on
>xtime_lock (attached for reference). Things appear to work fine, but
>there is an obvious problem (with - thus far to me - no obvious
>explanation) with it: The number of timer interrupts on CPUs not on
>duty to run do_timer() and alike is increasing significantly, 
>with spikes
>of over 100,000 per second. I'm investigating this, but of course any
>idea anyone of you might have what could be causing this would be
>very welcome.
>

forgive my poor english. From your patch, only cpu on duty will invoke
do_timer to update global timestamp. Why in your test it's CPUs 'not
on duty' to have high frequent do_timer? I may read it wrong. :-(

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  9:00                 ` Tian, Kevin
@ 2010-02-05  9:14                   ` Jan Beulich
  2010-02-05  9:52                     ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-05  9:14 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:00 >>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Sent: 2010年2月5日 16:49
>>
>>Yes, this patch works for us too. So a non-hacky version of it would be
>>appreciated.
>>
>>I also meanwhile tried out the idea to reduce the contention on
>>xtime_lock (attached for reference). Things appear to work fine, but
>>there is an obvious problem (with - thus far to me - no obvious
>>explanation) with it: The number of timer interrupts on CPUs not on
>>duty to run do_timer() and alike is increasing significantly, 
>>with spikes
>>of over 100,000 per second. I'm investigating this, but of course any
>>idea anyone of you might have what could be causing this would be
>>very welcome.
>>
>
>forgive my poor english. From your patch, only cpu on duty will invoke
>do_timer to update global timestamp. Why in your test it's CPUs 'not
>on duty' to have high frequent do_timer? I may read it wrong. :-(

If you look at the patch, I added extra statistics for those timer
interrupts that occur when a CPU is "on duty" (recorded as IRQ0,
which is otherwise unused) and when not "on duty" (recorded as MCEs,
since those hopefully(!!!) won't occur either, and in no case at a high
rate).

>From that I know that the rate of interrupts (not the rate of do_timer()
invocations) is much higher on not-on-duty CPUs, but is roughly as
without the patch for the on-duty one.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  8:48               ` Jan Beulich
  2010-02-05  9:00                 ` Tian, Kevin
@ 2010-02-05  9:16                 ` Yu, Ke
  2010-02-07 15:36                 ` Yu, Ke
  2 siblings, 0 replies; 61+ messages in thread
From: Yu, Ke @ 2010-02-05  9:16 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir Fraser

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Friday, February 05, 2010 4:49 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.

That is good. I will polish the patch. 

Regards
Ke

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  9:14                   ` Jan Beulich
@ 2010-02-05  9:52                     ` Tian, Kevin
  2010-02-05 10:37                       ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-05  9:52 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 2334 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 17:14
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:00 >>>
>>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>>Sent: 2010年2月5日 16:49
>>>
>>>Yes, this patch works for us too. So a non-hacky version of 
>it would be
>>>appreciated.
>>>
>>>I also meanwhile tried out the idea to reduce the contention on
>>>xtime_lock (attached for reference). Things appear to work fine, but
>>>there is an obvious problem (with - thus far to me - no obvious
>>>explanation) with it: The number of timer interrupts on CPUs not on
>>>duty to run do_timer() and alike is increasing significantly, 
>>>with spikes
>>>of over 100,000 per second. I'm investigating this, but of course any
>>>idea anyone of you might have what could be causing this would be
>>>very welcome.
>>>
>>
>>forgive my poor english. From your patch, only cpu on duty will invoke
>>do_timer to update global timestamp. Why in your test it's CPUs 'not
>>on duty' to have high frequent do_timer? I may read it wrong. :-(
>
>If you look at the patch, I added extra statistics for those timer
>interrupts that occur when a CPU is "on duty" (recorded as IRQ0,
>which is otherwise unused) and when not "on duty" (recorded as MCEs,
>since those hopefully(!!!) won't occur either, and in no case at a high
>rate).
>
>From that I know that the rate of interrupts (not the rate of 
>do_timer()
>invocations) is much higher on not-on-duty CPUs, but is roughly as
>without the patch for the on-duty one.
>

Are 100,000 per second a sum-up on all non-duty CPUs or observed on
just one? How about the level without the patch?

Your patch is quite straightforward, and in a glimpse it shouldn't have
such weird issue... Timer interrupts are caused in two paths: one is 
periodic timer when vcpu is running. As you have HZ=250, you won't 
get >250 interrupts in this portion as Xen accurately drives injections 
in 4ms pace. The other is singleshot timer when vcpu is entering idle. 
periodic timer will be stopped in that case, and a singleshot timer is 
registered with nearest expiration in local timer queue. That portion is
unpredictable, then possibly your patch increases hotness in that part.
But I haven't figured out how it could be. :-(

Thanks,
Kevin 

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  9:52                     ` Tian, Kevin
@ 2010-02-05 10:37                       ` Jan Beulich
  2010-02-05 11:16                         ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-05 10:37 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:52 >>>
>Are 100,000 per second a sum-up on all non-duty CPUs or observed on
>just one? How about the level without the patch?

That's 100,000 per CPU per second. Normally on an idle system the
number is about 25 per CPU per second.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05 10:37                       ` Jan Beulich
@ 2010-02-05 11:16                         ` Tian, Kevin
  2010-02-05 14:59                           ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-05 11:16 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: Keir, xen-devel@lists.xensource.com, Fraser

[-- Attachment #1: Type: text/plain, Size: 1463 bytes --]

>From: Jan Beulich
>Sent: 2010年2月5日 18:37
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 10:52 >>>
>>Are 100,000 per second a sum-up on all non-duty CPUs or observed on
>>just one? How about the level without the patch?
>
>That's 100,000 per CPU per second. Normally on an idle system the
>number is about 25 per CPU per second.
>

I think I know the reason. With your patch, now only one duty CPU
will update global jiffies, however this duty CPU may be preempted
over several ticks. On the other hand, all other non-duty CPU 
calculate their singleshot time expirations based on jiffies (see
stop_hz_timer). There're some conditions to get jiffies+1 as result.
When duty CPU is preempted, it's possible for jiffies value behind 
actual time for several ticks. That means non-duty CPU may request
an old time to Xen which expires and be pended with a timer interrupt
immediately by Xen. Then vcpu_block returns immediately. As 
non-duty CPU is in idle loop, this will loop to get your interrupt stat
very high until duty CPU gets re-scheduled to update jiffies.

Without your patch, jiffies can be updated timely as long as there's
running CPU in dom0.

Possible options is to take jiffies, system timestamp, and per-cpu
timestamp in stop_hz_timer, to generate a local latest value. Or
a per-cpu jiffies can be generated in per-cpu timer interrupt, which
is only used in per-cpu style like in stop_hz_timer.

Thanks
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05 11:16                         ` Tian, Kevin
@ 2010-02-05 14:59                           ` Jan Beulich
  2010-02-05 15:51                             ` Jan Beulich
  2010-02-06  1:50                             ` Tian, Kevin
  0 siblings, 2 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-05 14:59 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, KeirFraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 12:16 >>>
>Possible options is to take jiffies, system timestamp, and per-cpu
>timestamp in stop_hz_timer, to generate a local latest value. Or
>a per-cpu jiffies can be generated in per-cpu timer interrupt, which
>is only used in per-cpu style like in stop_hz_timer.

That was my next thought too - but it doesn't work (unless I made a
blatant mistake). Not only does it not get the interrupt rate down,
it even grows it on the duty CPU, and it creates visible brief hangs
as well a rushes of time.

Next I'll do is try to detect when the duty CPU is de-scheduled, and
move on the duty to one that is scheduled (i.e. one that is currently
executing timer_interrupt()).

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05 14:59                           ` Jan Beulich
@ 2010-02-05 15:51                             ` Jan Beulich
  2010-02-06  1:52                               ` Tian, Kevin
  2010-02-06  1:50                             ` Tian, Kevin
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-05 15:51 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian, Jan Beulich; +Cc: xen-devel@lists.xensource.com, KeirFraser

>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>>
>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>move on the duty to one that is scheduled (i.e. one that is currently
>executing timer_interrupt()).

This improves the situation (the highest spike I saw so far was 2,000
interrupts per CPU per second), but doesn't get it back the way it
ought to be (apart from the spikes, as with the original version of the
patch, interrupt activity is also generally too high, very erratic, and
even during the more quiet periods doesn't go down to the original
level).

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05 14:59                           ` Jan Beulich
  2010-02-05 15:51                             ` Jan Beulich
@ 2010-02-06  1:50                             ` Tian, Kevin
  2010-02-08  8:36                               ` Jan Beulich
  1 sibling, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-06  1:50 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, KeirFraser

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 22:59
>
>>>> "Tian, Kevin" <kevin.tian@intel.com> 05.02.10 12:16 >>>
>>Possible options is to take jiffies, system timestamp, and per-cpu
>>timestamp in stop_hz_timer, to generate a local latest value. Or
>>a per-cpu jiffies can be generated in per-cpu timer interrupt, which
>>is only used in per-cpu style like in stop_hz_timer.
>
>That was my next thought too - but it doesn't work (unless I made a
>blatant mistake). Not only does it not get the interrupt rate down,
>it even grows it on the duty CPU, and it creates visible brief hangs
>as well a rushes of time.

This really twists my head. Ideally with above option singleshot timer
is postponed and there's really no point to instead kick interrupt rate
up... Possibly you can add some trace in Xen side to check singleshot
timer registration/exipiration pattern. At least this is only suspicious
virtual timer pending path from my perspective, which is worthy of
some checking to understand what's going wrong from another angle.

>
>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>move on the duty to one that is scheduled (i.e. one that is currently
>executing timer_interrupt()).
>

It's still possible that you move duty to one active CPU, while that
active CPU is in stop_hz_timer with interrupt disabled. In that case,
again no one is able to update jiffies and again stale jiffies can be
observed... 

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05 15:51                             ` Jan Beulich
@ 2010-02-06  1:52                               ` Tian, Kevin
  2010-02-08  8:45                                 ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-06  1:52 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, KeirFraser

[-- Attachment #1: Type: text/plain, Size: 847 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月5日 23:52
>
>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>>
>>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>>move on the duty to one that is scheduled (i.e. one that is currently
>>executing timer_interrupt()).
>
>This improves the situation (the highest spike I saw so far was 2,000
>interrupts per CPU per second), but doesn't get it back the way it
>ought to be (apart from the spikes, as with the original version of the
>patch, interrupt activity is also generally too high, very erratic, and
>even during the more quiet periods doesn't go down to the original
>level).
>

could you send out your new patch? in same time, tweaking singleshot
timer stat from Xen side would be helpful as I said earlier. :-)

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-05  8:48               ` Jan Beulich
  2010-02-05  9:00                 ` Tian, Kevin
  2010-02-05  9:16                 ` Yu, Ke
@ 2010-02-07 15:36                 ` Yu, Ke
  2010-02-08  9:08                   ` Jan Beulich
  2 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-02-07 15:36 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 1544 bytes --]

Jan,

The attached is the updated patch, it has two changes
- change the logic from local irq disabled *and* poll event to local irq disabled *or* poll event 
- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The original scheduler does not provide such kind of list, so this patch implement the list in scheduler code.

Regards
Ke

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Friday, February 05, 2010 4:49 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Yes, this patch works for us too. So a non-hacky version of it would be
>appreciated.
>
>I also meanwhile tried out the idea to reduce the contention on
>xtime_lock (attached for reference). Things appear to work fine, but
>there is an obvious problem (with - thus far to me - no obvious
>explanation) with it: The number of timer interrupts on CPUs not on
>duty to run do_timer() and alike is increasing significantly, with spikes
>of over 100,000 per second. I'm investigating this, but of course any
>idea anyone of you might have what could be causing this would be
>very welcome.
>
>Jan
>
>>>> "Yu, Ke" <ke.yu@intel.com> 03.02.10 08:32 >>>
>Hi Jan,
>
>Could you please try the attached patch. this patch try to avoid entering deep
>C state when there is vCPU local irq disabled, and polling event channel.
>When tested in my 64 CPU box, this issue is gone with this patch.
>
>Best Regards
>Ke


[-- Attachment #2: cpuidle-hint-v3.patch --]
[-- Type: application/octet-stream, Size: 9123 bytes --]

cpuidle: not enter deep C state if there is urgent VCPU

when VCPU has local event delivery disabled or poll on event channel, it usually has urgent task running, e.g. spin_lock, in this case, it is better for cpuidle driver not to enter deep C state.

This patch fix the issue that SLES 11 SP1 domain0 hangs in the box of large number of CPUs (>= 64 CPUs).

Signed-off-by: Yu Ke <ke.yu@intel.com>
               Tian Kevin <kevin.tian@intel.com>

diff -r 2636e5619708 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/arch/x86/acpi/cpu_idle.c	Sun Feb 07 23:08:07 2010 +0800
@@ -41,10 +41,12 @@
 #include <xen/keyhandler.h>
 #include <xen/cpuidle.h>
 #include <xen/trace.h>
+#include <xen/sched-if.h>
 #include <asm/cache.h>
 #include <asm/io.h>
 #include <asm/hpet.h>
 #include <asm/processor.h>
+#include <asm/event.h>
 #include <xen/pmstat.h>
 #include <public/platform.h>
 #include <public/sysctl.h>
@@ -216,6 +218,42 @@ static inline void trace_exit_reason(u32
     }
 }
 
+/* vcpu is urgent if
+ *  - event delivery is disabled or
+ *  - being polling event channel
+ *
+ * if urgent vcpu exists, CPU should not enter deep C state
+ */
+static int sched_has_urgent_vcpu(void)
+{
+    struct vcpu *v;
+    int cpu;
+    struct list_head *iter;
+    unsigned long flags;
+    int rc=0;
+
+    cpu = smp_processor_id();
+    spin_lock_irqsave(
+            &per_cpu(schedule_data, cpu).schedule_lock, flags);
+
+    list_for_each(iter, &per_cpu(schedule_data,cpu).vcpu_list)
+    {
+        v = sched_vcpu_list_elem(iter);
+
+        if (unlikely(!vcpu_event_delivery_is_enabled(v) ||
+                    v->poll_evtchn != 0) )
+        {
+            rc=1;
+            break;
+        }
+    }
+
+    spin_unlock_irqrestore(
+            &per_cpu(schedule_data, cpu).schedule_lock, flags);
+
+    return rc;
+}
+
 static void acpi_processor_idle(void)
 {
     struct acpi_processor_power *power = processor_powers[smp_processor_id()];
@@ -225,6 +263,26 @@ static void acpi_processor_idle(void)
     u32 t1, t2 = 0;
     u32 exp = 0, pred = 0;
     u32 irq_traced[4] = { 0 };
+
+    if ( max_cstate > 0 && power && !sched_has_urgent_vcpu() &&
+         (next_state = cpuidle_current_governor->select(power)) > 0 )
+    {
+        cx = &power->states[next_state];
+        if ( power->flags.bm_check && acpi_idle_bm_check()
+             && cx->type == ACPI_STATE_C3 )
+            cx = power->safe_state;
+        if ( cx->idx > max_cstate )
+            cx = &power->states[max_cstate];
+        menu_get_trace_data(&exp, &pred);
+    }
+    if ( !cx )
+    {
+        if ( pm_idle_save )
+            pm_idle_save();
+        else
+            acpi_safe_halt();
+        return;
+    }
 
     cpufreq_dbs_timer_suspend();
 
@@ -241,28 +299,6 @@ static void acpi_processor_idle(void)
     if ( softirq_pending(smp_processor_id()) )
     {
         local_irq_enable();
-        sched_tick_resume();
-        cpufreq_dbs_timer_resume();
-        return;
-    }
-
-    if ( max_cstate > 0 && power && 
-         (next_state = cpuidle_current_governor->select(power)) > 0 )
-    {
-        cx = &power->states[next_state];
-        if ( power->flags.bm_check && acpi_idle_bm_check()
-             && cx->type == ACPI_STATE_C3 )
-            cx = power->safe_state;
-        if ( cx->idx > max_cstate )
-            cx = &power->states[max_cstate];
-        menu_get_trace_data(&exp, &pred);
-    }
-    if ( !cx )
-    {
-        if ( pm_idle_save )
-            pm_idle_save();
-        else
-            acpi_safe_halt();
         sched_tick_resume();
         cpufreq_dbs_timer_resume();
         return;
diff -r 2636e5619708 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/common/sched_credit.c	Sun Feb 07 23:08:07 2010 +0800
@@ -1060,7 +1060,9 @@ csched_runq_steal(int peer_cpu, int cpu,
                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
                 CSCHED_STAT_CRANK(migrate_queued);
                 __runq_remove(speer);
+                sched_vcpu_list_del(vc);
                 vc->processor = cpu;
+                sched_vcpu_list_add(vc);
                 return speer;
             }
         }
diff -r 2636e5619708 xen/common/schedule.c
--- a/xen/common/schedule.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/common/schedule.c	Sun Feb 07 23:08:07 2010 +0800
@@ -172,6 +172,15 @@ int sched_init_vcpu(struct vcpu *v, unsi
     init_timer(&v->poll_timer, poll_timer_fn,
                v, v->processor);
 
+    /* add to sched vcpu list */
+    if ( !is_idle_domain(d))
+    {
+        INIT_LIST_HEAD(&v->sched_vcpu_list);
+        vcpu_schedule_lock_irq(v);
+        sched_vcpu_list_add(v);
+        vcpu_schedule_unlock_irq(v);
+    }
+
     /* Idle VCPUs are scheduled immediately. */
     if ( is_idle_domain(d) )
     {
@@ -190,6 +199,14 @@ void sched_destroy_vcpu(struct vcpu *v)
     kill_timer(&v->periodic_timer);
     kill_timer(&v->singleshot_timer);
     kill_timer(&v->poll_timer);
+
+    if ( !is_idle_vcpu(v) )
+    {
+        vcpu_schedule_lock_irq(v);
+        sched_vcpu_list_del(v);
+        vcpu_schedule_unlock_irq(v);
+    }
+
     SCHED_OP(destroy_vcpu, v);
 }
 
@@ -296,10 +313,15 @@ static void vcpu_migrate(struct vcpu *v)
     }
 
     /* Switch to new CPU, then unlock old CPU. */
+    sched_vcpu_list_del(v);
     old_cpu = v->processor;
     v->processor = SCHED_OP(pick_cpu, v);
     spin_unlock_irqrestore(
         &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
+
+    vcpu_schedule_lock_irqsave(v, flags);
+    sched_vcpu_list_add(v);
+    vcpu_schedule_unlock_irqrestore(v, flags);
 
     /* Wake on new CPU. */
     vcpu_wake(v);
@@ -922,6 +944,7 @@ void __init scheduler_init(void)
     {
         spin_lock_init(&per_cpu(schedule_data, i).schedule_lock);
         init_timer(&per_cpu(schedule_data, i).s_timer, s_timer_fn, NULL, i);
+        INIT_LIST_HEAD(&per_cpu(schedule_data, i).vcpu_list);
     }
 
     for ( i = 0; schedulers[i] != NULL; i++ )
diff -r 2636e5619708 xen/include/asm-ia64/event.h
--- a/xen/include/asm-ia64/event.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/asm-ia64/event.h	Sun Feb 07 23:08:07 2010 +0800
@@ -48,6 +48,11 @@ static inline int local_events_need_deli
     return event_pending(current);
 }
 
+static inline int vcpu_event_delivery_is_enabled(struct vcpu* v)
+{
+    return !v->vcpu_info->evtchn_upcall_mask;
+}
+
 static inline int local_event_delivery_is_enabled(void)
 {
     return !current->vcpu_info->evtchn_upcall_mask;
diff -r 2636e5619708 xen/include/asm-x86/event.h
--- a/xen/include/asm-x86/event.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/asm-x86/event.h	Sun Feb 07 23:08:07 2010 +0800
@@ -23,6 +23,11 @@ static inline int local_events_need_deli
              !vcpu_info(v, evtchn_upcall_mask)));
 }
 
+static inline int vcpu_event_delivery_is_enabled(struct vcpu* v)
+{
+    return !vcpu_info(v, evtchn_upcall_mask);
+}
+
 static inline int local_event_delivery_is_enabled(void)
 {
     return !vcpu_info(current, evtchn_upcall_mask);
diff -r 2636e5619708 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/xen/sched-if.h	Sun Feb 07 23:08:07 2010 +0800
@@ -16,6 +16,7 @@ struct schedule_data {
     struct vcpu        *idle;           /* idle task for this cpu          */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    struct list_head    vcpu_list;      /* vcpu whose vcpu->processor is this cpu */
 } __cacheline_aligned;
 
 DECLARE_PER_CPU(struct schedule_data, schedule_data);
@@ -48,6 +49,28 @@ static inline void vcpu_schedule_unlock(
     do { vcpu_schedule_unlock(v); local_irq_enable(); } while ( 0 )
 #define vcpu_schedule_unlock_irqrestore(v, flags) \
     do { vcpu_schedule_unlock(v); local_irq_restore(flags); } while ( 0 )
+
+static inline void sched_vcpu_list_add(struct vcpu *v)
+{
+    ASSERT( spin_is_locked(&per_cpu(schedule_data,
+                    v->processor).schedule_lock) );
+
+    list_add(&v->sched_vcpu_list,
+            &per_cpu(schedule_data,v->processor).vcpu_list);
+}
+
+static inline void sched_vcpu_list_del(struct vcpu *v)
+{
+    ASSERT( spin_is_locked(&per_cpu(schedule_data,
+                    v->processor).schedule_lock) );
+
+    list_del( &v->sched_vcpu_list );
+}
+
+static inline struct vcpu* sched_vcpu_list_elem(struct list_head *elem)
+{
+    return list_entry(elem, struct vcpu, sched_vcpu_list);
+}
 
 struct task_slice {
     struct vcpu *task;
diff -r 2636e5619708 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/xen/sched.h	Sun Feb 07 23:08:07 2010 +0800
@@ -91,6 +91,8 @@ struct vcpu
     struct timer     poll_timer;    /* timeout for SCHEDOP_poll */
 
     void            *sched_priv;    /* scheduler-specific data */
+    struct list_head sched_vcpu_list; /* scheduler per-CPU vcpu list
+                                        head in schedule_data->vcpu_list */
 
     struct vcpu_runstate_info runstate;
 #ifndef CONFIG_COMPAT

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-06  1:50                             ` Tian, Kevin
@ 2010-02-08  8:36                               ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-08  8:36 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, KeirFraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:50 >>>
>>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>>move on the duty to one that is scheduled (i.e. one that is currently
>>executing timer_interrupt()).
>>
>
>It's still possible that you move duty to one active CPU, while that
>active CPU is in stop_hz_timer with interrupt disabled. In that case,
>again no one is able to update jiffies and again stale jiffies can be
>observed... 

No, since the first thing a CPU entering stop_hz_timer() does is
relinquish the duty of updating jiffies.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-06  1:52                               ` Tian, Kevin
@ 2010-02-08  8:45                                 ` Jan Beulich
  2010-02-09  7:55                                   ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-08  8:45 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, KeirFraser

[-- Attachment #1: Type: text/plain, Size: 1354 bytes --]

>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:52 >>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Sent: 2010年2月5日 23:52
>>
>>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>>
>>>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>>>move on the duty to one that is scheduled (i.e. one that is currently
>>>executing timer_interrupt()).
>>
>>This improves the situation (the highest spike I saw so far was 2,000
>>interrupts per CPU per second), but doesn't get it back the way it
>>ought to be (apart from the spikes, as with the original version of the
>>patch, interrupt activity is also generally too high, very erratic, and
>>even during the more quiet periods doesn't go down to the original
>>level).
>>
>
>could you send out your new patch? in same time, tweaking singleshot

Attached. After another refinement (in stop_hz_timer()) I didn't see
spikes above 1,000 interrupts per CPU per second anymore. But it's
still far from being as quiescent as without the patch.

What's also interesting is that there's an initial period (a minute or so)
where the interrupt rate is really stable (though still not as low as
previously), and only then it starts becoming erratic.

>timer stat from Xen side would be helpful as I said earlier. :-)

Didn't get to do that yet.

Jan

[-- Attachment #2: xen-x86-xtime-lock.patch --]
[-- Type: text/plain, Size: 6152 bytes --]

From: jbeulich@novell.com
Subject: reduce contention on xtime_lock
Patch-mainline: n/a
References: bnc#569014, bnc#571041, bnc#571769, bnc#572146

Especially on large systems the number of CPUs queueing up on
xtime_lock may become signficiant, and (as reported in the bugs above)
may even prevent proper operation of the system when Xen is using
deep C-states. There is, however, no need for all CPUs in the system
to update global time - it is sufficient to have a single (at any given
point in time) CPU being responsible for this.

--- sle11sp1-2010-02-02.orig/arch/x86/kernel/time-xen.c	2010-02-08 08:39:21.000000000 +0100
+++ sle11sp1-2010-02-02/arch/x86/kernel/time-xen.c	2010-02-08 08:40:15.000000000 +0100
@@ -13,6 +13,7 @@
 #include <linux/interrupt.h>
 #include <linux/time.h>
 #include <linux/sysctl.h>
+#include <linux/cpu.h>
 #include <linux/percpu.h>
 #include <linux/kernel_stat.h>
 #include <linux/posix-timers.h>
@@ -122,6 +123,20 @@ static int __init __permitted_clock_jitt
 }
 __setup("permitted_clock_jitter=", __permitted_clock_jitter);
 
+static volatile unsigned int duty_cpu = NR_CPUS;
+
+static int __cpuinit duty_cpu_callback(struct notifier_block *nb,
+				       unsigned long action, void *hcpu)
+{
+	if ((action & ~CPU_TASKS_FROZEN) == CPU_DOWN_PREPARE)
+		(void)cmpxchg(&duty_cpu, (unsigned long)hcpu, NR_CPUS);
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block __cpuinitdata duty_cpu_notifier = {
+	.notifier_call = duty_cpu_callback
+};
+
 /*
  * Scale a 64-bit delta by scaling and multiplying by a 32-bit fraction,
  * yielding a 64-bit result.
@@ -407,6 +422,7 @@ unsigned long profile_pc(struct pt_regs 
 }
 EXPORT_SYMBOL(profile_pc);
 
+#include <asm/mce.h>//temp
 /*
  * Default timer interrupt handler
  */
@@ -415,6 +431,7 @@ static irqreturn_t timer_interrupt(int i
 	s64 delta, delta_cpu, stolen, blocked;
 	unsigned int i, cpu = smp_processor_id();
 	struct shadow_time_info *shadow = &per_cpu(shadow_time, cpu);
+	bool duty = false;
 	struct vcpu_runstate_info runstate;
 
 	/* Keep nmi watchdog up to date */
@@ -427,7 +444,17 @@ static irqreturn_t timer_interrupt(int i
 	 * the irq version of write_lock because as just said we have irq
 	 * locally disabled. -arca
 	 */
-	write_seqlock(&xtime_lock);
+	do {
+		i = duty_cpu;
+		if (i == cpu
+		    || ((i == NR_CPUS
+			 || per_cpu(runstate, i).state != RUNSTATE_running)
+			&& cmpxchg(&duty_cpu, i, cpu) == i)) {
+			duty = true;
+			write_seqlock(&xtime_lock);
+			break;
+		}
+	} while (i != duty_cpu);
 
 	do {
 		get_time_values_from_xen(cpu);
@@ -441,8 +468,7 @@ static irqreturn_t timer_interrupt(int i
 		get_runstate_snapshot(&runstate);
 	} while (!time_values_up_to_date());
 
-	if ((unlikely(delta < -(s64)permitted_clock_jitter) ||
-	     unlikely(delta_cpu < -(s64)permitted_clock_jitter))
+	if (duty && unlikely(delta < -(s64)permitted_clock_jitter)
 	    && printk_ratelimit()) {
 		printk("Timer ISR/%u: Time went backwards: "
 		       "delta=%lld delta_cpu=%lld shadow=%lld "
@@ -454,27 +480,45 @@ static irqreturn_t timer_interrupt(int i
 		for (i = 0; i < num_online_cpus(); i++)
 			printk(" %d: %lld\n", i,
 			       per_cpu(processed_system_time, i));
+	} else if (unlikely(delta_cpu < -(s64)permitted_clock_jitter)
+		   && printk_ratelimit()) {
+		if (!duty)
+			write_seqlock(&xtime_lock);
+		printk("Timer ISR/%u: Time went backwards: "
+		       "delta=-%Lx shadow=%Lx off=%Lx processed=%Lx/%Lx\n",
+		       cpu, -delta_cpu, shadow->system_timestamp,
+		       get_nsec_offset(shadow), processed_system_time,
+		       per_cpu(processed_system_time, cpu));
+		for_each_cpu_and(i, cpu_online_mask, cpumask_of(cpu))
+			printk(" %u: %Lx\n", i,
+			       per_cpu(processed_system_time, i));
+		if (!duty)
+			write_sequnlock(&xtime_lock);
 	}
 
-	/* System-wide jiffy work. */
-	if (delta >= NS_PER_TICK) {
-		do_div(delta, NS_PER_TICK);
-		processed_system_time += delta * NS_PER_TICK;
-		while (delta > HZ) {
-			clobber_induction_variable(delta);
-			do_timer(HZ);
-			delta -= HZ;
+	if (duty) {
+kstat_incr_irqs_this_cpu(0, irq_to_desc(0));//temp
+		/* System-wide jiffy work. */
+		if (delta >= NS_PER_TICK) {
+			do_div(delta, NS_PER_TICK);
+			processed_system_time += delta * NS_PER_TICK;
+			while (delta > HZ) {
+				clobber_induction_variable(delta);
+				do_timer(HZ);
+				delta -= HZ;
+			}
+			do_timer(delta);
 		}
-		do_timer(delta);
-	}
 
-	if (shadow_tv_version != HYPERVISOR_shared_info->wc_version) {
-		update_wallclock();
-		if (keventd_up())
-			schedule_work(&clock_was_set_work);
-	}
+		if (shadow_tv_version != HYPERVISOR_shared_info->wc_version) {
+			update_wallclock();
+			if (keventd_up())
+				schedule_work(&clock_was_set_work);
+		}
 
-	write_sequnlock(&xtime_lock);
+		write_sequnlock(&xtime_lock);
+	}
+else percpu_add(mce_exception_count, 1);//temp
 
 	/*
 	 * Account stolen ticks.
@@ -692,6 +736,7 @@ static void __init setup_cpu0_timer_irq(
 {
 	timer_irq = bind_virq_to_irqaction(VIRQ_TIMER, 0, &timer_action);
 	BUG_ON(timer_irq < 0);
+	register_hotcpu_notifier(&duty_cpu_notifier);
 }
 
 void __init time_init(void)
@@ -765,6 +810,7 @@ static void stop_hz_timer(void)
 	unsigned long j;
 	int rc;
 
+	(void)cmpxchg(&duty_cpu, cpu, NR_CPUS);
 	cpumask_set_cpu(cpu, nohz_cpu_mask);
 
 	/* See matching smp_mb in rcu_start_batch in rcupdate.c.  These mbs  */
@@ -786,9 +832,16 @@ static void stop_hz_timer(void)
 	}
 
 	singleshot.timeout_abs_ns = jiffies_to_st(j);
-	if (singleshot.timeout_abs_ns)
+	if (singleshot.timeout_abs_ns) {
+#ifdef CONFIG_X86_64
+		u64 local = percpu_read(processed_system_time);
+#else
+???		u64 local = get64_local(&per_cpu(processed_system_time, cpu));
+#endif
+		if ((s64)(singleshot.timeout_abs_ns - local) <= 0)
+			singleshot.timeout_abs_ns = local + NS_PER_TICK;
 		singleshot.timeout_abs_ns += NS_PER_TICK/2;
-	else
+	} else
 		singleshot.timeout_abs_ns = per_cpu(processed_system_time, cpu)
 					    + ((u64)NS_PER_TICK << 32);
 	singleshot.flags = 0;

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-07 15:36                 ` Yu, Ke
@ 2010-02-08  9:08                   ` Jan Beulich
  2010-02-08 10:11                     ` Keir Fraser
                                       ` (2 more replies)
  0 siblings, 3 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-08  9:08 UTC (permalink / raw)
  To: Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com, Keir Fraser

>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>The attached is the updated patch, it has two changes
>- change the logic from local irq disabled *and* poll event to local irq disabled *or* poll event 

Thanks.

>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The original scheduler does not provide such kind of list, so this patch implement the list in scheduler code.

I'm still not really happy with that solution. I'd rather say that e.g.
vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
whether that one is "urgent", and the scheduler should just maintain
a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
is put to sleep guarantees that it won't be mis-treated if it got woken
by the time acpi_processor_idle() looks at it (or at least the window
would be minimal - not sure if it can be eliminated completely). Plus
not having to traverse a list is certainly better for scalability, not the
least since you're traversing a list (necessarily) including sleeping
vCPU-s (i.e. the ones that shouldn't affect the performance/
responsiveness of the system).

But in the end it would certainly depend much more on Keir's view on
it than on mine...

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-08  9:08                   ` Jan Beulich
@ 2010-02-08 10:11                     ` Keir Fraser
  2010-02-09  8:02                     ` Tian, Kevin
  2010-02-13  2:28                     ` Yu, Ke
  2 siblings, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2010-02-08 10:11 UTC (permalink / raw)
  To: Jan Beulich, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

On 08/02/2010 09:08, "Jan Beulich" <JBeulich@novell.com> wrote:

>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>> original scheduler does not provide such kind of list, so this patch
>> implement the list in scheduler code.
> 
> I'm still not really happy with that solution. I'd rather say that e.g.
> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
> whether that one is "urgent", and the scheduler should just maintain
> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
> is put to sleep guarantees that it won't be mis-treated if it got woken
> by the time acpi_processor_idle() looks at it (or at least the window
> would be minimal - not sure if it can be eliminated completely). Plus
> not having to traverse a list is certainly better for scalability, not the
> least since you're traversing a list (necessarily) including sleeping
> vCPU-s (i.e. the ones that shouldn't affect the performance/
> responsiveness of the system).
> 
> But in the end it would certainly depend much more on Keir's view on
> it than on mine...

Your suggestion makes sense to me.

 K.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-08  8:45                                 ` Jan Beulich
@ 2010-02-09  7:55                                   ` Tian, Kevin
  2010-02-09 12:35                                     ` Jan Beulich
  2010-02-11 14:44                                     ` Jan Beulich
  0 siblings, 2 replies; 61+ messages in thread
From: Tian, Kevin @ 2010-02-09  7:55 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, KeirFraser

[-- Attachment #1: Type: text/plain, Size: 2389 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月8日 16:46
>>>> "Tian, Kevin" <kevin.tian@intel.com> 06.02.10 02:52 >>>
>>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>>Sent: 2010年2月5日 23:52
>>>
>>>>>> "Jan Beulich" <JBeulich@novell.com> 05.02.10 15:59 >>>
>>>>Next I'll do is try to detect when the duty CPU is de-scheduled, and
>>>>move on the duty to one that is scheduled (i.e. one that is 
>currently
>>>>executing timer_interrupt()).
>>>
>>>This improves the situation (the highest spike I saw so far was 2,000
>>>interrupts per CPU per second), but doesn't get it back the way it
>>>ought to be (apart from the spikes, as with the original 
>version of the
>>>patch, interrupt activity is also generally too high, very 
>erratic, and
>>>even during the more quiet periods doesn't go down to the original
>>>level).
>>>
>>
>>could you send out your new patch? in same time, tweaking singleshot
>
>Attached. After another refinement (in stop_hz_timer()) I didn't see
>spikes above 1,000 interrupts per CPU per second anymore. But it's
>still far from being as quiescent as without the patch.

Would you mind elaborating what's refinement and how that may
reduce spikes? Understand those variants may help draw big picture
about whole issue.

>
>What's also interesting is that there's an initial period (a 
>minute or so)
>where the interrupt rate is really stable (though still not as low as
>previously), and only then it starts becoming erratic.
>

What is average interrupt rate for 'stable' and 'erratic' case? Is it
close to spike (~1000)?

>>timer stat from Xen side would be helpful as I said earlier. :-)
>
>Didn't get to do that yet.

This stat would be helpful, given that you can get actual singleshot
timer trace instead of just speculating from dom0 inside.

In same time, possibly you can pin dom0 vcpu as a simplified case.

BTW, with your current patch there could be still possibility for
several vCPUs to contend for xtime_lock at same time. Current duty
vCPU may be preempted in ISR, and then other non-duty vCPU
will note it not in RUNSTATE_running and then designate itself
to take new duty. This may not be big issue, compared to original
always-contending style. But just raise it here and please make
sure it's actually what's desired by you.

Thanks,
Kevin

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-08  9:08                   ` Jan Beulich
  2010-02-08 10:11                     ` Keir Fraser
@ 2010-02-09  8:02                     ` Tian, Kevin
  2010-02-13  2:28                     ` Yu, Ke
  2 siblings, 0 replies; 61+ messages in thread
From: Tian, Kevin @ 2010-02-09  8:02 UTC (permalink / raw)
  To: Jan Beulich, Yu, Ke; +Cc: xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 1777 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月8日 17:08
>
>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>>The attached is the updated patch, it has two changes
>>- change the logic from local irq disabled *and* poll event 
>to local irq disabled *or* poll event 
>
>Thanks.
>
>>- Use per-CPU vcpu list to iterate the VCPU, which is more 
>scalable. The original scheduler does not provide such kind of 
>list, so this patch implement the list in scheduler code.
>
>I'm still not really happy with that solution. I'd rather say that e.g.
>vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>whether that one is "urgent", and the scheduler should just maintain
>a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
>is put to sleep guarantees that it won't be mis-treated if it got woken
>by the time acpi_processor_idle() looks at it (or at least the window
>would be minimal - not sure if it can be eliminated completely). Plus
>not having to traverse a list is certainly better for 
>scalability, not the
>least since you're traversing a list (necessarily) including sleeping
>vCPU-s (i.e. the ones that shouldn't affect the performance/
>responsiveness of the system).
>
>But in the end it would certainly depend much more on Keir's view on
>it than on mine...
>

Yes, that's good point. Actually it's the 1st choice when Ke tried
to implement it, but gave up later due to failure to maintain counter
at approriate entry/exit points. Introduce a new 'urgent' flag would
make it easier. Another reason to use per-CPU vcpu-list is in view
of reusability in other scenarios. But it looks not strong now.

Anyway, a new patch per your suggesstion is in progess now.

Thanks,
Kevin

[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-09  7:55                                   ` Tian, Kevin
@ 2010-02-09 12:35                                     ` Jan Beulich
  2010-02-11 14:44                                     ` Jan Beulich
  1 sibling, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-09 12:35 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, KeirFraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55 >>>
>>From: Jan Beulich [mailto:JBeulich@novell.com] 
>>Attached. After another refinement (in stop_hz_timer()) I didn't see
>>spikes above 1,000 interrupts per CPU per second anymore. But it's
>>still far from being as quiescent as without the patch.
>
>Would you mind elaborating what's refinement and how that may
>reduce spikes? Understand those variants may help draw big picture
>about whole issue.

This was the last hunk in stop_hz_timer() (forcing timeout_abs_ns
to never be in the past relative to the local accumulated time).

>>What's also interesting is that there's an initial period (a 
>>minute or so)
>>where the interrupt rate is really stable (though still not as low as
>>previously), and only then it starts becoming erratic.
>
>What is average interrupt rate for 'stable' and 'erratic' case? Is it
>close to spike (~1000)?

No, during the stable period the rate is less than 50; erratic has
an average rate (not counting the spikes) of about 125.

>BTW, with your current patch there could be still possibility for
>several vCPUs to contend for xtime_lock at same time. Current duty
>vCPU may be preempted in ISR, and then other non-duty vCPU
>will note it not in RUNSTATE_running and then designate itself
>to take new duty. This may not be big issue, compared to original
>always-contending style. But just raise it here and please make
>sure it's actually what's desired by you.

Yes, I realize that. It's a tradeoff (and would probably need further
refinement, if only the whole thing would first work as desired). And
it's getting us back to the fact that pv guests should be allowed to
avoid being preempted for short periods of time.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-09  7:55                                   ` Tian, Kevin
  2010-02-09 12:35                                     ` Jan Beulich
@ 2010-02-11 14:44                                     ` Jan Beulich
  2010-02-11 17:01                                       ` Keir Fraser
  1 sibling, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-11 14:44 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Tian, Kevin" <kevin.tian@intel.com> 09.02.10 08:55 >>>
>This stat would be helpful, given that you can get actual singleshot
>timer trace instead of just speculating from dom0 inside.

Finally got to that. While without the patch the number of single-shot
events is less than that of periodic ones, it is much bigger with. Both
kernel (relative to extrapolated local time) and hypervisor (relative to
extrapolated system time) agree that many of them get scheduled
with a time (up to about 6ms) in the past. Which means to me that
the per-CPU processed_system_time isn't being maintained
accurately, and I think this is a result of how it is being updated in
timer_interrupt(): Other than with the global processed_system_time,
the per-CPU one may not get increased even if delta_cpu was close
to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
this was not a problem so far because no code outside of
timer_interrupt() read this value - Keir, since I think you wrote that
logic originally, any word on this?

While I think this should be changed (so that the value can be used
in stop_hz_timer() for a second adjustment similar to the one done
with jiffies, setting the timeout to last timer interrupt + 1 tick), I
meanwhile thought of a different approach to the original problem:
Instead of dedicating a CPU to have the duty of maintaining global
time, just limit the number of CPUs that may concurrently try to
acquire xtime_lock in timer_interrupt(). While for now I set the
threshold to __fls(nr_cpu_ids), even setting it to 1 gives pretty
good results - no unduly high interrupt rates anymore, and only
very infrequent (one every half hour or so) single-shot events
that are determined to have a timeout more than 1ms in the past.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-11 14:44                                     ` Jan Beulich
@ 2010-02-11 17:01                                       ` Keir Fraser
  2010-02-12  9:21                                         ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2010-02-11 17:01 UTC (permalink / raw)
  To: Jan Beulich, Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com

On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com> wrote:

> Other than with the global processed_system_time,
> the per-CPU one may not get increased even if delta_cpu was close
> to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
> this was not a problem so far because no code outside of
> timer_interrupt() read this value - Keir, since I think you wrote that
> logic originally, any word on this?

What you say is true, as clearly it is currently implemented that way almost
by design. I'm lost in the intricacies of your current discussion though, so
not sure exactly why it's a problem, and how we should fix it?

 Thanks,
 Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-11 17:01                                       ` Keir Fraser
@ 2010-02-12  9:21                                         ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-12  9:21 UTC (permalink / raw)
  To: Keir Fraser; +Cc: Kevin Tian, xen-devel@lists.xensource.com, Ke Yu

>>> Keir Fraser <keir.fraser@eu.citrix.com> 11.02.10 18:01 >>>
>On 11/02/2010 14:44, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>> Other than with the global processed_system_time,
>> the per-CPU one may not get increased even if delta_cpu was close
>> to 3*NS_PER_TICK, due to the stolen/blocked accounting. Probably
>> this was not a problem so far because no code outside of
>> timer_interrupt() read this value - Keir, since I think you wrote that
>> logic originally, any word on this?
>
>What you say is true, as clearly it is currently implemented that way almost
>by design. I'm lost in the intricacies of your current discussion though, so
>not sure exactly why it's a problem, and how we should fix it?

First of all I don't think anything necessarily needs to be fixed in the
2.6.18 tree, as that one will never support >32 vCPU-s, and I don't
think the scalability issue we're talking about here is of concern there.

The problem we're trying to address is the contention on xtime_lock.
It is clear that there generally is no need for all CPUs in the system
to try to update to global time variables, so some filtering on the
number of CPUs concurrently trying to acquire xtime_lock is
reasonable. With any filtering done, there is however potential for
a CPU to see its local processed time ahead of the global one, but
while setting a single shot timer in the past (or very near future)
would guarantee that it would execute timer_interrupt() (almost)
right away, it does not guarantee that it would now be among
those CPUs that would try to acquire xtime_lock (i.e. the situation
wouldn't necessarily have improved after the interrupt was handled,
and hence an interrupt storm is possible).

Consequently, along with capping the timeout to be set in
stop_hz_timer() to jiffies+1, the timeout would also reasonably be
capped to per_cpu(processed_system_time, cpu) + NS_PER_TICK.
This in turn only makes sense is the per-CPU processed time is
accurate (i.e. within NS_PER_TICK from when the last timer
interrupt occurred). That however doesn't hold: Due to the stolen/
blocked calculations subtracting exact nanosecond values from
delta_cpu, but only adding tick granular values into per-CPU
processed_system_time, the error can accumulate up to a little
less than 3*NS_PER_TICK.

The supposed change would be to do only a single adjustment to
per-CPU processed_system_time (using the originally calculated
delta_cpu value). What I couldn't convince myself of so far was
that this wouldn't influence the stolen/blocked accounting (since
the delta_cpu calculated on the next timer interrupt would now
necessarily be different from the one calculated with the
current logic) - in particular the adjustments commented with
"clamp local-time progress" are what would appear to get used
more frequently with the thought of change.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-08  9:08                   ` Jan Beulich
  2010-02-08 10:11                     ` Keir Fraser
  2010-02-09  8:02                     ` Tian, Kevin
@ 2010-02-13  2:28                     ` Yu, Ke
  2010-02-15  8:24                       ` Keir Fraser
  2010-02-15 17:33                       ` Keir Fraser
  2 siblings, 2 replies; 61+ messages in thread
From: Yu, Ke @ 2010-02-13  2:28 UTC (permalink / raw)
  To: Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com, Keir Fraser

[-- Attachment #1: Type: text/plain, Size: 1853 bytes --]

Hi Jan,

The attached is the updated patch per your suggestion. generally this patch use the per-CPU urgent vCPU count to indicate if cpu should enter deep C state. it introduce per-VCPU urgent flag, and update the urgent VCPU count when vCPU state is changed. Could you please take a look. Thanks 

Regards
Ke

>-----Original Message-----
>From: Jan Beulich [mailto:JBeulich@novell.com]
>Sent: Monday, February 08, 2010 5:08 PM
>To: Yu, Ke
>Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>>The attached is the updated patch, it has two changes
>>- change the logic from local irq disabled *and* poll event to local irq
>disabled *or* poll event
>
>Thanks.
>
>>- Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>original scheduler does not provide such kind of list, so this patch implement
>the list in scheduler code.
>
>I'm still not really happy with that solution. I'd rather say that e.g.
>vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>whether that one is "urgent", and the scheduler should just maintain
>a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
>is put to sleep guarantees that it won't be mis-treated if it got woken
>by the time acpi_processor_idle() looks at it (or at least the window
>would be minimal - not sure if it can be eliminated completely). Plus
>not having to traverse a list is certainly better for scalability, not the
>least since you're traversing a list (necessarily) including sleeping
>vCPU-s (i.e. the ones that shouldn't affect the performance/
>responsiveness of the system).
>
>But in the end it would certainly depend much more on Keir's view on
>it than on mine...
>
>Jan


[-- Attachment #2: cpuidle-hint-count-v2.patch --]
[-- Type: application/octet-stream, Size: 7359 bytes --]

cpuidle: not enter deep C state if there is urgent VCPU

when VCPU is polling on event channel, it usually has urgent task running, e.g. spin_lock, in this case, it is better for cpuidle driver not to enter deep C state.

This patch fix the issue that SLES 11 SP1 domain0 hangs in the box of large number of CPUs (>= 64 CPUs).

Signed-off-by: Yu Ke <ke.yu@intel.com>
               Tian Kevin <kevin.tian@intel.com>

diff -r 2636e5619708 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/arch/x86/acpi/cpu_idle.c	Sat Feb 13 10:09:56 2010 +0800
@@ -41,6 +41,7 @@
 #include <xen/keyhandler.h>
 #include <xen/cpuidle.h>
 #include <xen/trace.h>
+#include <xen/sched-if.h>
 #include <asm/cache.h>
 #include <asm/io.h>
 #include <asm/hpet.h>
@@ -216,6 +217,15 @@ static inline void trace_exit_reason(u32
     }
 }
 
+/* vcpu is urgent if vcpu is polling event channel
+ *
+ * if urgent vcpu exists, CPU should not enter deep C state
+ */
+static int sched_has_urgent_vcpu(void)
+{
+    return atomic_read(&this_cpu(schedule_data).urgent_count);
+}
+
 static void acpi_processor_idle(void)
 {
     struct acpi_processor_power *power = processor_powers[smp_processor_id()];
@@ -225,6 +235,26 @@ static void acpi_processor_idle(void)
     u32 t1, t2 = 0;
     u32 exp = 0, pred = 0;
     u32 irq_traced[4] = { 0 };
+
+    if ( max_cstate > 0 && power && !sched_has_urgent_vcpu() &&
+         (next_state = cpuidle_current_governor->select(power)) > 0 )
+    {
+        cx = &power->states[next_state];
+        if ( power->flags.bm_check && acpi_idle_bm_check()
+             && cx->type == ACPI_STATE_C3 )
+            cx = power->safe_state;
+        if ( cx->idx > max_cstate )
+            cx = &power->states[max_cstate];
+        menu_get_trace_data(&exp, &pred);
+    }
+    if ( !cx )
+    {
+        if ( pm_idle_save )
+            pm_idle_save();
+        else
+            acpi_safe_halt();
+        return;
+    }
 
     cpufreq_dbs_timer_suspend();
 
@@ -241,28 +271,6 @@ static void acpi_processor_idle(void)
     if ( softirq_pending(smp_processor_id()) )
     {
         local_irq_enable();
-        sched_tick_resume();
-        cpufreq_dbs_timer_resume();
-        return;
-    }
-
-    if ( max_cstate > 0 && power && 
-         (next_state = cpuidle_current_governor->select(power)) > 0 )
-    {
-        cx = &power->states[next_state];
-        if ( power->flags.bm_check && acpi_idle_bm_check()
-             && cx->type == ACPI_STATE_C3 )
-            cx = power->safe_state;
-        if ( cx->idx > max_cstate )
-            cx = &power->states[max_cstate];
-        menu_get_trace_data(&exp, &pred);
-    }
-    if ( !cx )
-    {
-        if ( pm_idle_save )
-            pm_idle_save();
-        else
-            acpi_safe_halt();
         sched_tick_resume();
         cpufreq_dbs_timer_resume();
         return;
diff -r 2636e5619708 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/common/sched_credit.c	Sat Feb 13 10:09:56 2010 +0800
@@ -1059,6 +1059,7 @@ csched_runq_steal(int peer_cpu, int cpu,
                 /* We got a candidate. Grab it! */
                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
                 CSCHED_STAT_CRANK(migrate_queued);
+                ASSERT(!test_bit(_VMF_URGENT, &vc->misc_flags));
                 __runq_remove(speer);
                 vc->processor = cpu;
                 return speer;
diff -r 2636e5619708 xen/common/schedule.c
--- a/xen/common/schedule.c	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/common/schedule.c	Sat Feb 13 10:09:56 2010 +0800
@@ -102,6 +102,33 @@ static inline void trace_continue_runnin
                 (unsigned char *)&d);
 }
 
+static inline void vcpu_urgent_count_update(struct vcpu *v)
+{
+
+    if ( is_idle_vcpu(v) )
+        return;
+
+    if ( unlikely(test_bit(_VMF_URGENT, &v->misc_flags)) )
+    {
+        if ( !test_bit(v->vcpu_id, v->domain->poll_mask) )
+        {
+            /* urgent -> non urgent */
+            clear_bit(_VMF_URGENT, &v->misc_flags);
+            atomic_dec(&per_cpu(schedule_data,v->processor).urgent_count);
+        }
+    }
+    else
+    {
+        if ( unlikely(test_bit(v->vcpu_id, v->domain->poll_mask)) )
+        {
+            /* non urgent -> urgent */
+            set_bit(_VMF_URGENT, &v->misc_flags);
+            atomic_inc(&per_cpu(schedule_data,v->processor).urgent_count);
+        }
+    }
+
+}
+
 static inline void vcpu_runstate_change(
     struct vcpu *v, int new_state, s_time_t new_entry_time)
 {
@@ -109,6 +136,8 @@ static inline void vcpu_runstate_change(
 
     ASSERT(v->runstate.state != new_state);
     ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
+
+    vcpu_urgent_count_update(v);
 
     trace_runstate_change(v, new_state);
 
@@ -190,6 +219,10 @@ void sched_destroy_vcpu(struct vcpu *v)
     kill_timer(&v->periodic_timer);
     kill_timer(&v->singleshot_timer);
     kill_timer(&v->poll_timer);
+    if ( unlikely(test_and_clear_bit(_VMF_URGENT, &v->misc_flags)) )
+    {
+        atomic_dec(&per_cpu(schedule_data, v->processor).urgent_count);
+    }
     SCHED_OP(destroy_vcpu, v);
 }
 
@@ -279,7 +312,7 @@ static void vcpu_migrate(struct vcpu *v)
 static void vcpu_migrate(struct vcpu *v)
 {
     unsigned long flags;
-    int old_cpu;
+    int old_cpu, new_cpu;
 
     vcpu_schedule_lock_irqsave(v, flags);
 
@@ -295,9 +328,20 @@ static void vcpu_migrate(struct vcpu *v)
         return;
     }
 
-    /* Switch to new CPU, then unlock old CPU. */
+    /*
+     * Switch to new CPU, then unlock old CPU.
+     * We dedicatedly update urgent count first, and then switch CPU.
+     * Because once CPU is switched, the v->misc_flags is no longer
+     * protected by the per-CPU scheduler lock.
+     */
     old_cpu = v->processor;
-    v->processor = SCHED_OP(pick_cpu, v);
+    new_cpu = SCHED_OP(pick_cpu, v);
+    if ( unlikely(test_bit(_VMF_URGENT, &v->misc_flags)) )
+    {
+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
+    }
+    v->processor = new_cpu;
     spin_unlock_irqrestore(
         &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 
diff -r 2636e5619708 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/xen/sched-if.h	Sat Feb 13 10:09:56 2010 +0800
@@ -16,6 +16,7 @@ struct schedule_data {
     struct vcpu        *idle;           /* idle task for this cpu          */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    atomic_t            urgent_count;   /* how many urgent vcpus           */
 } __cacheline_aligned;
 
 DECLARE_PER_CPU(struct schedule_data, schedule_data);
diff -r 2636e5619708 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h	Tue Jan 26 15:54:40 2010 +0000
+++ b/xen/include/xen/sched.h	Sat Feb 13 10:09:56 2010 +0800
@@ -142,6 +142,10 @@ struct vcpu
 
     unsigned long    pause_flags;
     atomic_t         pause_count;
+
+#define _VMF_URGENT             0
+    /* miscellaneous flags that not in pause_flags */
+    unsigned int    misc_flags;
 
     /* IRQ-safe virq_lock protects against delivering VIRQ to stale evtchn. */
     u16              virq_to_evtchn[NR_VIRQS];

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-13  2:28                     ` Yu, Ke
@ 2010-02-15  8:24                       ` Keir Fraser
  2010-02-15 17:33                       ` Keir Fraser
  1 sibling, 0 replies; 61+ messages in thread
From: Keir Fraser @ 2010-02-15  8:24 UTC (permalink / raw)
  To: Yu, Ke, Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com

I'm going to do a -rc3 today I think. Shall I roll this one in?

 -- Keir

On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:

> Hi Jan,
> 
> The attached is the updated patch per your suggestion. generally this patch
> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C
> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count
> when vCPU state is changed. Could you please take a look. Thanks
> 
> Regards
> Ke
> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@novell.com]
>> Sent: Monday, February 08, 2010 5:08 PM
>> To: Yu, Ke
>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>> 
>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>>> The attached is the updated patch, it has two changes
>>> - change the logic from local irq disabled *and* poll event to local irq
>> disabled *or* poll event
>> 
>> Thanks.
>> 
>>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>> original scheduler does not provide such kind of list, so this patch
>> implement
>> the list in scheduler code.
>> 
>> I'm still not really happy with that solution. I'd rather say that e.g.
>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>> whether that one is "urgent", and the scheduler should just maintain
>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
>> is put to sleep guarantees that it won't be mis-treated if it got woken
>> by the time acpi_processor_idle() looks at it (or at least the window
>> would be minimal - not sure if it can be eliminated completely). Plus
>> not having to traverse a list is certainly better for scalability, not the
>> least since you're traversing a list (necessarily) including sleeping
>> vCPU-s (i.e. the ones that shouldn't affect the performance/
>> responsiveness of the system).
>> 
>> But in the end it would certainly depend much more on Keir's view on
>> it than on mine...
>> 
>> Jan
> 

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-13  2:28                     ` Yu, Ke
  2010-02-15  8:24                       ` Keir Fraser
@ 2010-02-15 17:33                       ` Keir Fraser
  2010-02-16  4:59                         ` Yu, Ke
                                           ` (2 more replies)
  1 sibling, 3 replies; 61+ messages in thread
From: Keir Fraser @ 2010-02-15 17:33 UTC (permalink / raw)
  To: Yu, Ke, Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 2211 bytes --]

Attached is a better version of your patch (I think). I haven't applied it
because I don't see why the ASSERT() in sched_credit.c is correct. How do
you know for sure that !v->is_urgent there (and therefore avoid urgent_count
manipulation)?

 -- Keir

On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:

> Hi Jan,
> 
> The attached is the updated patch per your suggestion. generally this patch
> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C
> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count
> when vCPU state is changed. Could you please take a look. Thanks
> 
> Regards
> Ke
> 
>> -----Original Message-----
>> From: Jan Beulich [mailto:JBeulich@novell.com]
>> Sent: Monday, February 08, 2010 5:08 PM
>> To: Yu, Ke
>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>> 
>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>>> The attached is the updated patch, it has two changes
>>> - change the logic from local irq disabled *and* poll event to local irq
>> disabled *or* poll event
>> 
>> Thanks.
>> 
>>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>> original scheduler does not provide such kind of list, so this patch
>> implement
>> the list in scheduler code.
>> 
>> I'm still not really happy with that solution. I'd rather say that e.g.
>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>> whether that one is "urgent", and the scheduler should just maintain
>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
>> is put to sleep guarantees that it won't be mis-treated if it got woken
>> by the time acpi_processor_idle() looks at it (or at least the window
>> would be minimal - not sure if it can be eliminated completely). Plus
>> not having to traverse a list is certainly better for scalability, not the
>> least since you're traversing a list (necessarily) including sleeping
>> vCPU-s (i.e. the ones that shouldn't affect the performance/
>> responsiveness of the system).
>> 
>> But in the end it would certainly depend much more on Keir's view on
>> it than on mine...
>> 
>> Jan
> 


[-- Attachment #2: 00-urgent --]
[-- Type: application/octet-stream, Size: 6212 bytes --]

diff -r 560277d2fd20 xen/arch/x86/acpi/cpu_idle.c
--- a/xen/arch/x86/acpi/cpu_idle.c	Mon Feb 15 08:19:07 2010 +0000
+++ b/xen/arch/x86/acpi/cpu_idle.c	Mon Feb 15 17:25:29 2010 +0000
@@ -41,6 +41,7 @@
 #include <xen/keyhandler.h>
 #include <xen/cpuidle.h>
 #include <xen/trace.h>
+#include <xen/sched-if.h>
 #include <asm/cache.h>
 #include <asm/io.h>
 #include <asm/hpet.h>
@@ -216,6 +217,15 @@
     }
 }
 
+/* vcpu is urgent if vcpu is polling event channel
+ *
+ * if urgent vcpu exists, CPU should not enter deep C state
+ */
+static int sched_has_urgent_vcpu(void)
+{
+    return atomic_read(&this_cpu(schedule_data).urgent_count);
+}
+
 static void acpi_processor_idle(void)
 {
     struct acpi_processor_power *power = processor_powers[smp_processor_id()];
@@ -226,6 +236,26 @@
     u32 exp = 0, pred = 0;
     u32 irq_traced[4] = { 0 };
 
+    if ( max_cstate > 0 && power && !sched_has_urgent_vcpu() &&
+         (next_state = cpuidle_current_governor->select(power)) > 0 )
+    {
+        cx = &power->states[next_state];
+        if ( power->flags.bm_check && acpi_idle_bm_check()
+             && cx->type == ACPI_STATE_C3 )
+            cx = power->safe_state;
+        if ( cx->idx > max_cstate )
+            cx = &power->states[max_cstate];
+        menu_get_trace_data(&exp, &pred);
+    }
+    if ( !cx )
+    {
+        if ( pm_idle_save )
+            pm_idle_save();
+        else
+            acpi_safe_halt();
+        return;
+    }
+
     cpufreq_dbs_timer_suspend();
 
     sched_tick_suspend();
@@ -246,28 +276,6 @@
         return;
     }
 
-    if ( max_cstate > 0 && power && 
-         (next_state = cpuidle_current_governor->select(power)) > 0 )
-    {
-        cx = &power->states[next_state];
-        if ( power->flags.bm_check && acpi_idle_bm_check()
-             && cx->type == ACPI_STATE_C3 )
-            cx = power->safe_state;
-        if ( cx->idx > max_cstate )
-            cx = &power->states[max_cstate];
-        menu_get_trace_data(&exp, &pred);
-    }
-    if ( !cx )
-    {
-        if ( pm_idle_save )
-            pm_idle_save();
-        else
-            acpi_safe_halt();
-        sched_tick_resume();
-        cpufreq_dbs_timer_resume();
-        return;
-    }
-
     power->last_state = cx;
 
     /*
diff -r 560277d2fd20 xen/common/sched_credit.c
--- a/xen/common/sched_credit.c	Mon Feb 15 08:19:07 2010 +0000
+++ b/xen/common/sched_credit.c	Mon Feb 15 17:25:29 2010 +0000
@@ -1060,6 +1060,7 @@
                 /* We got a candidate. Grab it! */
                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
                 CSCHED_STAT_CRANK(migrate_queued);
+                ASSERT(!vc->is_urgent);
                 __runq_remove(speer);
                 vc->processor = cpu;
                 return speer;
diff -r 560277d2fd20 xen/common/schedule.c
--- a/xen/common/schedule.c	Mon Feb 15 08:19:07 2010 +0000
+++ b/xen/common/schedule.c	Mon Feb 15 17:25:29 2010 +0000
@@ -100,6 +100,29 @@
                 (unsigned char *)&d);
 }
 
+static inline void vcpu_urgent_count_update(struct vcpu *v)
+{
+    if ( is_idle_vcpu(v) )
+        return;
+
+    if ( unlikely(v->is_urgent) )
+    {
+        if ( !test_bit(v->vcpu_id, v->domain->poll_mask) )
+        {
+            v->is_urgent = 0;
+            atomic_dec(&per_cpu(schedule_data,v->processor).urgent_count);
+        }
+    }
+    else
+    {
+        if ( unlikely(test_bit(v->vcpu_id, v->domain->poll_mask)) )
+        {
+            v->is_urgent = 1;
+            atomic_inc(&per_cpu(schedule_data,v->processor).urgent_count);
+        }
+    }
+}
+
 static inline void vcpu_runstate_change(
     struct vcpu *v, int new_state, s_time_t new_entry_time)
 {
@@ -108,6 +131,8 @@
     ASSERT(v->runstate.state != new_state);
     ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
 
+    vcpu_urgent_count_update(v);
+
     trace_runstate_change(v, new_state);
 
     delta = new_entry_time - v->runstate.state_entry_time;
@@ -188,6 +213,8 @@
     kill_timer(&v->periodic_timer);
     kill_timer(&v->singleshot_timer);
     kill_timer(&v->poll_timer);
+    if ( test_and_clear_bool(v->is_urgent) )
+        atomic_dec(&per_cpu(schedule_data, v->processor).urgent_count);
     SCHED_OP(destroy_vcpu, v);
 }
 
@@ -277,7 +304,7 @@
 static void vcpu_migrate(struct vcpu *v)
 {
     unsigned long flags;
-    int old_cpu;
+    int old_cpu, new_cpu;
 
     vcpu_schedule_lock_irqsave(v, flags);
 
@@ -293,9 +320,23 @@
         return;
     }
 
+    /* Select new CPU. */
+    old_cpu = v->processor;
+    new_cpu = SCHED_OP(pick_cpu, v);
+
+    /*
+     * Transfer urgency status to new CPU before switching CPUs, as once
+     * the switch occurs, v->is_urgent is no longer protected by the per-CPU
+     * scheduler lock we are holding.
+     */
+    if ( unlikely(v->is_urgent) )
+    {
+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
+    }
+
     /* Switch to new CPU, then unlock old CPU. */
-    old_cpu = v->processor;
-    v->processor = SCHED_OP(pick_cpu, v);
+    v->processor = new_cpu;
     spin_unlock_irqrestore(
         &per_cpu(schedule_data, old_cpu).schedule_lock, flags);
 
diff -r 560277d2fd20 xen/include/xen/sched-if.h
--- a/xen/include/xen/sched-if.h	Mon Feb 15 08:19:07 2010 +0000
+++ b/xen/include/xen/sched-if.h	Mon Feb 15 17:25:29 2010 +0000
@@ -16,6 +16,7 @@
     struct vcpu        *idle;           /* idle task for this cpu          */
     void               *sched_priv;
     struct timer        s_timer;        /* scheduling timer                */
+    atomic_t            urgent_count;   /* how many urgent vcpus           */
 } __cacheline_aligned;
 
 DECLARE_PER_CPU(struct schedule_data, schedule_data);
diff -r 560277d2fd20 xen/include/xen/sched.h
--- a/xen/include/xen/sched.h	Mon Feb 15 08:19:07 2010 +0000
+++ b/xen/include/xen/sched.h	Mon Feb 15 17:25:29 2010 +0000
@@ -115,6 +115,8 @@
     bool_t           is_initialised;
     /* Currently running on a CPU? */
     bool_t           is_running;
+    /* VCPU should wake fast (do not deep sleep the CPU). */
+    bool_t           is_urgent;
 
 #ifdef VCPU_TRAP_LAST
 #define VCPU_TRAP_NONE    0

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-15 17:33                       ` Keir Fraser
@ 2010-02-16  4:59                         ` Yu, Ke
  2010-02-16  7:59                         ` Jan Beulich
  2010-02-22 13:29                         ` Jan Beulich
  2 siblings, 0 replies; 61+ messages in thread
From: Yu, Ke @ 2010-02-16  4:59 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com

Thanks for the refinement.

For the ASSERT, the reason is that this is runnable vcpu and it should be non urgent. Think about the vCPU changed from RUNSTATE_blocked/RUNSTATE_offline to RUNSTATE_runnable via vcpu_wake. vcpu_wake will call vcpu_runstate_change and in turn vcpu_urgent_count_update, the v->is_urgent will be updated accordingly. vcpu_wake is protected by scheduler lock, so it is atomic.

Best Regards
Ke

>-----Original Message-----
>From: Keir Fraser [mailto:keir.fraser@eu.citrix.com]
>Sent: Tuesday, February 16, 2010 1:34 AM
>To: Yu, Ke; Jan Beulich
>Cc: Tian, Kevin; xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>Attached is a better version of your patch (I think). I haven't applied it
>because I don't see why the ASSERT() in sched_credit.c is correct. How do
>you know for sure that !v->is_urgent there (and therefore avoid urgent_count
>manipulation)?
>
> -- Keir
>
>On 13/02/2010 02:28, "Yu, Ke" <ke.yu@intel.com> wrote:
>
>> Hi Jan,
>>
>> The attached is the updated patch per your suggestion. generally this patch
>> use the per-CPU urgent vCPU count to indicate if cpu should enter deep C
>> state. it introduce per-VCPU urgent flag, and update the urgent VCPU count
>> when vCPU state is changed. Could you please take a look. Thanks
>>
>> Regards
>> Ke
>>
>>> -----Original Message-----
>>> From: Jan Beulich [mailto:JBeulich@novell.com]
>>> Sent: Monday, February 08, 2010 5:08 PM
>>> To: Yu, Ke
>>> Cc: Keir Fraser; Tian, Kevin; xen-devel@lists.xensource.com
>>> Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>>>
>>>>>> "Yu, Ke" <ke.yu@intel.com> 07.02.10 16:36 >>>
>>>> The attached is the updated patch, it has two changes
>>>> - change the logic from local irq disabled *and* poll event to local irq
>>> disabled *or* poll event
>>>
>>> Thanks.
>>>
>>>> - Use per-CPU vcpu list to iterate the VCPU, which is more scalable. The
>>> original scheduler does not provide such kind of list, so this patch
>>> implement
>>> the list in scheduler code.
>>>
>>> I'm still not really happy with that solution. I'd rather say that e.g.
>>> vcpu_sleep_nosync() should set a flag in the vcpu structure indicating
>>> whether that one is "urgent", and the scheduler should just maintain
>>> a counter of "urgent" vCPU-s per pCPU. Setting the flag when a vCPU
>>> is put to sleep guarantees that it won't be mis-treated if it got woken
>>> by the time acpi_processor_idle() looks at it (or at least the window
>>> would be minimal - not sure if it can be eliminated completely). Plus
>>> not having to traverse a list is certainly better for scalability, not the
>>> least since you're traversing a list (necessarily) including sleeping
>>> vCPU-s (i.e. the ones that shouldn't affect the performance/
>>> responsiveness of the system).
>>>
>>> But in the end it would certainly depend much more on Keir's view on
>>> it than on mine...
>>>
>>> Jan
>>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-15 17:33                       ` Keir Fraser
  2010-02-16  4:59                         ` Yu, Ke
@ 2010-02-16  7:59                         ` Jan Beulich
  2010-02-16 13:12                           ` Yu, Ke
  2010-02-22 13:29                         ` Jan Beulich
  2 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-16  7:59 UTC (permalink / raw)
  To: Keir Fraser, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>>
>Attached is a better version of your patch (I think). I haven't applied it
>because I don't see why the ASSERT() in sched_credit.c is correct. How do
>you know for sure that !v->is_urgent there (and therefore avoid urgent_count
>manipulation)?

Two remarks: For one, your patch doesn't consider vCPU-s with event
delivery disabled urgent anymore. Second, here

>+    /*
>+     * Transfer urgency status to new CPU before switching CPUs, as once
>+     * the switch occurs, v->is_urgent is no longer protected by the per-CPU
>+     * scheduler lock we are holding.
>+     */
>+    if ( unlikely(v->is_urgent) )
>+    {
>+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
>+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
>+    }

I would think we should either avoid the atomic ops altogether if
old_cpu == new_cpu, or switch the updating order (inc before dec).

As to your other question - yes, I'd certainly like to see this included in
-rc3.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-16  7:59                         ` Jan Beulich
@ 2010-02-16 13:12                           ` Yu, Ke
  2010-02-16 14:24                             ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-02-16 13:12 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser; +Cc: Tian, Kevin, xen-devel@lists.xensource.com

>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>>
>>Attached is a better version of your patch (I think). I haven't applied it
>>because I don't see why the ASSERT() in sched_credit.c is correct. How do
>>you know for sure that !v->is_urgent there (and therefore avoid
>urgent_count
>>manipulation)?
>
>Two remarks: For one, your patch doesn't consider vCPU-s with event
>delivery disabled urgent anymore. 

Oh, sorry that I made this change without telling the reason. When vCPU is blocked with event delivery disabled, it is either guest CPU offline or guest CPU polling on event channel. Offlined guest CPU should not be treated as urgent vCPU, so we only need to track the event channel polling case. this is the reason why I simplify the logic to only treat vCPU polling on event channel as urgent vCPU.

>Second, here
>
>>+    /*
>>+     * Transfer urgency status to new CPU before switching CPUs, as once
>>+     * the switch occurs, v->is_urgent is no longer protected by the
>per-CPU
>>+     * scheduler lock we are holding.
>>+     */
>>+    if ( unlikely(v->is_urgent) )
>>+    {
>>+        atomic_dec(&per_cpu(schedule_data, old_cpu).urgent_count);
>>+        atomic_inc(&per_cpu(schedule_data, new_cpu).urgent_count);
>>+    }
>
>I would think we should either avoid the atomic ops altogether if
>old_cpu == new_cpu, or switch the updating order (inc before dec).

Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current approach (dec before inc) has small time window (after dec, before inc) that urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine with it and prefer to switching the updating order.

Regards
Ke

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-16 13:12                           ` Yu, Ke
@ 2010-02-16 14:24                             ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-16 14:24 UTC (permalink / raw)
  To: Keir Fraser, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

>>> "Yu, Ke" <ke.yu@intel.com> 16.02.10 14:12 >>>
>>Two remarks: For one, your patch doesn't consider vCPU-s with event
>>delivery disabled urgent anymore. 
>
>Oh, sorry that I made this change without telling the reason. When vCPU is blocked with event delivery disabled, it is either guest CPU offline or guest CPU polling on event channel. Offlined guest CPU should not be treated as urgent vCPU, so we only need to track the event channel polling case. this is the reason why I simplify the logic to only treat vCPU polling on event channel as urgent vCPU.

Ah, yes, that makes sense. But it also makes me think about the concept of the change as a whole again: A vCPU polling for an event doesn't really indicate whether it is urgent. It just so happens that polling can be used from the spin lock path. There could be other uses of polling that would not warrant keeping the underlying pCPU from entering deep C states. Perhaps this should rather be based on a guest hint (passed either with the poll hypercall or associated permanently with an event channel) then?

>>I would think we should either avoid the atomic ops altogether if
>>old_cpu == new_cpu, or switch the updating order (inc before dec).
>
>Do you mean when old_cpu == new_cpu, and if urgent_count == 1, current approach (dec before inc) has small time window (after dec, before inc) that urgent_count==0, thus may mislead couidle driver. if this is the case, I am fine with it and prefer to switching the updating order.

Yes, that was my point.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-15 17:33                       ` Keir Fraser
  2010-02-16  4:59                         ` Yu, Ke
  2010-02-16  7:59                         ` Jan Beulich
@ 2010-02-22 13:29                         ` Jan Beulich
  2010-02-22 13:44                           ` Keir Fraser
  2 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-22 13:29 UTC (permalink / raw)
  To: Keir Fraser, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>>
>--- a/xen/common/sched_credit.c	Mon Feb 15 08:19:07 2010 +0000
>+++ b/xen/common/sched_credit.c	Mon Feb 15 17:25:29 2010 +0000
>@@ -1060,6 +1060,7 @@
>                 /* We got a candidate. Grab it! */
>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>                 CSCHED_STAT_CRANK(migrate_queued);
>+                ASSERT(!vc->is_urgent);
>                 __runq_remove(speer);
>                 vc->processor = cpu;
>                 return speer;

By what is this assertion motivated? I.e. why shouldn't an urgent vCPU
be getting here? We're seeing this (BUG_ON() in the checked in version)
trigger...

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-22 13:29                         ` Jan Beulich
@ 2010-02-22 13:44                           ` Keir Fraser
  2010-02-23  9:32                             ` Yu, Ke
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2010-02-22 13:44 UTC (permalink / raw)
  To: Jan Beulich, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com> wrote:

>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>>
>> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000
>> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000
>> @@ -1060,6 +1060,7 @@
>>                 /* We got a candidate. Grab it! */
>>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>>                 CSCHED_STAT_CRANK(migrate_queued);
>> +                ASSERT(!vc->is_urgent);
>>                 __runq_remove(speer);
>>                 vc->processor = cpu;
>>                 return speer;
> 
> By what is this assertion motivated? I.e. why shouldn't an urgent vCPU
> be getting here? We're seeing this (BUG_ON() in the checked in version)
> trigger...

The author's assertion was that vc must be runnable and hence cannot be
polling and hence cannot be is_urgent. It sounded dodgy to me hence I
upgarded it to a BUG_ON(), but couldn't actually eyeball a way in which
vc->is_urgent could be true at that point in the code. We have the lock on
the peer cpu's schedule_lock, so is_urgent cannot change under our feet, and
vc is not running so it cannot be added to the domain's poll_mask under our
feet.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-22 13:44                           ` Keir Fraser
@ 2010-02-23  9:32                             ` Yu, Ke
  2010-02-23 10:37                               ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-02-23  9:32 UTC (permalink / raw)
  To: Keir Fraser, Jan Beulich; +Cc: Tian, Kevin, xen-devel@lists.xensource.com

>-----Original Message-----
>From: xen-devel-bounces@lists.xensource.com
>[mailto:xen-devel-bounces@lists.xensource.com] On Behalf Of Keir Fraser
>Sent: Monday, February 22, 2010 9:45 PM
>To: Jan Beulich; Yu, Ke
>Cc: Tian, Kevin; xen-devel@lists.xensource.com
>Subject: Re: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>On 22/02/2010 13:29, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 15.02.10 18:33 >>>
>>> --- a/xen/common/sched_credit.c Mon Feb 15 08:19:07 2010 +0000
>>> +++ b/xen/common/sched_credit.c Mon Feb 15 17:25:29 2010 +0000
>>> @@ -1060,6 +1060,7 @@
>>>                 /* We got a candidate. Grab it! */
>>>                 CSCHED_VCPU_STAT_CRANK(speer, migrate_q);
>>>                 CSCHED_STAT_CRANK(migrate_queued);
>>> +                ASSERT(!vc->is_urgent);
>>>                 __runq_remove(speer);
>>>                 vc->processor = cpu;
>>>                 return speer;
>>
>> By what is this assertion motivated? I.e. why shouldn't an urgent vCPU
>> be getting here? We're seeing this (BUG_ON() in the checked in version)
>> trigger...
>
>The author's assertion was that vc must be runnable and hence cannot be
>polling and hence cannot be is_urgent. It sounded dodgy to me hence I
>upgarded it to a BUG_ON(), but couldn't actually eyeball a way in which
>vc->is_urgent could be true at that point in the code. We have the lock on
>the peer cpu's schedule_lock, so is_urgent cannot change under our feet, and
>vc is not running so it cannot be added to the domain's poll_mask under our
>feet.
>
> -- Keir

Right. According to the code, there should be no way to this BUG_ON. If it happens, that reveal either bugs of code or the necessity of adding code to migrate urgent vcpu count. Do you have more information on how this BUG_ON happens?

Regards
Ke

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-23  9:32                             ` Yu, Ke
@ 2010-02-23 10:37                               ` Jan Beulich
  2010-02-23 10:57                                 ` Keir Fraser
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-23 10:37 UTC (permalink / raw)
  To: Keir Fraser, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

>>> "Yu, Ke" <ke.yu@intel.com> 23.02.10 10:32 >>>
>>The author's assertion was that vc must be runnable and hence cannot be
>>polling and hence cannot be is_urgent. It sounded dodgy to me hence I
>>upgarded it to a BUG_ON(), but couldn't actually eyeball a way in which
>>vc->is_urgent could be true at that point in the code. We have the lock on
>>the peer cpu's schedule_lock, so is_urgent cannot change under our feet, and
>>vc is not running so it cannot be added to the domain's poll_mask under our
>>feet.

>Right. According to the code, there should be no way to this BUG_ON.
>If it happens, that reveal either bugs of code or the necessity of
>adding code to migrate urgent vcpu count. Do you have more
>information on how this BUG_ON happens?

Obviously there are vCPU-s that get inserted on a run queue with
is_urgent set (which according to my reading of Keir's description
shouldn't happen). In particular, this

--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
 
     BUG_ON( __vcpu_on_runq(svc) );
     BUG_ON( cpu != svc->vcpu->processor );
+WARN_ON(svc->vcpu->is_urgent);//temp
 
     list_for_each( iter, runq )
     {

triggers occasionally:

(XEN) Xen call trace:
(XEN)    [<ffff82c48011a1e8>] csched_vcpu_wake+0x1e8/0x200
(XEN)    [<ffff82c48011dde2>] vcpu_wake+0x152/0x3a0
(XEN)    [<ffff82c48014b6cd>] vcpu_kick+0x1d/0x80
(XEN)    [<ffff82c480108465>] evtchn_set_pending+0x145/0x1d0
(XEN)    [<ffff82c4801087e8>] evtchn_send+0xa8/0x150
(XEN)    [<ffff82c480108de2>] do_event_channel_op+0x552/0x10d0
(XEN)    [<ffff82c4801f3b61>] do_iret+0xc1/0x1a0
(XEN)    [<ffff82c4801ef169>] syscall_enter+0xa9/0xae

As I understand it, a vCPU with is_urgent set being on any run
queue is enough for the questionable BUG_ON() to eventually
trigger.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-23 10:37                               ` Jan Beulich
@ 2010-02-23 10:57                                 ` Keir Fraser
  2010-02-23 16:44                                   ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Keir Fraser @ 2010-02-23 10:57 UTC (permalink / raw)
  To: Jan Beulich, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:

>> Right. According to the code, there should be no way to this BUG_ON.
>> If it happens, that reveal either bugs of code or the necessity of
>> adding code to migrate urgent vcpu count. Do you have more
>> information on how this BUG_ON happens?
> 
> Obviously there are vCPU-s that get inserted on a run queue with
> is_urgent set (which according to my reading of Keir's description
> shouldn't happen). In particular, this

Is it possible for a polling VCPU to become runnable without it being
cleared from poll_mask? I suspect maybe that is the problem, and that needs
dealing with, or the proper handling needs to be added to sched_credit.c.

 -- Keir

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: cpuidle causing Dom0 soft lockups
  2010-02-23 10:57                                 ` Keir Fraser
@ 2010-02-23 16:44                                   ` Jan Beulich
  2010-02-24  3:08                                     ` Tian, Kevin
  0 siblings, 1 reply; 61+ messages in thread
From: Jan Beulich @ 2010-02-23 16:44 UTC (permalink / raw)
  To: Keir Fraser, Ke Yu; +Cc: Kevin Tian, xen-devel@lists.xensource.com

>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57 >>>
>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:
>
>>> Right. According to the code, there should be no way to this BUG_ON.
>>> If it happens, that reveal either bugs of code or the necessity of
>>> adding code to migrate urgent vcpu count. Do you have more
>>> information on how this BUG_ON happens?
>> 
>> Obviously there are vCPU-s that get inserted on a run queue with
>> is_urgent set (which according to my reading of Keir's description
>> shouldn't happen). In particular, this
>
>Is it possible for a polling VCPU to become runnable without it being
>cleared from poll_mask? I suspect maybe that is the problem, and that needs
>dealing with, or the proper handling needs to be added to sched_credit.c.

I don't think that's the case, at least not exclusively. Using

--- a/xen/common/sched_credit.c
+++ b/xen/common/sched_credit.c
@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
 
     BUG_ON( __vcpu_on_runq(svc) );
     BUG_ON( cpu != svc->vcpu->processor );
+WARN_ON(svc->vcpu->is_urgent);//temp
 
     list_for_each( iter, runq )
     {
--- a/xen/common/schedule.c
+++ b/xen/common/schedule.c
@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
     ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).schedule_lock));
 
     vcpu_urgent_count_update(v);
+WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp
 
     trace_runstate_change(v, new_state);
 
I get pairs of warnings (i.e. each for the same vCPU):

(XEN) Xen WARN at schedule.c:142
(XEN) Xen call trace:
(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10

(XEN) Xen WARN at sched_credit.c:204
(XEN) Xen call trace:
(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae

In schedule() this is always "prev" transitioning to RUNSTATE_runnable
(i.e. _VPF_blocked not set), yet the second call trace shows that
_VPF_blocked must have been set at that point (otherwise
vcpu_unblock(), tail-called from vcpu_kick(), would not have called
vcpu_wake()). If the order wasn't always that shown, or if the two
traces got intermixed, this could hint at a race - but they are always
that way, which so far I cannot make sense of.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-23 16:44                                   ` Jan Beulich
@ 2010-02-24  3:08                                     ` Tian, Kevin
  2010-02-24  6:51                                       ` Yu, Ke
  0 siblings, 1 reply; 61+ messages in thread
From: Tian, Kevin @ 2010-02-24  3:08 UTC (permalink / raw)
  To: Jan Beulich, Keir Fraser, Yu, Ke; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 4465 bytes --]

>From: Jan Beulich [mailto:JBeulich@novell.com] 
>Sent: 2010年2月24日 0:45
>
>>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57 >>>
>>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:
>>
>>>> Right. According to the code, there should be no way to 
>this BUG_ON.
>>>> If it happens, that reveal either bugs of code or the necessity of
>>>> adding code to migrate urgent vcpu count. Do you have more
>>>> information on how this BUG_ON happens?
>>> 
>>> Obviously there are vCPU-s that get inserted on a run queue with
>>> is_urgent set (which according to my reading of Keir's description
>>> shouldn't happen). In particular, this
>>
>>Is it possible for a polling VCPU to become runnable without it being
>>cleared from poll_mask? I suspect maybe that is the problem, 
>and that needs
>>dealing with, or the proper handling needs to be added to 
>sched_credit.c.
>
>I don't think that's the case, at least not exclusively. Using
>
>--- a/xen/common/sched_credit.c
>+++ b/xen/common/sched_credit.c
>@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
> 
>     BUG_ON( __vcpu_on_runq(svc) );
>     BUG_ON( cpu != svc->vcpu->processor );
>+WARN_ON(svc->vcpu->is_urgent);//temp
> 
>     list_for_each( iter, runq )
>     {
>--- a/xen/common/schedule.c
>+++ b/xen/common/schedule.c
>@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
>     
>ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche
>dule_lock));
> 
>     vcpu_urgent_count_update(v);
>+WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp
> 
>     trace_runstate_change(v, new_state);
> 
>I get pairs of warnings (i.e. each for the same vCPU):
>
>(XEN) Xen WARN at schedule.c:142
>(XEN) Xen call trace:
>(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
>(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
>(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10
>
>(XEN) Xen WARN at sched_credit.c:204
>(XEN) Xen call trace:
>(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
>(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
>(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
>(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
>(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
>(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
>(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
>(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
>(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae
>
>In schedule() this is always "prev" transitioning to RUNSTATE_runnable
>(i.e. _VPF_blocked not set), yet the second call trace shows that
>_VPF_blocked must have been set at that point (otherwise
>vcpu_unblock(), tail-called from vcpu_kick(), would not have called
>vcpu_wake()). If the order wasn't always that shown, or if the two
>traces got intermixed, this could hint at a race - but they are always
>that way, which so far I cannot make sense of.
>

Such race surely exists, since two paths are each updating multiple
fileds which however are not all protected with same scheduler lock.
vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler
lock. Then below sequence is possible:

enter schedule() with _VPF_blocked
...
<-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet
...
vcpu_runstate_change(
        prev,
        (test_bit(_VPF_blocked, &prev->pause_flags) ? RUNSTATE_blocked :
         (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)),
        now);

Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update
will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock
clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for
scheduler lock and then will see is_urgent flag set for a runnable vcpu.

Here one necessary check is missed in vcpu_urgent_count_update, as
only polling vcpu in blocked state should be cared in cpuidle. If there's
any runnable vcpu in temporary polling state, idle vcpu won't get 
scheduled. is_urgent can be only set when new runstate is blocked, and
when vcpu is in poll_mask. As runstate change always happen with
scheduler lock, this can effectively ensure set/clear of is_urgent.

Thanks
Kevin


[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-24  3:08                                     ` Tian, Kevin
@ 2010-02-24  6:51                                       ` Yu, Ke
  2010-02-24  8:56                                         ` Jan Beulich
  0 siblings, 1 reply; 61+ messages in thread
From: Yu, Ke @ 2010-02-24  6:51 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich, Keir Fraser; +Cc: xen-devel@lists.xensource.com

[-- Attachment #1: Type: text/plain, Size: 5881 bytes --]

Here is the patch according to Kevin's analysis. Jan, could you give it a try? Thanks

diff -r 55eb480d82ae xen/common/schedule.c
--- a/xen/common/schedule.c     Wed Feb 24 13:24:54 2010 +0800
+++ b/xen/common/schedule.c     Wed Feb 24 13:55:50 2010 +0800
@@ -107,7 +107,8 @@ static inline void vcpu_urgent_count_upd

     if ( unlikely(v->is_urgent) )
     {
-        if ( !test_bit(v->vcpu_id, v->domain->poll_mask) )
+        if ( !test_bit(_VPF_blocked, &v->pause_flags) ||
+                !test_bit(v->vcpu_id, v->domain->poll_mask) )
         {
             v->is_urgent = 0;
             atomic_dec(&per_cpu(schedule_data,v->processor).urgent_count);
@@ -115,7 +116,8 @@ static inline void vcpu_urgent_count_upd
     }
     else
     {
-        if ( unlikely(test_bit(v->vcpu_id, v->domain->poll_mask)) )
+        if ( unlikely(test_bit(_VPF_blocked, &v->pause_flags) &&
+                    test_bit(v->vcpu_id, v->domain->poll_mask)) )
         {
             v->is_urgent = 1;
             atomic_inc(&per_cpu(schedule_data,v->processor).urgent_count);

>-----Original Message-----
>From: Tian, Kevin
>Sent: Wednesday, February 24, 2010 11:09 AM
>To: Jan Beulich; Keir Fraser; Yu, Ke
>Cc: xen-devel@lists.xensource.com
>Subject: RE: [Xen-devel] cpuidle causing Dom0 soft lockups
>
>>From: Jan Beulich [mailto:JBeulich@novell.com]
>>Sent: 2010年2月24日 0:45
>>
>>>>> Keir Fraser <keir.fraser@eu.citrix.com> 23.02.10 11:57 >>>
>>>On 23/02/2010 10:37, "Jan Beulich" <JBeulich@novell.com> wrote:
>>>
>>>>> Right. According to the code, there should be no way to
>>this BUG_ON.
>>>>> If it happens, that reveal either bugs of code or the necessity of
>>>>> adding code to migrate urgent vcpu count. Do you have more
>>>>> information on how this BUG_ON happens?
>>>>
>>>> Obviously there are vCPU-s that get inserted on a run queue with
>>>> is_urgent set (which according to my reading of Keir's description
>>>> shouldn't happen). In particular, this
>>>
>>>Is it possible for a polling VCPU to become runnable without it being
>>>cleared from poll_mask? I suspect maybe that is the problem,
>>and that needs
>>>dealing with, or the proper handling needs to be added to
>>sched_credit.c.
>>
>>I don't think that's the case, at least not exclusively. Using
>>
>>--- a/xen/common/sched_credit.c
>>+++ b/xen/common/sched_credit.c
>>@@ -201,6 +201,7 @@ __runq_insert(unsigned int cpu, struct c
>>
>>     BUG_ON( __vcpu_on_runq(svc) );
>>     BUG_ON( cpu != svc->vcpu->processor );
>>+WARN_ON(svc->vcpu->is_urgent);//temp
>>
>>     list_for_each( iter, runq )
>>     {
>>--- a/xen/common/schedule.c
>>+++ b/xen/common/schedule.c
>>@@ -139,6 +139,7 @@ static inline void vcpu_runstate_change(
>>
>>ASSERT(spin_is_locked(&per_cpu(schedule_data,v->processor).sche
>>dule_lock));
>>
>>     vcpu_urgent_count_update(v);
>>+WARN_ON(v->is_urgent && new_state <= RUNSTATE_runnable);//temp
>>
>>     trace_runstate_change(v, new_state);
>>
>>I get pairs of warnings (i.e. each for the same vCPU):
>>
>>(XEN) Xen WARN at schedule.c:142
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c48011c8d5>] schedule+0x375/0x510
>>(XEN)    [<ffff82c48011deb8>] __do_softirq+0x58/0x80
>>(XEN)    [<ffff82c4801e61e6>] process_softirqs+0x6/0x10
>>
>>(XEN) Xen WARN at sched_credit.c:204
>>(XEN) Xen call trace:
>>(XEN)    [<ffff82c4801186b9>] csched_vcpu_wake+0x169/0x1a0
>>(XEN)    [<ffff82c4801497f2>] update_runstate_area+0x102/0x110
>>(XEN)    [<ffff82c48011cdcf>] vcpu_wake+0x13f/0x390
>>(XEN)    [<ffff82c48014b1a0>] context_switch+0x760/0xed0
>>(XEN)    [<ffff82c48014913d>] vcpu_kick+0x1d/0x80
>>(XEN)    [<ffff82c480107feb>] evtchn_set_pending+0xab/0x1b0
>>(XEN)    [<ffff82c4801083a9>] evtchn_send+0x129/0x150
>>(XEN)    [<ffff82c480108950>] do_event_channel_op+0x4c0/0xf50
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c4801461b5>] reprogram_timer+0x55/0x90
>>(XEN)    [<ffff82c48011fd44>] timer_softirq_action+0x1a4/0x360
>>(XEN)    [<ffff82c4801e6169>] syscall_enter+0xa9/0xae
>>
>>In schedule() this is always "prev" transitioning to RUNSTATE_runnable
>>(i.e. _VPF_blocked not set), yet the second call trace shows that
>>_VPF_blocked must have been set at that point (otherwise
>>vcpu_unblock(), tail-called from vcpu_kick(), would not have called
>>vcpu_wake()). If the order wasn't always that shown, or if the two
>>traces got intermixed, this could hint at a race - but they are always
>>that way, which so far I cannot make sense of.
>>
>
>Such race surely exists, since two paths are each updating multiple
>fileds which however are not all protected with same scheduler lock.
>vcpu_unblock manipulates _VPF_blocked w/o acuquiring scheduler
>lock. Then below sequence is possible:
>
>enter schedule() with _VPF_blocked
>...
><-vcpu_unblock clears _VPF_blocked. poll_mask hasn't been cleared yet
>...
>vcpu_runstate_change(
>        prev,
>        (test_bit(_VPF_blocked, &prev->pause_flags) ? RUNSTATE_blocked :
>         (vcpu_runnable(prev) ? RUNSTATE_runnable : RUNSTATE_offline)),
>        now);
>
>Then RUNSTATE_runnable is chosen. Then vcpu_urgent_count_update
>will set is_urgent flag since poll_mask has bit set. Then vcpu_unblock
>clears poll_mask and then invoke vcpu_wake. vcpu_wake wait for
>scheduler lock and then will see is_urgent flag set for a runnable vcpu.
>
>Here one necessary check is missed in vcpu_urgent_count_update, as
>only polling vcpu in blocked state should be cared in cpuidle. If there's
>any runnable vcpu in temporary polling state, idle vcpu won't get
>scheduled. is_urgent can be only set when new runstate is blocked, and
>when vcpu is in poll_mask. As runstate change always happen with
>scheduler lock, this can effectively ensure set/clear of is_urgent.
>
>Thanks
>Kevin


[-- Attachment #2: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 61+ messages in thread

* RE: cpuidle causing Dom0 soft lockups
  2010-02-24  6:51                                       ` Yu, Ke
@ 2010-02-24  8:56                                         ` Jan Beulich
  0 siblings, 0 replies; 61+ messages in thread
From: Jan Beulich @ 2010-02-24  8:56 UTC (permalink / raw)
  To: Ke Yu, Kevin Tian; +Cc: xen-devel@lists.xensource.com, Keir Fraser

>>> "Yu, Ke" <ke.yu@intel.com> 24.02.10 07:51 >>>
>Here is the patch according to Kevin's analysis. Jan, could you give it a try? Thanks

Indeed, all of the warnings are gone with that after running for almost
an hour where previously the BUG_ON() (converted to a WARN_ON())
would hit within minutes.

Not sure whether Keir needs another formal submission or could push
this in right away.

Anyway, I believe the original BUG_ON() causing this issue should also
be converted to a WARN_ON(). After all, the state checked doesn't
indicate a problem the system cannot survive - all that it indicates is
that the urgent vCPU accounting is broken, and hence deep C states
may get entered more or less frequently than supposed to.

Jan

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2010-02-24  8:56 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-21  9:51 cpuidle causing Dom0 soft lockups Jan Beulich
2010-01-21 10:11 ` Keir Fraser
2010-01-21 10:16   ` Jan Beulich
2010-01-21 10:26     ` Keir Fraser
2010-01-21 10:53       ` Jan Beulich
2010-01-21 11:03         ` Keir Fraser
2010-01-21 11:13           ` Jan Beulich
2010-02-02  7:54           ` Jan Beulich
2010-02-02  8:13             ` Juergen Gross
2010-02-02 17:07             ` Yu, Ke
2010-02-03 10:15               ` Jan Beulich
2010-02-03 12:10                 ` Tian, Kevin
2010-02-03 12:18                   ` Juergen Gross
2010-02-04  1:40                     ` Tian, Kevin
2010-02-04  6:31                       ` Juergen Gross
2010-02-03 13:20                   ` Jan Beulich
2010-02-04  1:48                     ` Tian, Kevin
2010-02-03 14:46                 ` Yu, Ke
2010-02-03  7:32             ` Yu, Ke
2010-02-03 10:23               ` Jan Beulich
2010-02-05  8:48               ` Jan Beulich
2010-02-05  9:00                 ` Tian, Kevin
2010-02-05  9:14                   ` Jan Beulich
2010-02-05  9:52                     ` Tian, Kevin
2010-02-05 10:37                       ` Jan Beulich
2010-02-05 11:16                         ` Tian, Kevin
2010-02-05 14:59                           ` Jan Beulich
2010-02-05 15:51                             ` Jan Beulich
2010-02-06  1:52                               ` Tian, Kevin
2010-02-08  8:45                                 ` Jan Beulich
2010-02-09  7:55                                   ` Tian, Kevin
2010-02-09 12:35                                     ` Jan Beulich
2010-02-11 14:44                                     ` Jan Beulich
2010-02-11 17:01                                       ` Keir Fraser
2010-02-12  9:21                                         ` Jan Beulich
2010-02-06  1:50                             ` Tian, Kevin
2010-02-08  8:36                               ` Jan Beulich
2010-02-05  9:16                 ` Yu, Ke
2010-02-07 15:36                 ` Yu, Ke
2010-02-08  9:08                   ` Jan Beulich
2010-02-08 10:11                     ` Keir Fraser
2010-02-09  8:02                     ` Tian, Kevin
2010-02-13  2:28                     ` Yu, Ke
2010-02-15  8:24                       ` Keir Fraser
2010-02-15 17:33                       ` Keir Fraser
2010-02-16  4:59                         ` Yu, Ke
2010-02-16  7:59                         ` Jan Beulich
2010-02-16 13:12                           ` Yu, Ke
2010-02-16 14:24                             ` Jan Beulich
2010-02-22 13:29                         ` Jan Beulich
2010-02-22 13:44                           ` Keir Fraser
2010-02-23  9:32                             ` Yu, Ke
2010-02-23 10:37                               ` Jan Beulich
2010-02-23 10:57                                 ` Keir Fraser
2010-02-23 16:44                                   ` Jan Beulich
2010-02-24  3:08                                     ` Tian, Kevin
2010-02-24  6:51                                       ` Yu, Ke
2010-02-24  8:56                                         ` Jan Beulich
2010-01-21 10:35 ` Wei, Gang
2010-01-21 12:07 ` Yu, Ke
2010-01-25  8:08   ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).