From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mukesh Rathor <mukesh.rathor@oracle.com>
Subject: Re: dom0 hang
Date: Mon, 06 Jul 2009 20:46:52 -0700
Message-ID: <4A52C52C.9080409@oracle.com>
References: <4A426D50.80401@oracle.com>
	<4A4C2743.5030703@oracle.com>		<de76405a0907021050y10a8bea0kc9de92126b58a9e8@mail.gmail.com>		<4A4D0710.10309@oracle.com>		<de76405a0907021349q20e47f5ave3cc86b74c511f0@mail.gmail.com>		<4A4D2253.8070807@oracle.com>	<de76405a0907021437m52f1913au2cd0963dd99eada3@mail.gmail.com>
	<4A4D4D78.1060609@oracle.com> <4A4D5C69.5020409@oracle.com>
	<4D05DB80B95B23498C72C700BD6C2E0B2F9F599D@pdsmsx502.ccr.corp.intel.com>
Reply-To: mukesh.rathor@oracle.com
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4D05DB80B95B23498C72C700BD6C2E0B2F9F599D@pdsmsx502.ccr.corp.intel.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: "Yu, Ke" <ke.yu@intel.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "Kurt C. Hackel" <kurt.hackel@oracle.com>, "Tian,
	Kevin" <kevin.tian@intel.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>
List-Id: xen-devel@lists.xenproject.org


Well, the problem takes long to reproduce (only on certain boxes). And then it
may not always happen. So I want to make sure I understand the fix, as it
was pretty hard to debug.

While the fix will still allow softirqs pending, I guess, functionally
it's OK because after irq disable, it'll check for pending softirq, and
just return. I think the comment about expecting no softirq pending
should be fixed.

BTW, why can't the tick be suspended when csched_schedule() concludes
it's idle vcpu before returning? won't that would make it less intrusive.

thanks,
Mukesh


Yu, Ke wrote:
> Hi Mukesh,
> 
> Could you please try the following patch, to see if it can resolve the issue you observed? Thanks.
> 
> Best Regards
> Ke
> 
> diff -r d461c4d8af17 xen/arch/x86/acpi/cpu_idle.c
> --- a/xen/arch/x86/acpi/cpu_idle.c
> +++ b/xen/arch/x86/acpi/cpu_idle.c
> @@ -228,10 +228,10 @@ static void acpi_processor_idle(void)
>      /*
>       * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer,
>       * which will break the later assumption of no sofirq pending,
> -     * so add do_softirq
> +     * so process the pending timers
>       */
> -    if ( softirq_pending(smp_processor_id()) )
> -        do_softirq();
> +
> +    process_pending_timers();
> 
>      /*
>       * Interrupts must be disabled during bus mastering calculations and
> 
>> -----Original Message-----
>> From: Mukesh Rathor [mailto:mukesh.rathor@oracle.com]
>> Sent: Friday, July 03, 2009 9:19 AM
>> To: mukesh.rathor@oracle.com
>> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com; Yu, Ke; Kurt C.
>> Hackel
>> Subject: Re: [Xen-devel] dom0 hang
>>
>>
>> Hi Kevin/Yu:
>>
>> acpi_processor_idle()
>> {
>>     sched_tick_suspend();
>>      /*
>>      * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer,
>>      * which will break the later assumption of no sofirq pending,
>>      * so add do_softirq
>>      */
>>     if ( softirq_pending(smp_processor_id()) )
>>         do_softirq();             <===============
>>
>>     local_irq_disable();
>>     if ( softirq_pending(smp_processor_id()) )
>>     {
>>         local_irq_enable();
>>         sched_tick_resume();
>>         cpufreq_dbs_timer_resume();
>>         return;
>>     }
>>
>> wouldn't the do_softirq() call scheduler with tick suspended, and
>> the scheduler then context switches to another vcpu0 (with *_BOOST) which
>> would result in the stuck vcpu I described?
>>
>> thanks
>> Mukesh
>>
>>
>> Mukesh Rathor wrote:
>>> ah, i totally missed csched_tick():
>>>     if ( !is_idle_vcpu(current) )
>>>         csched_vcpu_acct(cpu);
>>>
>>> yeah, looks like that's what is going on. i'm still waiting to
>>> reproduce. at first glance, looking at c/s 19460, seems like
>>> suspend/resume, well at least the resume, should happen in
>>> csched_schedule().....
>>>
>>> thanks,
>>> Mukesh
>>>
>>>
>>> George Dunlap wrote:
>>>> [Oops, adding back in distro list, also adding Kevin Tian and Yu Ke
>>>> who wrote cs 19460]
>>>>
>>>> The functionality I was talking about, subtracting credits and
>>>> clearing BOOST, happens in csched_vcpu_acct() (which is different than
>>>> csched_acct()).  vcpu_acct() is called from csched_tick(), which
>>>> should still happen every 10ms on every cpu.
>>>>
>>>> The patch I referred to (cs 19460) disables and re-enables tickers in
>>>> xen/arch/x86/acpi/cpu_idle.c:acpi_processor_idle() every time the
>>>> processor idles.  I can't see anywhere else that tickers are disabled,
>>>> so it's probably something not properly re-enabling them again.
>>>>
>>>> Try applying the attached patch to see if that changes anything.  (I'm
>>>> on the road, so I can't repro the lockup issue.)  If that doesn't
>>>> work, try disabling c-states and see if that helps.  Then at least
>>>> we'll know where the problem lies.
>>>>
>>>>  -George
>>>>
>>>> On Thu, Jul 2, 2009 at 10:10 PM, Mukesh
>>>> Rathor<mukesh.rathor@oracle.com> wrote:
>>>>> that seems to only suspend csched_pcpu.ticker which is csched_tick
>>>>> that is
>>>>> only sorting local runq.
>>>>>
>>>>> again, we are concerned about csched_priv.master_ticker that calls
>>>>> csched_acct? correct, so i can trace that?
>>>>>
>>>>> thanks,
>>>>> mukesh
>>>>>
>>>>>
>>>>> George Dunlap wrote:
>>>>>> Ah, I see that there's been some changes to tick stuff with the
>>>>>> c-state (e.g., cs 19460).  It looks like they're supposed to be going
>>>>>> still, but perhaps the tick_suspend() and tick_resume() aren't being
>>>>>> called properly.  Let me take a closer look.
>>>>>>
>>>>>>  -George
>>>>>>
>>>>>> On Thu, Jul 2, 2009 at 8:14 PM, Mukesh
>> Rathor<mukesh.rathor@oracle.com>
>>>>>> wrote:
>>>>>>> George Dunlap wrote:
>>>>>>>> On Thu, Jul 2, 2009 at 4:19 AM, Mukesh
>>>>>>>> Rathor<mukesh.rathor@oracle.com>
>>>>>>>> wrote:
>>>>>>>>> dom0 hang:
>>>>>>>>>  vcpu0 is trying to wakeup a task and in try_to_wake_up() calls
>>>>>>>>>  task_rq_lock(). since the task has cpu set to 1, it gets runq lock
>>>>>>>>>  for vcpu1. next it calls resched_task() which results in sending
>>>>>>>>> IPI
>>>>>>>>>  to vcpu1. for that, vcpu0 gets into the
>> HYPERVISOR_event_channel_op
>>>>>>>>>  HCALL and is waiting to return. Meanwhile, vcpu1 got running,
>>>>>>>>> and is
>>>>>>>>>  spinning on it's runq lock in
>>>>>>>>> "schedule():spin_lock_irq(&rq->lock);",
>>>>>>>>>  that vcpu0 is holding (and is waiting to return from the HCALL).
>>>>>>>>>
>>>>>>>>>  As I had noticed before, vcpu0 never gets scheduled in xen. So
>>>>>>>>>  looking further into xen:
>>>>>>>>>
>>>>>>>>> xen:
>>>>>>>>>  Both vcpu's are on the same runq, in this case cpu1. But the
>>>>>>>>>  priority of vcpu1 has been set to CSCHED_PRI_TS_BOOST. As a
>> result,
>>>>>>>>>  the scheduler always picks vcpu1, and vcpu0 is starved. Also, I
>>>>>>>>> see in
>>>>>>>>>  kdb that the scheduler timer is not set on cpu 0. That would've
>>>>>>>>>  allowed csched_load_balance() to kick in on cpu0. [Also, on
>>>>>>>>>  cpu1, the accounting timer, csched_tick, is not set.  Altho,
>>>>>>>>>  csched_tick() is running on cpu0, it only checks runq for cpu0.]
>>>>>>>>>
>>>>>>>>>  Looks like c/s 19500 changed csched_schedule():
>>>>>>>>>
>>>>>>>>> -    ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE);
>>>>>>>>> +    ret.time = (is_idle_vcpu(snext->vcpu) ?
>>>>>>>>> +                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
>>>>>>>>>
>>>>>>>>>  The quickest fix for us would be to just back that out.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  BTW, just a comment on following (all in sched_credit.c):
>>>>>>>>>
>>>>>>>>>    if ( svc->pri == CSCHED_PRI_TS_UNDER &&
>>>>>>>>>       !(svc->flags & CSCHED_FLAG_VCPU_PARKED) )
>>>>>>>>>    {
>>>>>>>>>       svc->pri = CSCHED_PRI_TS_BOOST;
>>>>>>>>>    }
>>>>>>>>>  comibined with
>>>>>>>>>  if ( snext->pri > CSCHED_PRI_TS_OVER )
>>>>>>>>>          __runq_remove(snext);
>>>>>>>>>
>>>>>>>>>    Setting CSCHED_PRI_TS_BOOST as pri of vcpu seems dangerous.
>> To
>>>>>>>>> me,
>>>>>>>>>    since csched_schedule() never checks for time accumulated by a
>>>>>>>>>    vcpu at pri CSCHED_PRI_TS_BOOST, that is same as pinning a
>>>>>>>>> vcpu to a
>>>>>>>>>    pcpu. if that vcpu never makes progress, essentially, the system
>>>>>>>>>    has lost a physical cpu.  Optionally, csched_schedule() should
>>>>>>>>> always
>>>>>>>>>    check for cpu time accumulated and reduce the priority over
>> time.
>>>>>>>>>    I can't tell right off if it already does that. or something like
>>>>>>>>>    that :)...  my 2 cents.
>>>>>>>> Hmm... what's supposed to happen is that eventually a timer tick will
>>>>>>>> interrupt vcpu1.  If cpu1 is set to be "active", then it will be
>>>>>>>> debited 10ms worth of credit.  Eventually, it will go into OVER, and
>>>>>>>> lose BOOST.  If it's "inactive", then when the tick happens, it will
>>>>>>>> be set to "active" and be debited 10ms again, setting it directly
>>>>>>>> into
>>>>>>>> OVER (and thus also losing boost).
>>>>>>>>
>>>>>>>> Can you see if the timer ticks are still happening, and perhaps put
>>>>>>>> some tracing it to verify that what I described above is happening?
>>>>>>>>
>>>>>>>>  -George
>>>>>>> George,
>>>>>>>
>>>>>>> Is that in csched_acct()? Looks like that's somehow gotten removed. If
>>>>>>> true, then may be that's the fundamental problem to chase.
>>>>>>>
>>>>>>> Here's what the trq looks like when hung, not in any schedule
>>>>>>> function:
>>>>>>>
>>>>>>> [0]xkdb> dtrq
>>>>>>> CPU[00]: NOW:0x00003f2db9af369e
>>>>>>>  1: exp=0x00003ee31cb32200 fn:csched_tick
>> data:0000000000000000
>>>>>>>  2: exp=0x00003ee347ece164 fn:time_calibration
>> data:0000000000000000
>>>>>>>  3: exp=0x00003ee69a28f04b fn:mce_work_fn
>> data:0000000000000000
>>>>>>>  4: exp=0x00003f055895e25f fn:plt_overflow
>> data:0000000000000000
>>>>>>>  5: exp=0x00003ee353810216 fn:rtc_update_second
>> data:ffff83007f0226d8
>>>>>>> CPU[01]: NOW:0x00003f2db9af369e
>>>>>>>  1: exp=0x00003ee30b847988 fn:s_timer_fn
>> data:0000000000000000
>>>>>>>  2: exp=0x00003f1b309ebd45 fn:pmt_timer_callback
>> data:ffff83007f022a68
>>>>>>>
>>>>>>> thanks
>>>>>>> Mukesh
>>>>>>>
>>> _______________________________________________
>>> Xen-devel mailing list
>>> Xen-devel@lists.xensource.com
>>> http://lists.xensource.com/xen-devel