From mboxrd@z Thu Jan  1 00:00:00 1970
From: Mukesh Rathor <mukesh.rathor@oracle.com>
Subject: Re: dom0 hang
Date: Thu, 02 Jul 2009 18:18:33 -0700
Message-ID: <4A4D5C69.5020409@oracle.com>
References: <4A426D50.80401@oracle.com>
	<4A4C2743.5030703@oracle.com>		<de76405a0907021050y10a8bea0kc9de92126b58a9e8@mail.gmail.com>		<4A4D0710.10309@oracle.com>		<de76405a0907021349q20e47f5ave3cc86b74c511f0@mail.gmail.com>		<4A4D2253.8070807@oracle.com>	<de76405a0907021437m52f1913au2cd0963dd99eada3@mail.gmail.com>
	<4A4D4D78.1060609@oracle.com>
Reply-To: mukesh.rathor@oracle.com
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4A4D4D78.1060609@oracle.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: mukesh.rathor@oracle.com
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "Kurt C. Hackel" <kurt.hackel@oracle.com>, "Tian,
	Kevin" <kevin.tian@intel.com>, xen-devel@lists.xensource.com, Yu Ke <ke.yu@intel.com>
List-Id: xen-devel@lists.xenproject.org


Hi Kevin/Yu:

acpi_processor_idle()
{
     sched_tick_suspend();
      /*
      * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer,
      * which will break the later assumption of no sofirq pending,
      * so add do_softirq
      */
     if ( softirq_pending(smp_processor_id()) )
         do_softirq();             <===============

     local_irq_disable();
     if ( softirq_pending(smp_processor_id()) )
     {
         local_irq_enable();
         sched_tick_resume();
         cpufreq_dbs_timer_resume();
         return;
     }

wouldn't the do_softirq() call scheduler with tick suspended, and
the scheduler then context switches to another vcpu0 (with *_BOOST) which
would result in the stuck vcpu I described?

thanks
Mukesh


Mukesh Rathor wrote:
> ah, i totally missed csched_tick():
>     if ( !is_idle_vcpu(current) )
>         csched_vcpu_acct(cpu);
> 
> yeah, looks like that's what is going on. i'm still waiting to 
> reproduce. at first glance, looking at c/s 19460, seems like 
> suspend/resume, well at least the resume, should happen in 
> csched_schedule().....
> 
> thanks,
> Mukesh
> 
> 
> George Dunlap wrote:
>> [Oops, adding back in distro list, also adding Kevin Tian and Yu Ke
>> who wrote cs 19460]
>>
>> The functionality I was talking about, subtracting credits and
>> clearing BOOST, happens in csched_vcpu_acct() (which is different than
>> csched_acct()).  vcpu_acct() is called from csched_tick(), which
>> should still happen every 10ms on every cpu.
>>
>> The patch I referred to (cs 19460) disables and re-enables tickers in
>> xen/arch/x86/acpi/cpu_idle.c:acpi_processor_idle() every time the
>> processor idles.  I can't see anywhere else that tickers are disabled,
>> so it's probably something not properly re-enabling them again.
>>
>> Try applying the attached patch to see if that changes anything.  (I'm
>> on the road, so I can't repro the lockup issue.)  If that doesn't
>> work, try disabling c-states and see if that helps.  Then at least
>> we'll know where the problem lies.
>>
>>  -George
>>
>> On Thu, Jul 2, 2009 at 10:10 PM, Mukesh 
>> Rathor<mukesh.rathor@oracle.com> wrote:
>>> that seems to only suspend csched_pcpu.ticker which is csched_tick 
>>> that is
>>> only sorting local runq.
>>>
>>> again, we are concerned about csched_priv.master_ticker that calls
>>> csched_acct? correct, so i can trace that?
>>>
>>> thanks,
>>> mukesh
>>>
>>>
>>> George Dunlap wrote:
>>>> Ah, I see that there's been some changes to tick stuff with the
>>>> c-state (e.g., cs 19460).  It looks like they're supposed to be going
>>>> still, but perhaps the tick_suspend() and tick_resume() aren't being
>>>> called properly.  Let me take a closer look.
>>>>
>>>>  -George
>>>>
>>>> On Thu, Jul 2, 2009 at 8:14 PM, Mukesh Rathor<mukesh.rathor@oracle.com>
>>>> wrote:
>>>>> George Dunlap wrote:
>>>>>> On Thu, Jul 2, 2009 at 4:19 AM, Mukesh 
>>>>>> Rathor<mukesh.rathor@oracle.com>
>>>>>> wrote:
>>>>>>> dom0 hang:
>>>>>>>  vcpu0 is trying to wakeup a task and in try_to_wake_up() calls
>>>>>>>  task_rq_lock(). since the task has cpu set to 1, it gets runq lock
>>>>>>>  for vcpu1. next it calls resched_task() which results in sending 
>>>>>>> IPI
>>>>>>>  to vcpu1. for that, vcpu0 gets into the HYPERVISOR_event_channel_op
>>>>>>>  HCALL and is waiting to return. Meanwhile, vcpu1 got running, 
>>>>>>> and is
>>>>>>>  spinning on it's runq lock in 
>>>>>>> "schedule():spin_lock_irq(&rq->lock);",
>>>>>>>  that vcpu0 is holding (and is waiting to return from the HCALL).
>>>>>>>
>>>>>>>  As I had noticed before, vcpu0 never gets scheduled in xen. So
>>>>>>>  looking further into xen:
>>>>>>>
>>>>>>> xen:
>>>>>>>  Both vcpu's are on the same runq, in this case cpu1. But the
>>>>>>>  priority of vcpu1 has been set to CSCHED_PRI_TS_BOOST. As a result,
>>>>>>>  the scheduler always picks vcpu1, and vcpu0 is starved. Also, I 
>>>>>>> see in
>>>>>>>  kdb that the scheduler timer is not set on cpu 0. That would've
>>>>>>>  allowed csched_load_balance() to kick in on cpu0. [Also, on
>>>>>>>  cpu1, the accounting timer, csched_tick, is not set.  Altho,
>>>>>>>  csched_tick() is running on cpu0, it only checks runq for cpu0.]
>>>>>>>
>>>>>>>  Looks like c/s 19500 changed csched_schedule():
>>>>>>>
>>>>>>> -    ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE);
>>>>>>> +    ret.time = (is_idle_vcpu(snext->vcpu) ?
>>>>>>> +                -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE));
>>>>>>>
>>>>>>>  The quickest fix for us would be to just back that out.
>>>>>>>
>>>>>>>
>>>>>>>  BTW, just a comment on following (all in sched_credit.c):
>>>>>>>
>>>>>>>    if ( svc->pri == CSCHED_PRI_TS_UNDER &&
>>>>>>>       !(svc->flags & CSCHED_FLAG_VCPU_PARKED) )
>>>>>>>    {
>>>>>>>       svc->pri = CSCHED_PRI_TS_BOOST;
>>>>>>>    }
>>>>>>>  comibined with
>>>>>>>  if ( snext->pri > CSCHED_PRI_TS_OVER )
>>>>>>>          __runq_remove(snext);
>>>>>>>
>>>>>>>    Setting CSCHED_PRI_TS_BOOST as pri of vcpu seems dangerous. To 
>>>>>>> me,
>>>>>>>    since csched_schedule() never checks for time accumulated by a
>>>>>>>    vcpu at pri CSCHED_PRI_TS_BOOST, that is same as pinning a 
>>>>>>> vcpu to a
>>>>>>>    pcpu. if that vcpu never makes progress, essentially, the system
>>>>>>>    has lost a physical cpu.  Optionally, csched_schedule() should
>>>>>>> always
>>>>>>>    check for cpu time accumulated and reduce the priority over time.
>>>>>>>    I can't tell right off if it already does that. or something like
>>>>>>>    that :)...  my 2 cents.
>>>>>> Hmm... what's supposed to happen is that eventually a timer tick will
>>>>>> interrupt vcpu1.  If cpu1 is set to be "active", then it will be
>>>>>> debited 10ms worth of credit.  Eventually, it will go into OVER, and
>>>>>> lose BOOST.  If it's "inactive", then when the tick happens, it will
>>>>>> be set to "active" and be debited 10ms again, setting it directly 
>>>>>> into
>>>>>> OVER (and thus also losing boost).
>>>>>>
>>>>>> Can you see if the timer ticks are still happening, and perhaps put
>>>>>> some tracing it to verify that what I described above is happening?
>>>>>>
>>>>>>  -George
>>>>> George,
>>>>>
>>>>> Is that in csched_acct()? Looks like that's somehow gotten removed. If
>>>>> true, then may be that's the fundamental problem to chase.
>>>>>
>>>>> Here's what the trq looks like when hung, not in any schedule 
>>>>> function:
>>>>>
>>>>> [0]xkdb> dtrq
>>>>> CPU[00]: NOW:0x00003f2db9af369e
>>>>>  1: exp=0x00003ee31cb32200 fn:csched_tick data:0000000000000000
>>>>>  2: exp=0x00003ee347ece164 fn:time_calibration data:0000000000000000
>>>>>  3: exp=0x00003ee69a28f04b fn:mce_work_fn data:0000000000000000
>>>>>  4: exp=0x00003f055895e25f fn:plt_overflow data:0000000000000000
>>>>>  5: exp=0x00003ee353810216 fn:rtc_update_second data:ffff83007f0226d8
>>>>>
>>>>> CPU[01]: NOW:0x00003f2db9af369e
>>>>>  1: exp=0x00003ee30b847988 fn:s_timer_fn data:0000000000000000
>>>>>  2: exp=0x00003f1b309ebd45 fn:pmt_timer_callback data:ffff83007f022a68
>>>>>
>>>>>
>>>>> thanks
>>>>> Mukesh
>>>>>
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel