From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mukesh Rathor Subject: Re: dom0 hang Date: Mon, 06 Jul 2009 20:46:52 -0700 Message-ID: <4A52C52C.9080409@oracle.com> References: <4A426D50.80401@oracle.com> <4A4C2743.5030703@oracle.com> <4A4D0710.10309@oracle.com> <4A4D2253.8070807@oracle.com> <4A4D4D78.1060609@oracle.com> <4A4D5C69.5020409@oracle.com> <4D05DB80B95B23498C72C700BD6C2E0B2F9F599D@pdsmsx502.ccr.corp.intel.com> Reply-To: mukesh.rathor@oracle.com Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D05DB80B95B23498C72C700BD6C2E0B2F9F599D@pdsmsx502.ccr.corp.intel.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: "Yu, Ke" Cc: George Dunlap , "Kurt C. Hackel" , "Tian, Kevin" , "xen-devel@lists.xensource.com" List-Id: xen-devel@lists.xenproject.org Well, the problem takes long to reproduce (only on certain boxes). And then it may not always happen. So I want to make sure I understand the fix, as it was pretty hard to debug. While the fix will still allow softirqs pending, I guess, functionally it's OK because after irq disable, it'll check for pending softirq, and just return. I think the comment about expecting no softirq pending should be fixed. BTW, why can't the tick be suspended when csched_schedule() concludes it's idle vcpu before returning? won't that would make it less intrusive. thanks, Mukesh Yu, Ke wrote: > Hi Mukesh, > > Could you please try the following patch, to see if it can resolve the issue you observed? Thanks. > > Best Regards > Ke > > diff -r d461c4d8af17 xen/arch/x86/acpi/cpu_idle.c > --- a/xen/arch/x86/acpi/cpu_idle.c > +++ b/xen/arch/x86/acpi/cpu_idle.c > @@ -228,10 +228,10 @@ static void acpi_processor_idle(void) > /* > * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer, > * which will break the later assumption of no sofirq pending, > - * so add do_softirq > + * so process the pending timers > */ > - if ( softirq_pending(smp_processor_id()) ) > - do_softirq(); > + > + process_pending_timers(); > > /* > * Interrupts must be disabled during bus mastering calculations and > >> -----Original Message----- >> From: Mukesh Rathor [mailto:mukesh.rathor@oracle.com] >> Sent: Friday, July 03, 2009 9:19 AM >> To: mukesh.rathor@oracle.com >> Cc: George Dunlap; Tian, Kevin; xen-devel@lists.xensource.com; Yu, Ke; Kurt C. >> Hackel >> Subject: Re: [Xen-devel] dom0 hang >> >> >> Hi Kevin/Yu: >> >> acpi_processor_idle() >> { >> sched_tick_suspend(); >> /* >> * sched_tick_suspend may raise TIMER_SOFTIRQ by __stop_timer, >> * which will break the later assumption of no sofirq pending, >> * so add do_softirq >> */ >> if ( softirq_pending(smp_processor_id()) ) >> do_softirq(); <=============== >> >> local_irq_disable(); >> if ( softirq_pending(smp_processor_id()) ) >> { >> local_irq_enable(); >> sched_tick_resume(); >> cpufreq_dbs_timer_resume(); >> return; >> } >> >> wouldn't the do_softirq() call scheduler with tick suspended, and >> the scheduler then context switches to another vcpu0 (with *_BOOST) which >> would result in the stuck vcpu I described? >> >> thanks >> Mukesh >> >> >> Mukesh Rathor wrote: >>> ah, i totally missed csched_tick(): >>> if ( !is_idle_vcpu(current) ) >>> csched_vcpu_acct(cpu); >>> >>> yeah, looks like that's what is going on. i'm still waiting to >>> reproduce. at first glance, looking at c/s 19460, seems like >>> suspend/resume, well at least the resume, should happen in >>> csched_schedule()..... >>> >>> thanks, >>> Mukesh >>> >>> >>> George Dunlap wrote: >>>> [Oops, adding back in distro list, also adding Kevin Tian and Yu Ke >>>> who wrote cs 19460] >>>> >>>> The functionality I was talking about, subtracting credits and >>>> clearing BOOST, happens in csched_vcpu_acct() (which is different than >>>> csched_acct()). vcpu_acct() is called from csched_tick(), which >>>> should still happen every 10ms on every cpu. >>>> >>>> The patch I referred to (cs 19460) disables and re-enables tickers in >>>> xen/arch/x86/acpi/cpu_idle.c:acpi_processor_idle() every time the >>>> processor idles. I can't see anywhere else that tickers are disabled, >>>> so it's probably something not properly re-enabling them again. >>>> >>>> Try applying the attached patch to see if that changes anything. (I'm >>>> on the road, so I can't repro the lockup issue.) If that doesn't >>>> work, try disabling c-states and see if that helps. Then at least >>>> we'll know where the problem lies. >>>> >>>> -George >>>> >>>> On Thu, Jul 2, 2009 at 10:10 PM, Mukesh >>>> Rathor wrote: >>>>> that seems to only suspend csched_pcpu.ticker which is csched_tick >>>>> that is >>>>> only sorting local runq. >>>>> >>>>> again, we are concerned about csched_priv.master_ticker that calls >>>>> csched_acct? correct, so i can trace that? >>>>> >>>>> thanks, >>>>> mukesh >>>>> >>>>> >>>>> George Dunlap wrote: >>>>>> Ah, I see that there's been some changes to tick stuff with the >>>>>> c-state (e.g., cs 19460). It looks like they're supposed to be going >>>>>> still, but perhaps the tick_suspend() and tick_resume() aren't being >>>>>> called properly. Let me take a closer look. >>>>>> >>>>>> -George >>>>>> >>>>>> On Thu, Jul 2, 2009 at 8:14 PM, Mukesh >> Rathor >>>>>> wrote: >>>>>>> George Dunlap wrote: >>>>>>>> On Thu, Jul 2, 2009 at 4:19 AM, Mukesh >>>>>>>> Rathor >>>>>>>> wrote: >>>>>>>>> dom0 hang: >>>>>>>>> vcpu0 is trying to wakeup a task and in try_to_wake_up() calls >>>>>>>>> task_rq_lock(). since the task has cpu set to 1, it gets runq lock >>>>>>>>> for vcpu1. next it calls resched_task() which results in sending >>>>>>>>> IPI >>>>>>>>> to vcpu1. for that, vcpu0 gets into the >> HYPERVISOR_event_channel_op >>>>>>>>> HCALL and is waiting to return. Meanwhile, vcpu1 got running, >>>>>>>>> and is >>>>>>>>> spinning on it's runq lock in >>>>>>>>> "schedule():spin_lock_irq(&rq->lock);", >>>>>>>>> that vcpu0 is holding (and is waiting to return from the HCALL). >>>>>>>>> >>>>>>>>> As I had noticed before, vcpu0 never gets scheduled in xen. So >>>>>>>>> looking further into xen: >>>>>>>>> >>>>>>>>> xen: >>>>>>>>> Both vcpu's are on the same runq, in this case cpu1. But the >>>>>>>>> priority of vcpu1 has been set to CSCHED_PRI_TS_BOOST. As a >> result, >>>>>>>>> the scheduler always picks vcpu1, and vcpu0 is starved. Also, I >>>>>>>>> see in >>>>>>>>> kdb that the scheduler timer is not set on cpu 0. That would've >>>>>>>>> allowed csched_load_balance() to kick in on cpu0. [Also, on >>>>>>>>> cpu1, the accounting timer, csched_tick, is not set. Altho, >>>>>>>>> csched_tick() is running on cpu0, it only checks runq for cpu0.] >>>>>>>>> >>>>>>>>> Looks like c/s 19500 changed csched_schedule(): >>>>>>>>> >>>>>>>>> - ret.time = MILLISECS(CSCHED_MSECS_PER_TSLICE); >>>>>>>>> + ret.time = (is_idle_vcpu(snext->vcpu) ? >>>>>>>>> + -1 : MILLISECS(CSCHED_MSECS_PER_TSLICE)); >>>>>>>>> >>>>>>>>> The quickest fix for us would be to just back that out. >>>>>>>>> >>>>>>>>> >>>>>>>>> BTW, just a comment on following (all in sched_credit.c): >>>>>>>>> >>>>>>>>> if ( svc->pri == CSCHED_PRI_TS_UNDER && >>>>>>>>> !(svc->flags & CSCHED_FLAG_VCPU_PARKED) ) >>>>>>>>> { >>>>>>>>> svc->pri = CSCHED_PRI_TS_BOOST; >>>>>>>>> } >>>>>>>>> comibined with >>>>>>>>> if ( snext->pri > CSCHED_PRI_TS_OVER ) >>>>>>>>> __runq_remove(snext); >>>>>>>>> >>>>>>>>> Setting CSCHED_PRI_TS_BOOST as pri of vcpu seems dangerous. >> To >>>>>>>>> me, >>>>>>>>> since csched_schedule() never checks for time accumulated by a >>>>>>>>> vcpu at pri CSCHED_PRI_TS_BOOST, that is same as pinning a >>>>>>>>> vcpu to a >>>>>>>>> pcpu. if that vcpu never makes progress, essentially, the system >>>>>>>>> has lost a physical cpu. Optionally, csched_schedule() should >>>>>>>>> always >>>>>>>>> check for cpu time accumulated and reduce the priority over >> time. >>>>>>>>> I can't tell right off if it already does that. or something like >>>>>>>>> that :)... my 2 cents. >>>>>>>> Hmm... what's supposed to happen is that eventually a timer tick will >>>>>>>> interrupt vcpu1. If cpu1 is set to be "active", then it will be >>>>>>>> debited 10ms worth of credit. Eventually, it will go into OVER, and >>>>>>>> lose BOOST. If it's "inactive", then when the tick happens, it will >>>>>>>> be set to "active" and be debited 10ms again, setting it directly >>>>>>>> into >>>>>>>> OVER (and thus also losing boost). >>>>>>>> >>>>>>>> Can you see if the timer ticks are still happening, and perhaps put >>>>>>>> some tracing it to verify that what I described above is happening? >>>>>>>> >>>>>>>> -George >>>>>>> George, >>>>>>> >>>>>>> Is that in csched_acct()? Looks like that's somehow gotten removed. If >>>>>>> true, then may be that's the fundamental problem to chase. >>>>>>> >>>>>>> Here's what the trq looks like when hung, not in any schedule >>>>>>> function: >>>>>>> >>>>>>> [0]xkdb> dtrq >>>>>>> CPU[00]: NOW:0x00003f2db9af369e >>>>>>> 1: exp=0x00003ee31cb32200 fn:csched_tick >> data:0000000000000000 >>>>>>> 2: exp=0x00003ee347ece164 fn:time_calibration >> data:0000000000000000 >>>>>>> 3: exp=0x00003ee69a28f04b fn:mce_work_fn >> data:0000000000000000 >>>>>>> 4: exp=0x00003f055895e25f fn:plt_overflow >> data:0000000000000000 >>>>>>> 5: exp=0x00003ee353810216 fn:rtc_update_second >> data:ffff83007f0226d8 >>>>>>> CPU[01]: NOW:0x00003f2db9af369e >>>>>>> 1: exp=0x00003ee30b847988 fn:s_timer_fn >> data:0000000000000000 >>>>>>> 2: exp=0x00003f1b309ebd45 fn:pmt_timer_callback >> data:ffff83007f022a68 >>>>>>> >>>>>>> thanks >>>>>>> Mukesh >>>>>>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xensource.com >>> http://lists.xensource.com/xen-devel