From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4C77FF51.4000801@domain.hid> Date: Fri, 27 Aug 2010 20:09:21 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <4C7783C9.5090501@domain.hid> <4C7788A5.2020307@domain.hid> <4C77B054.8080609@domain.hid> <4C77B208.5090802@domain.hid> <4C77F142.4030103@domain.hid> <4C77F5EF.3010701@domain.hid> <4C77F65E.5000208@domain.hid> <4C77FB05.4090301@domain.hid> <4C77FD89.5040100@domain.hid> In-Reply-To: <4C77FD89.5040100@domain.hid> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-core] False positive XENO_BUGON(NUCLEUS, need_resched == 0)? List-Id: Xenomai life and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gilles Chanteperdrix Cc: xenomai-core Gilles Chanteperdrix wrote: > Jan Kiszka wrote: >> Gilles Chanteperdrix wrote: >>> Gilles Chanteperdrix wrote: >>>> Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Gilles Chanteperdrix wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> I'm hitting that bug check in __xnpod_schedule after >>>>>>>>> xnintr_clock_handler issued a xnpod_schedule like this: >>>>>>>>> >>>>>>>>> if (--sched->inesting == 0) { >>>>>>>>> __clrbits(sched->status, XNINIRQ); >>>>>>>>> xnpod_schedule(); >>>>>>>>> } >>>>>>>>> >>>>>>>>> Either the assumption behind the bug check is no longer correct (no call >>>>>>>>> to xnpod_schedule() without a real need), or we should check for >>>>>>>>> __xnpod_test_resched(sched) in xnintr_clock_handler (but under nklock then). >>>>>>>>> >>>>>>>>> Comments? >>>>>>>> You probably have a real bug. This BUG_ON means that the scheduler is >>>>>>>> about to switch context for real, whereas the resched bit is not set, >>>>>>>> which is wrong. >>>>>>> This happened over my 2.6.35 port - maybe some spurious IRQ enabling. >>>>>>> Debugging further... >>>>>> You should look for something which changes the scheduler state without >>>>>> setting the resched bit, or for something which clears the bit without >>>>>> taking the scheduler changes into account. >>>>> It looks like a generic Xenomai issue on SMP boxes, though a mostly >>>>> harmless one: >>>>> >>>>> The task that was scheduled in without XNRESCHED set locally has been >>>>> woken up by a remote CPU. The waker requeued the task and set the >>>>> resched condition for itself and in the resched proxy mask for the >>>>> remote CPU. But there is at least one place in the Xenomai code where we >>>>> drop the nklock between xnsched_set_resched and xnpod_schedule: >>>>> do_taskexit_event (I bet there are even more). Now the resched target >>>>> CPU runs into a timer handler, issues xnpod_schedule unconditionally, >>>>> and happens to find the woken-up task before it is actually informed via >>>>> an IPI. >>>>> >>>>> I think this is a harmless race, but it ruins the debug assertion >>>>> "need_resched != 0". >>>> Not that harmless, since without the debugging code, we would miss the >>>> reschedule too... >>> Ok. But we would finally reschedule when handling the IPI. So, the >>> effect we see is a useless delay in the rescheduling. >>> >> Depends on the POV: The interrupt or context switch between set_resched >> and xnpod_reschedule that may defer rescheduling may also hit us before >> we were able to wake up the thread at all. The worst case should not >> differ significantly. > > Yes, and whether we set the bit and call xnpod_schedule atomically does > not really matter either: the IPI takes time to propagate, and since > xnarch_send_ipi does not wait for the IPI to have been received on the > remote CPU, there is no guarantee that xnpod_schedule could not have > been called in the mean time. Indeed. > > More importantly, since in order to do an action on a remote xnsched_t, > we need to hold the nklock, is there any point in not setting the > XNRESCHED bit on that distant structure, at the same time as when we set > the cpu bit on the local sched structure mask and send the IPI? This > way, setting the XNRESCHED bit in the IPI handler would no longer be > necessary, and we would avoid the race. > I guess so. The IPI isn't more than a hint that something /may/ have changed in the schedule anyway. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux