From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4A4C5ADB.5080602@domain.hid> Date: Thu, 02 Jul 2009 08:59:39 +0200 From: Gilles Chanteperdrix MIME-Version: 1.0 References: <4A48FB71.6070506@domain.hid> <4A49CD81.4060706@domain.hid> <4A49CFF0.7070202@domain.hid> <1246353623.7803.21.camel@domain.hid> <4A49D935.3060900@domain.hid> <1246353913.7803.24.camel@domain.hid> <4A49DA4E.2020604@domain.hid> <1246354047.7803.25.camel@domain.hid> <4A49DC0A.5000208@domain.hid> <4A4A391B.8000700@domain.hid> <4A4B4ED4.6020208@domain.hid> <4A4B558D.20307@domain.hid> <4A4B58E9.4050407@domain.hid> <4A4B5985.3070504@domain.hid> <4A4B8617.5000704@domain.hid> <4A4B8851.9070005@domain.hid> <4A4B8912.1060700@domain.hid> <4A4BA334.8090701@domain.hid> <1246472157.7803.62.camel@domain.hid> <1246472833.7803.67.camel@domain.hid> <4A4C15DC.80106@domain.hid> <4A4C52B2.9070000@domain.hid> In-Reply-To: <4A4C52B2.9070000@domain.hid> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-core] x86: Endless minor faults List-Id: Xenomai life and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka Cc: xenomai-core Jan Kiszka wrote: > Gilles Chanteperdrix wrote: >> Philippe Gerum wrote: >>> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >>>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Jan Kiszka wrote: >>>>>>> Gilles Chanteperdrix wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Jan Kiszka wrote: >>>>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>>> It's still unclear what goes on precisely, we are still digging, but the >>>>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>>>> Short update: Further instrumentation revealed that cr3 differs from >>>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the kernel >>>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open race >>>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't switch_mm run >>>>>>>>>>>> in a preemptible manner now?). >>>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>>>> >>>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and we are now >>>>>>>>>>>> checking if it makes a difference. Digging deeper into the code in the >>>>>>>>>>>> meanwhile... >>>>>>>>>>> As you have found out in the mean time, we do not use unlocked context >>>>>>>>>>> switches on x86. >>>>>>>>>>> >>>>>>>>>> Yes. >>>>>>>>>> >>>>>>>>>> The last question I asked myself (but couldn't answer yet due to other >>>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around >>>>>>>>>> switch_mm for its Linux callers? >>>>>>>>> Ha, that's the point: only activate_mm is protected, but we have more >>>>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I found, >>>>>>> We have two direct callers of switch_mm in sched.c and one in fs/aio.c. >>>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but that >>>>>>> is not enough according to current tests. It seems to reduce to >>>>>>> probability of corruption, though. >>>>>>> >>>>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy between >>>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai could >>>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From what I >>>>>>>> understand, switch_mm called by Xenomai upon return to root would re-set >>>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 right >>>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bit being >>>>>>>> set instead of cleared as expected. So, maybe an irqs off section is >>>>>>>> missing in leave_mm. >>>>>>> leave_mm is already protected by its caller smp_invalidate_interrupt - >>>>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>>>> >>>>>> Hmm... lazy tlb: This means a new task is switched in and has active_mm >>>>>> != mm. But do_page_fault reads task->mm... Just thoughts, no clear >>>>>> picture yet. >>>>>> >>>>> Looking closer at the call sites of switch_mm, I think our the problem >>>>> is mostly related to use_mm from fs/aio.c (customer is using aio >>>>> heavily). But other callers need protection, too. We are going to test >>>>> this patch tomorrow: >>>>> >>>>> diff --git a/fs/aio.c b/fs/aio.c >>>>> index 76da125..d90fca3 100644 >>>>> --- a/fs/aio.c >>>>> +++ b/fs/aio.c >>>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>>> { >>>>> struct mm_struct *active_mm; >>>>> struct task_struct *tsk = current; >>>>> + unsigned long flags; >>>>> >>>>> task_lock(tsk); >>>>> active_mm = tsk->active_mm; >>>>> atomic_inc(&mm->mm_count); >>>>> + local_irq_save_hw_cond(flags); >>>>> tsk->mm = mm; >>>>> tsk->active_mm = mm; >>>>> switch_mm(active_mm, mm, tsk); >>>>> + local_irq_restore_hw_cond(flags); >>>>> task_unlock(tsk); >>>>> >>>>> mmdrop(active_mm); >>>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>>> index aa8f86c..8c545a4 100644 >>>>> --- a/kernel/sched.c >>>>> +++ b/kernel/sched.c >>>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_struct *prev, >>>>> next->active_mm = oldmm; >>>>> atomic_inc(&oldmm->mm_count); >>>>> enter_lazy_tlb(oldmm, next); >>>>> - } else >>>>> + } else { >>>>> + unsigned long flags; >>>>> + local_irq_save_hw_cond(flags); >>>>> switch_mm(oldmm, mm, next); >>>>> + local_irq_restore_hw_cond(flags); >>>>> + } >>>>> >>>>> if (unlikely(!prev->mm)) { >>>>> prev->active_mm = NULL; >>>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>>>> >>>>> BUG_ON(cpu_online(smp_processor_id())); >>>>> >>>>> - if (mm != &init_mm) >>>>> + if (mm != &init_mm) { >>>>> + unsigned long flags; >>>>> + local_irq_save_hw_cond(flags); >>>>> switch_mm(mm, &init_mm, current); >>>>> + local_irq_restore_hw_cond(flags); >>>>> + } >>>>> mmdrop(mm); >>>>> } >>>> Please fix the callee instead of ironing the call sites. This would >>>> avoid further issues as upstream emits additional switch_mm calls over >>>> time, and make ironing activate_mm useless. >>>> >>> Btw, this how the powerpc port works in locked switch mode, and this >>> particular bug does not apply in unlocked switch mode anyway; this is >>> why we don't have that problem on this arch. The ARM port always works >>> in unlocked switch mode IIRC for latency reasons, so this should be a >>> non-issue here as well. Gilles, would you confirm this? >> Yes, on arm switch_mm always has irqs on. We handle the case where >> switch_mm was interrupted in the middle of its operations by restarting >> the mm switch after xenomai switches back to Linux. >> > > On archs with non-atomic switch_mm(), use_mm() will require a different > strategy. I'm thinking about something like > > use_mm(): > set_some_flag(); > barrier(); > current->mm = new_mm; > current->active_mm = new_mm; > switch_mm(old_active_mm, new_mm, current); > clear_some_flag(); > > and switch_mm(): > ... > if (likely(prev != next) || some_flag_set()) { > clear_some_flag(); > ... > > ie. enforce mm switch if we may have interrupted use_mm at an unpleasant > time. I just don't know yet where to attach that some_flag to. Should be > the current task, but can we always access it from switch_mm? These mechanisms are already in place. All you have to do is: use_mm() ipipe_active_mm = NULL; barrier(); current->mm = new_mm; current->active_mm = new_mm; switch_mm(old_active_mm, new_mm, current); -- Gilles.