From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4A4BB58D.3070207@domain.hid> Date: Wed, 01 Jul 2009 21:14:21 +0200 From: Jan Kiszka MIME-Version: 1.0 References: <4A48FB71.6070506@domain.hid> <4A49CD81.4060706@domain.hid> <4A49CFF0.7070202@domain.hid> <1246353623.7803.21.camel@domain.hid> <4A49D935.3060900@domain.hid> <1246353913.7803.24.camel@domain.hid> <4A49DA4E.2020604@domain.hid> <1246354047.7803.25.camel@domain.hid> <4A49DC0A.5000208@domain.hid> <4A4A391B.8000700@domain.hid> <4A4B4ED4.6020208@domain.hid> <4A4B558D.20307@domain.hid> <4A4B58E9.4050407@domain.hid> <4A4B5985.3070504@domain.hid> <4A4B8617.5000704@domain.hid> <4A4B8851.9070005@domain.hid> <4A4B8912.1060700@domain.hid> <4A4BA334.8090701@domain.hid> <1246472157.7803.62.camel@domain.hid> <1246472833.7803.67.camel@domain.hid> <4A4BB1BD.8020101@domain.hid> In-Reply-To: <4A4BB1BD.8020101@domain.hid> Content-Type: multipart/signed; micalg=pgp-sha1; protocol="application/pgp-signature"; boundary="------------enig81DB1BA6BE7848EA796B6E63" Sender: jan.kiszka@domain.hid Subject: Re: [Xenomai-core] x86: Endless minor faults List-Id: Xenomai life and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: xenomai-core This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig81DB1BA6BE7848EA796B6E63 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: quoted-printable Jan Kiszka wrote: > Philippe Gerum wrote: >> On Wed, 2009-07-01 at 20:15 +0200, Philippe Gerum wrote: >>> On Wed, 2009-07-01 at 19:56 +0200, Jan Kiszka wrote: >>>> Jan Kiszka wrote: >>>>> Jan Kiszka wrote: >>>>>> Gilles Chanteperdrix wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Jan Kiszka wrote: >>>>>>>>> Gilles Chanteperdrix wrote: >>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>> Jan Kiszka wrote: >>>>>>>>>>>> It's still unclear what goes on precisely, we are still digg= ing, but the >>>>>>>>>>>> test system that can produce this is highly contended. >>>>>>>>>>> Short update: Further instrumentation revealed that cr3 diffe= rs from >>>>>>>>>>> active_mm->pgd while we are looping over that fault, ie. the = kernel >>>>>>>>>>> tries to fixup the wrong mm. And that means we have some open= race >>>>>>>>>>> window between updating cr3 and active_mm somewhere (isn't sw= itch_mm run >>>>>>>>>>> in a preemptible manner now?). >>>>>>>>>> Maybe the rsp is wrong and leads you to the wrong active_mm ? >>>>>>>>>> >>>>>>>>>>> As a first shot I disabled CONFIG_IPIPE_DELAYED_ATOMICSW, and= we are now >>>>>>>>>>> checking if it makes a difference. Digging deeper into the co= de in the >>>>>>>>>>> meanwhile... >>>>>>>>>> As you have found out in the mean time, we do not use unlocked= context >>>>>>>>>> switches on x86. >>>>>>>>>> >>>>>>>>> Yes. >>>>>>>>> >>>>>>>>> The last question I asked myself (but couldn't answer yet due t= o other >>>>>>>>> activity) was: Where are the local_irq_disable/enable_hw around= >>>>>>>>> switch_mm for its Linux callers? >>>>>>>> Ha, that's the point: only activate_mm is protected, but we have= more >>>>>>>> spots in 2.6.29 and maybe other kernels, too! >>>>>>> Ok, I do not see where switch_mm is called with IRQs off. What I = found, >>>>>> We have two direct callers of switch_mm in sched.c and one in fs/a= io.c. >>>>>> Both need protection (I pushed IRQ disabling into switch_mm), but = that >>>>>> is not enough according to current tests. It seems to reduce to >>>>>> probability of corruption, though. >>>>>> >>>>>>> however, is that leave_mm sets the cr3 and just clears >>>>>>> active_mm->cpu_vm_mask. So, at this point, we have a discrepancy = between >>>>>>> cr3 and active_mm. I do not know what could happen if Xenomai cou= ld >>>>>>> interrupt leave_mm between the cpu_clear and the write_cr3. From = what I >>>>>>> understand, switch_mm called by Xenomai upon return to root would= re-set >>>>>>> the bit, and re-set cr3, which would be set to the kernel cr3 rig= ht >>>>>>> after that, but this would result in the active_mm.cpu_vm_mask bi= t being >>>>>>> set instead of cleared as expected. So, maybe an irqs off section= is >>>>>>> missing in leave_mm. >>>>>> leave_mm is already protected by its caller smp_invalidate_interru= pt - >>>>>> but now I'm parsing context_switch /wrt to lazy tlb. >>>>>> >>>>> Hmm... lazy tlb: This means a new task is switched in and has activ= e_mm >>>>> !=3D mm. But do_page_fault reads task->mm... Just thoughts, no clea= r >>>>> picture yet. >>>>> >>>> Looking closer at the call sites of switch_mm, I think our the probl= em >>>> is mostly related to use_mm from fs/aio.c (customer is using aio >>>> heavily). But other callers need protection, too. We are going to te= st >>>> this patch tomorrow: >>>> >>>> diff --git a/fs/aio.c b/fs/aio.c >>>> index 76da125..d90fca3 100644 >>>> --- a/fs/aio.c >>>> +++ b/fs/aio.c >>>> @@ -618,13 +618,16 @@ static void use_mm(struct mm_struct *mm) >>>> { >>>> struct mm_struct *active_mm; >>>> struct task_struct *tsk =3D current; >>>> + unsigned long flags; >>>> =20 >>>> task_lock(tsk); >>>> active_mm =3D tsk->active_mm; >>>> atomic_inc(&mm->mm_count); >>>> + local_irq_save_hw_cond(flags); >>>> tsk->mm =3D mm; >>>> tsk->active_mm =3D mm; >>>> switch_mm(active_mm, mm, tsk); >>>> + local_irq_restore_hw_cond(flags); >>>> task_unlock(tsk); >>>> =20 >>>> mmdrop(active_mm); >>>> diff --git a/kernel/sched.c b/kernel/sched.c >>>> index aa8f86c..8c545a4 100644 >>>> --- a/kernel/sched.c >>>> +++ b/kernel/sched.c >>>> @@ -2668,8 +2668,12 @@ context_switch(struct rq *rq, struct task_str= uct *prev, >>>> next->active_mm =3D oldmm; >>>> atomic_inc(&oldmm->mm_count); >>>> enter_lazy_tlb(oldmm, next); >>>> - } else >>>> + } else { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(oldmm, mm, next); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> =20 >>>> if (unlikely(!prev->mm)) { >>>> prev->active_mm =3D NULL; >>>> @@ -6406,8 +6410,12 @@ void idle_task_exit(void) >>>> =20 >>>> BUG_ON(cpu_online(smp_processor_id())); >>>> =20 >>>> - if (mm !=3D &init_mm) >>>> + if (mm !=3D &init_mm) { >>>> + unsigned long flags; >>>> + local_irq_save_hw_cond(flags); >>>> switch_mm(mm, &init_mm, current); >>>> + local_irq_restore_hw_cond(flags); >>>> + } >>>> mmdrop(mm); >>>> } >>> Please fix the callee instead of ironing the call sites. This would >>> avoid further issues as upstream emits additional switch_mm calls ove= r >>> time, and make ironing activate_mm useless. >>> >> Btw, this how the powerpc port works in locked switch mode, and this >> particular bug does not apply in unlocked switch mode anyway; this is >> why we don't have that problem on this arch. The ARM port always works= >> in unlocked switch mode IIRC for latency reasons, so this should be a >> non-issue here as well. Gilles, would you confirm this? >=20 > The aio issue (if it turns out to be one, /me still needs to understand= > the concrete patch) should bite any arch, but we may solve it > differently where switch_mm is actually reentrant. Took a beer to get my mind free, here is the potential race: -> aio thread runs with mm =3D=3D init_mm -> aio thread calls into use_mm -> preemption by Xenomai right after tsk->active_mm =3D mm -> Xenomai switches from the aio thread to an RT thread with identical mm (definitely possible in our scenario) -> switch_mm will not update cr3 as prev =3D=3D next -> next fault of the RT thread (e.g. in __xn_put_user...) will attempt to fix the right mm, but the CPU will continue to fail over init_mm Jan --------------enig81DB1BA6BE7848EA796B6E63 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.9 (GNU/Linux) Comment: Using GnuPG with SUSE - http://enigmail.mozdev.org iEYEARECAAYFAkpLtZEACgkQniDOoMHTA+lCiwCdFLAz+jroB3lCESldGy8+bpen qvsAnAwixFhog43TySwbT54oF/fcgKQi =bzv/ -----END PGP SIGNATURE----- --------------enig81DB1BA6BE7848EA796B6E63--