From mboxrd@z Thu Jan 1 00:00:00 1970 Message-ID: <4B5ACAD7.5070607@domain.hid> Date: Sat, 23 Jan 2010 11:09:27 +0100 From: Jan Kiszka MIME-Version: 1.0 References: <4B59D952.30202@domain.hid> <1264181859.2350.152.camel@domain.hid> <4B59E2D4.5040004@domain.hid> <4B59E33E.1040203@domain.hid> <1264182738.2350.158.camel@domain.hid> <4B59E87F.80108@domain.hid> <1264183719.2350.159.camel@domain.hid> <1264240767.2350.164.camel@domain.hid> In-Reply-To: <1264240767.2350.164.camel@domain.hid> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Xenomai-core] Domain switch during page fault handling List-Id: Xenomai life and development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Philippe Gerum Cc: "Mauerer, Wolfgang" , xenomai-core Philippe Gerum wrote: > On Fri, 2010-01-22 at 19:08 +0100, Philippe Gerum wrote: >> On Fri, 2010-01-22 at 19:03 +0100, Jan Kiszka wrote: >>> Philippe Gerum wrote: >>>> On Fri, 2010-01-22 at 18:41 +0100, Jan Kiszka wrote: >>>>> Gilles Chanteperdrix wrote: >>>>>> Philippe Gerum wrote: >>>>>>> On Fri, 2010-01-22 at 17:58 +0100, Jan Kiszka wrote: >>>>>>>> Hi guys, >>>>>>>> >>>>>>>> we are currently trying to catch an ugly Linux pipeline state corruption >>>>>>>> on x86-64. >>>>>>>> >>>>>>>> Conceptual question: If a Xenomai task causes a fault, we enter >>>>>>>> ipipe_trap_notify over the primary domain and leave it over the root >>>>>>>> domain, right? Now, if the root domain happened to be stalled when the >>>>>>>> exception happened, where should it normally be unstalled again, >>>>>>>> *for_that_task*? Our problem is that we generate a code path where this >>>>>>>> does not happen. >>>>>>> xnhadow_relax -> ipipe_reenter_root -> finish_task_switch -> >>>>>>> finish_lock_switch -> unstall >>>>>>> >>>>>>> Since xnshadow_relax is called on behalf the event dispatcher, we should >>>>>>> expect it to return with the root domain unstalled after a domain >>>>>>> downgrade, from primary to root. >>>>>> Ok, but what about local_irq_restore_nosync at the end of the function ? >>>>>> >>>>> That is, IMO, our problem: It replays the root state on fault entry, but >>>>> that one is totally unrelated to the (Xenomai) task that caused the fault. >>>> The code seems fishy. Try restoring only when the incoming domain was >>>> the root one. Indeed. >>>> >>> Something like this? >>> >>> diff --git a/arch/x86/kernel/ipipe.c b/arch/x86/kernel/ipipe.c >>> index 4442d96..0558ea3 100644 >>> --- a/arch/x86/kernel/ipipe.c >>> +++ b/arch/x86/kernel/ipipe.c >>> @@ -702,19 +702,21 @@ static int __ipipe_xlate_signo[] = { >>> >>> int __ipipe_handle_exception(struct pt_regs *regs, long error_code, int vector) >>> { >>> + bool restore_flags = false; >>> unsigned long flags; >>> >>> - /* Pick up the root domain state of the interrupted context. */ >>> - local_save_flags(flags); >>> + if (ipipe_root_domain_p && irqs_disabled_hw()) { >>> + /* Pick up the root domain state of the interrupted context. */ >>> + local_save_flags(flags); >>> >>> - if (ipipe_root_domain_p) { >>> /* >>> * Replicate hw interrupt state into the virtual mask before >>> * calling the I-pipe event handler over the root domain. Also >>> * required later when calling the Linux exception handler. >>> */ >>> - if (irqs_disabled_hw()) >>> - local_irq_disable(); >>> + local_irq_disable(); >>> + >>> + restore_flags = true; >>> } >>> #ifdef CONFIG_KGDB >>> /* catch exception KGDB is interested in over non-root domains */ >>> @@ -725,7 +727,8 @@ int __ipipe_handle_exception(struct pt_regs *regs, long error_code, int vector) >>> #endif /* CONFIG_KGDB */ >>> >>> if (unlikely(ipipe_trap_notify(vector, regs))) { >>> - local_irq_restore_nosync(flags); >>> + if (restore_flags) >>> + local_irq_restore_nosync(flags); >>> return 1; >>> } >>> >>> @@ -770,7 +773,8 @@ int __ipipe_handle_exception(struct pt_regs *regs, long error_code, int vector) >>> * Relevant for 64-bit: Restore root domain state as the low-level >>> * return code will not align it to regs.flags. >>> */ >>> - local_irq_restore_nosync(flags); >>> + if (restore_flags) >>> + local_irq_restore_nosync(flags); >>> >>> return 0; >>> } >>> >>> >>> We are currently not able to test this on the system that triggers it, >>> but we'll do so tomorrow (yeah...). >>> >> Should work. Famous last words. >> > > Strike that. This won't work, because the fixup code will use the saved > flags even when root is not the incoming domain and/or hw IRQs are on on > entry. In short, local_save_flags() must be done unconditionally, as > previously. It will accidentally work for 64-bit where __fixup_if is empty. And for 32-bit, I would say we need to make it depend on restore_flags as well. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux