From mboxrd@z Thu Jan 1 00:00:00 1970 References: <0531d2db-aa87-1950-6d7f-41c80d4fe679@web.de> <875yyrf3yo.fsf@xenomai.org> <87a6o2b0to.fsf@xenomai.org> <50f40c7f-ae8d-7ded-9dc7-91310f0d7909@siemens.com> <874ke9bzde.fsf@xenomai.org> <3814ce50-b6cd-04e3-587b-40197f905ef8@siemens.com> <87k0n5af78.fsf@xenomai.org> <14f3aaa0-0d59-3b10-c21e-d408a1346673@siemens.com> <87eeddabpx.fsf@xenomai.org> <87im2o8fn6.fsf@xenomai.org> <87zgw06ysa.fsf@xenomai.org> <22521ff4-b3b1-86c7-2486-831fade9f268@siemens.com> <87wnr37eht.fsf@xenomai.org> <99135b48-1452-4f47-eb93-1bfae2e5e0a2@siemens.com> From: Philippe Gerum Subject: Re: [PATCH] irq_pipeline: Prevent returning to user space with pending inband IRQs In-reply-to: <99135b48-1452-4f47-eb93-1bfae2e5e0a2@siemens.com> Date: Fri, 11 Jun 2021 17:39:23 +0200 Message-ID: <8735tobec4.fsf@xenomai.org> MIME-Version: 1.0 Content-Type: text/plain List-Id: Discussions about the Xenomai project List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jan Kiszka Cc: Xenomai Jan Kiszka writes: > On 09.06.21 08:12, Philippe Gerum wrote: >> >> Jan Kiszka writes: >> >>> On 08.06.21 19:39, Philippe Gerum wrote: >>>> >>>> Philippe Gerum writes: >>>> >>>>> Jan Kiszka writes: >>>>> >>>>>> On 07.06.21 18:19, Philippe Gerum wrote: >>>>>>> >>>>>>> Jan Kiszka writes: >>>>>>> >>>>>>>> On 07.06.21 17:04, Philippe Gerum wrote: >>>>>>>>> >>>>>>>>> Jan Kiszka writes: >>>>>>>>> >>>>>>>>>> On 07.06.21 15:48, Jan Kiszka wrote: >>>>>>>>>>> On 07.06.21 15:44, Jan Kiszka wrote: >>>>>>>>>>>> On 07.06.21 15:03, Philippe Gerum wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Jan Kiszka writes: >>>>>>>>>>>>> >>>>>>>>>>>>>> On 07.06.21 09:17, Philippe Gerum wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Jan Kiszka writes: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 06.06.21 16:43, Philippe Gerum via Xenomai wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Jan Kiszka writes: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> From: Jan Kiszka >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> The correct order is hard IRQs off first, then stalling inband, see also >>>>>>>>>>>>>>>>>> I-pipe. Otherwise, we risk to take an interrupt for the inband stage but >>>>>>>>>>>>>>>>>> do not inject it before returning to user space. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Signed-off-by: Jan Kiszka >>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Fixes the RCU stalls Florian reported, at least for me. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I'm afraid this wasn't the last regression of translating I-pipe into >>>>>>>>>>>>>>>>>> dovetail.. >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> include/linux/entry-common.h | 3 ++- >>>>>>>>>>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-) >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> diff --git a/include/linux/entry-common.h b/include/linux/entry-common.h >>>>>>>>>>>>>>>>>> index 0fb45b2d6094..00540110985e 100644 >>>>>>>>>>>>>>>>>> --- a/include/linux/entry-common.h >>>>>>>>>>>>>>>>>> +++ b/include/linux/entry-common.h >>>>>>>>>>>>>>>>>> @@ -203,7 +203,8 @@ static inline void local_irq_disable_exit_to_user(void); >>>>>>>>>>>>>>>>>> #ifndef local_irq_disable_exit_to_user >>>>>>>>>>>>>>>>>> static inline void local_irq_disable_exit_to_user(void) >>>>>>>>>>>>>>>>>> { >>>>>>>>>>>>>>>>>> - local_irq_disable_full(); >>>>>>>>>>>>>>>>>> + hard_cond_local_irq_disable(); >>>>>>>>>>>>>>>>>> + local_irq_disable(); >>>>>>>>>>>>>>>>>> } >>>>>>>>>>>>>>>>>> #endif >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Good spot, real issue here, just like the CPU_IDLE code deals with. The >>>>>>>>>>>>>>>>> fix looks good, but maybe we could do even better. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The way local_irq_disable_full() works tends to introduce this issue in >>>>>>>>>>>>>>>>> a sneaky way, when the code path does not synchronize the interrupt log >>>>>>>>>>>>>>>>> (exit_to_user_mode_loop may be at fault in this case, which does not >>>>>>>>>>>>>>>>> traverses synchronize_pipeline()). Lack of synchronization would still >>>>>>>>>>>>>>>>> hit us if some trap handler turns hard irqs off -> on -> off the same >>>>>>>>>>>>>>>>> way, and we don't leave through the user exit path. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Inverting the order of the actions in local_irq_disable_full() may be an >>>>>>>>>>>>>>>>> option (inband_irq_disable would allow that), making sure we cannot be >>>>>>>>>>>>>>>>> caught in the same issue when returning to kernel mode is another >>>>>>>>>>>>>>>>> way. This needs more thought I think. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So, always this way? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> local_irq_disable_full: >>>>>>>>>>>>>>>> hard_local_irq_disable >>>>>>>>>>>>>>>> local_irq_disable >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> local_irq_enable_full: >>>>>>>>>>>>>>>> hard_local_irq_enable >>>>>>>>>>>>>>>> local_irq_enable >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Or even flip local_irq_enable_full as well? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Not this one, hard irqs must be on before local_irq_enable() is issued >>>>>>>>>>>>>>> otherwise we would trigger a debug assertion. The reason for such >>>>>>>>>>>>>>> assertion is that the I-pipe version of local_irq_enable() force enables >>>>>>>>>>>>>>> hard irqs on, and this is something we want to depart from because >>>>>>>>>>>>>>> whether the caller expects or not such side-effect is error-prone. For >>>>>>>>>>>>>>> this reason, inband_irq_enable() expects hard irqs on, which should be >>>>>>>>>>>>>>> the current hard irq state for the caller under normal circumstances. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Was there ever code that required ordering local_irq_disable_full the >>>>>>>>>>>>>>>> other way around? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> After review, I don't think so. The current order for >>>>>>>>>>>>>>> local_irq_disable_full() was rather determined by applying a reverse >>>>>>>>>>>>>>> sequence compared to local_irq_enable_full(). So this looks good. I need >>>>>>>>>>>>>>> to test this on the armv[7/8] side here first before merging. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> Is there a commit somewhere that I can drop already? >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 1. These are fine with me in their latest iteration: >>>>>>>>>>>>> >>>>>>>>>>>>> irq_pipeline: Clean up stage_info field and users >>>>>>>>>>>>> irq_pipeline: Account for stage migration across faults >>>>>>>>>>>>> irq_pipeline: Warn when calling irqentry_enter with oob stalled >>>>>>>>>>>>> x86: dovetail: Fix TS flag reservation >>>>>>>>>>>>> >>>>>>>>>>>>> 2. This one should be replaced by a fix in local_irq_disable_full(), >>>>>>>>>>>>> pending some tests ongoing here: >>>>>>>>>>>>> >>>>>>>>>>>>> irq_pipeline: Prevent returning to user space with pending inband IRQs >>>>>>>>>>>>> >>>>>>>>>>>>> However, as I mentioned earlier, this may not be sufficient per se, we >>>>>>>>>>>>> may have to synchronize the in-band log when unstalling the stage on the >>>>>>>>>>>>> kernel exit path, i.e. the exit point fixed up by the "Account for stage >>>>>>>>>>>>> migration across faults" patch. This would address the following >>>>>>>>>>>>> scenario: >>>>>>>>>>>>> >>>>>>>>>>>>> irqentry_enter on kernel context: >>>>>>>>>>>>> stall_inband(); >>>>>>>>>>>>> trap_handler(); >>>>>>>>>>>>> hard_local_irq_enable() >>>>>>>>>>>>> hard_local_irq_disable() >>>>>>>>>>>>> if (state.stage_info == IRQENTRY_INBAND_UNSTALLED || >>>>>>>>>>>>> state.stage_info == IRQENTRY_OOB) { >>>>>>>>>>>>> unstall_inband(); >>>>>>>>>>>>> synchronize_pipeline(); >>>>>>>>>>>> >>>>>>>>>>>> Oh, unstall_inband does not synchronize? Because of >>>>>>>>>>>> hard_local_irq_disable or in general? >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Ah, found it - unstall != inband_irq_enable. Why not using the latter >>>>>>>>>>> directly? >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> ...because it's instrumented to detect hard_irqs_disabled. >>>>>>>>>> >>>>>>>>>> Can we safely sync at that point? >>>>>>>>> >>>>>>>>> Yes we may do this, because we restore the original -unstalled- state of >>>>>>>>> the in-band stage. So if we have pending IRQs, we may play them at this >>>>>>>>> point. What we may not do obviously, is creating synchronization points >>>>>>>>> where the kernel does not expect irqs. >>>>>>>>> >>>>>>>> >>>>>>>> Again: When will any reschedules that might be requested by those >>>>>>>> replayed IRQs be handled then? >>>>>>>> >>>>>>> >>>>>>> Nowhere unless we provide a log synchronization routine dedicated to the >>>>>>> in-band kernel exit path, which would fold in the rescheduling call via >>>>>>> irqentry_exit_cond_resched(). I'll write that one. >>>>>>> >>>>>> >>>>>> How about simply moving the replay up, before the kernel checks for >>>>>> pending reschedules anyway? We can stall inband then again, to please >>>>>> lockdep & Co., as long as there is no hard-irq-on section before the intret. >>>>>> >>>>> >>>>> Yes, this is the right way to do this. I'm on it. >>>> >>>> Please test this (heavily) with cobalt. This survived a 2-hours long kvm >>>> test here, running loops of the evl test suite + hackbench over 64 vcpus, >>>> so far so good. No reason for any difference in behavior between cobalt >>>> and evl in this area, this is merely a dovetail business. >>>> >>>> commit 30808d958277a525baad33f3c68ccb7910dc3176 (HEAD -> rebase/v5.10.y-evl) >>>> Author: Philippe Gerum >>>> >>>> genirq: irq_pipeline: synchronize log on irq exit to kernel >>>> >>>> We must make sure to play any IRQ which might be pending in the >>>> in-band log before leaving an interrupt frame for a preempted kernel >>>> context. >>>> >>>> This completes "irq_pipeline: Account for stage migration across >>>> faults", so that we synchronize the log once the in-band stage is >>>> unstalled. In addition, we also care to do this before >>>> preempt_schedule_irq() runs, so that we won't miss any rescheduling >>>> request which might have been triggered by some IRQ we just played. >>>> >>>> Signed-off-by: Philippe Gerum >>>> Suggested-by: Jan Kiszka >>>> >>>> diff --git a/kernel/entry/common.c b/kernel/entry/common.c >>>> index 4e81c0c03e5726a..99512307537561b 100644 >>>> --- a/kernel/entry/common.c >>>> +++ b/kernel/entry/common.c >>>> @@ -477,8 +477,57 @@ bool irqexit_may_preempt_schedule(irqentry_state_t state, >>>> >>>> #endif >>>> >>>> +#ifdef CONFIG_IRQ_PIPELINE >>>> + >>>> +static bool irqentry_syncstage(irqentry_state_t state) /* hard irqs off */ >>>> +{ >>>> + /* >>>> + * If pipelining interrupts, enable in-band IRQs then >>>> + * synchronize the interrupt log on exit if: >>>> + * >>>> + * - irqentry_enter() stalled the stage in order to mirror the >>>> + * hardware state. >>>> + * >>>> + * - we where coming from oob, thus went through a stage migration >>>> + * that was caused by taking a CPU exception, e.g., a fault. >>>> + * >>>> + * We run before preempt_schedule_irq() may be called later on >>>> + * by preemptible kernels, so that any rescheduling request >>>> + * triggered by in-band IRQ handlers is considered. >>>> + */ >>>> + if (state.stage_info == IRQENTRY_INBAND_UNSTALLED || >>>> + state.stage_info == IRQENTRY_OOB) { >>>> + unstall_inband_nocheck(); >>>> + synchronize_pipeline_on_irq(); >>>> + stall_inband_nocheck(); >>>> + return true; >>>> + } >>>> + >>>> + return false; >>>> +} >>>> + >>>> +static void irqentry_unstall(void) >>>> +{ >>>> + unstall_inband_nocheck(); >>>> +} >>>> + >>>> +#else >>>> + >>>> +static bool irqentry_syncstage(irqentry_state_t state) >>>> +{ >>>> + return false; >>>> +} >>>> + >>>> +static void irqentry_unstall(void) >>>> +{ >>>> +} >>>> + >>>> +#endif >>>> + >>>> noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) >>>> { >>>> + bool synchronized = false; >>>> + >>>> if (running_oob()) >>>> return; >>>> >>>> @@ -488,7 +537,11 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) >>>> if (user_mode(regs)) { >>>> irqentry_exit_to_user_mode(regs); >>>> return; >>>> - } else if (irqexit_may_preempt_schedule(state, regs)) { >>>> + } >>>> + >>>> + synchronized = irqentry_syncstage(state); >>>> + >>>> + if (irqexit_may_preempt_schedule(state, regs)) { >>>> /* >>>> * If RCU was not watching on entry this needs to be done >>>> * carefully and needs the same ordering of lockdep/tracing >>>> @@ -521,17 +574,9 @@ noinstr void irqentry_exit(struct pt_regs *regs, irqentry_state_t state) >>>> } >>>> >>>> out: >>>> -#ifdef CONFIG_IRQ_PIPELINE >>>> - /* >>>> - * If pipelining interrupts, clear the in-band stall bit if >>>> - * irqentry_enter() raised it in order to mirror the hardware >>>> - * state. Also clear it when we where coming from oob, thus went >>>> - * through a migration that was caused by taking, e.g., a fault. >>>> - */ >>>> - if (state.stage_info == IRQENTRY_INBAND_UNSTALLED || >>>> - state.stage_info == IRQENTRY_OOB) >>>> - unstall_inband(); >>>> -#endif >>>> + if (synchronized) >>>> + irqentry_unstall(); >>>> + >>>> return; >>>> } >>>> >>>> >>> >>> I've put that into a staging/v5.10.y-dovetail branch, pointed >>> xenomai-images to that and started a run in our lab. That should give us >>> some confidence that things work under cobalt across all archs, and we >>> can make that commit official. >>> >>> However, heavy stressing would require triggering migration-on-fault in >>> tighter loops while running something like stress-ng in the background. >>> Maybe it would make sense to add such a pattern to the switchtest (would >>> also require a way to silence the related kernel warnings during >>> runtime, not only build-time)? >>> >> >> switchtest aims at exercising the core scheduling logic including the >> transition between execution stages, so migration-on-fault would fit >> there indeed. >> > > Do we need this testcase first? My feeling from review and testing is > that is safe to move forward with the patch already. > Looks ok here too. The patch is fairly straightforward, and this has been tested on the EVL side of the universe on a handful of armv7, armv8 machines without any glitch. I'll merge this patch shortly. -- Philippe.