* [PATCH 0/2] sched: Preemption fixlet and cleanup @ 2014-12-15 2:24 Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker 0 siblings, 2 replies; 6+ messages in thread From: Frederic Weisbecker @ 2014-12-15 2:24 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra Here is the reworked fix against the missing preemption check that Linus reported + the need_resched() loop pulled up to __schedule() callers also suggested by him. Thanks. Frederic Weisbecker (2): sched: Fix missing preemption check in cond_resched() sched: Pull resched loop to __schedule() callers kernel/sched/core.c | 51 ++++++++++++++++++++++++++------------------------- 1 file changed, 26 insertions(+), 25 deletions(-) -- 2.1.3 ^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH 1/2] sched: Fix missing preemption check in cond_resched() 2014-12-15 2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker @ 2014-12-15 2:24 ` Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker 1 sibling, 0 replies; 6+ messages in thread From: Frederic Weisbecker @ 2014-12-15 2:24 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra If an interrupt fires in cond_resched() between the call to __schedule() and the PREEMPT_ACTIVE count decrementation with the interrupt having set TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption being delayed because we are interrupting a preempt-disabled area, is usually fixed up after preemption is re-enabled back with an explicit call to preempt_schedule(). This is what preempt_enable() does but a raw preempt count decrement as performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed preemption check. Therefore when such a race happens, the rescheduling is going to be delayed until the next scheduler or preemption entrypoint. This can be a problem for scheduler latency sensitive workloads. Lets fix that by consolidating cond_resched() with preempt_schedule() internals. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-and-inspired-by: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> --- kernel/sched/core.c | 40 +++++++++++++++++++--------------------- 1 file changed, 19 insertions(+), 21 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b5797b7..069a2d8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2877,6 +2877,21 @@ void __sched schedule_preempt_disabled(void) preempt_disable(); } +static void preempt_schedule_common(void) +{ + do { + __preempt_count_add(PREEMPT_ACTIVE); + __schedule(); + __preempt_count_sub(PREEMPT_ACTIVE); + + /* + * Check again in case we missed a preemption opportunity + * between schedule and now. + */ + barrier(); + } while (need_resched()); +} + #ifdef CONFIG_PREEMPT /* * this is the entry point to schedule() from in-kernel preemption @@ -2892,17 +2907,7 @@ asmlinkage __visible void __sched notrace preempt_schedule(void) if (likely(!preemptible())) return; - do { - __preempt_count_add(PREEMPT_ACTIVE); - __schedule(); - __preempt_count_sub(PREEMPT_ACTIVE); - - /* - * Check again in case we missed a preemption opportunity - * between schedule and now. - */ - barrier(); - } while (need_resched()); + preempt_schedule_common(); } NOKPROBE_SYMBOL(preempt_schedule); EXPORT_SYMBOL(preempt_schedule); @@ -4202,17 +4207,10 @@ SYSCALL_DEFINE0(sched_yield) return 0; } -static void __cond_resched(void) -{ - __preempt_count_add(PREEMPT_ACTIVE); - __schedule(); - __preempt_count_sub(PREEMPT_ACTIVE); -} - int __sched _cond_resched(void) { if (should_resched()) { - __cond_resched(); + preempt_schedule_common(); return 1; } return 0; @@ -4237,7 +4235,7 @@ int __cond_resched_lock(spinlock_t *lock) if (spin_needbreak(lock) || resched) { spin_unlock(lock); if (resched) - __cond_resched(); + preempt_schedule_common(); else cpu_relax(); ret = 1; @@ -4253,7 +4251,7 @@ int __sched __cond_resched_softirq(void) if (should_resched()) { local_bh_enable(); - __cond_resched(); + preempt_schedule_common(); local_bh_disable(); return 1; } -- 2.1.3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH 2/2] sched: Pull resched loop to __schedule() callers 2014-12-15 2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker @ 2014-12-15 2:24 ` Frederic Weisbecker 2014-12-15 2:46 ` Linus Torvalds 1 sibling, 1 reply; 6+ messages in thread From: Frederic Weisbecker @ 2014-12-15 2:24 UTC (permalink / raw) To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra __schedule() disables preemption during its job and re-enables it afterward without doing a preemption check to avoid recursion. But if an event happens after the context switch which requires rescheduling, we need to check again if a task of a higher priority needs the CPU. A preempt irq can raise such a situation. To handle that, __schedule() loops on need_resched(). But preempt_schedule_*() functions, which call __schedule(), also loop on need_resched() to handle missed preempt irqs. Hence we end up with the same loop happening twice. Lets simplify that by attributing the need_resched() loop responsability to all __schedule() callers. There is a risk that the outer loop now handles reschedules that used to be handled by the inner loop with the added overhead of caller details (inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner rescheduling loop weren't too frequent, this shouldn't matter. Especially since the whole preemption path is now loosing one loop in any case. Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> --- kernel/sched/core.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 069a2d8..368c8f3 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -2748,6 +2748,10 @@ again: * - explicit schedule() call * - return from syscall or exception to user-space * - return from interrupt-handler to user-space + * + * WARNING: all callers must re-check need_resched() afterward and reschedule + * accordingly in case an event triggered the need for rescheduling (such as + * an interrupt waking up a task) while preemption was disabled in __schedule(). */ static void __sched __schedule(void) { @@ -2756,7 +2760,6 @@ static void __sched __schedule(void) struct rq *rq; int cpu; -need_resched: preempt_disable(); cpu = smp_processor_id(); rq = cpu_rq(cpu); @@ -2821,8 +2824,6 @@ need_resched: post_schedule(rq); sched_preempt_enable_no_resched(); - if (need_resched()) - goto need_resched; } static inline void sched_submit_work(struct task_struct *tsk) @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void) struct task_struct *tsk = current; sched_submit_work(tsk); - __schedule(); + do { + __schedule(); + } while (need_resched()); } EXPORT_SYMBOL(schedule); -- 2.1.3 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers 2014-12-15 2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker @ 2014-12-15 2:46 ` Linus Torvalds 2014-12-15 16:32 ` Frederic Weisbecker 0 siblings, 1 reply; 6+ messages in thread From: Linus Torvalds @ 2014-12-15 2:46 UTC (permalink / raw) To: Frederic Weisbecker; +Cc: Ingo Molnar, LKML, Peter Zijlstra On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > -need_resched: > preempt_disable(); > cpu = smp_processor_id(); > rq = cpu_rq(cpu); > @@ -2821,8 +2824,6 @@ need_resched: > post_schedule(rq); > > sched_preempt_enable_no_resched(); > - if (need_resched()) > - goto need_resched; So my suggestion was to move the "preempt_disable()/enable_no_resched()" into the callers too. Everybody already does that - except for "schedule()" itself. So that would involve adding it here too: > @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void) > struct task_struct *tsk = current; > > sched_submit_work(tsk); > - __schedule(); > + do { preempt_disable(); > + __schedule(); sched_preempt_enable_no_resched(); > + } while (need_resched()); > } Hmm? That would mean that we have one less preempt update in the __sched_preempt() cases. If somebody cares about the preempt counter value, we'd have to increemnt the preempt count add values too, ie do __preempt_count_add(PREEMPT_ACTIVE+1); there, but I'm not convicned anybody cares about the exact count. As it is, you end up doing the preempt count things twice in the __sched_preempt() case: first the PREEMPT_ACTIVE count, and then the count inside __schedule(). Hmm? Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers 2014-12-15 2:46 ` Linus Torvalds @ 2014-12-15 16:32 ` Frederic Weisbecker 2014-12-15 19:16 ` Linus Torvalds 0 siblings, 1 reply; 6+ messages in thread From: Frederic Weisbecker @ 2014-12-15 16:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Ingo Molnar, LKML, Peter Zijlstra On Sun, Dec 14, 2014 at 06:46:27PM -0800, Linus Torvalds wrote: > On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > > -need_resched: > > preempt_disable(); > > cpu = smp_processor_id(); > > rq = cpu_rq(cpu); > > @@ -2821,8 +2824,6 @@ need_resched: > > post_schedule(rq); > > > > sched_preempt_enable_no_resched(); > > - if (need_resched()) > > - goto need_resched; > > > So my suggestion was to move the > "preempt_disable()/enable_no_resched()" into the callers too. > > Everybody already does that - except for "schedule()" itself. So that > would involve adding it here too: > > > @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void) > > struct task_struct *tsk = current; > > > > sched_submit_work(tsk); > > - __schedule(); > > + do { > preempt_disable(); > > + __schedule(); > sched_preempt_enable_no_resched(); > > + } while (need_resched()); > > } > > Hmm? Indeed! > > That would mean that we have one less preempt update in the > __sched_preempt() cases. If somebody cares about the preempt counter > value, we'd have to increemnt the preempt count add values too, ie do > > __preempt_count_add(PREEMPT_ACTIVE+1); > > there, but I'm not convicned anybody cares about the exact count. It _looks_ fine to only add PREEMPT_ACTIVE without PREEMPT_OFFSET. I guess we are only interested in avoiding preemption. But it may have an impact on some context checkers that rely on in_atomic*() which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I guess it's a hack for some specific situation. Now if we do what we plan, PREEMPT_ACTIVE won't involve PREEMPT_OFFSET anymore, so in_atomic() should handle PREEMPT_ACTIVE. schedule_debug() for example relies on in_atomic_preempt_off() which wants preemption disabled through PREEMPT_OFFSET. But we can tweak that. This is the only context check I know of, lets hope there are no others lurking. Ah and there is also the preempt off tracer which ignores the PREEMPT_ACTIVE counts because they use __preempt_count_add/sub() and they use to be immediately followed by preempt disable. So we need to use the non-underscored version of preempt_count_add/sub() on preempt_schedule*() to trace correctly (note though that ftrace only records preempt_count() & PREEMPT_MASK. So it's going to record preemption disabled area with a 0 pc. > As it is, you end up doing the preempt count things twice in the > __sched_preempt() case: first the PREEMPT_ACTIVE count, and then the > count inside __schedule(). Indeed so I'm going to split the changes in two steps: 1) disable preemption from schedule(), add PREEMPT_ACTIVE + PREEMPT_OFFSET from preempt_schedule*() callers, make them use non-underscored preempt_count_add() and remove the preempt_disable() from __schedule(). That change should be safe and we remove the overhead of the double preempt_disable. 2) Optional change: remove the PREEMPT_OFFSET from the preempt_schedule*() callers and tweak schedule_bug(). Having that as a seperate patch so it's easier to ignore, drop or revert as it looks like a sensitive change. ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers 2014-12-15 16:32 ` Frederic Weisbecker @ 2014-12-15 19:16 ` Linus Torvalds 0 siblings, 0 replies; 6+ messages in thread From: Linus Torvalds @ 2014-12-15 19:16 UTC (permalink / raw) To: Frederic Weisbecker; +Cc: Ingo Molnar, LKML, Peter Zijlstra On Mon, Dec 15, 2014 at 8:32 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote: > > But it may have an impact on some context checkers that rely on in_atomic*() > which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I > guess it's a hack for some specific situation. I think we should remove it. The only reason for it is the scheduler itself, which used to have the in_atomic() check (ok, still has, it's just called "in_atomic_preempt_off()"). But yes, if we keep the "mask off PREEMPT_ACTIVE" in in_atomic(), then we do need to update the counts with "PREEMPT_ACTIVE+1" instead. Or something like that. Linus ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-12-15 19:16 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-15 2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker 2014-12-15 2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker 2014-12-15 2:46 ` Linus Torvalds 2014-12-15 16:32 ` Frederic Weisbecker 2014-12-15 19:16 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox