public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] sched: Preemption fixlet and cleanup
@ 2014-12-15  2:24 Frederic Weisbecker
  2014-12-15  2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker
  2014-12-15  2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker
  0 siblings, 2 replies; 6+ messages in thread
From: Frederic Weisbecker @ 2014-12-15  2:24 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra

Here is the reworked fix against the missing preemption check that Linus
reported + the need_resched() loop pulled up to __schedule() callers also
suggested by him.

Thanks.

Frederic Weisbecker (2):
  sched: Fix missing preemption check in cond_resched()
  sched: Pull resched loop to __schedule() callers

 kernel/sched/core.c | 51 ++++++++++++++++++++++++++-------------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

-- 
2.1.3


^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] sched: Fix missing preemption check in cond_resched()
  2014-12-15  2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker
@ 2014-12-15  2:24 ` Frederic Weisbecker
  2014-12-15  2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker
  1 sibling, 0 replies; 6+ messages in thread
From: Frederic Weisbecker @ 2014-12-15  2:24 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra

If an interrupt fires in cond_resched() between the call to __schedule()
and the PREEMPT_ACTIVE count decrementation with the interrupt having
set TIF_NEED_RESCHED, the call to preempt_schedule_irq() will be ignored
due to the PREEMPT_ACTIVE count. This kind of scenario, with irq preemption
being delayed because we are interrupting a preempt-disabled area, is
usually fixed up after preemption is re-enabled back with an explicit
call to preempt_schedule().

This is what preempt_enable() does but a raw preempt count decrement as
performed by __preempt_count_sub(PREEMPT_ACTIVE) doesn't handle delayed
preemption check. Therefore when such a race happens, the rescheduling
is going to be delayed until the next scheduler or preemption entrypoint.
This can be a problem for scheduler latency sensitive workloads.

Lets fix that by consolidating cond_resched() with preempt_schedule()
internals.

Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-and-inspired-by: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/core.c | 40 +++++++++++++++++++---------------------
 1 file changed, 19 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b5797b7..069a2d8 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2877,6 +2877,21 @@ void __sched schedule_preempt_disabled(void)
 	preempt_disable();
 }
 
+static void preempt_schedule_common(void)
+{
+	do {
+		__preempt_count_add(PREEMPT_ACTIVE);
+		__schedule();
+		__preempt_count_sub(PREEMPT_ACTIVE);
+
+		/*
+		 * Check again in case we missed a preemption opportunity
+		 * between schedule and now.
+		 */
+		barrier();
+	} while (need_resched());
+}
+
 #ifdef CONFIG_PREEMPT
 /*
  * this is the entry point to schedule() from in-kernel preemption
@@ -2892,17 +2907,7 @@ asmlinkage __visible void __sched notrace preempt_schedule(void)
 	if (likely(!preemptible()))
 		return;
 
-	do {
-		__preempt_count_add(PREEMPT_ACTIVE);
-		__schedule();
-		__preempt_count_sub(PREEMPT_ACTIVE);
-
-		/*
-		 * Check again in case we missed a preemption opportunity
-		 * between schedule and now.
-		 */
-		barrier();
-	} while (need_resched());
+	preempt_schedule_common();
 }
 NOKPROBE_SYMBOL(preempt_schedule);
 EXPORT_SYMBOL(preempt_schedule);
@@ -4202,17 +4207,10 @@ SYSCALL_DEFINE0(sched_yield)
 	return 0;
 }
 
-static void __cond_resched(void)
-{
-	__preempt_count_add(PREEMPT_ACTIVE);
-	__schedule();
-	__preempt_count_sub(PREEMPT_ACTIVE);
-}
-
 int __sched _cond_resched(void)
 {
 	if (should_resched()) {
-		__cond_resched();
+		preempt_schedule_common();
 		return 1;
 	}
 	return 0;
@@ -4237,7 +4235,7 @@ int __cond_resched_lock(spinlock_t *lock)
 	if (spin_needbreak(lock) || resched) {
 		spin_unlock(lock);
 		if (resched)
-			__cond_resched();
+			preempt_schedule_common();
 		else
 			cpu_relax();
 		ret = 1;
@@ -4253,7 +4251,7 @@ int __sched __cond_resched_softirq(void)
 
 	if (should_resched()) {
 		local_bh_enable();
-		__cond_resched();
+		preempt_schedule_common();
 		local_bh_disable();
 		return 1;
 	}
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] sched: Pull resched loop to __schedule() callers
  2014-12-15  2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker
  2014-12-15  2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker
@ 2014-12-15  2:24 ` Frederic Weisbecker
  2014-12-15  2:46   ` Linus Torvalds
  1 sibling, 1 reply; 6+ messages in thread
From: Frederic Weisbecker @ 2014-12-15  2:24 UTC (permalink / raw)
  To: Ingo Molnar, Linus Torvalds; +Cc: LKML, Frederic Weisbecker, Peter Zijlstra

__schedule() disables preemption during its job and re-enables it
afterward without doing a preemption check to avoid recursion.

But if an event happens after the context switch which requires
rescheduling, we need to check again if a task of a higher priority
needs the CPU. A preempt irq can raise such a situation. To handle that,
__schedule() loops on need_resched().

But preempt_schedule_*() functions, which call __schedule(), also loop
on need_resched() to handle missed preempt irqs. Hence we end up with
the same loop happening twice.

Lets simplify that by attributing the need_resched() loop responsability
to all __schedule() callers.

There is a risk that the outer loop now handles reschedules that used
to be handled by the inner loop with the added overhead of caller details
(inc/dec of PREEMPT_ACTIVE, irq save/restore) but assuming those inner
rescheduling loop weren't too frequent, this shouldn't matter. Especially
since the whole preemption path is now loosing one loop in any case.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
---
 kernel/sched/core.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 069a2d8..368c8f3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2748,6 +2748,10 @@ again:
  *          - explicit schedule() call
  *          - return from syscall or exception to user-space
  *          - return from interrupt-handler to user-space
+ *
+ * WARNING: all callers must re-check need_resched() afterward and reschedule
+ * accordingly in case an event triggered the need for rescheduling (such as
+ * an interrupt waking up a task) while preemption was disabled in __schedule().
  */
 static void __sched __schedule(void)
 {
@@ -2756,7 +2760,6 @@ static void __sched __schedule(void)
 	struct rq *rq;
 	int cpu;
 
-need_resched:
 	preempt_disable();
 	cpu = smp_processor_id();
 	rq = cpu_rq(cpu);
@@ -2821,8 +2824,6 @@ need_resched:
 	post_schedule(rq);
 
 	sched_preempt_enable_no_resched();
-	if (need_resched())
-		goto need_resched;
 }
 
 static inline void sched_submit_work(struct task_struct *tsk)
@@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
 	struct task_struct *tsk = current;
 
 	sched_submit_work(tsk);
-	__schedule();
+	do {
+		__schedule();
+	} while (need_resched());
 }
 EXPORT_SYMBOL(schedule);
 
-- 
2.1.3


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers
  2014-12-15  2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker
@ 2014-12-15  2:46   ` Linus Torvalds
  2014-12-15 16:32     ` Frederic Weisbecker
  0 siblings, 1 reply; 6+ messages in thread
From: Linus Torvalds @ 2014-12-15  2:46 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Ingo Molnar, LKML, Peter Zijlstra

On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> -need_resched:
>         preempt_disable();
>         cpu = smp_processor_id();
>         rq = cpu_rq(cpu);
> @@ -2821,8 +2824,6 @@ need_resched:
>         post_schedule(rq);
>
>         sched_preempt_enable_no_resched();
> -       if (need_resched())
> -               goto need_resched;


So my suggestion was to move the
"preempt_disable()/enable_no_resched()" into the callers too.

Everybody already does that - except for "schedule()" itself. So that
would involve adding it here too:

> @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
>         struct task_struct *tsk = current;
>
>         sched_submit_work(tsk);
> -       __schedule();
> +       do {
preempt_disable();
> +               __schedule();
sched_preempt_enable_no_resched();
> +       } while (need_resched());
>  }

Hmm?

That would mean that we have one less preempt update in the
__sched_preempt() cases. If somebody cares about the preempt counter
value, we'd have to increemnt the preempt count add values too, ie do

    __preempt_count_add(PREEMPT_ACTIVE+1);

there, but I'm not convicned anybody cares about the exact count.

As it is, you end up doing the preempt count things twice in the
__sched_preempt() case: first the PREEMPT_ACTIVE count, and then the
count inside __schedule().

Hmm?

                     Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers
  2014-12-15  2:46   ` Linus Torvalds
@ 2014-12-15 16:32     ` Frederic Weisbecker
  2014-12-15 19:16       ` Linus Torvalds
  0 siblings, 1 reply; 6+ messages in thread
From: Frederic Weisbecker @ 2014-12-15 16:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Ingo Molnar, LKML, Peter Zijlstra

On Sun, Dec 14, 2014 at 06:46:27PM -0800, Linus Torvalds wrote:
> On Sun, Dec 14, 2014 at 6:24 PM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > -need_resched:
> >         preempt_disable();
> >         cpu = smp_processor_id();
> >         rq = cpu_rq(cpu);
> > @@ -2821,8 +2824,6 @@ need_resched:
> >         post_schedule(rq);
> >
> >         sched_preempt_enable_no_resched();
> > -       if (need_resched())
> > -               goto need_resched;
> 
> 
> So my suggestion was to move the
> "preempt_disable()/enable_no_resched()" into the callers too.
> 
> Everybody already does that - except for "schedule()" itself. So that
> would involve adding it here too:
> 
> > @@ -2842,7 +2843,9 @@ asmlinkage __visible void __sched schedule(void)
> >         struct task_struct *tsk = current;
> >
> >         sched_submit_work(tsk);
> > -       __schedule();
> > +       do {
> preempt_disable();
> > +               __schedule();
> sched_preempt_enable_no_resched();
> > +       } while (need_resched());
> >  }
> 
> Hmm?

Indeed!

> 
> That would mean that we have one less preempt update in the
> __sched_preempt() cases. If somebody cares about the preempt counter
> value, we'd have to increemnt the preempt count add values too, ie do
> 
>     __preempt_count_add(PREEMPT_ACTIVE+1);
> 
> there, but I'm not convicned anybody cares about the exact count.

It _looks_ fine to only add PREEMPT_ACTIVE without PREEMPT_OFFSET. I guess
we are only interested in avoiding preemption.

But it may have an impact on some context checkers that rely on in_atomic*()
which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I
guess it's a hack for some specific situation. Now if we do what we plan,
PREEMPT_ACTIVE won't involve PREEMPT_OFFSET anymore, so in_atomic() should
handle PREEMPT_ACTIVE.

schedule_debug() for example relies on in_atomic_preempt_off() which wants
preemption disabled through PREEMPT_OFFSET. But we can tweak that.

This is the only context check I know of, lets hope there are no others lurking.

Ah and there is also the preempt off tracer which ignores the PREEMPT_ACTIVE
counts because they use __preempt_count_add/sub() and they use to be immediately
followed by preempt disable. So we need to use the non-underscored version of
preempt_count_add/sub() on preempt_schedule*() to trace correctly (note though
that ftrace only records preempt_count() & PREEMPT_MASK. So it's going to record
preemption disabled area with a 0 pc.

> As it is, you end up doing the preempt count things twice in the
> __sched_preempt() case: first the PREEMPT_ACTIVE count, and then the
> count inside __schedule().

Indeed so I'm going to split the changes in two steps:

1) disable preemption from schedule(), add PREEMPT_ACTIVE + PREEMPT_OFFSET from
preempt_schedule*() callers, make them use non-underscored preempt_count_add()
and remove the preempt_disable() from __schedule(). That change should be safe
and we remove the overhead of the double preempt_disable.

2) Optional change: remove the PREEMPT_OFFSET from the preempt_schedule*() callers
and tweak schedule_bug(). Having that as a seperate patch so it's easier to ignore,
drop or revert as it looks like a sensitive change.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] sched: Pull resched loop to __schedule() callers
  2014-12-15 16:32     ` Frederic Weisbecker
@ 2014-12-15 19:16       ` Linus Torvalds
  0 siblings, 0 replies; 6+ messages in thread
From: Linus Torvalds @ 2014-12-15 19:16 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Ingo Molnar, LKML, Peter Zijlstra

On Mon, Dec 15, 2014 at 8:32 AM, Frederic Weisbecker <fweisbec@gmail.com> wrote:
>
> But it may have an impact on some context checkers that rely on in_atomic*()
> which ignore the PREEMPT_ACTIVE value. It shouldn't ignore that though but I
> guess it's a hack for some specific situation.

I think we should remove it. The only reason for it is the scheduler
itself, which used to have the in_atomic() check (ok, still has, it's
just called "in_atomic_preempt_off()").

But yes, if we keep the "mask off PREEMPT_ACTIVE" in in_atomic(), then
we do need to update the counts with "PREEMPT_ACTIVE+1" instead. Or
something like that.

                Linus

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-12-15 19:16 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-15  2:24 [PATCH 0/2] sched: Preemption fixlet and cleanup Frederic Weisbecker
2014-12-15  2:24 ` [PATCH 1/2] sched: Fix missing preemption check in cond_resched() Frederic Weisbecker
2014-12-15  2:24 ` [PATCH 2/2] sched: Pull resched loop to __schedule() callers Frederic Weisbecker
2014-12-15  2:46   ` Linus Torvalds
2014-12-15 16:32     ` Frederic Weisbecker
2014-12-15 19:16       ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox