Re: Too many rescheduling interrupts (still!)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Peter Zijlstra <peterz@infradead.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Thomas Gleixner <tglx@linutronix.de>,
	Mike Galbraith <bitbucket@online.de>, X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: Too many rescheduling interrupts (still!)
Date: Wed, 12 Feb 2014 17:39:16 +0100	[thread overview]
Message-ID: <20140212163916.GA27965@twins.programming.kicks-ass.net> (raw)
In-Reply-To: <CALCETrW-GCNuqSTO=p==0gPE2c8wZKSe98Gi4PP2NRdgk5iKag@mail.gmail.com>

On Wed, Feb 12, 2014 at 07:49:07AM -0800, Andy Lutomirski wrote:
> On Wed, Feb 12, 2014 at 2:13 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Feb 11, 2014 at 02:34:11PM -0800, Andy Lutomirski wrote:
> >> On Tue, Feb 11, 2014 at 1:21 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> >> A small number of reschedule interrupts appear to be due to a race:
> >> >> both resched_task and wake_up_idle_cpu do, essentially:
> >> >>
> >> >> set_tsk_need_resched(t);
> >> >> smb_mb();
> >> >> if (!tsk_is_polling(t))
> >> >>   smp_send_reschedule(cpu);
> >> >>
> >> >> The problem is that set_tsk_need_resched wakes the CPU and, if the CPU
> >> >> is too quick (which isn't surprising if it was in C0 or C1), then it
> >> >> could *clear* TS_POLLING before tsk_is_polling is read.
> >
> > Yeah we have the wrong default for the idle loops.. it should default to
> > polling and only switch to !polling at the very last moment if it really
> > needs an interrupt to wake.
> 
> I might be missing something, but won't that break the scheduler? 

for the idle task.. all other tasks will have it !polling.

But note how the current generic idle loop does:

  if (!current_clr_polling_and_test()) {
  	...
	if (cpuidle_idle_call())
		arch_cpu_idle();
	...
  }

This means that it still runs a metric ton of code, right up to the
mwait with !polling, and then at the mwait we switch it back to polling.

Completely daft.

> Since rq->lock is held, the resched calls could check the rq state
> (curr == idle, maybe) to distinguish these cases.

Not enough; but I'm afraid I confused you with the above.

My suggestion was really more that we should call into the cpuidle/arch
idle code with polling set, and only right before we hit hlt/wfi/etc..
should we clear the polling bit.

> > It can't we're holding its rq->lock.
> 
> Exactly.  AFAICT the only reason that any of this code holds rq->lock
> (especially ttwu_queue_remote, which I seem to call a few thousand
> times per second) is because the only way to make a cpu reschedule
> involves playing with per-task flags.  If the flags were per-rq or
> per-cpu instead, then rq->lock wouldn't be needed.  If this were all
> done locklessly, then I think either a full cmpxchg or some fairly
> careful use of full barriers would be needed, but I bet that cmpxchg
> is still considerably faster than a spinlock plus a set_bit.

Ahh, that's what you're saying. Yes we should be able to do something
clever there.

Something like the below is I think as close as we can come without
major surgery and moving TIF_NEED_RESCHED and POLLING into a per-cpu
variable.

I might have messed it up though; brain seems to have given out for the
day :/

---
 kernel/sched/core.c  | 17 +++++++++++++----
 kernel/sched/idle.c  | 21 +++++++++++++--------
 kernel/sched/sched.h |  5 ++++-
 3 files changed, 30 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fb9764fbc537..a5b64040c21d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -529,7 +529,7 @@ void resched_task(struct task_struct *p)
 	}
 
 	/* NEED_RESCHED must be visible before we test polling */
-	smp_mb();
+	smp_mb__after_clear_bit();
 	if (!tsk_is_polling(p))
 		smp_send_reschedule(cpu);
 }
@@ -1476,12 +1476,15 @@ static int ttwu_remote(struct task_struct *p, int wake_flags)
 }
 
 #ifdef CONFIG_SMP
-static void sched_ttwu_pending(void)
+void sched_ttwu_pending(void)
 {
 	struct rq *rq = this_rq();
 	struct llist_node *llist = llist_del_all(&rq->wake_list);
 	struct task_struct *p;
 
+	if (!llist)
+		return;
+
 	raw_spin_lock(&rq->lock);
 
 	while (llist) {
@@ -1536,8 +1539,14 @@ void scheduler_ipi(void)
 
 static void ttwu_queue_remote(struct task_struct *p, int cpu)
 {
-	if (llist_add(&p->wake_entry, &cpu_rq(cpu)->wake_list))
-		smp_send_reschedule(cpu);
+	struct rq *rq = cpu_rq(cpu);
+
+	if (llist_add(&p->wake_entry, &rq->wake_list)) {
+		set_tsk_need_resched(rq->idle);
+		smp_mb__after_clear_bit();
+		if (!tsk_is_polling(rq->idle) || rq->curr != rq->idle)
+			smp_send_reschedule(cpu);
+	}
 }
 
 bool cpus_share_cache(int this_cpu, int that_cpu)
diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index 14ca43430aee..bd8ed2d2f2f7 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -105,19 +105,24 @@ static void cpu_idle_loop(void)
 				} else {
 					local_irq_enable();
 				}
-				__current_set_polling();
 			}
 			arch_cpu_idle_exit();
-			/*
-			 * We need to test and propagate the TIF_NEED_RESCHED
-			 * bit here because we might not have send the
-			 * reschedule IPI to idle tasks.
-			 */
-			if (tif_need_resched())
-				set_preempt_need_resched();
 		}
+
+		/*
+		 * We must clear polling before running sched_ttwu_pending().
+		 * Otherwise it becomes possible to have entries added in
+		 * ttwu_queue_remote() and still not get an IPI to process
+		 * them.
+		 */
+		__current_clr_polling();
+
+		set_preempt_need_resched();
+		sched_ttwu_pending();
+
 		tick_nohz_idle_exit();
 		schedule_preempt_disabled();
+		__current_set_polling();
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bf34c257d3b..b59dbdb135d8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1157,9 +1157,10 @@ extern const struct sched_class rt_sched_class;
 extern const struct sched_class fair_sched_class;
 extern const struct sched_class idle_sched_class;
 
-
 #ifdef CONFIG_SMP
 
+extern void sched_ttwu_pending(void)
+
 extern void update_group_power(struct sched_domain *sd, int cpu);
 
 extern void trigger_load_balance(struct rq *rq);
@@ -1170,6 +1171,8 @@ extern void idle_exit_fair(struct rq *this_rq);
 
 #else	/* CONFIG_SMP */
 
+static inline void sched_ttwu_pending(void) { }
+
 static inline void idle_balance(int cpu, struct rq *rq)
 {
 }

next prev parent reply	other threads:[~2014-02-12 16:39 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-11 20:23 Too many rescheduling interrupts (still!) Andy Lutomirski
2014-02-11 21:21 ` Thomas Gleixner
2014-02-11 22:34   ` Andy Lutomirski
2014-02-12 10:13     ` Peter Zijlstra
2014-02-12 15:49       ` Andy Lutomirski
2014-02-12 16:39         ` Peter Zijlstra [this message]
2014-02-12 18:19           ` Andy Lutomirski
2014-02-12 20:17             ` Peter Zijlstra
2014-02-13  1:40               ` [RFC] sched: Add a new lockless wake-from-idle implementation Andy Lutomirski
2014-02-13  9:38                 ` Ingo Molnar
2014-02-13 14:49                 ` Frederic Weisbecker
2014-02-13 14:50                 ` Peter Zijlstra
2014-02-13 17:07                   ` Andy Lutomirski
2014-02-13 20:16                     ` Peter Zijlstra
2014-02-13 20:35                       ` Andy Lutomirski
2014-02-13 19:58                   ` Andy Lutomirski
2014-02-14  1:38                     ` Andy Lutomirski
2014-02-14 20:01                   ` Andy Lutomirski
2014-02-14 20:17                     ` Andy Lutomirski
2014-02-14 21:19                       ` Peter Zijlstra
2014-02-12 15:59       ` Too many rescheduling interrupts (still!) Frederic Weisbecker
2014-02-12 16:43         ` Peter Zijlstra
2014-02-12 17:46           ` Frederic Weisbecker
2014-02-12 18:15             ` Peter Zijlstra

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:fb9764fbc53 dfblob:a5b64040c21 dfblob:14ca43430ae
dfblob:bd8ed2d2f2f dfblob:1bf34c257d3 dfblob:b59dbdb135d )
 OR (
bs:"Re: Too many rescheduling interrupts (still!)" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140212163916.GA27965@twins.programming.kicks-ass.net \
    --to=peterz@infradead.org \
    --cc=bitbucket@online.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox