Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org, mingo@kernel.org,
	jiangshanlai@gmail.com, dipankar@in.ibm.com,
	akpm@linux-foundation.org, mathieu.desnoyers@efficios.com,
	josh@joshtriplett.org, tglx@linutronix.de, rostedt@goodmis.org,
	dhowells@redhat.com, edumazet@google.com, fweisbec@gmail.com,
	oleg@redhat.com, will.deacon@arm.com
Subject: Re: [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option
Date: Thu, 27 Jul 2017 06:08:16 -0700	[thread overview]
Message-ID: <20170727130816.GP3730@linux.vnet.ibm.com> (raw)
In-Reply-To: <20170727083003.ivb2fr47vepa2g6t@hirez.programming.kicks-ass.net>

On Thu, Jul 27, 2017 at 10:30:03AM +0200, Peter Zijlstra wrote:
> On Wed, Jul 26, 2017 at 08:41:10AM -0700, Paul E. McKenney wrote:
> > On Wed, Jul 26, 2017 at 09:41:28AM +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 25, 2017 at 04:59:36PM -0700, Paul E. McKenney wrote:
> 
> > > Sure, but SCHED_OTHER auto throttles in that if there's anything else to
> > > run, you get to wait. So you can't generate an IPI storm with it. Also,
> > > again, we can be limited to a subset of CPUs.
> > 
> > OK, what is its auto-throttle policy?  One round of IPIs per jiffy or
> > some such?
> 
> No. Its called wakeup latency :-) Your SCHED_OTHER task will not get to
> insta-run all the time. If there are other tasks already running, we'll
> not IPI unless it should preempt.
> 
> If its idle, nobody cares..

So it does IPI immediately sometimes.

> > Does this auto-throttling also apply if the user is running a CPU-bound
> > SCHED_BATCH or SCHED_IDLE task on each CPU, and periodically waking up
> > one of a large group of SCHED_OTHER tasks, where the SCHED_OTHER tasks
> > immediately sleep upon being awakened?
> 
> SCHED_BATCH is even more likely to suffer wakeup latency since it will
> never preempt anything.

Ahem.  In this scenario, SCHED_BATCH is already running on a the CPU in
question, and a SCHED_OTHER task is awakened from some other CPU.

Do we IPI in that case.

> > OK, and the rq->curr assignment is in common code, correct?  Does this
> > allow the IPI-only-requesting-process approach to live entirely within
> > common code?
> 
> That is the idea.
> 
> > The 2010 email thread ended up with sys_membarrier() acquiring the
> > runqueue lock for each CPU,
> 
> Yes, that's something I'm not happy with. Machine wide banging of that
> lock will be a performance no-no.

Regardless of whether or not we need to acquire runqueue locks, IPI,
or whatever other distasteful operation, we should also be able to
throttle and batch operations to at least some extent.

> > because doing otherwise meant adding code to the scheduler fastpath.
> 
> And that's obviously another thing I'm not happy with either.

Nor should you or anyone be.

> > Don't we still need to do this?
> > 
> > https://marc.info/?l=linux-kernel&m=126341138408407&w=2
> > https://marc.info/?l=linux-kernel&m=126349766324224&w=2
> 
> I don't know.. those seem focussed on mm_cpumask() and we can't use that
> per Will's email.
> 
> So I think we need to think anew on this, start from the ground up.

Probably from several points in the ground, but OK...

> What is missing for this:
> 
> static void ipi_mb(void *info)
> {
> 	smp_mb(); // IPIs should be serializing but paranoid
> }
> 
> 
> sys_membarrier()
> {
> 	smp_mb(); // because sysenter isn't an unconditional mb
> 
> 	for_each_online_cpu(cpu) {
> 		struct task_struct *p;
> 
> 		rcu_read_lock();
> 		p = task_rcu_dereference(&cpu_curr(cpu));
> 		if (p && p->mm == current->mm)
> 			__set_bit(cpus, cpu);
> 		rcu_read_unlock();
> 	}
> 
> 	on_cpu_cpu_mask(cpus, ipi_mb, NULL, 1); // does local smp_mb() too
> }
> 
> VS
> 
> __schedule()
> {
> 	spin_lock(&rq->lock);
> 	smp_mb__after_spinlock(); // really full mb implied
> 
> 	/* lots */
> 
> 	if (likely(prev != next)) {
> 
> 		rq->curr = next;
> 
> 		context_switch() {
> 			switch_mm();
> 			switch_to();
> 			// neither need imply a barrier
> 
> 			spin_unlock(&rq->lock);
> 		}
> 	}
> }
> 
> 
> 
> 
> So I think we need either switch_mm() or switch_to() to imply a full
> barrier for this to work, otherwise we get:
> 
>   CPU0				CPU1
> 
> 
>   lock rq->lock
>   mb
> 
>   rq->curr = A
> 
>   unlock rq->lock
> 
>   lock rq->lock
>   mb
> 
> 				sys_membarrier()
> 
> 				mb
> 
> 				for_each_online_cpu()
> 				  p = A
> 				  // no match no IPI
> 
> 				mb
>   rq->curr = B
> 
>   unlock rq->lock
> 
> 
> And that's bad, because now CPU0 doesn't have an MB happening _after_
> sys_membarrier() if B matches.

Yes, this looks somewhat similar to the scenario that Mathieu pointed out
back in 2010: https://marc.info/?l=linux-kernel&m=126349766324224&w=2

> So without audit, I only know of PPC and Alpha not having a barrier in
> either switch_*().
> 
> x86 obviously has barriers all over the place, arm has a super duper
> heavy barrier in switch_to().

Agreed, if we are going to rely on ->mm, we need ordering on assignment
to it.

> One option might be to resurrect spin_unlock_wait(), although to use
> that here is really ugly too, but it would avoid thrashing the
> rq->lock.
> 
> I think it'd end up having to look like:
> 
> 	rq = cpu_rq(cpu);
> again:
> 	rcu_read_lock()
> 	p = task_rcu_dereference(&rq->curr);
> 	if (p) {
> 		raw_spin_unlock_wait(&rq->lock);
> 		q = task_rcu_dereference(&rq->curr);
> 		if (q != p) {
> 			rcu_read_unlock();
> 			goto again;
> 		}
> 	}
> 	...
> 
> which is just about as horrible as it looks.

It does indeed look a bit suboptimal.

							Thanx, Paul

next prev parent reply	other threads:[~2017-07-27 13:08 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-24 21:57 [PATCH tip/core/rcu 0/5] Related non-RCU updates Paul E. McKenney
2017-07-24 21:58 ` [PATCH tip/core/rcu 1/5] module: Fix pr_fmt() bug for header use of printk Paul E. McKenney
2017-07-24 21:58 ` [PATCH tip/core/rcu 2/5] init_task: Remove redundant INIT_TASK_RCU_TREE_PREEMPT() macro Paul E. McKenney
2017-07-24 21:58 ` [PATCH tip/core/rcu 3/5] EXPERIMENTAL sched: Allow migrating kthreads into online but inactive CPUs Paul E. McKenney
2017-07-24 21:58 ` [PATCH tip/core/rcu 4/5] sys_membarrier: Add expedited option Paul E. McKenney
2017-07-25  4:27   ` Boqun Feng
2017-07-25 16:24     ` Paul E. McKenney
2017-07-25 13:21   ` Mathieu Desnoyers
2017-07-25 16:48     ` Paul E. McKenney
2017-07-25 16:33   ` Peter Zijlstra
2017-07-25 16:49     ` Paul E. McKenney
2017-07-25 16:59       ` Peter Zijlstra
2017-07-25 17:17         ` Paul E. McKenney
2017-07-25 18:53           ` Peter Zijlstra
2017-07-25 19:36             ` Paul E. McKenney
2017-07-25 20:24               ` Peter Zijlstra
2017-07-25 21:19                 ` Paul E. McKenney
2017-07-25 21:55                   ` Peter Zijlstra
2017-07-25 22:39                     ` Mathieu Desnoyers
2017-07-25 22:50                     ` Mathieu Desnoyers
2017-07-26  0:01                       ` Paul E. McKenney
2017-07-26  7:46                       ` Peter Zijlstra
2017-07-26 15:42                         ` Paul E. McKenney
2017-07-26 18:01                           ` Mathieu Desnoyers
2017-07-26 18:30                             ` Paul E. McKenney
2017-07-26 20:37                               ` Mathieu Desnoyers
2017-07-26 21:11                                 ` Paul E. McKenney
2017-07-27  1:45                                   ` Paul E. McKenney
2017-07-27 12:39                                     ` Mathieu Desnoyers
2017-07-27 14:44                                       ` Paul E. McKenney
2017-07-27 10:24                               ` Peter Zijlstra
2017-07-27 14:52                                 ` Paul E. McKenney
2017-07-27  8:53                             ` Peter Zijlstra
2017-07-27 10:09                               ` Peter Zijlstra
2017-07-27 10:22                               ` Will Deacon
2017-07-27 13:14                               ` Paul E. McKenney
2017-07-25 23:59                     ` Paul E. McKenney
2017-07-26  7:41                       ` Peter Zijlstra
2017-07-26 15:41                         ` Paul E. McKenney
2017-07-27  8:30                           ` Peter Zijlstra
2017-07-27 13:08                             ` Paul E. McKenney [this message]
2017-07-27 13:49                               ` Peter Zijlstra
2017-07-27 14:32                                 ` Paul E. McKenney
2017-07-27 14:36                                   ` Peter Zijlstra
2017-07-27 14:46                                     ` Paul E. McKenney
2017-07-27 13:55                               ` Boqun Feng
2017-07-27 14:16                                 ` Paul E. McKenney
2017-07-27 14:29                                   ` Boqun Feng
2017-07-27 14:36                                     ` Paul E. McKenney
2017-07-27 14:41                                       ` Will Deacon
2017-07-27 14:47                                       ` Boqun Feng
2017-07-27 14:55                                         ` Paul E. McKenney
2017-07-27 13:56                               ` Peter Zijlstra
2017-07-27 15:19                                 ` Peter Zijlstra
2017-07-26  9:36                   ` Will Deacon
2017-07-26 15:46                     ` Paul E. McKenney
2017-07-27 10:14               ` Peter Zijlstra
2017-07-27 12:56                 ` Paul E. McKenney
2017-07-27 13:37                   ` Peter Zijlstra
2017-07-27 14:33                     ` Paul E. McKenney
2017-07-24 21:58 ` [PATCH tip/core/rcu 5/5] EXP: sched/cputime: Fix using smp_processor_id() in preemptible Paul E. McKenney
2017-07-24 22:01   ` Wanpeng Li
2017-07-24 22:29     ` Paul E. McKenney
2017-07-31 22:51 ` [PATCH tip/core/rcu 0/5] Related non-RCU updates Paul E. McKenney
2017-07-31 22:53   ` [PATCH v2 tip/core/rcu 1/4] module: Fix pr_fmt() bug for header use of printk Paul E. McKenney
2017-07-31 22:53   ` [PATCH v2 tip/core/rcu 2/4] init_task: Remove redundant INIT_TASK_RCU_TREE_PREEMPT() macro Paul E. McKenney
2017-07-31 22:53   ` [PATCH v2 tip/core/rcu 3/4] sched: Allow migrating kthreads into online but inactive CPUs Paul E. McKenney
2017-07-31 22:53   ` [PATCH v2 tip/core/rcu 4/4] membarrier: Expedited private command Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170727130816.GP3730@linux.vnet.ibm.com \
    --to=paulmck@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=edumazet@google.com \
    --cc=fweisbec@gmail.com \
    --cc=jiangshanlai@gmail.com \
    --cc=josh@joshtriplett.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).