Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Nicholas Miell <nmiell@comcast.net>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	Alan Cox <gnomes@lxorguk.ukuu.org.uk>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	David Howells <dhowells@redhat.com>,
	Nick Piggin <npiggin@kernel.dk>
Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
Date: Mon, 16 Mar 2015 15:43:56 +0000 (UTC)	[thread overview]
Message-ID: <1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20150316141939.GE21418@twins.programming.kicks-ass.net>

----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 10:19:39 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> 
> TL;DR
> 
> > --- a/arch/x86/include/asm/mmu_context.h
> > +++ b/arch/x86/include/asm/mmu_context.h
> > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  #endif
> >  		cpumask_set_cpu(cpu, mm_cpumask(next));
> >  
> > +		/*
> > +		 * smp_mb() between mm_cpumask set and following memory
> > +		 * accesses to user-space addresses is required by
> > +		 * sys_membarrier(). A smp_mb() is also needed between
> > +		 * prior memory accesses and mm_cpumask clear. This
> > +		 * ensures that all user-space address memory accesses
> > +		 * performed by the current thread are in program order
> > +		 * when the mm_cpumask is set. Implied by load_cr3.
> > +		 */
> > +
> >  		/* Re-load page tables */
> >  		load_cr3(next->pgd);
> >  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  			 * We were in lazy tlb mode and leave_mm disabled
> >  			 * tlb flush IPI delivery. We must reload CR3
> >  			 * to make sure to use no freed page tables.
> > +			 *
> > +			 * smp_mb() between mm_cpumask set and memory accesses
> > +			 * to user-space addresses is required by
> > +			 * sys_membarrier(). This ensures that all user-space
> > +			 * address memory accesses performed by the current
> > +			 * thread are in program order when the mm_cpumask is
> > +			 * set. Implied by load_cr3.
> >  			 */
> >  			load_cr3(next->pgd);
> >  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> 
> 
> In both cases the cpumask_set_cpu() will also imply a MB.

I'm probably missing what exactly in cpumask_set_cpu()
implies this guarantee. cpumask_set_cpu() uses set_bit().
On x86, set_bit is indeed implemented with a lock-prefixed
orb or bts. However, the comment above set_bit() states:

 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writing portable code,
 * make sure not to rely on its reordering guarantees.

And it states nothing about memory barriers. Typically,
atomic ops that imply memory barriers always return
something (xchg, cmpxchg, add_return). Ops like atomic_add
do not imply barriers.

> 
> > +enum {
> > +	/*
> > +	 * Private flag set: only synchronize across a single process. If this
> > +	 * flag is not set, it means "shared": synchronize across multiple
> > +	 * processes.  The shared mode is useful for shared memory mappings
> > +	 * across processes.
> > +	 */
> > +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > +
> > +	/*
> > +	 * Expedited flag set: adds some overhead, fast execution (few
> > +	 * microseconds).  If this flag is not set, it means "delayed": low
> > +	 * overhead, but slow execution (few milliseconds).
> > +	 */
> > +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> 
> 
> I suppose this is an unprivileged syscall; so what do we do about:
> 
> 	for (;;)
> 		sys_membar(EXPEDITED);
> 
> Which would spray the entire system with IPIs at break neck speed.

Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.
Therefore, if someone cares about issuing barriers on the entire
system, the only option is to use non-EXPEDITED, which rely on
synchronize_rcu().

The only way to invoke expedited barriers in a loop is:

for (;;)
        sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);

Which will only send IPIs to the CPU running threads from the same
process.

> 
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	/* Order memory accesses with respects to sys_membarrier caller. */
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-memory by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_fallback(void)
> > +{
> > +	struct mm_struct *mm;
> > +	int cpu;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm == mm)
> > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > +	}
> > +}
> 
> > +static void membarrier_expedited(void)
> > +{
> > +	struct mm_struct *mm;
> > +	cpumask_var_t tmpmask;
> > +	int cpu;
> > +
> > +	/*
> > +	 * Memory barrier on the caller thread between previous memory accesses
> > +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> > +	 * user-space address memory accesses prior to sys_membarrier() before
> > +	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> > +	 * with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > +		membarrier_fallback();
> > +		goto out;
> > +	}
> > +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > +	preempt_disable();
> > +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > +	for_each_cpu(cpu, tmpmask) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm != mm)
> > +			cpumask_clear_cpu(cpu, tmpmask);
> > +	}
> > +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > +	preempt_enable();
> > +	free_cpumask_var(tmpmask);
> > +out:
> > +	/*
> > +	 * Memory barrier on the caller thread between sending & waiting for
> > +	 * memory-barrier IPIs and following memory accesses to user-space
> > +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > +	 * before all user-space address memory accesses following
> > +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +}
> 
> Did you just write:
> 
> bool membar_cpu_is_mm(int cpu, void *info)
> {
> 	struct mm_struct *mm = info;
> 	struct rq *rq = cpu_rq(cpu);
> 	bool ret;
> 
> 	raw_spin_lock_irq(&rq->lock);
> 	ret = rq->curr->mm == mm;
> 	raw_spin_unlock_irq(&rq->lock);
> 
> 	return ret;
> }
> 
> 	on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);
> 

It is very similar indeed! The main difference is that my implementation
was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs
for which TLB shootdown is simply pending (confirmed by taking the rq lock
and checking cpu_curr(cpu)->mm against current->mm).

Now that you mention this, I think we don't really need to use
mm_cpumask(current->mm) at all. Just iterating on each cpu, taking
the rq lock, and comparing the mm should be enough. This would
remove the need to rely on having extra memory barriers around
set/clear of the mm cpumask.

The main reason why I did not use on_each_cpu_cond() was that it did
not exist back in 2010. ;-)

> On which; I absolutely hate that rq->lock thing in there. What is
> 'wrong' with doing a lockless compare there? Other than not actually
> being able to deref rq->curr of course, but we need to fix that anyhow.

If we can make sure rq->curr deref could be done without holding the rq
lock, then I think all we would need is to ensure that updates to rq->curr
are surrounded by memory barriers. Therefore, we would have the following:

* When a thread is scheduled out, a memory barrier would be issued before
  rq->curr is updated to the next thread task_struct.

* Before a thread is scheduled in, a memory barrier needs to be issued
  after rq->curr is updated to the incoming thread.

In order to be able to dereference rq->curr->mm without holding the
rq->lock, do you envision we should protect task reclaim with RCU-sched ?

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2015-03-16 15:44 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
2015-03-15 22:05 ` Paul E. McKenney
2015-03-16  3:25 ` Josh Triplett
2015-03-16 13:00   ` Mathieu Desnoyers
2015-03-16 14:19 ` Peter Zijlstra
2015-03-16 14:24   ` Steven Rostedt
2015-03-16 15:49     ` Mathieu Desnoyers
2015-03-16 15:49     ` Paul E. McKenney
2015-03-16 16:12       ` Steven Rostedt
2015-03-16 15:43   ` Mathieu Desnoyers [this message]
2015-03-16 15:57     ` Mathieu Desnoyers
2015-03-16 17:13     ` Peter Zijlstra
2015-03-16 17:21     ` Peter Zijlstra
2015-03-16 18:53       ` Mathieu Desnoyers
2015-03-16 20:54         ` Peter Zijlstra
2015-03-17  1:45           ` Mathieu Desnoyers
2015-03-17  2:26             ` Steven Rostedt
2015-03-17  6:40               ` Peter Zijlstra
2015-03-17 11:44                 ` Paul E. McKenney
2015-03-17 14:10                   ` Steven Rostedt
2015-03-17 16:35                     ` Paul E. McKenney
2015-03-17 12:46               ` Mathieu Desnoyers
2015-03-18  1:06                 ` Steven Rostedt
2015-03-17  6:30             ` Peter Zijlstra
2015-03-17 11:56               ` Paul E. McKenney
2015-03-17 12:01                 ` Paul E. McKenney
2015-03-17 13:13               ` Mathieu Desnoyers
2015-03-17 16:36                 ` Mathieu Desnoyers
2015-03-17 16:48                   ` Paul E. McKenney
2015-03-17 17:55                   ` josh
2015-03-17 16:37                 ` Peter Zijlstra
2015-03-17 16:49                   ` Paul E. McKenney
2015-03-17 17:00                     ` Peter Zijlstra
2015-03-16 17:24     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=josh@joshtriplett.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nmiell@comcast.net \
    --cc=npiggin@kernel.dk \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=stephen@networkplumber.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox