Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
	KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Nicholas Miell <nmiell@comcast.net>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@redhat.com>,
	Alan Cox <gnomes@lxorguk.ukuu.org.uk>,
	Lai Jiangshan <laijs@cn.fujitsu.com>,
	Stephen Hemminger <stephen@networkplumber.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Josh Triplett <josh@joshtriplett.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	David Howells <dhowells@redhat.com>,
	Nick Piggin <npiggin@kernel.dk>
Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
Date: Mon, 16 Mar 2015 15:43:56 +0000 (UTC)	[thread overview]
Message-ID: <1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20150316141939.GE21418@twins.programming.kicks-ass.net>

----- Original Message -----
> From: "Peter Zijlstra" <peterz@infradead.org>
> To: "Mathieu Desnoyers" <mathieu.desnoyers@efficios.com>
> Cc: linux-kernel@vger.kernel.org, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, "Steven Rostedt"
> <rostedt@goodmis.org>, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>, "Nicholas Miell" <nmiell@comcast.net>,
> "Linus Torvalds" <torvalds@linux-foundation.org>, "Ingo Molnar" <mingo@redhat.com>, "Alan Cox"
> <gnomes@lxorguk.ukuu.org.uk>, "Lai Jiangshan" <laijs@cn.fujitsu.com>, "Stephen Hemminger"
> <stephen@networkplumber.org>, "Andrew Morton" <akpm@linux-foundation.org>, "Josh Triplett" <josh@joshtriplett.org>,
> "Thomas Gleixner" <tglx@linutronix.de>, "David Howells" <dhowells@redhat.com>, "Nick Piggin" <npiggin@kernel.dk>
> Sent: Monday, March 16, 2015 10:19:39 AM
> Subject: Re: [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12)
> 
> On Sun, Mar 15, 2015 at 03:24:19PM -0400, Mathieu Desnoyers wrote:
> 
> TL;DR
> 
> > --- a/arch/x86/include/asm/mmu_context.h
> > +++ b/arch/x86/include/asm/mmu_context.h
> > @@ -45,6 +45,16 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  #endif
> >  		cpumask_set_cpu(cpu, mm_cpumask(next));
> >  
> > +		/*
> > +		 * smp_mb() between mm_cpumask set and following memory
> > +		 * accesses to user-space addresses is required by
> > +		 * sys_membarrier(). A smp_mb() is also needed between
> > +		 * prior memory accesses and mm_cpumask clear. This
> > +		 * ensures that all user-space address memory accesses
> > +		 * performed by the current thread are in program order
> > +		 * when the mm_cpumask is set. Implied by load_cr3.
> > +		 */
> > +
> >  		/* Re-load page tables */
> >  		load_cr3(next->pgd);
> >  		trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> > @@ -82,6 +92,13 @@ static inline void switch_mm(struct mm_struct *prev,
> > struct mm_struct *next,
> >  			 * We were in lazy tlb mode and leave_mm disabled
> >  			 * tlb flush IPI delivery. We must reload CR3
> >  			 * to make sure to use no freed page tables.
> > +			 *
> > +			 * smp_mb() between mm_cpumask set and memory accesses
> > +			 * to user-space addresses is required by
> > +			 * sys_membarrier(). This ensures that all user-space
> > +			 * address memory accesses performed by the current
> > +			 * thread are in program order when the mm_cpumask is
> > +			 * set. Implied by load_cr3.
> >  			 */
> >  			load_cr3(next->pgd);
> >  			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
> 
> 
> In both cases the cpumask_set_cpu() will also imply a MB.

I'm probably missing what exactly in cpumask_set_cpu()
implies this guarantee. cpumask_set_cpu() uses set_bit().
On x86, set_bit is indeed implemented with a lock-prefixed
orb or bts. However, the comment above set_bit() states:

 * Note: there are no guarantees that this function will not be reordered
 * on non x86 architectures, so if you are writing portable code,
 * make sure not to rely on its reordering guarantees.

And it states nothing about memory barriers. Typically,
atomic ops that imply memory barriers always return
something (xchg, cmpxchg, add_return). Ops like atomic_add
do not imply barriers.

> 
> > +enum {
> > +	/*
> > +	 * Private flag set: only synchronize across a single process. If this
> > +	 * flag is not set, it means "shared": synchronize across multiple
> > +	 * processes.  The shared mode is useful for shared memory mappings
> > +	 * across processes.
> > +	 */
> > +	MEMBARRIER_PRIVATE_FLAG = (1 << 0),
> > +
> > +	/*
> > +	 * Expedited flag set: adds some overhead, fast execution (few
> > +	 * microseconds).  If this flag is not set, it means "delayed": low
> > +	 * overhead, but slow execution (few milliseconds).
> > +	 */
> > +	MEMBARRIER_EXPEDITED_FLAG = (1 << 1),
> 
> 
> I suppose this is an unprivileged syscall; so what do we do about:
> 
> 	for (;;)
> 		sys_membar(EXPEDITED);
> 
> Which would spray the entire system with IPIs at break neck speed.

Currently, combining EXPEDITED with non-PRIVATE returns -EINVAL.
Therefore, if someone cares about issuing barriers on the entire
system, the only option is to use non-EXPEDITED, which rely on
synchronize_rcu().

The only way to invoke expedited barriers in a loop is:

for (;;)
        sys_membarrier(MEMBARRIER_EXPEDITED | MEMBARRIER_PRIVATE);

Which will only send IPIs to the CPU running threads from the same
process.

> 
> > +static void membarrier_ipi(void *unused)
> > +{
> > +	/* Order memory accesses with respects to sys_membarrier caller. */
> > +	smp_mb();
> > +}
> > +
> > +/*
> > + * Handle out-of-memory by sending per-cpu IPIs instead.
> > + */
> > +static void membarrier_fallback(void)
> > +{
> > +	struct mm_struct *mm;
> > +	int cpu;
> > +
> > +	for_each_cpu(cpu, mm_cpumask(current->mm)) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm == mm)
> > +			smp_call_function_single(cpu, membarrier_ipi, NULL, 1);
> > +	}
> > +}
> 
> > +static void membarrier_expedited(void)
> > +{
> > +	struct mm_struct *mm;
> > +	cpumask_var_t tmpmask;
> > +	int cpu;
> > +
> > +	/*
> > +	 * Memory barrier on the caller thread between previous memory accesses
> > +	 * to user-space addresses and sending memory-barrier IPIs. Orders all
> > +	 * user-space address memory accesses prior to sys_membarrier() before
> > +	 * mm_cpumask read and membarrier_ipi executions. This barrier is paired
> > +	 * with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> > +		membarrier_fallback();
> > +		goto out;
> > +	}
> > +	cpumask_copy(tmpmask, mm_cpumask(current->mm));
> > +	preempt_disable();
> > +	cpumask_clear_cpu(smp_processor_id(), tmpmask);
> > +	for_each_cpu(cpu, tmpmask) {
> > +		raw_spin_lock_irq(&cpu_rq(cpu)->lock);
> > +		mm = cpu_curr(cpu)->mm;
> > +		raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
> > +		if (current->mm != mm)
> > +			cpumask_clear_cpu(cpu, tmpmask);
> > +	}
> > +	smp_call_function_many(tmpmask, membarrier_ipi, NULL, 1);
> > +	preempt_enable();
> > +	free_cpumask_var(tmpmask);
> > +out:
> > +	/*
> > +	 * Memory barrier on the caller thread between sending & waiting for
> > +	 * memory-barrier IPIs and following memory accesses to user-space
> > +	 * addresses. Orders mm_cpumask read and membarrier_ipi executions
> > +	 * before all user-space address memory accesses following
> > +	 * sys_membarrier(). This barrier is paired with memory barriers in:
> > +	 * - membarrier_ipi() (for each running threads of the current process)
> > +	 * - switch_mm() (ordering scheduler mm_cpumask update wrt memory
> > +	 *                accesses to user-space addresses)
> > +	 * - Each CPU ->mm update performed with rq lock held by the scheduler.
> > +	 *   A memory barrier is issued each time ->mm is changed while the rq
> > +	 *   lock is held.
> > +	 */
> > +	smp_mb();
> > +}
> 
> Did you just write:
> 
> bool membar_cpu_is_mm(int cpu, void *info)
> {
> 	struct mm_struct *mm = info;
> 	struct rq *rq = cpu_rq(cpu);
> 	bool ret;
> 
> 	raw_spin_lock_irq(&rq->lock);
> 	ret = rq->curr->mm == mm;
> 	raw_spin_unlock_irq(&rq->lock);
> 
> 	return ret;
> }
> 
> 	on_each_cpu_cond(membar_cpu_is_mm, membar_ipi, mm, 1, GFP_NOWAIT);
> 

It is very similar indeed! The main difference is that my implementation
was starting from a copy of mm_cpumask(current->mm) and clearing the CPUs
for which TLB shootdown is simply pending (confirmed by taking the rq lock
and checking cpu_curr(cpu)->mm against current->mm).

Now that you mention this, I think we don't really need to use
mm_cpumask(current->mm) at all. Just iterating on each cpu, taking
the rq lock, and comparing the mm should be enough. This would
remove the need to rely on having extra memory barriers around
set/clear of the mm cpumask.

The main reason why I did not use on_each_cpu_cond() was that it did
not exist back in 2010. ;-)

> On which; I absolutely hate that rq->lock thing in there. What is
> 'wrong' with doing a lockless compare there? Other than not actually
> being able to deref rq->curr of course, but we need to fix that anyhow.

If we can make sure rq->curr deref could be done without holding the rq
lock, then I think all we would need is to ensure that updates to rq->curr
are surrounded by memory barriers. Therefore, we would have the following:

* When a thread is scheduled out, a memory barrier would be issued before
  rq->curr is updated to the next thread task_struct.

* Before a thread is scheduled in, a memory barrier needs to be issued
  after rq->curr is updated to the incoming thread.

In order to be able to dereference rq->curr->mm without holding the
rq->lock, do you envision we should protect task reclaim with RCU-sched ?

Thanks!

Mathieu

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

next prev parent reply	other threads:[~2015-03-16 15:44 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-15 19:24 [RFC PATCH] sys_membarrier(): system/process-wide memory barrier (x86) (v12) Mathieu Desnoyers
2015-03-15 22:05 ` Paul E. McKenney
2015-03-16  3:25 ` Josh Triplett
2015-03-16 13:00   ` Mathieu Desnoyers
2015-03-16 14:19 ` Peter Zijlstra
2015-03-16 14:24   ` Steven Rostedt
2015-03-16 15:49     ` Mathieu Desnoyers
2015-03-16 15:49     ` Paul E. McKenney
2015-03-16 16:12       ` Steven Rostedt
2015-03-16 15:43   ` Mathieu Desnoyers [this message]
2015-03-16 15:57     ` Mathieu Desnoyers
2015-03-16 17:13     ` Peter Zijlstra
2015-03-16 17:21     ` Peter Zijlstra
2015-03-16 18:53       ` Mathieu Desnoyers
2015-03-16 20:54         ` Peter Zijlstra
2015-03-17  1:45           ` Mathieu Desnoyers
2015-03-17  2:26             ` Steven Rostedt
2015-03-17  6:40               ` Peter Zijlstra
2015-03-17 11:44                 ` Paul E. McKenney
2015-03-17 14:10                   ` Steven Rostedt
2015-03-17 16:35                     ` Paul E. McKenney
2015-03-17 12:46               ` Mathieu Desnoyers
2015-03-18  1:06                 ` Steven Rostedt
2015-03-17  6:30             ` Peter Zijlstra
2015-03-17 11:56               ` Paul E. McKenney
2015-03-17 12:01                 ` Paul E. McKenney
2015-03-17 13:13               ` Mathieu Desnoyers
2015-03-17 16:36                 ` Mathieu Desnoyers
2015-03-17 16:48                   ` Paul E. McKenney
2015-03-17 17:55                   ` josh
2015-03-17 16:37                 ` Peter Zijlstra
2015-03-17 16:49                   ` Paul E. McKenney
2015-03-17 17:00                     ` Peter Zijlstra
2015-03-16 17:24     ` Peter Zijlstra

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1203077851.9491.1426520636551.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=gnomes@lxorguk.ukuu.org.uk \
    --cc=josh@joshtriplett.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=nmiell@comcast.net \
    --cc=npiggin@kernel.dk \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=stephen@networkplumber.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.