All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
To: Peter Zijlstra <peterz@infradead.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Boqun Feng <boqun.feng@gmail.com>, Andrew Hunter <ahh@google.com>,
	maged michael <maged.michael@gmail.com>,
	gromer <gromer@google.com>, Avi Kivity <avi@scylladb.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Paul Mackerras <paulus@samba.org>,
	Michael Ellerman <mpe@ellerman.id.au>
Subject: Re: [RFC PATCH v2] membarrier: expedited private command
Date: Fri, 28 Jul 2017 15:36:09 +0000 (UTC)	[thread overview]
Message-ID: <624649192.29564.1501256169885.JavaMail.zimbra@efficios.com> (raw)
In-Reply-To: <20170728085532.ylhuz2irwmgpmejv@hirez.programming.kicks-ass.net>

----- On Jul 28, 2017, at 4:55 AM, Peter Zijlstra peterz@infradead.org wrote:

> On Thu, Jul 27, 2017 at 05:13:14PM -0400, Mathieu Desnoyers wrote:
>> +static void membarrier_expedited_mb_after_set_current(struct mm_struct *mm,
>> +		struct mm_struct *oldmm)
> 
> That is a bit of a mouth-full...
> 
>> +{
>> +	if (!IS_ENABLED(CONFIG_MEMBARRIER))
>> +		return;
>> +	/*
>> +	 * __schedule()
>> +	 *   finish_task_switch()
>> +	 *    if (mm)
>> +	 *      mmdrop(mm)
>> +	 *        atomic_dec_and_test()
>         *
>> +	 * takes care of issuing a memory barrier when oldmm is
>> +	 * non-NULL. We also don't need the barrier when switching to a
>> +	 * kernel thread, nor when we switch between threads belonging
>> +	 * to the same process.
>> +	 */
>> +	if (likely(oldmm || !mm || mm == oldmm))
>> +		return;
>> +	/*
>> +	 * When switching between processes, membarrier expedited
>> +	 * private requires a memory barrier after we set the current
>> +	 * task.
>> +	 */
>> +	smp_mb();
>> +}
> 
> And because of what it complements, I would have expected the callsite:
> 
>> @@ -2737,6 +2763,7 @@ context_switch(struct rq *rq, struct task_struct *prev,
>>  
>>  	mm = next->mm;
>>  	oldmm = prev->active_mm;
>> +	membarrier_expedited_mb_after_set_current(mm, oldmm);
>>  	/*
>>  	 * For paravirt, this is coupled with an exit in switch_to to
>>  	 * combine the page table reload and the switch backend into
> 
> to be in finish_task_switch(), something like:
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e9785f7aed75..33f34a201255 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2641,8 +2641,18 @@ static struct rq *finish_task_switch(struct task_struct
> *prev)
> 	finish_arch_post_lock_switch();
> 
> 	fire_sched_in_preempt_notifiers(current);
> +
> +	/*
> +	 * For CONFIG_MEMBARRIER we need a full memory barrier after the
> +	 * rq->curr assignment. Not all architectures have one in either
> +	 * switch_to() or switch_mm() so we use (and complement) the one
> +	 * implied by mmdrop()'s atomic_dec_and_test().
> +	 */
> 	if (mm)
> 		mmdrop(mm);
> +	else if (IS_ENABLED(CONFIG_MEMBARRIER))
> +		smp_mb();
> +
> 	if (unlikely(prev_state == TASK_DEAD)) {
> 		if (prev->sched_class->task_dead)
> 			prev->sched_class->task_dead(prev);
> 
> 
> I realize this is sub-optimal if we're switching to a kernel thread, so
> it might want some work, then again, a whole bunch of architectures
> don't in fact need this extra barrier at all.

As discussed on IRC, I plan to go instead for:

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 01e3b881ab3a..dd677fb2ee92 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2636,6 +2636,11 @@ static struct rq *finish_task_switch(struct task_struct *
prev)
        vtime_task_switch(prev);
        perf_event_task_sched_in(prev, current);
        finish_lock_switch(rq, prev);
+       /*
+        * The membarrier system call requires a full memory barrier
+        * after storing to rq->curr, before going back to user-space.
+        */
+       smp_mb__after_unlock_lock();
        finish_arch_post_lock_switch();
 
        fire_sched_in_preempt_notifiers(current);

Which is free on most architectures, except those defining
CONFIG_ARCH_WEAK_RELEASE_ACQUIRE. CCing PPC maintainers.

> 
>> +static void membarrier_private_expedited(void)
>> +{
>> +	int cpu, this_cpu;
>> +	bool fallback = false;
>> +	cpumask_var_t tmpmask;
>> +
>> +	if (num_online_cpus() == 1)
>> +		return;
>> +
>> +	/*
>> +	 * Matches memory barriers around rq->curr modification in
>> +	 * scheduler.
>> +	 */
>> +	smp_mb();	/* system call entry is not a mb. */
>> +
> 
> Weren't you going to put in a comment on that GFP_NOWAIT thing?

I only added it to the uapi header. Adding this to the implementation
too:

+       /*
+        * Expedited membarrier commands guarantee that they won't
+        * block, hence the GFP_NOWAIT allocation flag and fallback
+        * implementation.
+        */



> 
>> +	if (!alloc_cpumask_var(&tmpmask, GFP_NOWAIT)) {
> 
> You really want: zalloc_cpumask_var().

ok

> 
>> +		/* Fallback for OOM. */
>> +		fallback = true;
>> +	}
>> +
>> +	/*
>> +	 * Skipping the current CPU is OK even through we can be
>> +	 * migrated at any point. The current CPU, at the point where we
>> +	 * read raw_smp_processor_id(), is ensured to be in program
>> +	 * order with respect to the caller thread. Therefore, we can
>> +	 * skip this CPU from the iteration.
>> +	 */
>> +	this_cpu = raw_smp_processor_id();
> 
> So if instead you do the below, that is still true, but you have the
> opportunity to skip moar CPUs, then again, if you migrate the wrong way
> you'll end up not skipping yourself.. a well.

Chances are better to skip more CPUs in face of migration if we do it
in the loop as you suggest. Will do.

> 
>> +	cpus_read_lock();
>> +	for_each_online_cpu(cpu) {
>> +		struct task_struct *p;
>> +
>		if (cpu == raw_smp_processor_id())
>			continue;
> 
>> +		rcu_read_lock();
>> +		p = task_rcu_dereference(&cpu_rq(cpu)->curr);
>> +		if (p && p->mm == current->mm) {
>> +			if (!fallback)
>> +				__cpumask_set_cpu(cpu, tmpmask);
>> +			else
>> +				smp_call_function_single(cpu, ipi_mb, NULL, 1);
>> +		}
>> +		rcu_read_unlock();
>> +	}
>> +	cpus_read_unlock();
> 
> This ^, wants to go after that v
> 
>> +	if (!fallback) {
>> +		smp_call_function_many(tmpmask, ipi_mb, NULL, 1);
>> +		free_cpumask_var(tmpmask);
>> +	}
> 
> Because otherwise the bits in your tmpmask might no longer match the
> online state.

Good point, thanks!

Mathieu

> 
>> +
>> +	/*
>> +	 * Memory barrier on the caller thread _after_ we finished
>> +	 * waiting for the last IPI. Matches memory barriers around
>> +	 * rq->curr modification in scheduler.
>> +	 */
>> +	smp_mb();	/* exit from system call is not a mb */
> > +}

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

      parent reply	other threads:[~2017-07-28 15:34 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-07-27 21:13 [RFC PATCH v2] membarrier: expedited private command Mathieu Desnoyers
2017-07-27 22:13 ` Paul E. McKenney
2017-07-27 22:41   ` Mathieu Desnoyers
2017-07-27 22:57     ` Paul E. McKenney
2017-07-28  8:55 ` Peter Zijlstra
2017-07-28 11:10   ` Peter Zijlstra
2017-07-28 11:57   ` Peter Zijlstra
2017-07-28 15:38     ` Mathieu Desnoyers
2017-07-28 16:46       ` Peter Zijlstra
2017-07-28 17:06         ` Mathieu Desnoyers
2017-07-29  1:58           ` Nicholas Piggin
2017-07-29  9:23             ` Peter Zijlstra
2017-07-29  9:45               ` Nicholas Piggin
2017-07-29  9:48                 ` Nicholas Piggin
2017-07-29 10:51                   ` Peter Zijlstra
2017-07-31 19:31             ` Mathieu Desnoyers
2017-07-31 13:20     ` Michael Ellerman
2017-07-31 13:36       ` Peter Zijlstra
2017-08-01  0:35       ` Nicholas Piggin
2017-08-01  1:33         ` Mathieu Desnoyers
2017-08-01  2:00           ` Nicholas Piggin
2017-08-01  8:12             ` Peter Zijlstra
2017-08-01  9:57               ` Nicholas Piggin
2017-08-01 10:22                 ` Peter Zijlstra
2017-08-01 10:32                   ` Avi Kivity
2017-08-01 10:46                     ` Peter Zijlstra
2017-08-01 10:39                   ` Nicholas Piggin
2017-08-01 11:00                     ` Peter Zijlstra
2017-08-01 11:54                       ` Nicholas Piggin
2017-08-01 13:23                   ` Paul E. McKenney
2017-08-01 14:16                     ` Peter Zijlstra
2017-08-01 23:32                       ` Paul E. McKenney
2017-08-02  0:45                         ` Nicholas Piggin
2017-07-28 15:36   ` Mathieu Desnoyers [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=624649192.29564.1501256169885.JavaMail.zimbra@efficios.com \
    --to=mathieu.desnoyers@efficios.com \
    --cc=ahh@google.com \
    --cc=avi@scylladb.com \
    --cc=benh@kernel.crashing.org \
    --cc=boqun.feng@gmail.com \
    --cc=gromer@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maged.michael@gmail.com \
    --cc=mpe@ellerman.id.au \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=paulus@samba.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.