Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: linux-kernel@vger.kernel.org,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@elte.hu>,
	akpm@linux-foundation.org, josh@joshtriplett.org,
	tglx@linutronix.de, Valdis.Kletnieks@vt.edu, dhowells@redhat.com,
	laijs@cn.fujitsu.com, dipankar@in.ibm.com
Subject: Re: [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5)
Date: Wed, 13 Jan 2010 21:16:45 -0500	[thread overview]
Message-ID: <20100114021645.GA28784@Krystal> (raw)
In-Reply-To: <20100114085019.D716.A69D9226@jp.fujitsu.com>

* KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
> > * KOSAKI Motohiro (kosaki.motohiro@jp.fujitsu.com) wrote:
[...]
> > > It depend on what mean "constant overhead". kmalloc might cause
> > > page reclaim and undeterministic delay. I'm not sure (1) How much
> > > membarrier_retry() slower than smp_call_function_many and (2) Which do
> > > you think important average or worst performance. Only I note I don't
> > > think GFP_KERNEL is constant overhead.
> > 
> > 10,000,000 sys_membarrier calls (varying the number of threads to which
> > we send IPIs), IPI-to-many, 8-core system:
> > 
> > T=1: 0m20.173s
> > T=2: 0m20.506s
> > T=3: 0m22.632s
> > T=4: 0m24.759s
> > T=5: 0m26.633s
> > T=6: 0m29.654s
> > T=7: 0m30.669s
> > 
> > Just doing local mb()+single IPI to T other threads:
> > 
> > T=1: 0m18.801s
> > T=2: 0m29.086s
> > T=3: 0m46.841s
> > T=4: 0m53.758s
> > T=5: 1m10.856s
> > T=6: 1m21.142s
> > T=7: 1m38.362s
> > 
> > So sending single IPIs adds about 1.5 microseconds per extra core. With
> > the IPI-to-many scheme, we add about 0.2 microseconds per extra core. So
> > we have a factor 10 gain in scalability. The initial cost of the cpumask
> > allocation (which seems to be allocated on the stack in my config) is
> > just about 1.4 microseconds. So here, we only have a small gain for the
> > 1 IPI case, which does not justify the added complexity of dealing with
> > it differently.
> 
> I'd like to discuss to separate CONFIG_CPUMASK_OFFSTACK=1 and CONFIG_CPUMASK_OFFSTACK=0.
> 
> CONFIG_CPUMASK_OFFSTACK=0 (your config)
> 	- cpumask is allocated on stask
> 	- alloc_cpumask_var() is nop (yes, nop is constant overhead ;)
> 	- alloc_cpumask_var() always return 1, then membarrier_retry() is never called.
> 	- alloc_cpumask_var() ignore GFP_KERNEL parameter
> 
> CONFIG_CPUMASK_OFFSTACK=1 and use GFP_KERNEL
> 	- cpumask is allocated on heap
> 	- alloc_cpumask_var() is the wrapper of kmalloc()
> 	- GFP_KERNEL parameter is passed kmalloc
> 	- GFP_KERNEL mean alloc_cpumask_var() always return 1, except
> 	  oom-killer case. IOW, membarrier_retry() is still never called
> 	  on typical use case.
> 	- kmalloc(GFP_KERNEL) might invoke page reclaim and it can spent few
> 	  seconds (not microseconds).
> 
> CONFIG_CPUMASK_OFFSTACK=1 and use GFP_ATOMIC
> 	- cpumask is allocated on heap
> 	- alloc_cpumask_var() is the wrapper of kmalloc()
> 	- GFP_ATOMIC mean kmalloc never invoke page reclaim. IOW, 
> 	  kmalloc() cost is nearly constant. (few or lots microseconds)
> 	- OTOH, alloc_cpumask_var() might fail, at that time membarrier_retry()
> 	  is called.
> 
> So, My last mail talked about CONFIG_CPUMASK_OFFSTACK=1, but you mesured CONFIG_CPUMASK_OFFSTACK=0.
> That's the reason why our conclusion is different.

I would have to put my system in OOM condition anyway to measure the
page reclaim overhead. Given that sys_membarrier is not exactly a fast
path, I don't think it matters _that much_.

Hrm. Well, given the "expedited" nature of the system call, it might
come as a surprise to have to wait for page reclaim, and surprises are
not good. OTOH, I don't want to allow users to easily consume all the
GFP_ATOMIC pool. But I think it's unlikely, as we are bounded by the
number of processors which can concurrently run sys_membarrier().

> 
> > 
> > Also... it's pretty much a slow path anyway compared to the RCU
> > read-side. I just don't want this slow path to scale badly.
> > 
> > > 
> > > hmm...
> > > Do you intend to GFP_ATOMIC?
> > 
> > Would it help to lower the allocation overhead ?
> 
> No. If the system have lots free memory, GFP_ATOMIC and GFP_KERNEL
> don't have any difference. but if the system have no free memory,
> GFP_KERNEL might cause big latency.

Having a somewhat bounded latency is good for a synchronization
primitive, even for the slow path.

> 
> 
> Perhaps, It is no big issue. If the system have no free memory, another
> syscall will invoke page reclaim soon although sys_membarrier avoid it.
> I'm not sure. It depend on librcu latency policy.

I'd like to stay on the safe side. If you tell me that there is no risk
to let users exhaust GFP_ATOMIC pools prematurely, then I'll use it.

> 
> Another alternative plan is,
> 
> 	if (!alloc_cpumask_var(&tmpmask, GFP_KERNEL)) {
> 		err = -ENOMEM;
> 		goto unlock;
> 	}
> 
> and kill membarrier_retry(). because CONFIG_CPUMASK_OFFSTACK=1 is
> only used for SGI big hpc machine, it mean nobody can test membarrier_retry().
> Never called function doesn't have lots worth.
> 
> Thought?

I don't want to rely on a system call which can fail at arbitrary points
in the program to create a synchronization primitive. Currently (with
the forthcoming v6 patch), I can test if the system call exists and if
the flags are supported at library init time (by checking -ENOSYS and
-EINVAL return values). From that point on, I don't want to check error
values anymore. This means that a system call that fails on a given
kernel will _always_ fail. The same is true for the opposite. This is
why not returning -ENOMEM is important here.

So I rather prefer to have one single simple failure handler in the
kernel, even if it is not often used, than to have multiple subtly
different error-handling of -ENOMEM at the user-space caller sites,
resulting in an expectable mess. These error handlers won't be tested
any more than the one located in the kernel.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2010-01-14  2:16 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-01-13  1:37 [RFC PATCH] introduce sys_membarrier(): process-wide memory barrier (v5) Mathieu Desnoyers
2010-01-13  3:23 ` KOSAKI Motohiro
2010-01-13  3:58   ` Mathieu Desnoyers
2010-01-13  4:47     ` KOSAKI Motohiro
2010-01-13  5:33       ` Paul E. McKenney
2010-01-13 15:03       ` Mathieu Desnoyers
2010-01-14  0:15         ` KOSAKI Motohiro
2010-01-14  2:16           ` Mathieu Desnoyers [this message]
2010-01-14  2:25             ` KOSAKI Motohiro
2010-01-13  5:00 ` Nicholas Miell
2010-01-13  5:31   ` Paul E. McKenney
2010-01-13  5:39     ` Nicholas Miell
2010-01-13 14:38       ` Mathieu Desnoyers
2010-01-13 18:07         ` Nicholas Miell
2010-01-13 18:24           ` Mathieu Desnoyers
2010-01-13 18:41             ` Nicholas Miell
2010-01-13 19:17               ` Mathieu Desnoyers
2010-01-13 19:42                 ` David Daney
2010-01-13 19:53                   ` Nicholas Miell
2010-01-13 23:42                     ` Mathieu Desnoyers
2010-01-13 15:58       ` Paul E. McKenney
2010-01-13 11:07 ` Heiko Carstens
2010-01-13 14:46   ` Mathieu Desnoyers
2010-01-13 16:38 ` Peter Zijlstra
2010-01-13 19:36   ` Mathieu Desnoyers
2010-01-14  9:08     ` Peter Zijlstra
2010-01-14 16:26       ` Mathieu Desnoyers
2010-01-14 17:03         ` Peter Zijlstra
2010-01-14 17:54           ` Mathieu Desnoyers
2010-01-14 18:37             ` Mathieu Desnoyers
2010-01-14 18:52               ` Steven Rostedt
2010-01-14 19:33                 ` Mathieu Desnoyers
2010-01-14 21:26                   ` Steven Rostedt
2010-01-19 18:37                   ` Peter Zijlstra
2010-01-19 19:06                     ` Peter Zijlstra
2010-01-20  3:13                       ` Mathieu Desnoyers
2010-01-20  8:45                         ` Peter Zijlstra
2010-01-21 11:26                       ` Peter Zijlstra
2010-01-21 16:07                         ` Mathieu Desnoyers
2010-01-21 16:12                           ` Steven Rostedt
2010-01-21 16:22                             ` Mathieu Desnoyers
2010-01-21 16:32                               ` Steven Rostedt
2010-01-21 17:02                                 ` Mathieu Desnoyers
2010-01-21 16:17                           ` Peter Zijlstra
2010-01-21 17:01                             ` Mathieu Desnoyers
2010-01-19 19:43                     ` Steven Rostedt
2010-01-14 18:50             ` Steven Rostedt
2010-01-19 16:47         ` Peter Zijlstra
2010-01-19 17:11           ` Mathieu Desnoyers
2010-01-19 17:30           ` Steven Rostedt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100114021645.GA28784@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=Valdis.Kletnieks@vt.edu \
    --cc=akpm@linux-foundation.org \
    --cc=dhowells@redhat.com \
    --cc=dipankar@in.ibm.com \
    --cc=josh@joshtriplett.org \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=laijs@cn.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.