Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jiri Olsa <jolsa@redhat.com>, Ingo Molnar <mingo@elte.hu>,
	Peter Zijlstra <a.p.zijlstra@chello.nl>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	fbl@redhat.com, nhorman@redhat.com, davem@redhat.com,
	htejun@gmail.com, jarkao2@gmail.com, oleg@redhat.com,
	davidel@xmailserver.org
Subject: Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
Date: Tue, 7 Jul 2009 10:57:10 -0400	[thread overview]
Message-ID: <20090707145710.GB7124@Krystal> (raw)
In-Reply-To: <4A535EB9.2020406@gmail.com>

* Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Mathieu Desnoyers a écrit :
> > * Jiri Olsa (jolsa@redhat.com) wrote:
> >> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> >>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> >>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> >>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
> >>>>>
> >>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> >>>>>>> * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Ingo Molnar a écrit :
> >>>>>>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>>>>>>>>  #define _raw_read_relax(lock)	cpu_relax()
> >>>>>>>>>>  #define _raw_write_relax(lock)	cpu_relax()
> >>>>>>>>>>  
> >>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>>>>>>>> +#define smp_mb__after_lock() do { } while (0)
> >>>>>>>>> Two small stylistic comments, please make this an inline function:
> >>>>>>>>>
> >>>>>>>>> static inline void smp_mb__after_lock(void) { }
> >>>>>>>>> #define smp_mb__after_lock
> >>>>>>>>>
> >>>>>>>>> (untested)
> >>>>>>>>>
> >>>>>>>>>> +/* The lock does not imply full memory barrier. */
> >>>>>>>>>> +#ifndef smp_mb__after_lock
> >>>>>>>>>> +#define smp_mb__after_lock() smp_mb()
> >>>>>>>>>> +#endif
> >>>>>>>>> ditto.
> >>>>>>>>>
> >>>>>>>>> 	Ingo
> >>>>>>>> This was following existing implementations of various smp_mb__??? helpers :
> >>>>>>>>
> >>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>>  * clear_bit may not imply a memory barrier
> >>>>>>>>  */
> >>>>>>>> #ifndef smp_mb__before_clear_bit
> >>>>>>>> #define smp_mb__before_clear_bit()      smp_mb()
> >>>>>>>> #define smp_mb__after_clear_bit()       smp_mb()
> >>>>>>>> #endif
> >>>>>>> Did i mention that those should be fixed too? :-)
> >>>>>>>
> >>>>>>> 	Ingo
> >>>>>> ok, could I include it in the 2/2 or you prefer separate patch?
> >>>>> depends on whether it will regress ;-)
> >>>>>
> >>>>> If it regresses, it's better to have it separate. If it wont, it can 
> >>>>> be included. If unsure, default to the more conservative option.
> >>>>>
> >>>>> 	Ingo
> >>>>
> >>>> how about this.. 
> >>>> and similar change for smp_mb__before_clear_bit in a separate patch
> >>>>
> >>>>
> >>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> >>>> index b7e5db8..4e77853 100644
> >>>> --- a/arch/x86/include/asm/spinlock.h
> >>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>>  #define _raw_read_relax(lock)	cpu_relax()
> >>>>  #define _raw_write_relax(lock)	cpu_relax()
> >>>>  
> >>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>> +static inline void smp_mb__after_lock(void) { }
> >>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +
> >>>>  #endif /* _ASM_X86_SPINLOCK_H */
> >>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> >>>> index 252b245..4be57ab 100644
> >>>> --- a/include/linux/spinlock.h
> >>>> +++ b/include/linux/spinlock.h
> >>>> @@ -132,6 +132,11 @@ do {								\
> >>>>  #endif /*__raw_spin_is_contended*/
> >>>>  #endif
> >>>>  
> >>>> +/* The lock does not imply full memory barrier. */
> >>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +static inline void smp_mb__after_lock(void) { smp_mb(); }
> >>>> +#endif
> >>>> +
> >>>>  /**
> >>>>   * spin_unlock_wait - wait until the spinlock gets unlocked
> >>>>   * @lock: the spinlock in question.
> >>>> diff --git a/include/net/sock.h b/include/net/sock.h
> >>>> index 4eb8409..98afcd9 100644
> >>>> --- a/include/net/sock.h
> >>>> +++ b/include/net/sock.h
> >>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> >>>>   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> >>>>   * could then endup calling schedule and sleep forever if there are no more
> >>>>   * data on the socket.
> >>>> + *
> >>>> + * The sk_has_helper is always called right after a call to read_lock, so we
> >>>> + * can use smp_mb__after_lock barrier.
> >>>>   */
> >>>>  static inline int sk_has_sleeper(struct sock *sk)
> >>>>  {
> >>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> >>>>  	 *
> >>>>  	 * This memory barrier is paired in the sock_poll_wait.
> >>>>  	 */
> >>>> -	smp_mb();
> >>>> +	smp_mb__after_lock();
> >>>>  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> >>>>  }
> >>>>  
> >>> any feedback on this? 
> >>> I'd send v6 if this way is acceptable..
> >>>
> >>> thanks,
> >>> jirka
> >> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
> >> it is used quite extensivelly.
> >>
> >> I'd prefer to send it in a separate patch, so we can move on with the 
> >> changes I've sent so far..
> >>
> > 
> > As with any optimization (and this is one that adds a semantic that will
> > just grow the memory barrier/locking rule complexity), it should come
> > with performance benchmarks showing real-life improvements.
> > 
> > Otherwise I'd recommend sticking to smp_mb() if this execution path is
> > not that critical, or to move to RCU if it's _that_ critical.
> > 
> > A valid argument would be if the data structures protected are so
> > complex that RCU is out of question but still the few cycles saved by
> > removing a memory barrier are really significant. And even then, the
> > proper solution would be more something like a
> > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> > improvements on architectures other than x86 as well.
> > 
> > So in all cases, I don't think the smp_mb__after_lock() is the
> > appropriate solution.
> 
> RCU on this part is out of the question, as David already mentioned it.
> 

OK

> It would be a regression for short lived tcp/udp sessions, and some workloads
> use them a lot...
> 
> We gained about 20% performance between 2.6.26 and 2.6.31, carefuly removing
> some atomic ops in network stack, adding RCU where it was sensible, but this
> is a painful process, not something Jiri can use to fix bugs on legacy RedHat
> kernels :) (We still are sorting out regressions)
> 

Yep, I can understand that. Tbench on localhost is an especially good
benchmark for this ;)

> To solve problem pointed by Jiri, we have to insert an smp_mb() at this point,
> (not mentioning the other change in select() logic of course)
> 
>  static void sock_def_readable(struct sock *sk, int len)
>  {
>  	read_lock(&sk->sk_callback_lock);
> +	smp_mb(); /* paired with opposite smp_mb() in sk poll logic */
>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
>  						POLLRDNORM | POLLRDBAND);
>  	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> 
> As about every incoming packet calls this path, we should be very careful not
> slowing down stack if not necessary.
> 
> On x86 this extra smp_mb() is not needed, since previous call to read_lock()
> already gives the full barrier for free.
> 
> 

Well, I see the __read_lock()+2x smp_mb+__read_unlock is not well suited
for x86. You're right.

But read_lock + smp_mb__after_lock + read_unlock is not well suited for
powerpc, arm, mips and probably others where there is an explicit memory
barrier at the end of the read lock primitive.

One thing that would be efficient for all architectures is to create a
locking primitive that contains the smp_mb, e.g.:

read_lock_smp_mb()

which would act as a read_lock which does a full smp_mb after the lock
is taken.

The naming may be a bit odd, better ideas are welcome.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

next prev parent reply	other threads:[~2009-07-07 14:57 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-07-03  8:12 [PATCHv5 0/2] net: fix race in the receive/select Jiri Olsa
2009-07-03  8:13 ` [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks Jiri Olsa
2009-07-07 15:56   ` Eric Dumazet
2009-07-03  8:14 ` [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Jiri Olsa
2009-07-03  9:06   ` Ingo Molnar
2009-07-03  9:20     ` Eric Dumazet
2009-07-03  9:24       ` Ingo Molnar
2009-07-03  9:56         ` Jiri Olsa
2009-07-03 10:25           ` Ingo Molnar
2009-07-03 11:18             ` Jiri Olsa
2009-07-03 11:30               ` Jarek Poplawski
2009-07-03 11:43                 ` Jiri Olsa
2009-07-07 10:18               ` Jiri Olsa
2009-07-07 13:46                 ` Jiri Olsa
2009-07-07 14:01                   ` Mathieu Desnoyers
2009-07-07 14:34                     ` Oleg Nesterov
2009-07-07 15:04                       ` Mathieu Desnoyers
2009-07-07 15:44                         ` Oleg Nesterov
2009-07-07 15:50                           ` Peter Zijlstra
2009-07-07 19:45                             ` Mathieu Desnoyers
2009-07-07 22:44                               ` Eric Dumazet
2009-07-07 23:28                                 ` Mathieu Desnoyers
2009-07-07 23:51                                   ` Oleg Nesterov
2009-07-08  4:34                                     ` Mathieu Desnoyers
2009-07-08  7:18                                       ` Jarek Poplawski
2009-07-07 14:34                     ` Jiri Olsa
2009-07-07 14:42                     ` Eric Dumazet
2009-07-07 14:57                       ` Mathieu Desnoyers [this message]
2009-07-07 15:23                         ` Eric Dumazet
2009-07-08 17:47                           ` Jiri Olsa
2009-07-08 18:07                             ` David Miller
2009-07-08 18:16                               ` Jiri Olsa
2009-07-03 14:04     ` Mathieu Desnoyers
2009-07-03 15:29       ` Herbert Xu
2009-07-03 15:37         ` Eric Dumazet
2009-07-03 15:47           ` Mathieu Desnoyers
2009-07-03 17:06             ` Paul E. McKenney
2009-07-03 17:31               ` Mathieu Desnoyers
2009-07-03 15:40         ` Mathieu Desnoyers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20090707145710.GB7124@Krystal \
    --to=mathieu.desnoyers@polymtl.ca \
    --cc=a.p.zijlstra@chello.nl \
    --cc=davem@redhat.com \
    --cc=davidel@xmailserver.org \
    --cc=eric.dumazet@gmail.com \
    --cc=fbl@redhat.com \
    --cc=htejun@gmail.com \
    --cc=jarkao2@gmail.com \
    --cc=jolsa@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=nhorman@redhat.com \
    --cc=oleg@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).