From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mathieu Desnoyers Subject: Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Date: Tue, 7 Jul 2009 10:57:10 -0400 Message-ID: <20090707145710.GB7124@Krystal> References: <20090703090606.GA3902@elte.hu> <4A4DCD54.1080908@gmail.com> <20090703092438.GE3902@elte.hu> <20090703095659.GA4518@jolsa.lab.eng.brq.redhat.com> <20090703102530.GD32128@elte.hu> <20090703111848.GA10267@jolsa.lab.eng.brq.redhat.com> <20090707101816.GA6619@jolsa.lab.eng.brq.redhat.com> <20090707134601.GB6619@jolsa.lab.eng.brq.redhat.com> <20090707140135.GA5506@Krystal> <4A535EB9.2020406@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Jiri Olsa , Ingo Molnar , Peter Zijlstra , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, fbl@redhat.com, nhorman@redhat.com, davem@redhat.com, htejun@gmail.com, jarkao2@gmail.com, oleg@redhat.com, davidel@xmailserver.org To: Eric Dumazet Return-path: Received: from tomts5-srv.bellnexxia.net ([209.226.175.25]:38022 "EHLO tomts5-srv.bellnexxia.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757690AbZGGO50 convert rfc822-to-8bit (ORCPT ); Tue, 7 Jul 2009 10:57:26 -0400 Content-Disposition: inline In-Reply-To: <4A535EB9.2020406@gmail.com> Sender: netdev-owner@vger.kernel.org List-ID: * Eric Dumazet (eric.dumazet@gmail.com) wrote: > Mathieu Desnoyers a =E9crit : > > * Jiri Olsa (jolsa@redhat.com) wrote: > >> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote: > >>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote: > >>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote: > >>>>> * Jiri Olsa wrote: > >>>>> > >>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote: > >>>>>>> * Eric Dumazet wrote: > >>>>>>> > >>>>>>>> Ingo Molnar a =E9crit : > >>>>>>>>> * Jiri Olsa wrote: > >>>>>>>>> > >>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h > >>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(= raw_rwlock_t *rw) > >>>>>>>>>> #define _raw_read_relax(lock) cpu_relax() > >>>>>>>>>> #define _raw_write_relax(lock) cpu_relax() > >>>>>>>>>> =20 > >>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory ba= rriers. */ > >>>>>>>>>> +#define smp_mb__after_lock() do { } while (0) > >>>>>>>>> Two small stylistic comments, please make this an inline fu= nction: > >>>>>>>>> > >>>>>>>>> static inline void smp_mb__after_lock(void) { } > >>>>>>>>> #define smp_mb__after_lock > >>>>>>>>> > >>>>>>>>> (untested) > >>>>>>>>> > >>>>>>>>>> +/* The lock does not imply full memory barrier. */ > >>>>>>>>>> +#ifndef smp_mb__after_lock > >>>>>>>>>> +#define smp_mb__after_lock() smp_mb() > >>>>>>>>>> +#endif > >>>>>>>>> ditto. > >>>>>>>>> > >>>>>>>>> Ingo > >>>>>>>> This was following existing implementations of various smp_m= b__??? helpers : > >>>>>>>> > >>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitop= s.h > >>>>>>>> > >>>>>>>> /* > >>>>>>>> * clear_bit may not imply a memory barrier > >>>>>>>> */ > >>>>>>>> #ifndef smp_mb__before_clear_bit > >>>>>>>> #define smp_mb__before_clear_bit() smp_mb() > >>>>>>>> #define smp_mb__after_clear_bit() smp_mb() > >>>>>>>> #endif > >>>>>>> Did i mention that those should be fixed too? :-) > >>>>>>> > >>>>>>> Ingo > >>>>>> ok, could I include it in the 2/2 or you prefer separate patch= ? > >>>>> depends on whether it will regress ;-) > >>>>> > >>>>> If it regresses, it's better to have it separate. If it wont, i= t can=20 > >>>>> be included. If unsure, default to the more conservative option= =2E > >>>>> > >>>>> Ingo > >>>> > >>>> how about this..=20 > >>>> and similar change for smp_mb__before_clear_bit in a separate pa= tch > >>>> > >>>> > >>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/= asm/spinlock.h > >>>> index b7e5db8..4e77853 100644 > >>>> --- a/arch/x86/include/asm/spinlock.h > >>>> +++ b/arch/x86/include/asm/spinlock.h > >>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rw= lock_t *rw) > >>>> #define _raw_read_relax(lock) cpu_relax() > >>>> #define _raw_write_relax(lock) cpu_relax() > >>>> =20 > >>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers= =2E */ > >>>> +static inline void smp_mb__after_lock(void) { } > >>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK > >>>> + > >>>> #endif /* _ASM_X86_SPINLOCK_H */ > >>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h > >>>> index 252b245..4be57ab 100644 > >>>> --- a/include/linux/spinlock.h > >>>> +++ b/include/linux/spinlock.h > >>>> @@ -132,6 +132,11 @@ do { \ > >>>> #endif /*__raw_spin_is_contended*/ > >>>> #endif > >>>> =20 > >>>> +/* The lock does not imply full memory barrier. */ > >>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK > >>>> +static inline void smp_mb__after_lock(void) { smp_mb(); } > >>>> +#endif > >>>> + > >>>> /** > >>>> * spin_unlock_wait - wait until the spinlock gets unlocked > >>>> * @lock: the spinlock in question. > >>>> diff --git a/include/net/sock.h b/include/net/sock.h > >>>> index 4eb8409..98afcd9 100644 > >>>> --- a/include/net/sock.h > >>>> +++ b/include/net/sock.h > >>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const= struct sock *sk) > >>>> * in its cache, and so does the tp->rcv_nxt update on CPU2 sid= e. The CPU1 > >>>> * could then endup calling schedule and sleep forever if there= are no more > >>>> * data on the socket. > >>>> + * > >>>> + * The sk_has_helper is always called right after a call to rea= d_lock, so we > >>>> + * can use smp_mb__after_lock barrier. > >>>> */ > >>>> static inline int sk_has_sleeper(struct sock *sk) > >>>> { > >>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct so= ck *sk) > >>>> * > >>>> * This memory barrier is paired in the sock_poll_wait. > >>>> */ > >>>> - smp_mb(); > >>>> + smp_mb__after_lock(); > >>>> return sk->sk_sleep && waitqueue_active(sk->sk_sleep); > >>>> } > >>>> =20 > >>> any feedback on this?=20 > >>> I'd send v6 if this way is acceptable.. > >>> > >>> thanks, > >>> jirka > >> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bi= t and > >> it is used quite extensivelly. > >> > >> I'd prefer to send it in a separate patch, so we can move on with = the=20 > >> changes I've sent so far.. > >> > >=20 > > As with any optimization (and this is one that adds a semantic that= will > > just grow the memory barrier/locking rule complexity), it should co= me > > with performance benchmarks showing real-life improvements. > >=20 > > Otherwise I'd recommend sticking to smp_mb() if this execution path= is > > not that critical, or to move to RCU if it's _that_ critical. > >=20 > > A valid argument would be if the data structures protected are so > > complex that RCU is out of question but still the few cycles saved = by > > removing a memory barrier are really significant. And even then, th= e > > proper solution would be more something like a > > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performa= nce > > improvements on architectures other than x86 as well. > >=20 > > So in all cases, I don't think the smp_mb__after_lock() is the > > appropriate solution. >=20 > RCU on this part is out of the question, as David already mentioned i= t. >=20 OK > It would be a regression for short lived tcp/udp sessions, and some w= orkloads > use them a lot... >=20 > We gained about 20% performance between 2.6.26 and 2.6.31, carefuly r= emoving > some atomic ops in network stack, adding RCU where it was sensible, b= ut this > is a painful process, not something Jiri can use to fix bugs on legac= y RedHat > kernels :) (We still are sorting out regressions) >=20 Yep, I can understand that. Tbench on localhost is an especially good benchmark for this ;) > To solve problem pointed by Jiri, we have to insert an smp_mb() at th= is point, > (not mentioning the other change in select() logic of course) >=20 > static void sock_def_readable(struct sock *sk, int len) > { > read_lock(&sk->sk_callback_lock); > + smp_mb(); /* paired with opposite smp_mb() in sk poll logic */ > if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) > wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN | > POLLRDNORM | POLLRDBAND); > sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); > read_unlock(&sk->sk_callback_lock); > } >=20 > As about every incoming packet calls this path, we should be very car= eful not > slowing down stack if not necessary. >=20 > On x86 this extra smp_mb() is not needed, since previous call to read= _lock() > already gives the full barrier for free. >=20 >=20 Well, I see the __read_lock()+2x smp_mb+__read_unlock is not well suite= d for x86. You're right. But read_lock + smp_mb__after_lock + read_unlock is not well suited for powerpc, arm, mips and probably others where there is an explicit memor= y barrier at the end of the read lock primitive. One thing that would be efficient for all architectures is to create a locking primitive that contains the smp_mb, e.g.: read_lock_smp_mb() which would act as a read_lock which does a full smp_mb after the lock is taken. The naming may be a bit odd, better ideas are welcome. Mathieu --=20 Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE = 9A68