[PATCHv5 0/2] net: fix race in the receive/select

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCHv5 0/2] net: fix race in the receive/select
@ 2009-07-03  8:12 Jiri Olsa
  2009-07-03  8:13 ` [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks Jiri Olsa
  2009-07-03  8:14 ` [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Jiri Olsa
  0 siblings, 2 replies; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03  8:12 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	eric.dumazet

This patchset fixies the race within the poll_wait call and the receive callbacks,
by adding memory barrier.

- 1/2 includes the actual fix
- 2/2 includes optimization for the x86 arch

It is discussed and described in the following discussions:

http://lkml.org/lkml/2009/6/18/124
http://lkml.org/lkml/2009/6/25/188
http://lkml.org/lkml/2009/6/29/216

Patchset is made on top of Linus'es tree 7c5371c403abb29f01bc6cff6c5096abdf2dc524 .

Booted on x86_64.

wbr,
jirka

---
 arch/x86/include/asm/spinlock.h |    3 ++
 include/linux/spinlock.h        |    5 +++
 include/net/sock.h              |   69 +++++++++++++++++++++++++++++++++++++++
 net/atm/common.c                |    6 ++--
 net/core/datagram.c             |    2 +-
 net/core/sock.c                 |    8 ++--
 net/dccp/output.c               |    2 +-
 net/dccp/proto.c                |    2 +-
 net/ipv4/tcp.c                  |    2 +-
 net/iucv/af_iucv.c              |    4 +-
 net/rxrpc/af_rxrpc.c            |    4 +-
 net/unix/af_unix.c              |    8 ++--
 12 files changed, 96 insertions(+), 19 deletions(-)

^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks
  2009-07-03  8:12 [PATCHv5 0/2] net: fix race in the receive/select Jiri Olsa
@ 2009-07-03  8:13 ` Jiri Olsa
  2009-07-07 15:56   ` Eric Dumazet
  2009-07-03  8:14 ` [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Jiri Olsa
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03  8:13 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	eric.dumazet

Adding memory barrier after the poll_wait function, paired with
receive callbacks. Adding fuctions sk_poll_wait and sk_has_sleeper
to wrap the memory barrier.

Without the memory barrier, following race can happen.
The race fires, when following code paths meet, and the tp->rcv_nxt 
and __add_wait_queue updates stay in CPU caches.


CPU1                         CPU2

sys_select                   receive packet
  ...                        ...
  __add_wait_queue           update tp->rcv_nxt
  ...                        ...
  tp->rcv_nxt check          sock_def_readable
  ...                        {
  schedule                      ...
                                if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
                                        wake_up_interruptible(sk->sk_sleep)
                                ...
                             }

If there was no cache the code would work ok, since the wait_queue and
rcv_nxt are opposit to each other.

Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
passed the tp->rcv_nxt check and sleeps, or will get the new value for
tp->rcv_nxt and will return with new data mask.
In both cases the process (CPU1) is being added to the wait queue, so the
waitqueue_active (CPU2) call cannot miss and will wake up CPU1.

The bad case is when the __add_wait_queue changes done by CPU1 stay in its
cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
endup calling schedule and sleep forever if there are no more data on the
socket.


Calls to poll_wait in following modules were ommited:
	net/bluetooth/af_bluetooth.c
	net/irda/af_irda.c
	net/irda/irnet/irnet_ppp.c
	net/mac80211/rc80211_pid_debugfs.c
	net/phonet/socket.c
	net/rds/af_rds.c
	net/rfkill/core.c
	net/sunrpc/cache.c
	net/sunrpc/rpc_pipe.c
	net/tipc/socket.c

wbr,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>

---
 include/net/sock.h   |   66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 net/atm/common.c     |    6 ++--
 net/core/datagram.c  |    2 +-
 net/core/sock.c      |    8 +++---
 net/dccp/output.c    |    2 +-
 net/dccp/proto.c     |    2 +-
 net/ipv4/tcp.c       |    2 +-
 net/iucv/af_iucv.c   |    4 +-
 net/rxrpc/af_rxrpc.c |    4 +-
 net/unix/af_unix.c   |    8 +++---
 10 files changed, 85 insertions(+), 19 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 352f06b..4eb8409 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -54,6 +54,7 @@
 
 #include <linux/filter.h>
 #include <linux/rculist_nulls.h>
+#include <linux/poll.h>
 
 #include <asm/atomic.h>
 #include <net/dst.h>
@@ -1241,6 +1242,71 @@ static inline int sk_has_allocations(const struct sock *sk)
 	return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
 }
 
+/**
+ * sk_has_sleeper - check if there are any waiting processes
+ * @sk: socket
+ *
+ * Returns true if socket has waiting processes
+ *
+ * The purpose of the sk_has_sleeper and sock_poll_wait is to wrap the memory
+ * barrier call. They were added due to the race found within the tcp code.
+ *
+ * Consider following tcp code paths:
+ *
+ * CPU1                  CPU2
+ *
+ * sys_select            receive packet
+ *   ...                 ...
+ *   __add_wait_queue    update tp->rcv_nxt
+ *   ...                 ...
+ *   tp->rcv_nxt check   sock_def_readable
+ *   ...                 {
+ *   schedule               ...
+ *                          if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+ *                              wake_up_interruptible(sk->sk_sleep)
+ *                          ...
+ *                       }
+ *
+ * The race for tcp fires when the __add_wait_queue changes done by CPU1 stay
+ * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
+ * could then endup calling schedule and sleep forever if there are no more
+ * data on the socket.
+ */
+static inline int sk_has_sleeper(struct sock *sk)
+{
+	/*
+	 * We need to be sure we are in sync with the
+	 * add_wait_queue modifications to the wait queue.
+	 *
+	 * This memory barrier is paired in the sock_poll_wait.
+	 */
+	smp_mb();
+	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
+}
+
+/**
+ * sock_poll_wait - place memory barrier behind the poll_wait call.
+ * @filp:           file
+ * @wait_address:   socket wait queue
+ * @p:              poll_table
+ *
+ * See the comments in the sk_has_sleeper function.
+ */
+static inline void sock_poll_wait(struct file *filp,
+		wait_queue_head_t *wait_address, poll_table *p)
+{
+	if (p && wait_address) {
+		poll_wait(filp, wait_address, p);
+		/*
+		 * We need to be sure we are in sync with the
+		 * socket flags modification.
+		 *
+		 * This memory barrier is paired in the sk_has_sleeper.
+		*/
+		smp_mb();
+	}
+}
+
 /*
  * 	Queue a received datagram if it will fit. Stream and sequenced
  *	protocols can't normally use this as they need to fit buffers in
diff --git a/net/atm/common.c b/net/atm/common.c
index c1c9793..8c4d843 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -92,7 +92,7 @@ static void vcc_sock_destruct(struct sock *sk)
 static void vcc_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -110,7 +110,7 @@ static void vcc_write_space(struct sock *sk)
 	read_lock(&sk->sk_callback_lock);
 
 	if (vcc_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible(sk->sk_sleep);
 
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
@@ -594,7 +594,7 @@ unsigned int vcc_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct atm_vcc *vcc;
 	unsigned int mask;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	mask = 0;
 
 	vcc = ATM_SD(sock);
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 58abee1..b0fe692 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -712,7 +712,7 @@ unsigned int datagram_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	unsigned int mask;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	mask = 0;
 
 	/* exceptional events? */
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..6354863 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1715,7 +1715,7 @@ EXPORT_SYMBOL(sock_no_sendpage);
 static void sock_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_all(sk->sk_sleep);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -1723,7 +1723,7 @@ static void sock_def_wakeup(struct sock *sk)
 static void sock_def_error_report(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_poll(sk->sk_sleep, POLLERR);
 	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
 	read_unlock(&sk->sk_callback_lock);
@@ -1732,7 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
@@ -1747,7 +1747,7 @@ static void sock_def_write_space(struct sock *sk)
 	 * progress.  --DaveM
 	 */
 	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
diff --git a/net/dccp/output.c b/net/dccp/output.c
index c0e88c1..c96119f 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -196,7 +196,7 @@ void dccp_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible(sk->sk_sleep);
 	/* Should agree with poll, otherwise some programs break */
 	if (sock_writeable(sk))
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 314a1b5..94ca8ea 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -311,7 +311,7 @@ unsigned int dccp_poll(struct file *file, struct socket *sock,
 	unsigned int mask;
 	struct sock *sk = sock->sk;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	if (sk->sk_state == DCCP_LISTEN)
 		return inet_csk_listen_poll(sk);
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 7870a53..9114524 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -339,7 +339,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct sock *sk = sock->sk;
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	if (sk->sk_state == TCP_LISTEN)
 		return inet_csk_listen_poll(sk);
 
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index 6be5f92..49c15b4 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -306,7 +306,7 @@ static inline int iucv_below_msglim(struct sock *sk)
 static void iucv_sock_wake_msglim(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+	if (sk_has_sleeper(sk))
 		wake_up_interruptible_all(sk->sk_sleep);
 	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	read_unlock(&sk->sk_callback_lock);
@@ -1256,7 +1256,7 @@ unsigned int iucv_sock_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	unsigned int mask = 0;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 
 	if (sk->sk_state == IUCV_LISTEN)
 		return iucv_accept_poll(sk);
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index eac5e7b..bfe493e 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -63,7 +63,7 @@ static void rxrpc_write_space(struct sock *sk)
 	_enter("%p", sk);
 	read_lock(&sk->sk_callback_lock);
 	if (rxrpc_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible(sk->sk_sleep);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
@@ -588,7 +588,7 @@ static unsigned int rxrpc_poll(struct file *file, struct socket *sock,
 	unsigned int mask;
 	struct sock *sk = sock->sk;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	mask = 0;
 
 	/* the socket is readable if there are any messages waiting on the Rx
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 36d4e44..fc3ebb9 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -315,7 +315,7 @@ static void unix_write_space(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (unix_writable(sk)) {
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
+		if (sk_has_sleeper(sk))
 			wake_up_interruptible_sync(sk->sk_sleep);
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
@@ -1985,7 +1985,7 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 	struct sock *sk = sock->sk;
 	unsigned int mask;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	mask = 0;
 
 	/* exceptional events? */
@@ -2022,7 +2022,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk, *other;
 	unsigned int mask, writable;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk->sk_sleep, wait);
 	mask = 0;
 
 	/* exceptional events? */
@@ -2053,7 +2053,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 		other = unix_peer_get(sk);
 		if (other) {
 			if (unix_peer(other) != sk) {
-				poll_wait(file, &unix_sk(other)->peer_wait,
+				sock_poll_wait(file, &unix_sk(other)->peer_wait,
 					  wait);
 				if (unix_recvq_full(other))
 					writable = 0;

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks
  2009-07-03  8:13 ` [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks Jiri Olsa
@ 2009-07-07 15:56   ` Eric Dumazet
  0 siblings, 0 replies; 39+ messages in thread
From: Eric Dumazet @ 2009-07-07 15:56 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: netdev, linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg,
	davidel

Jiri Olsa a écrit :
> Adding memory barrier after the poll_wait function, paired with
> receive callbacks. Adding fuctions sk_poll_wait and sk_has_sleeper
> to wrap the memory barrier.
> 
> Without the memory barrier, following race can happen.
> The race fires, when following code paths meet, and the tp->rcv_nxt 
> and __add_wait_queue updates stay in CPU caches.
> 
> 
> CPU1                         CPU2
> 
> sys_select                   receive packet
>   ...                        ...
>   __add_wait_queue           update tp->rcv_nxt
>   ...                        ...
>   tp->rcv_nxt check          sock_def_readable
>   ...                        {
>   schedule                      ...
>                                 if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>                                         wake_up_interruptible(sk->sk_sleep)
>                                 ...
>                              }
> 
> If there was no cache the code would work ok, since the wait_queue and
> rcv_nxt are opposit to each other.
> 
> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> tp->rcv_nxt and will return with new data mask.
> In both cases the process (CPU1) is being added to the wait queue, so the
> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> 
> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1 will then
> endup calling schedule and sleep forever if there are no more data on the
> socket.
> 
> 
> Calls to poll_wait in following modules were ommited:
> 	net/bluetooth/af_bluetooth.c
> 	net/irda/af_irda.c
> 	net/irda/irnet/irnet_ppp.c
> 	net/mac80211/rc80211_pid_debugfs.c
> 	net/phonet/socket.c
> 	net/rds/af_rds.c
> 	net/rfkill/core.c
> 	net/sunrpc/cache.c
> 	net/sunrpc/rpc_pipe.c
> 	net/tipc/socket.c
> 
> wbr,
> jirka
> 
> 
> Signed-off-by: Jiri Olsa <jolsa@redhat.com>

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

> 
> ---
>  include/net/sock.h   |   66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  net/atm/common.c     |    6 ++--
>  net/core/datagram.c  |    2 +-
>  net/core/sock.c      |    8 +++---
>  net/dccp/output.c    |    2 +-
>  net/dccp/proto.c     |    2 +-
>  net/ipv4/tcp.c       |    2 +-
>  net/iucv/af_iucv.c   |    4 +-
>  net/rxrpc/af_rxrpc.c |    4 +-
>  net/unix/af_unix.c   |    8 +++---
>  10 files changed, 85 insertions(+), 19 deletions(-)
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 352f06b..4eb8409 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -54,6 +54,7 @@
>  
>  #include <linux/filter.h>
>  #include <linux/rculist_nulls.h>
> +#include <linux/poll.h>
>  
>  #include <asm/atomic.h>
>  #include <net/dst.h>
> @@ -1241,6 +1242,71 @@ static inline int sk_has_allocations(const struct sock *sk)
>  	return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
>  }
>  
> +/**
> + * sk_has_sleeper - check if there are any waiting processes
> + * @sk: socket
> + *
> + * Returns true if socket has waiting processes
> + *
> + * The purpose of the sk_has_sleeper and sock_poll_wait is to wrap the memory
> + * barrier call. They were added due to the race found within the tcp code.
> + *
> + * Consider following tcp code paths:
> + *
> + * CPU1                  CPU2
> + *
> + * sys_select            receive packet
> + *   ...                 ...
> + *   __add_wait_queue    update tp->rcv_nxt
> + *   ...                 ...
> + *   tp->rcv_nxt check   sock_def_readable
> + *   ...                 {
> + *   schedule               ...
> + *                          if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> + *                              wake_up_interruptible(sk->sk_sleep)
> + *                          ...
> + *                       }
> + *
> + * The race for tcp fires when the __add_wait_queue changes done by CPU1 stay
> + * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> + * could then endup calling schedule and sleep forever if there are no more
> + * data on the socket.
> + */
> +static inline int sk_has_sleeper(struct sock *sk)
> +{
> +	/*
> +	 * We need to be sure we are in sync with the
> +	 * add_wait_queue modifications to the wait queue.
> +	 *
> +	 * This memory barrier is paired in the sock_poll_wait.
> +	 */
> +	smp_mb();
> +	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> +}
> +
> +/**
> + * sock_poll_wait - place memory barrier behind the poll_wait call.
> + * @filp:           file
> + * @wait_address:   socket wait queue
> + * @p:              poll_table
> + *
> + * See the comments in the sk_has_sleeper function.
> + */
> +static inline void sock_poll_wait(struct file *filp,
> +		wait_queue_head_t *wait_address, poll_table *p)
> +{
> +	if (p && wait_address) {
> +		poll_wait(filp, wait_address, p);
> +		/*
> +		 * We need to be sure we are in sync with the
> +		 * socket flags modification.
> +		 *
> +		 * This memory barrier is paired in the sk_has_sleeper.
> +		*/
> +		smp_mb();
> +	}
> +}
> +
>  /*
>   * 	Queue a received datagram if it will fit. Stream and sequenced
>   *	protocols can't normally use this as they need to fit buffers in
> diff --git a/net/atm/common.c b/net/atm/common.c
> index c1c9793..8c4d843 100644
> --- a/net/atm/common.c
> +++ b/net/atm/common.c
> @@ -92,7 +92,7 @@ static void vcc_sock_destruct(struct sock *sk)
>  static void vcc_def_wakeup(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up(sk->sk_sleep);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> @@ -110,7 +110,7 @@ static void vcc_write_space(struct sock *sk)
>  	read_lock(&sk->sk_callback_lock);
>  
>  	if (vcc_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible(sk->sk_sleep);
>  
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
> @@ -594,7 +594,7 @@ unsigned int vcc_poll(struct file *file, struct socket *sock, poll_table *wait)
>  	struct atm_vcc *vcc;
>  	unsigned int mask;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	mask = 0;
>  
>  	vcc = ATM_SD(sock);
> diff --git a/net/core/datagram.c b/net/core/datagram.c
> index 58abee1..b0fe692 100644
> --- a/net/core/datagram.c
> +++ b/net/core/datagram.c
> @@ -712,7 +712,7 @@ unsigned int datagram_poll(struct file *file, struct socket *sock,
>  	struct sock *sk = sock->sk;
>  	unsigned int mask;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	mask = 0;
>  
>  	/* exceptional events? */
> diff --git a/net/core/sock.c b/net/core/sock.c
> index b0ba569..6354863 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1715,7 +1715,7 @@ EXPORT_SYMBOL(sock_no_sendpage);
>  static void sock_def_wakeup(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_all(sk->sk_sleep);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> @@ -1723,7 +1723,7 @@ static void sock_def_wakeup(struct sock *sk)
>  static void sock_def_error_report(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_poll(sk->sk_sleep, POLLERR);
>  	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
>  	read_unlock(&sk->sk_callback_lock);
> @@ -1732,7 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
>  static void sock_def_readable(struct sock *sk, int len)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
>  						POLLRDNORM | POLLRDBAND);
>  	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> @@ -1747,7 +1747,7 @@ static void sock_def_write_space(struct sock *sk)
>  	 * progress.  --DaveM
>  	 */
>  	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
>  						POLLWRNORM | POLLWRBAND);
>  
> diff --git a/net/dccp/output.c b/net/dccp/output.c
> index c0e88c1..c96119f 100644
> --- a/net/dccp/output.c
> +++ b/net/dccp/output.c
> @@ -196,7 +196,7 @@ void dccp_write_space(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
>  
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible(sk->sk_sleep);
>  	/* Should agree with poll, otherwise some programs break */
>  	if (sock_writeable(sk))
> diff --git a/net/dccp/proto.c b/net/dccp/proto.c
> index 314a1b5..94ca8ea 100644
> --- a/net/dccp/proto.c
> +++ b/net/dccp/proto.c
> @@ -311,7 +311,7 @@ unsigned int dccp_poll(struct file *file, struct socket *sock,
>  	unsigned int mask;
>  	struct sock *sk = sock->sk;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	if (sk->sk_state == DCCP_LISTEN)
>  		return inet_csk_listen_poll(sk);
>  
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 7870a53..9114524 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -339,7 +339,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
>  	struct sock *sk = sock->sk;
>  	struct tcp_sock *tp = tcp_sk(sk);
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	if (sk->sk_state == TCP_LISTEN)
>  		return inet_csk_listen_poll(sk);
>  
> diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
> index 6be5f92..49c15b4 100644
> --- a/net/iucv/af_iucv.c
> +++ b/net/iucv/af_iucv.c
> @@ -306,7 +306,7 @@ static inline int iucv_below_msglim(struct sock *sk)
>  static void iucv_sock_wake_msglim(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
> -	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +	if (sk_has_sleeper(sk))
>  		wake_up_interruptible_all(sk->sk_sleep);
>  	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	read_unlock(&sk->sk_callback_lock);
> @@ -1256,7 +1256,7 @@ unsigned int iucv_sock_poll(struct file *file, struct socket *sock,
>  	struct sock *sk = sock->sk;
>  	unsigned int mask = 0;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  
>  	if (sk->sk_state == IUCV_LISTEN)
>  		return iucv_accept_poll(sk);
> diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
> index eac5e7b..bfe493e 100644
> --- a/net/rxrpc/af_rxrpc.c
> +++ b/net/rxrpc/af_rxrpc.c
> @@ -63,7 +63,7 @@ static void rxrpc_write_space(struct sock *sk)
>  	_enter("%p", sk);
>  	read_lock(&sk->sk_callback_lock);
>  	if (rxrpc_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible(sk->sk_sleep);
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	}
> @@ -588,7 +588,7 @@ static unsigned int rxrpc_poll(struct file *file, struct socket *sock,
>  	unsigned int mask;
>  	struct sock *sk = sock->sk;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	mask = 0;
>  
>  	/* the socket is readable if there are any messages waiting on the Rx
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 36d4e44..fc3ebb9 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -315,7 +315,7 @@ static void unix_write_space(struct sock *sk)
>  {
>  	read_lock(&sk->sk_callback_lock);
>  	if (unix_writable(sk)) {
> -		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> +		if (sk_has_sleeper(sk))
>  			wake_up_interruptible_sync(sk->sk_sleep);
>  		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
>  	}
> @@ -1985,7 +1985,7 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
>  	struct sock *sk = sock->sk;
>  	unsigned int mask;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	mask = 0;
>  
>  	/* exceptional events? */
> @@ -2022,7 +2022,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  	struct sock *sk = sock->sk, *other;
>  	unsigned int mask, writable;
>  
> -	poll_wait(file, sk->sk_sleep, wait);
> +	sock_poll_wait(file, sk->sk_sleep, wait);
>  	mask = 0;
>  
>  	/* exceptional events? */
> @@ -2053,7 +2053,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
>  		other = unix_peer_get(sk);
>  		if (other) {
>  			if (unix_peer(other) != sk) {
> -				poll_wait(file, &unix_sk(other)->peer_wait,
> +				sock_poll_wait(file, &unix_sk(other)->peer_wait,
>  					  wait);
>  				if (unix_recvq_full(other))
>  					writable = 0;


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  8:12 [PATCHv5 0/2] net: fix race in the receive/select Jiri Olsa
  2009-07-03  8:13 ` [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks Jiri Olsa
@ 2009-07-03  8:14 ` Jiri Olsa
  2009-07-03  9:06   ` Ingo Molnar
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03  8:14 UTC (permalink / raw)
  To: netdev
  Cc: linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	eric.dumazet

Adding smp_mb__after_lock define to be used as a smp_mb call after
a lock.  

Making it nop for x86, since {read|write|spin}_lock() on x86 are 
full memory barriers.

wbr,
jirka


Signed-off-by: Jiri Olsa <jolsa@redhat.com>

---
 arch/x86/include/asm/spinlock.h |    3 +++
 include/linux/spinlock.h        |    5 +++++
 include/net/sock.h              |    5 ++++-
 3 files changed, 12 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index b7e5db8..39ecc5f 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
 #define _raw_read_relax(lock)	cpu_relax()
 #define _raw_write_relax(lock)	cpu_relax()
 
+/* The {read|write|spin}_lock() on x86 are full memory barriers. */
+#define smp_mb__after_lock() do { } while (0)
+
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 252b245..ae053bd 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -132,6 +132,11 @@ do {								\
 #endif /*__raw_spin_is_contended*/
 #endif
 
+/* The lock does not imply full memory barrier. */
+#ifndef smp_mb__after_lock
+#define smp_mb__after_lock() smp_mb()
+#endif
+
 /**
  * spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/include/net/sock.h b/include/net/sock.h
index 4eb8409..98afcd9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
  * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
  * could then endup calling schedule and sleep forever if there are no more
  * data on the socket.
+ *
+ * The sk_has_helper is always called right after a call to read_lock, so we
+ * can use smp_mb__after_lock barrier.
  */
 static inline int sk_has_sleeper(struct sock *sk)
 {
@@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
 	 *
 	 * This memory barrier is paired in the sock_poll_wait.
 	 */
-	smp_mb();
+	smp_mb__after_lock();
 	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  8:14 ` [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Jiri Olsa
@ 2009-07-03  9:06   ` Ingo Molnar
  2009-07-03  9:20     ` Eric Dumazet
  2009-07-03 14:04     ` Mathieu Desnoyers
  0 siblings, 2 replies; 39+ messages in thread
From: Ingo Molnar @ 2009-07-03  9:06 UTC (permalink / raw)
  To: Jiri Olsa, Peter Zijlstra, Mathieu Desnoyers
  Cc: netdev, linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg,
	davidel, eric.dumazet


* Jiri Olsa <jolsa@redhat.com> wrote:

> +++ b/arch/x86/include/asm/spinlock.h
> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>  #define _raw_read_relax(lock)	cpu_relax()
>  #define _raw_write_relax(lock)	cpu_relax()
>  
> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> +#define smp_mb__after_lock() do { } while (0)

Two small stylistic comments, please make this an inline function:

static inline void smp_mb__after_lock(void) { }
#define smp_mb__after_lock

(untested)

> +/* The lock does not imply full memory barrier. */
> +#ifndef smp_mb__after_lock
> +#define smp_mb__after_lock() smp_mb()
> +#endif

ditto.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  9:06   ` Ingo Molnar
@ 2009-07-03  9:20     ` Eric Dumazet
  2009-07-03  9:24       ` Ingo Molnar
  2009-07-03 14:04     ` Mathieu Desnoyers
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-07-03  9:20 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

Ingo Molnar a écrit :
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
>> +++ b/arch/x86/include/asm/spinlock.h
>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>>  #define _raw_read_relax(lock)	cpu_relax()
>>  #define _raw_write_relax(lock)	cpu_relax()
>>  
>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
>> +#define smp_mb__after_lock() do { } while (0)
> 
> Two small stylistic comments, please make this an inline function:
> 
> static inline void smp_mb__after_lock(void) { }
> #define smp_mb__after_lock
> 
> (untested)
> 
>> +/* The lock does not imply full memory barrier. */
>> +#ifndef smp_mb__after_lock
>> +#define smp_mb__after_lock() smp_mb()
>> +#endif
> 
> ditto.
> 
> 	Ingo

This was following existing implementations of various smp_mb__??? helpers :

# grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h

/*
 * clear_bit may not imply a memory barrier
 */
#ifndef smp_mb__before_clear_bit
#define smp_mb__before_clear_bit()      smp_mb()
#define smp_mb__after_clear_bit()       smp_mb()
#endif



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  9:20     ` Eric Dumazet
@ 2009-07-03  9:24       ` Ingo Molnar
  2009-07-03  9:56         ` Jiri Olsa
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-07-03  9:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiri Olsa, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel


* Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Ingo Molnar a écrit :
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> >> +++ b/arch/x86/include/asm/spinlock.h
> >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>  #define _raw_read_relax(lock)	cpu_relax()
> >>  #define _raw_write_relax(lock)	cpu_relax()
> >>  
> >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >> +#define smp_mb__after_lock() do { } while (0)
> > 
> > Two small stylistic comments, please make this an inline function:
> > 
> > static inline void smp_mb__after_lock(void) { }
> > #define smp_mb__after_lock
> > 
> > (untested)
> > 
> >> +/* The lock does not imply full memory barrier. */
> >> +#ifndef smp_mb__after_lock
> >> +#define smp_mb__after_lock() smp_mb()
> >> +#endif
> > 
> > ditto.
> > 
> > 	Ingo
> 
> This was following existing implementations of various smp_mb__??? helpers :
> 
> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> 
> /*
>  * clear_bit may not imply a memory barrier
>  */
> #ifndef smp_mb__before_clear_bit
> #define smp_mb__before_clear_bit()      smp_mb()
> #define smp_mb__after_clear_bit()       smp_mb()
> #endif

Did i mention that those should be fixed too? :-)

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  9:24       ` Ingo Molnar
@ 2009-07-03  9:56         ` Jiri Olsa
  2009-07-03 10:25           ` Ingo Molnar
  0 siblings, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03  9:56 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> 
> * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > Ingo Molnar a écrit :
> > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > >> +++ b/arch/x86/include/asm/spinlock.h
> > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > >>  #define _raw_read_relax(lock)	cpu_relax()
> > >>  #define _raw_write_relax(lock)	cpu_relax()
> > >>  
> > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > >> +#define smp_mb__after_lock() do { } while (0)
> > > 
> > > Two small stylistic comments, please make this an inline function:
> > > 
> > > static inline void smp_mb__after_lock(void) { }
> > > #define smp_mb__after_lock
> > > 
> > > (untested)
> > > 
> > >> +/* The lock does not imply full memory barrier. */
> > >> +#ifndef smp_mb__after_lock
> > >> +#define smp_mb__after_lock() smp_mb()
> > >> +#endif
> > > 
> > > ditto.
> > > 
> > > 	Ingo
> > 
> > This was following existing implementations of various smp_mb__??? helpers :
> > 
> > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > 
> > /*
> >  * clear_bit may not imply a memory barrier
> >  */
> > #ifndef smp_mb__before_clear_bit
> > #define smp_mb__before_clear_bit()      smp_mb()
> > #define smp_mb__after_clear_bit()       smp_mb()
> > #endif
> 
> Did i mention that those should be fixed too? :-)
> 
> 	Ingo

ok, could I include it in the 2/2 or you prefer separate patch?

jirka

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  9:56         ` Jiri Olsa
@ 2009-07-03 10:25           ` Ingo Molnar
  2009-07-03 11:18             ` Jiri Olsa
  0 siblings, 1 reply; 39+ messages in thread
From: Ingo Molnar @ 2009-07-03 10:25 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel


* Jiri Olsa <jolsa@redhat.com> wrote:

> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > 
> > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > 
> > > Ingo Molnar a écrit :
> > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > 
> > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > >>  
> > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > 
> > > > Two small stylistic comments, please make this an inline function:
> > > > 
> > > > static inline void smp_mb__after_lock(void) { }
> > > > #define smp_mb__after_lock
> > > > 
> > > > (untested)
> > > > 
> > > >> +/* The lock does not imply full memory barrier. */
> > > >> +#ifndef smp_mb__after_lock
> > > >> +#define smp_mb__after_lock() smp_mb()
> > > >> +#endif
> > > > 
> > > > ditto.
> > > > 
> > > > 	Ingo
> > > 
> > > This was following existing implementations of various smp_mb__??? helpers :
> > > 
> > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > 
> > > /*
> > >  * clear_bit may not imply a memory barrier
> > >  */
> > > #ifndef smp_mb__before_clear_bit
> > > #define smp_mb__before_clear_bit()      smp_mb()
> > > #define smp_mb__after_clear_bit()       smp_mb()
> > > #endif
> > 
> > Did i mention that those should be fixed too? :-)
> > 
> > 	Ingo
> 
> ok, could I include it in the 2/2 or you prefer separate patch?

depends on whether it will regress ;-)

If it regresses, it's better to have it separate. If it wont, it can 
be included. If unsure, default to the more conservative option.

	Ingo

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 10:25           ` Ingo Molnar
@ 2009-07-03 11:18             ` Jiri Olsa
  2009-07-03 11:30               ` Jarek Poplawski
  2009-07-07 10:18               ` Jiri Olsa
  0 siblings, 2 replies; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03 11:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > > 
> > > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > 
> > > > Ingo Molnar a écrit :
> > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > 
> > > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > > >>  
> > > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > > 
> > > > > Two small stylistic comments, please make this an inline function:
> > > > > 
> > > > > static inline void smp_mb__after_lock(void) { }
> > > > > #define smp_mb__after_lock
> > > > > 
> > > > > (untested)
> > > > > 
> > > > >> +/* The lock does not imply full memory barrier. */
> > > > >> +#ifndef smp_mb__after_lock
> > > > >> +#define smp_mb__after_lock() smp_mb()
> > > > >> +#endif
> > > > > 
> > > > > ditto.
> > > > > 
> > > > > 	Ingo
> > > > 
> > > > This was following existing implementations of various smp_mb__??? helpers :
> > > > 
> > > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > > 
> > > > /*
> > > >  * clear_bit may not imply a memory barrier
> > > >  */
> > > > #ifndef smp_mb__before_clear_bit
> > > > #define smp_mb__before_clear_bit()      smp_mb()
> > > > #define smp_mb__after_clear_bit()       smp_mb()
> > > > #endif
> > > 
> > > Did i mention that those should be fixed too? :-)
> > > 
> > > 	Ingo
> > 
> > ok, could I include it in the 2/2 or you prefer separate patch?
> 
> depends on whether it will regress ;-)
> 
> If it regresses, it's better to have it separate. If it wont, it can 
> be included. If unsure, default to the more conservative option.
> 
> 	Ingo


how about this.. 
and similar change for smp_mb__before_clear_bit in a separate patch


diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index b7e5db8..4e77853 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
 #define _raw_read_relax(lock)	cpu_relax()
 #define _raw_write_relax(lock)	cpu_relax()
 
+/* The {read|write|spin}_lock() on x86 are full memory barriers. */
+static inline void smp_mb__after_lock(void) { }
+#define ARCH_HAS_SMP_MB_AFTER_LOCK
+
 #endif /* _ASM_X86_SPINLOCK_H */
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 252b245..4be57ab 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -132,6 +132,11 @@ do {								\
 #endif /*__raw_spin_is_contended*/
 #endif
 
+/* The lock does not imply full memory barrier. */
+#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
+static inline void smp_mb__after_lock(void) { smp_mb(); }
+#endif
+
 /**
  * spin_unlock_wait - wait until the spinlock gets unlocked
  * @lock: the spinlock in question.
diff --git a/include/net/sock.h b/include/net/sock.h
index 4eb8409..98afcd9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
  * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
  * could then endup calling schedule and sleep forever if there are no more
  * data on the socket.
+ *
+ * The sk_has_helper is always called right after a call to read_lock, so we
+ * can use smp_mb__after_lock barrier.
  */
 static inline int sk_has_sleeper(struct sock *sk)
 {
@@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
 	 *
 	 * This memory barrier is paired in the sock_poll_wait.
 	 */
-	smp_mb();
+	smp_mb__after_lock();
 	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
 }
 

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 11:18             ` Jiri Olsa
@ 2009-07-03 11:30               ` Jarek Poplawski
  2009-07-03 11:43                 ` Jiri Olsa
  2009-07-07 10:18               ` Jiri Olsa
  1 sibling, 1 reply; 39+ messages in thread
From: Jarek Poplawski @ 2009-07-03 11:30 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers,
	netdev, linux-kernel, fbl, nhorman, davem, htejun, oleg, davidel

On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
...
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index b7e5db8..4e77853 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
...
> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
>   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
>   * could then endup calling schedule and sleep forever if there are no more
>   * data on the socket.
> + *
> + * The sk_has_helper is always called right after a call to read_lock, so we
Btw.:
- * The sk_has_helper is always called right after a call to read_lock, so we
+ * The sk_has_sleeper is always called right after a call to read_lock, so we

Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 11:30               ` Jarek Poplawski
@ 2009-07-03 11:43                 ` Jiri Olsa
  0 siblings, 0 replies; 39+ messages in thread
From: Jiri Olsa @ 2009-07-03 11:43 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: Ingo Molnar, Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers,
	netdev, linux-kernel, fbl, nhorman, davem, htejun, oleg, davidel

On Fri, Jul 03, 2009 at 11:30:27AM +0000, Jarek Poplawski wrote:
> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> ...
> > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> > index b7e5db8..4e77853 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> ...
> > @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> >   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> >   * could then endup calling schedule and sleep forever if there are no more
> >   * data on the socket.
> > + *
> > + * The sk_has_helper is always called right after a call to read_lock, so we
> Btw.:
> - * The sk_has_helper is always called right after a call to read_lock, so we
> + * The sk_has_sleeper is always called right after a call to read_lock, so we
> 
> Jarek P.

oops, thanks

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 11:18             ` Jiri Olsa
  2009-07-03 11:30               ` Jarek Poplawski
@ 2009-07-07 10:18               ` Jiri Olsa
  2009-07-07 13:46                 ` Jiri Olsa
  1 sibling, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-07 10:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> > 
> > * Jiri Olsa <jolsa@redhat.com> wrote:
> > 
> > > On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > 
> > > > > Ingo Molnar a écrit :
> > > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > > 
> > > > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > > > >>  
> > > > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > > > 
> > > > > > Two small stylistic comments, please make this an inline function:
> > > > > > 
> > > > > > static inline void smp_mb__after_lock(void) { }
> > > > > > #define smp_mb__after_lock
> > > > > > 
> > > > > > (untested)
> > > > > > 
> > > > > >> +/* The lock does not imply full memory barrier. */
> > > > > >> +#ifndef smp_mb__after_lock
> > > > > >> +#define smp_mb__after_lock() smp_mb()
> > > > > >> +#endif
> > > > > > 
> > > > > > ditto.
> > > > > > 
> > > > > > 	Ingo
> > > > > 
> > > > > This was following existing implementations of various smp_mb__??? helpers :
> > > > > 
> > > > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > > > 
> > > > > /*
> > > > >  * clear_bit may not imply a memory barrier
> > > > >  */
> > > > > #ifndef smp_mb__before_clear_bit
> > > > > #define smp_mb__before_clear_bit()      smp_mb()
> > > > > #define smp_mb__after_clear_bit()       smp_mb()
> > > > > #endif
> > > > 
> > > > Did i mention that those should be fixed too? :-)
> > > > 
> > > > 	Ingo
> > > 
> > > ok, could I include it in the 2/2 or you prefer separate patch?
> > 
> > depends on whether it will regress ;-)
> > 
> > If it regresses, it's better to have it separate. If it wont, it can 
> > be included. If unsure, default to the more conservative option.
> > 
> > 	Ingo
> 
> 
> how about this.. 
> and similar change for smp_mb__before_clear_bit in a separate patch
> 
> 
> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> index b7e5db8..4e77853 100644
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>  #define _raw_read_relax(lock)	cpu_relax()
>  #define _raw_write_relax(lock)	cpu_relax()
>  
> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> +static inline void smp_mb__after_lock(void) { }
> +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> +
>  #endif /* _ASM_X86_SPINLOCK_H */
> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> index 252b245..4be57ab 100644
> --- a/include/linux/spinlock.h
> +++ b/include/linux/spinlock.h
> @@ -132,6 +132,11 @@ do {								\
>  #endif /*__raw_spin_is_contended*/
>  #endif
>  
> +/* The lock does not imply full memory barrier. */
> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> +static inline void smp_mb__after_lock(void) { smp_mb(); }
> +#endif
> +
>  /**
>   * spin_unlock_wait - wait until the spinlock gets unlocked
>   * @lock: the spinlock in question.
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 4eb8409..98afcd9 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
>   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
>   * could then endup calling schedule and sleep forever if there are no more
>   * data on the socket.
> + *
> + * The sk_has_helper is always called right after a call to read_lock, so we
> + * can use smp_mb__after_lock barrier.
>   */
>  static inline int sk_has_sleeper(struct sock *sk)
>  {
> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
>  	 *
>  	 * This memory barrier is paired in the sock_poll_wait.
>  	 */
> -	smp_mb();
> +	smp_mb__after_lock();
>  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
>  }
>  

any feedback on this? 
I'd send v6 if this way is acceptable..

thanks,
jirka

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 10:18               ` Jiri Olsa
@ 2009-07-07 13:46                 ` Jiri Olsa
  2009-07-07 14:01                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-07 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric Dumazet, Peter Zijlstra, Mathieu Desnoyers, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> > On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> > > 
> > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > 
> > > > On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > 
> > > > > > Ingo Molnar a écrit :
> > > > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > > > 
> > > > > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > > > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > > > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > > > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > > > > >>  
> > > > > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > > > > 
> > > > > > > Two small stylistic comments, please make this an inline function:
> > > > > > > 
> > > > > > > static inline void smp_mb__after_lock(void) { }
> > > > > > > #define smp_mb__after_lock
> > > > > > > 
> > > > > > > (untested)
> > > > > > > 
> > > > > > >> +/* The lock does not imply full memory barrier. */
> > > > > > >> +#ifndef smp_mb__after_lock
> > > > > > >> +#define smp_mb__after_lock() smp_mb()
> > > > > > >> +#endif
> > > > > > > 
> > > > > > > ditto.
> > > > > > > 
> > > > > > > 	Ingo
> > > > > > 
> > > > > > This was following existing implementations of various smp_mb__??? helpers :
> > > > > > 
> > > > > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > > > > 
> > > > > > /*
> > > > > >  * clear_bit may not imply a memory barrier
> > > > > >  */
> > > > > > #ifndef smp_mb__before_clear_bit
> > > > > > #define smp_mb__before_clear_bit()      smp_mb()
> > > > > > #define smp_mb__after_clear_bit()       smp_mb()
> > > > > > #endif
> > > > > 
> > > > > Did i mention that those should be fixed too? :-)
> > > > > 
> > > > > 	Ingo
> > > > 
> > > > ok, could I include it in the 2/2 or you prefer separate patch?
> > > 
> > > depends on whether it will regress ;-)
> > > 
> > > If it regresses, it's better to have it separate. If it wont, it can 
> > > be included. If unsure, default to the more conservative option.
> > > 
> > > 	Ingo
> > 
> > 
> > how about this.. 
> > and similar change for smp_mb__before_clear_bit in a separate patch
> > 
> > 
> > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> > index b7e5db8..4e77853 100644
> > --- a/arch/x86/include/asm/spinlock.h
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >  #define _raw_read_relax(lock)	cpu_relax()
> >  #define _raw_write_relax(lock)	cpu_relax()
> >  
> > +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > +static inline void smp_mb__after_lock(void) { }
> > +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> > +
> >  #endif /* _ASM_X86_SPINLOCK_H */
> > diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> > index 252b245..4be57ab 100644
> > --- a/include/linux/spinlock.h
> > +++ b/include/linux/spinlock.h
> > @@ -132,6 +132,11 @@ do {								\
> >  #endif /*__raw_spin_is_contended*/
> >  #endif
> >  
> > +/* The lock does not imply full memory barrier. */
> > +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> > +static inline void smp_mb__after_lock(void) { smp_mb(); }
> > +#endif
> > +
> >  /**
> >   * spin_unlock_wait - wait until the spinlock gets unlocked
> >   * @lock: the spinlock in question.
> > diff --git a/include/net/sock.h b/include/net/sock.h
> > index 4eb8409..98afcd9 100644
> > --- a/include/net/sock.h
> > +++ b/include/net/sock.h
> > @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> >   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> >   * could then endup calling schedule and sleep forever if there are no more
> >   * data on the socket.
> > + *
> > + * The sk_has_helper is always called right after a call to read_lock, so we
> > + * can use smp_mb__after_lock barrier.
> >   */
> >  static inline int sk_has_sleeper(struct sock *sk)
> >  {
> > @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> >  	 *
> >  	 * This memory barrier is paired in the sock_poll_wait.
> >  	 */
> > -	smp_mb();
> > +	smp_mb__after_lock();
> >  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> >  }
> >  
> 
> any feedback on this? 
> I'd send v6 if this way is acceptable..
> 
> thanks,
> jirka

also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
it is used quite extensivelly.

I'd prefer to send it in a separate patch, so we can move on with the 
changes I've sent so far..

regards,
jirka

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 13:46                 ` Jiri Olsa
@ 2009-07-07 14:01                   ` Mathieu Desnoyers
  2009-07-07 14:34                     ` Oleg Nesterov
                                       ` (2 more replies)
  0 siblings, 3 replies; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-07 14:01 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev, linux-kernel,
	fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

* Jiri Olsa (jolsa@redhat.com) wrote:
> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> > On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> > > On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> > > > 
> > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > 
> > > > > On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > > > > > 
> > > > > > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > > 
> > > > > > > Ingo Molnar a écrit :
> > > > > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > > > > 
> > > > > > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > > > > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > > > > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > > > > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > > > > > >>  
> > > > > > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > > > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > > > > > 
> > > > > > > > Two small stylistic comments, please make this an inline function:
> > > > > > > > 
> > > > > > > > static inline void smp_mb__after_lock(void) { }
> > > > > > > > #define smp_mb__after_lock
> > > > > > > > 
> > > > > > > > (untested)
> > > > > > > > 
> > > > > > > >> +/* The lock does not imply full memory barrier. */
> > > > > > > >> +#ifndef smp_mb__after_lock
> > > > > > > >> +#define smp_mb__after_lock() smp_mb()
> > > > > > > >> +#endif
> > > > > > > > 
> > > > > > > > ditto.
> > > > > > > > 
> > > > > > > > 	Ingo
> > > > > > > 
> > > > > > > This was following existing implementations of various smp_mb__??? helpers :
> > > > > > > 
> > > > > > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > > > > > 
> > > > > > > /*
> > > > > > >  * clear_bit may not imply a memory barrier
> > > > > > >  */
> > > > > > > #ifndef smp_mb__before_clear_bit
> > > > > > > #define smp_mb__before_clear_bit()      smp_mb()
> > > > > > > #define smp_mb__after_clear_bit()       smp_mb()
> > > > > > > #endif
> > > > > > 
> > > > > > Did i mention that those should be fixed too? :-)
> > > > > > 
> > > > > > 	Ingo
> > > > > 
> > > > > ok, could I include it in the 2/2 or you prefer separate patch?
> > > > 
> > > > depends on whether it will regress ;-)
> > > > 
> > > > If it regresses, it's better to have it separate. If it wont, it can 
> > > > be included. If unsure, default to the more conservative option.
> > > > 
> > > > 	Ingo
> > > 
> > > 
> > > how about this.. 
> > > and similar change for smp_mb__before_clear_bit in a separate patch
> > > 
> > > 
> > > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> > > index b7e5db8..4e77853 100644
> > > --- a/arch/x86/include/asm/spinlock.h
> > > +++ b/arch/x86/include/asm/spinlock.h
> > > @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > >  #define _raw_read_relax(lock)	cpu_relax()
> > >  #define _raw_write_relax(lock)	cpu_relax()
> > >  
> > > +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > +static inline void smp_mb__after_lock(void) { }
> > > +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> > > +
> > >  #endif /* _ASM_X86_SPINLOCK_H */
> > > diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> > > index 252b245..4be57ab 100644
> > > --- a/include/linux/spinlock.h
> > > +++ b/include/linux/spinlock.h
> > > @@ -132,6 +132,11 @@ do {								\
> > >  #endif /*__raw_spin_is_contended*/
> > >  #endif
> > >  
> > > +/* The lock does not imply full memory barrier. */
> > > +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> > > +static inline void smp_mb__after_lock(void) { smp_mb(); }
> > > +#endif
> > > +
> > >  /**
> > >   * spin_unlock_wait - wait until the spinlock gets unlocked
> > >   * @lock: the spinlock in question.
> > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > index 4eb8409..98afcd9 100644
> > > --- a/include/net/sock.h
> > > +++ b/include/net/sock.h
> > > @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> > >   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> > >   * could then endup calling schedule and sleep forever if there are no more
> > >   * data on the socket.
> > > + *
> > > + * The sk_has_helper is always called right after a call to read_lock, so we
> > > + * can use smp_mb__after_lock barrier.
> > >   */
> > >  static inline int sk_has_sleeper(struct sock *sk)
> > >  {
> > > @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> > >  	 *
> > >  	 * This memory barrier is paired in the sock_poll_wait.
> > >  	 */
> > > -	smp_mb();
> > > +	smp_mb__after_lock();
> > >  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> > >  }
> > >  
> > 
> > any feedback on this? 
> > I'd send v6 if this way is acceptable..
> > 
> > thanks,
> > jirka
> 
> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
> it is used quite extensivelly.
> 
> I'd prefer to send it in a separate patch, so we can move on with the 
> changes I've sent so far..
> 

As with any optimization (and this is one that adds a semantic that will
just grow the memory barrier/locking rule complexity), it should come
with performance benchmarks showing real-life improvements.

Otherwise I'd recommend sticking to smp_mb() if this execution path is
not that critical, or to move to RCU if it's _that_ critical.

A valid argument would be if the data structures protected are so
complex that RCU is out of question but still the few cycles saved by
removing a memory barrier are really significant. And even then, the
proper solution would be more something like a
__read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
improvements on architectures other than x86 as well.

So in all cases, I don't think the smp_mb__after_lock() is the
appropriate solution.

Mathieu

> regards,
> jirka

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:01                   ` Mathieu Desnoyers
@ 2009-07-07 14:34                     ` Oleg Nesterov
  2009-07-07 15:04                       ` Mathieu Desnoyers
  2009-07-07 14:34                     ` Jiri Olsa
  2009-07-07 14:42                     ` Eric Dumazet
  2 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2009-07-07 14:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jiri Olsa, Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

On 07/07, Mathieu Desnoyers wrote:
>
> As with any optimization (and this is one that adds a semantic that will
> just grow the memory barrier/locking rule complexity), it should come
> with performance benchmarks showing real-life improvements.

Well, the same applies to smp_mb__xxx_atomic_yyy or smp_mb__before_clear_bit.

Imho the new helper is not worse, and it could be also used by
try_to_wake_up(), __pollwake(), insert_work() at least.

> Otherwise I'd recommend sticking to smp_mb() if this execution path is
> not that critical, or to move to RCU if it's _that_ critical.
>
> A valid argument would be if the data structures protected are so
> complex that RCU is out of question but still the few cycles saved by
> removing a memory barrier are really significant.

Not sure I understand how RCU can help,

> And even then, the
> proper solution would be more something like a
> __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> improvements on architectures other than x86 as well.

Hmm. could you explain what you mean?

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:34                     ` Oleg Nesterov
@ 2009-07-07 15:04                       ` Mathieu Desnoyers
  2009-07-07 15:44                         ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-07 15:04 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Jiri Olsa, Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

* Oleg Nesterov (oleg@redhat.com) wrote:
> On 07/07, Mathieu Desnoyers wrote:
> >
> > As with any optimization (and this is one that adds a semantic that will
> > just grow the memory barrier/locking rule complexity), it should come
> > with performance benchmarks showing real-life improvements.
> 
> Well, the same applies to smp_mb__xxx_atomic_yyy or smp_mb__before_clear_bit.
> 
> Imho the new helper is not worse, and it could be also used by
> try_to_wake_up(), __pollwake(), insert_work() at least.

It's basically related to Amdahl law. If the smp_mb is a small portion
of the overall read_lock cost, then it may not be worth it to remove it.
At the contrary, if the mb is a big portion of set/clear bit, then it's
worth it. We also have to consider the frequency at which these
operations are done to figure out the overall performance impact.
Also, locks imply cache-line bouncing, which are typically costly.
clear/set bit does not imply this as much. So the tradeoffs are very
different there.

So it's not as simple as "we do this for set/clear bit, we should
therefore do this for locks".

> 
> > Otherwise I'd recommend sticking to smp_mb() if this execution path is
> > not that critical, or to move to RCU if it's _that_ critical.
> >
> > A valid argument would be if the data structures protected are so
> > complex that RCU is out of question but still the few cycles saved by
> > removing a memory barrier are really significant.
> 
> Not sure I understand how RCU can help,
> 

Changing a read_lock to a rcu_read_lock would save the whole atomic
cache-line bouncing operation on that fast path. But it may imply data
structure redesign. So it is more for future development than current
kernel releases.

> > And even then, the
> > proper solution would be more something like a
> > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> > improvements on architectures other than x86 as well.
> 
> Hmm. could you explain what you mean?
> 

Actually, thinking about it more, to appropriately support x86, as well
as powerpc, arm and mips, we would need something like:

read_lock_smp_mb()

Which would be a read_lock with an included memory barrier.

Mathieu

> Oleg.
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 15:04                       ` Mathieu Desnoyers
@ 2009-07-07 15:44                         ` Oleg Nesterov
  2009-07-07 15:50                           ` Peter Zijlstra
  0 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2009-07-07 15:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jiri Olsa, Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

On 07/07, Mathieu Desnoyers wrote:
>
> Actually, thinking about it more, to appropriately support x86, as well
> as powerpc, arm and mips, we would need something like:
>
> read_lock_smp_mb()
>
> Which would be a read_lock with an included memory barrier.

Then we need read_lock_irq_smp_mb, read_lock_irqsave__smp_mb, write_lock_xxx,
otherwise it is not clear why only read_lock() has _smp_mb() version.

The same for spin_lock_xxx...

Oleg.


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 15:44                         ` Oleg Nesterov
@ 2009-07-07 15:50                           ` Peter Zijlstra
  2009-07-07 19:45                             ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Peter Zijlstra @ 2009-07-07 15:50 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Mathieu Desnoyers, Jiri Olsa, Ingo Molnar, Eric Dumazet, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

On Tue, 2009-07-07 at 17:44 +0200, Oleg Nesterov wrote:
> On 07/07, Mathieu Desnoyers wrote:
> >
> > Actually, thinking about it more, to appropriately support x86, as well
> > as powerpc, arm and mips, we would need something like:
> >
> > read_lock_smp_mb()
> >
> > Which would be a read_lock with an included memory barrier.
> 
> Then we need read_lock_irq_smp_mb, read_lock_irqsave__smp_mb, write_lock_xxx,
> otherwise it is not clear why only read_lock() has _smp_mb() version.
> 
> The same for spin_lock_xxx...

At which time the smp_mb__{before,after}_{un,}lock become attractive
again.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 15:50                           ` Peter Zijlstra
@ 2009-07-07 19:45                             ` Mathieu Desnoyers
  2009-07-07 22:44                               ` Eric Dumazet
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-07 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Oleg Nesterov, Jiri Olsa, Ingo Molnar, Eric Dumazet, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

* Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> On Tue, 2009-07-07 at 17:44 +0200, Oleg Nesterov wrote:
> > On 07/07, Mathieu Desnoyers wrote:
> > >
> > > Actually, thinking about it more, to appropriately support x86, as well
> > > as powerpc, arm and mips, we would need something like:
> > >
> > > read_lock_smp_mb()
> > >
> > > Which would be a read_lock with an included memory barrier.
> > 
> > Then we need read_lock_irq_smp_mb, read_lock_irqsave__smp_mb, write_lock_xxx,
> > otherwise it is not clear why only read_lock() has _smp_mb() version.
> > 
> > The same for spin_lock_xxx...
> 
> At which time the smp_mb__{before,after}_{un,}lock become attractive
> again.
> 

Then having a new __read_lock() (without acquire semantic) which would
be required to be followed by a smp__mb_after_lock() would make sense. I
think this would fit all of x86, powerpc, arm, mips without having to
create tons of new primitives. Only "simpler" ones that clearly separate
locking from memory barriers.

Mathieu




-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 19:45                             ` Mathieu Desnoyers
@ 2009-07-07 22:44                               ` Eric Dumazet
  2009-07-07 23:28                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-07-07 22:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Peter Zijlstra, Oleg Nesterov, Jiri Olsa, Ingo Molnar, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

Mathieu Desnoyers a écrit :
> * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
>> On Tue, 2009-07-07 at 17:44 +0200, Oleg Nesterov wrote:
>>> On 07/07, Mathieu Desnoyers wrote:
>>>> Actually, thinking about it more, to appropriately support x86, as well
>>>> as powerpc, arm and mips, we would need something like:
>>>>
>>>> read_lock_smp_mb()
>>>>
>>>> Which would be a read_lock with an included memory barrier.
>>> Then we need read_lock_irq_smp_mb, read_lock_irqsave__smp_mb, write_lock_xxx,
>>> otherwise it is not clear why only read_lock() has _smp_mb() version.
>>>
>>> The same for spin_lock_xxx...
>> At which time the smp_mb__{before,after}_{un,}lock become attractive
>> again.
>>
> 
> Then having a new __read_lock() (without acquire semantic) which would
> be required to be followed by a smp__mb_after_lock() would make sense. I
> think this would fit all of x86, powerpc, arm, mips without having to
> create tons of new primitives. Only "simpler" ones that clearly separate
> locking from memory barriers.
> 

Hmm... On x86, read_lock() is :

	lock subl $0x1,(%eax)
	jns   .Lok
	call	__read_lock_failed
.Lok:	ret


What would be __read_lock() ? I cant see how it could *not* use lock prefix
actually and or being cheaper...


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 22:44                               ` Eric Dumazet
@ 2009-07-07 23:28                                 ` Mathieu Desnoyers
  2009-07-07 23:51                                   ` Oleg Nesterov
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-07 23:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Zijlstra, Oleg Nesterov, Jiri Olsa, Ingo Molnar, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

* Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Mathieu Desnoyers a écrit :
> > * Peter Zijlstra (a.p.zijlstra@chello.nl) wrote:
> >> On Tue, 2009-07-07 at 17:44 +0200, Oleg Nesterov wrote:
> >>> On 07/07, Mathieu Desnoyers wrote:
> >>>> Actually, thinking about it more, to appropriately support x86, as well
> >>>> as powerpc, arm and mips, we would need something like:
> >>>>
> >>>> read_lock_smp_mb()
> >>>>
> >>>> Which would be a read_lock with an included memory barrier.
> >>> Then we need read_lock_irq_smp_mb, read_lock_irqsave__smp_mb, write_lock_xxx,
> >>> otherwise it is not clear why only read_lock() has _smp_mb() version.
> >>>
> >>> The same for spin_lock_xxx...
> >> At which time the smp_mb__{before,after}_{un,}lock become attractive
> >> again.
> >>
> > 
> > Then having a new __read_lock() (without acquire semantic) which would
> > be required to be followed by a smp__mb_after_lock() would make sense. I
> > think this would fit all of x86, powerpc, arm, mips without having to
> > create tons of new primitives. Only "simpler" ones that clearly separate
> > locking from memory barriers.
> > 
> 
> Hmm... On x86, read_lock() is :
> 
> 	lock subl $0x1,(%eax)
> 	jns   .Lok
> 	call	__read_lock_failed
> .Lok:	ret
> 
> 
> What would be __read_lock() ? I cant see how it could *not* use lock prefix
> actually and or being cheaper...
> 

(I'll use read_lock_noacquire() instead of __read_lock() because
__read_lock() is already used for low-level primitives and will produce
name clashes. But I recognise that noacquire is just an ugly name.)

Here, a __read_lock_noacquire _must_ be followed by a
smp__mb_after_lock(), and a __read_unlock_norelease() _must_ be
preceded by a smp__mb_before_unlock().

x86 :

#define __read_lock_noacquire	read_lock
/* Assumes all __*_lock_noacquire primitives act as a full smp_mb() */
#define smp__mb_after_lock()

/* Assumes all __*_unlock_norelease primitives act as a full smp_mb() */
#define smp__mb_before_unlock()
#define __read_unlock_norelease	read_unlock

it's that easy :-)


however, on powerpc, we have to stop and think about it a bit more:

quoting http://www.linuxjournal.com/article/8212

"lwsync, or lightweight sync, orders loads with respect to subsequent
loads and stores, and it also orders stores. However, it does not order
stores with respect to subsequent loads. Interestingly enough, the
lwsync instruction enforces the same ordering as does the zSeries and,
coincidentally, the SPARC TSO."

static inline long __read_trylock_noacquire(raw_rwlock_t *rw)
{
        long tmp;

        __asm__ __volatile__(
"1:     lwarx           %0,0,%1\n"
        __DO_SIGN_EXTEND
"       addic.          %0,%0,1\n\
        ble-            2f\n"
        PPC405_ERR77(0,%1)
"       stwcx.          %0,0,%1\n\
        bne-            1b\n\
        /* isync\n\ Removed the isync because following smp_mb (sync
         * instruction) includes a core synchronizing barrier. */
2:"     : "=&r" (tmp)
        : "r" (&rw->lock)
        : "cr0", "xer", "memory");

        return tmp;
}

#define smp__mb_after_lock()	smp_mb()


#define smp__mb_before_unlock()	smp_mb()

static inline void __raw_read_unlock_norelease(raw_rwlock_t *rw)
{
        long tmp;

        __asm__ __volatile__(
        "# read_unlock\n\t"
        /* LWSYNC_ON_SMP -------- can be removed, replace by prior
         * smp_mb() */
"1:     lwarx           %0,0,%1\n\
        addic           %0,%0,-1\n"
        PPC405_ERR77(0,%1)
"       stwcx.          %0,0,%1\n\
        bne-            1b"
        : "=&r"(tmp)
        : "r"(&rw->lock)
        : "cr0", "xer", "memory");
}

I assume here that lwarx/stwcx pairs for different addresses cannot be
reordered with other pairs. If they can, then we already have a problem
with the current powerpc read lock implementation.

I just wrote this as an example to show how this could become a
performance improvement on architectures different than x86. The code
proposed above comes without warranty and should be tested with care. :)

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 23:28                                 ` Mathieu Desnoyers
@ 2009-07-07 23:51                                   ` Oleg Nesterov
  2009-07-08  4:34                                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Oleg Nesterov @ 2009-07-07 23:51 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, Peter Zijlstra, Jiri Olsa, Ingo Molnar, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

On 07/07, Mathieu Desnoyers wrote:
>
> * Eric Dumazet (eric.dumazet@gmail.com) wrote:
> >
> > What would be __read_lock() ? I cant see how it could *not* use lock prefix
> > actually and or being cheaper...
> >
>
> (I'll use read_lock_noacquire() instead of __read_lock() because
> __read_lock() is already used for low-level primitives and will produce
> name clashes. But I recognise that noacquire is just an ugly name.)
>
> Here, a __read_lock_noacquire _must_ be followed by a
> smp__mb_after_lock(), and a __read_unlock_norelease() _must_ be
> preceded by a smp__mb_before_unlock().

Your point was, smp_mb__after_lock() adds more complexity to the
barriers/locking rules.

Do you really think __read_lock_noacquire() makes this all more
simple/understandable? And again, we need __read_lock_irq_noaquire/etc.

Personally, I disagree. In fact, I do not understand when/why
_noacquire can be used, but this is another story.

Let's look from the different angle. The first patch from Jiri fixes
the bug. Yes, it is not clear if this is possible to trigger this
bug in practice, but still nobody disagrees the bug does exist.
The second patch fixes the added pessimization.

So, if you do not agree with these patches, perhaps you can send
fixes on top of these changes?

Sadly, I already removed the previous emails so I can't add my
acked-by to Jiri's patches. I didn't do this before because I
thought I am in no position to ack these changes. But looking
at this discussion, I'd like to vote for both these patches
anyway ;)

Oleg.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 23:51                                   ` Oleg Nesterov
@ 2009-07-08  4:34                                     ` Mathieu Desnoyers
  2009-07-08  7:18                                       ` Jarek Poplawski
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-08  4:34 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Eric Dumazet, Peter Zijlstra, Jiri Olsa, Ingo Molnar, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, davidel

* Oleg Nesterov (oleg@redhat.com) wrote:
> On 07/07, Mathieu Desnoyers wrote:
> >
> > * Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > >
> > > What would be __read_lock() ? I cant see how it could *not* use lock prefix
> > > actually and or being cheaper...
> > >
> >
> > (I'll use read_lock_noacquire() instead of __read_lock() because
> > __read_lock() is already used for low-level primitives and will produce
> > name clashes. But I recognise that noacquire is just an ugly name.)
> >
> > Here, a __read_lock_noacquire _must_ be followed by a
> > smp__mb_after_lock(), and a __read_unlock_norelease() _must_ be
> > preceded by a smp__mb_before_unlock().
> 
> Your point was, smp_mb__after_lock() adds more complexity to the
> barriers/locking rules.
> 
> Do you really think __read_lock_noacquire() makes this all more
> simple/understandable? And again, we need __read_lock_irq_noaquire/etc.
> 

Yep, agreed that it also sounds like added complexity in locking rules,
and I've not yet seen the benefit of it.

> Personally, I disagree. In fact, I do not understand when/why
> _noacquire can be used, but this is another story.
> 

Because adding smp_mb__after_lock() is _only_ useful on x86. Most other
architectures _will_ suffer from a performance degradation, unless you
implement the __read_lock_noacquire.

> Let's look from the different angle. The first patch from Jiri fixes
> the bug. Yes, it is not clear if this is possible to trigger this
> bug in practice, but still nobody disagrees the bug does exist.
> The second patch fixes the added pessimization.

I fully agree with the bugfix.

> 
> So, if you do not agree with these patches, perhaps you can send
> fixes on top of these changes?

Given we can later build around the smp__mb_after_lock() to eliminate the
performance deterioration on non-x86 architectures by adding a
__read_lock_noacquire() primitive, I guess this can be done in a later
phase as an optimization.

I do not care if performance are not perfect for all architectures at
this point. What I really care about is that we do not introduce new
locking, atomic ops or memory barrier semantics that only make sense
for a single architecture and limit others.

Given that we can eventually move to a
__read_lock_noacquire()/smp_mb__after_lock() scheme, then adding just
smp_mb__after_lock() in the first place does not seem like a bad move.
It will just degrade performance of non-x86 architectures until
__read_lock_noacquire() or something similar comes.

So it looks fine if the code path is critical enough to justify adding
such new memory barrier. As long as we don't end up having
smp_mb__after_ponies().

Cheers,

Mathieu

> 
> 
> 
> Sadly, I already removed the previous emails so I can't add my
> acked-by to Jiri's patches. I didn't do this before because I
> thought I am in no position to ack these changes. But looking
> at this discussion, I'd like to vote for both these patches
> anyway ;)
> 
> Oleg.
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-08  4:34                                     ` Mathieu Desnoyers
@ 2009-07-08  7:18                                       ` Jarek Poplawski
  0 siblings, 0 replies; 39+ messages in thread
From: Jarek Poplawski @ 2009-07-08  7:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Oleg Nesterov, Eric Dumazet, Peter Zijlstra, Jiri Olsa,
	Ingo Molnar, netdev, linux-kernel, fbl, nhorman, davem, htejun,
	davidel

On Wed, Jul 08, 2009 at 12:34:32AM -0400, Mathieu Desnoyers wrote:
...
> Because adding smp_mb__after_lock() is _only_ useful on x86. Most other
> architectures _will_ suffer from a performance degradation, unless you
> implement the __read_lock_noacquire.

It's completely backwards: processor barriers are just expected to
add a performance degradation. That's like:

x86 developer:
OK, we need to add a barrier here: even x86 might need this.

Alpha developer:
Right, than we need this even more.

x86 developer:
But wait, we can avoid it using a dummy after some locks, because they
have such a barrier already.

Alpha developer:
Then it's not OK: it's _only_ useful on x86; our architecture _will_
suffer from a performance degradation...

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:01                   ` Mathieu Desnoyers
  2009-07-07 14:34                     ` Oleg Nesterov
@ 2009-07-07 14:34                     ` Jiri Olsa
  2009-07-07 14:42                     ` Eric Dumazet
  2 siblings, 0 replies; 39+ messages in thread
From: Jiri Olsa @ 2009-07-07 14:34 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev, linux-kernel,
	fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Tue, Jul 07, 2009 at 10:01:35AM -0400, Mathieu Desnoyers wrote:
> * Jiri Olsa (jolsa@redhat.com) wrote:
> > On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> > > On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> > > > On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> > > > > 
> > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > 
> > > > > > On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> > > > > > > 
> > > > > > > * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > > > > 
> > > > > > > > Ingo Molnar a écrit :
> > > > > > > > > * Jiri Olsa <jolsa@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > >> +++ b/arch/x86/include/asm/spinlock.h
> > > > > > > > >> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > > > > > > >>  #define _raw_read_relax(lock)	cpu_relax()
> > > > > > > > >>  #define _raw_write_relax(lock)	cpu_relax()
> > > > > > > > >>  
> > > > > > > > >> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > > > > > >> +#define smp_mb__after_lock() do { } while (0)
> > > > > > > > > 
> > > > > > > > > Two small stylistic comments, please make this an inline function:
> > > > > > > > > 
> > > > > > > > > static inline void smp_mb__after_lock(void) { }
> > > > > > > > > #define smp_mb__after_lock
> > > > > > > > > 
> > > > > > > > > (untested)
> > > > > > > > > 
> > > > > > > > >> +/* The lock does not imply full memory barrier. */
> > > > > > > > >> +#ifndef smp_mb__after_lock
> > > > > > > > >> +#define smp_mb__after_lock() smp_mb()
> > > > > > > > >> +#endif
> > > > > > > > > 
> > > > > > > > > ditto.
> > > > > > > > > 
> > > > > > > > > 	Ingo
> > > > > > > > 
> > > > > > > > This was following existing implementations of various smp_mb__??? helpers :
> > > > > > > > 
> > > > > > > > # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> > > > > > > > 
> > > > > > > > /*
> > > > > > > >  * clear_bit may not imply a memory barrier
> > > > > > > >  */
> > > > > > > > #ifndef smp_mb__before_clear_bit
> > > > > > > > #define smp_mb__before_clear_bit()      smp_mb()
> > > > > > > > #define smp_mb__after_clear_bit()       smp_mb()
> > > > > > > > #endif
> > > > > > > 
> > > > > > > Did i mention that those should be fixed too? :-)
> > > > > > > 
> > > > > > > 	Ingo
> > > > > > 
> > > > > > ok, could I include it in the 2/2 or you prefer separate patch?
> > > > > 
> > > > > depends on whether it will regress ;-)
> > > > > 
> > > > > If it regresses, it's better to have it separate. If it wont, it can 
> > > > > be included. If unsure, default to the more conservative option.
> > > > > 
> > > > > 	Ingo
> > > > 
> > > > 
> > > > how about this.. 
> > > > and similar change for smp_mb__before_clear_bit in a separate patch
> > > > 
> > > > 
> > > > diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> > > > index b7e5db8..4e77853 100644
> > > > --- a/arch/x86/include/asm/spinlock.h
> > > > +++ b/arch/x86/include/asm/spinlock.h
> > > > @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> > > >  #define _raw_read_relax(lock)	cpu_relax()
> > > >  #define _raw_write_relax(lock)	cpu_relax()
> > > >  
> > > > +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > > > +static inline void smp_mb__after_lock(void) { }
> > > > +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> > > > +
> > > >  #endif /* _ASM_X86_SPINLOCK_H */
> > > > diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> > > > index 252b245..4be57ab 100644
> > > > --- a/include/linux/spinlock.h
> > > > +++ b/include/linux/spinlock.h
> > > > @@ -132,6 +132,11 @@ do {								\
> > > >  #endif /*__raw_spin_is_contended*/
> > > >  #endif
> > > >  
> > > > +/* The lock does not imply full memory barrier. */
> > > > +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> > > > +static inline void smp_mb__after_lock(void) { smp_mb(); }
> > > > +#endif
> > > > +
> > > >  /**
> > > >   * spin_unlock_wait - wait until the spinlock gets unlocked
> > > >   * @lock: the spinlock in question.
> > > > diff --git a/include/net/sock.h b/include/net/sock.h
> > > > index 4eb8409..98afcd9 100644
> > > > --- a/include/net/sock.h
> > > > +++ b/include/net/sock.h
> > > > @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> > > >   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> > > >   * could then endup calling schedule and sleep forever if there are no more
> > > >   * data on the socket.
> > > > + *
> > > > + * The sk_has_helper is always called right after a call to read_lock, so we
> > > > + * can use smp_mb__after_lock barrier.
> > > >   */
> > > >  static inline int sk_has_sleeper(struct sock *sk)
> > > >  {
> > > > @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> > > >  	 *
> > > >  	 * This memory barrier is paired in the sock_poll_wait.
> > > >  	 */
> > > > -	smp_mb();
> > > > +	smp_mb__after_lock();
> > > >  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> > > >  }
> > > >  
> > > 
> > > any feedback on this? 
> > > I'd send v6 if this way is acceptable..
> > > 
> > > thanks,
> > > jirka
> > 
> > also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
> > it is used quite extensivelly.
> > 
> > I'd prefer to send it in a separate patch, so we can move on with the 
> > changes I've sent so far..
> > 
> 
> As with any optimization (and this is one that adds a semantic that will
> just grow the memory barrier/locking rule complexity), it should come
> with performance benchmarks showing real-life improvements.
> 
> Otherwise I'd recommend sticking to smp_mb() if this execution path is
> not that critical, or to move to RCU if it's _that_ critical.
> 
> A valid argument would be if the data structures protected are so
> complex that RCU is out of question but still the few cycles saved by
> removing a memory barrier are really significant. And even then, the
> proper solution would be more something like a
> __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> improvements on architectures other than x86 as well.
> 
> So in all cases, I don't think the smp_mb__after_lock() is the
> appropriate solution.

well, I'm not that familiar with RCU, but I dont mind to omit the smp_mb__after_lock 
change as long as the first one (1/2) stays :)

how about others, any ideas?

thanks,
jirka
> 
> Mathieu
> 
> > regards,
> > jirka
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:01                   ` Mathieu Desnoyers
  2009-07-07 14:34                     ` Oleg Nesterov
  2009-07-07 14:34                     ` Jiri Olsa
@ 2009-07-07 14:42                     ` Eric Dumazet
  2009-07-07 14:57                       ` Mathieu Desnoyers
  2 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-07-07 14:42 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jiri Olsa, Ingo Molnar, Eric Dumazet, Peter Zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

Mathieu Desnoyers a écrit :
> * Jiri Olsa (jolsa@redhat.com) wrote:
>> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
>>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
>>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
>>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
>>>>>
>>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
>>>>>>> * Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>>>>
>>>>>>>> Ingo Molnar a écrit :
>>>>>>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
>>>>>>>>>
>>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h
>>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>>>>>>>>>>  #define _raw_read_relax(lock)	cpu_relax()
>>>>>>>>>>  #define _raw_write_relax(lock)	cpu_relax()
>>>>>>>>>>  
>>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
>>>>>>>>>> +#define smp_mb__after_lock() do { } while (0)
>>>>>>>>> Two small stylistic comments, please make this an inline function:
>>>>>>>>>
>>>>>>>>> static inline void smp_mb__after_lock(void) { }
>>>>>>>>> #define smp_mb__after_lock
>>>>>>>>>
>>>>>>>>> (untested)
>>>>>>>>>
>>>>>>>>>> +/* The lock does not imply full memory barrier. */
>>>>>>>>>> +#ifndef smp_mb__after_lock
>>>>>>>>>> +#define smp_mb__after_lock() smp_mb()
>>>>>>>>>> +#endif
>>>>>>>>> ditto.
>>>>>>>>>
>>>>>>>>> 	Ingo
>>>>>>>> This was following existing implementations of various smp_mb__??? helpers :
>>>>>>>>
>>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
>>>>>>>>
>>>>>>>> /*
>>>>>>>>  * clear_bit may not imply a memory barrier
>>>>>>>>  */
>>>>>>>> #ifndef smp_mb__before_clear_bit
>>>>>>>> #define smp_mb__before_clear_bit()      smp_mb()
>>>>>>>> #define smp_mb__after_clear_bit()       smp_mb()
>>>>>>>> #endif
>>>>>>> Did i mention that those should be fixed too? :-)
>>>>>>>
>>>>>>> 	Ingo
>>>>>> ok, could I include it in the 2/2 or you prefer separate patch?
>>>>> depends on whether it will regress ;-)
>>>>>
>>>>> If it regresses, it's better to have it separate. If it wont, it can 
>>>>> be included. If unsure, default to the more conservative option.
>>>>>
>>>>> 	Ingo
>>>>
>>>> how about this.. 
>>>> and similar change for smp_mb__before_clear_bit in a separate patch
>>>>
>>>>
>>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
>>>> index b7e5db8..4e77853 100644
>>>> --- a/arch/x86/include/asm/spinlock.h
>>>> +++ b/arch/x86/include/asm/spinlock.h
>>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
>>>>  #define _raw_read_relax(lock)	cpu_relax()
>>>>  #define _raw_write_relax(lock)	cpu_relax()
>>>>  
>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
>>>> +static inline void smp_mb__after_lock(void) { }
>>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK
>>>> +
>>>>  #endif /* _ASM_X86_SPINLOCK_H */
>>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
>>>> index 252b245..4be57ab 100644
>>>> --- a/include/linux/spinlock.h
>>>> +++ b/include/linux/spinlock.h
>>>> @@ -132,6 +132,11 @@ do {								\
>>>>  #endif /*__raw_spin_is_contended*/
>>>>  #endif
>>>>  
>>>> +/* The lock does not imply full memory barrier. */
>>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
>>>> +static inline void smp_mb__after_lock(void) { smp_mb(); }
>>>> +#endif
>>>> +
>>>>  /**
>>>>   * spin_unlock_wait - wait until the spinlock gets unlocked
>>>>   * @lock: the spinlock in question.
>>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>>> index 4eb8409..98afcd9 100644
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
>>>>   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
>>>>   * could then endup calling schedule and sleep forever if there are no more
>>>>   * data on the socket.
>>>> + *
>>>> + * The sk_has_helper is always called right after a call to read_lock, so we
>>>> + * can use smp_mb__after_lock barrier.
>>>>   */
>>>>  static inline int sk_has_sleeper(struct sock *sk)
>>>>  {
>>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
>>>>  	 *
>>>>  	 * This memory barrier is paired in the sock_poll_wait.
>>>>  	 */
>>>> -	smp_mb();
>>>> +	smp_mb__after_lock();
>>>>  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
>>>>  }
>>>>  
>>> any feedback on this? 
>>> I'd send v6 if this way is acceptable..
>>>
>>> thanks,
>>> jirka
>> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
>> it is used quite extensivelly.
>>
>> I'd prefer to send it in a separate patch, so we can move on with the 
>> changes I've sent so far..
>>
> 
> As with any optimization (and this is one that adds a semantic that will
> just grow the memory barrier/locking rule complexity), it should come
> with performance benchmarks showing real-life improvements.
> 
> Otherwise I'd recommend sticking to smp_mb() if this execution path is
> not that critical, or to move to RCU if it's _that_ critical.
> 
> A valid argument would be if the data structures protected are so
> complex that RCU is out of question but still the few cycles saved by
> removing a memory barrier are really significant. And even then, the
> proper solution would be more something like a
> __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> improvements on architectures other than x86 as well.
> 
> So in all cases, I don't think the smp_mb__after_lock() is the
> appropriate solution.

RCU on this part is out of the question, as David already mentioned it.

It would be a regression for short lived tcp/udp sessions, and some workloads
use them a lot...

We gained about 20% performance between 2.6.26 and 2.6.31, carefuly removing
some atomic ops in network stack, adding RCU where it was sensible, but this
is a painful process, not something Jiri can use to fix bugs on legacy RedHat
kernels :) (We still are sorting out regressions)

To solve problem pointed by Jiri, we have to insert an smp_mb() at this point,
(not mentioning the other change in select() logic of course)

 static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
+	smp_mb(); /* paired with opposite smp_mb() in sk poll logic */
 	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
 		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
 }

As about every incoming packet calls this path, we should be very careful not
slowing down stack if not necessary.

On x86 this extra smp_mb() is not needed, since previous call to read_lock()
already gives the full barrier for free.



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:42                     ` Eric Dumazet
@ 2009-07-07 14:57                       ` Mathieu Desnoyers
  2009-07-07 15:23                         ` Eric Dumazet
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-07 14:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, netdev, linux-kernel, fbl,
	nhorman, davem, htejun, jarkao2, oleg, davidel

* Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Mathieu Desnoyers a écrit :
> > * Jiri Olsa (jolsa@redhat.com) wrote:
> >> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> >>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> >>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> >>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
> >>>>>
> >>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> >>>>>>> * Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>>>>>
> >>>>>>>> Ingo Molnar a écrit :
> >>>>>>>>> * Jiri Olsa <jolsa@redhat.com> wrote:
> >>>>>>>>>
> >>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>>>>>>>>  #define _raw_read_relax(lock)	cpu_relax()
> >>>>>>>>>>  #define _raw_write_relax(lock)	cpu_relax()
> >>>>>>>>>>  
> >>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>>>>>>>> +#define smp_mb__after_lock() do { } while (0)
> >>>>>>>>> Two small stylistic comments, please make this an inline function:
> >>>>>>>>>
> >>>>>>>>> static inline void smp_mb__after_lock(void) { }
> >>>>>>>>> #define smp_mb__after_lock
> >>>>>>>>>
> >>>>>>>>> (untested)
> >>>>>>>>>
> >>>>>>>>>> +/* The lock does not imply full memory barrier. */
> >>>>>>>>>> +#ifndef smp_mb__after_lock
> >>>>>>>>>> +#define smp_mb__after_lock() smp_mb()
> >>>>>>>>>> +#endif
> >>>>>>>>> ditto.
> >>>>>>>>>
> >>>>>>>>> 	Ingo
> >>>>>>>> This was following existing implementations of various smp_mb__??? helpers :
> >>>>>>>>
> >>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>>  * clear_bit may not imply a memory barrier
> >>>>>>>>  */
> >>>>>>>> #ifndef smp_mb__before_clear_bit
> >>>>>>>> #define smp_mb__before_clear_bit()      smp_mb()
> >>>>>>>> #define smp_mb__after_clear_bit()       smp_mb()
> >>>>>>>> #endif
> >>>>>>> Did i mention that those should be fixed too? :-)
> >>>>>>>
> >>>>>>> 	Ingo
> >>>>>> ok, could I include it in the 2/2 or you prefer separate patch?
> >>>>> depends on whether it will regress ;-)
> >>>>>
> >>>>> If it regresses, it's better to have it separate. If it wont, it can 
> >>>>> be included. If unsure, default to the more conservative option.
> >>>>>
> >>>>> 	Ingo
> >>>>
> >>>> how about this.. 
> >>>> and similar change for smp_mb__before_clear_bit in a separate patch
> >>>>
> >>>>
> >>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> >>>> index b7e5db8..4e77853 100644
> >>>> --- a/arch/x86/include/asm/spinlock.h
> >>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>>  #define _raw_read_relax(lock)	cpu_relax()
> >>>>  #define _raw_write_relax(lock)	cpu_relax()
> >>>>  
> >>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>> +static inline void smp_mb__after_lock(void) { }
> >>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +
> >>>>  #endif /* _ASM_X86_SPINLOCK_H */
> >>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> >>>> index 252b245..4be57ab 100644
> >>>> --- a/include/linux/spinlock.h
> >>>> +++ b/include/linux/spinlock.h
> >>>> @@ -132,6 +132,11 @@ do {								\
> >>>>  #endif /*__raw_spin_is_contended*/
> >>>>  #endif
> >>>>  
> >>>> +/* The lock does not imply full memory barrier. */
> >>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +static inline void smp_mb__after_lock(void) { smp_mb(); }
> >>>> +#endif
> >>>> +
> >>>>  /**
> >>>>   * spin_unlock_wait - wait until the spinlock gets unlocked
> >>>>   * @lock: the spinlock in question.
> >>>> diff --git a/include/net/sock.h b/include/net/sock.h
> >>>> index 4eb8409..98afcd9 100644
> >>>> --- a/include/net/sock.h
> >>>> +++ b/include/net/sock.h
> >>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> >>>>   * in its cache, and so does the tp->rcv_nxt update on CPU2 side.  The CPU1
> >>>>   * could then endup calling schedule and sleep forever if there are no more
> >>>>   * data on the socket.
> >>>> + *
> >>>> + * The sk_has_helper is always called right after a call to read_lock, so we
> >>>> + * can use smp_mb__after_lock barrier.
> >>>>   */
> >>>>  static inline int sk_has_sleeper(struct sock *sk)
> >>>>  {
> >>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> >>>>  	 *
> >>>>  	 * This memory barrier is paired in the sock_poll_wait.
> >>>>  	 */
> >>>> -	smp_mb();
> >>>> +	smp_mb__after_lock();
> >>>>  	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> >>>>  }
> >>>>  
> >>> any feedback on this? 
> >>> I'd send v6 if this way is acceptable..
> >>>
> >>> thanks,
> >>> jirka
> >> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
> >> it is used quite extensivelly.
> >>
> >> I'd prefer to send it in a separate patch, so we can move on with the 
> >> changes I've sent so far..
> >>
> > 
> > As with any optimization (and this is one that adds a semantic that will
> > just grow the memory barrier/locking rule complexity), it should come
> > with performance benchmarks showing real-life improvements.
> > 
> > Otherwise I'd recommend sticking to smp_mb() if this execution path is
> > not that critical, or to move to RCU if it's _that_ critical.
> > 
> > A valid argument would be if the data structures protected are so
> > complex that RCU is out of question but still the few cycles saved by
> > removing a memory barrier are really significant. And even then, the
> > proper solution would be more something like a
> > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> > improvements on architectures other than x86 as well.
> > 
> > So in all cases, I don't think the smp_mb__after_lock() is the
> > appropriate solution.
> 
> RCU on this part is out of the question, as David already mentioned it.
> 

OK

> It would be a regression for short lived tcp/udp sessions, and some workloads
> use them a lot...
> 
> We gained about 20% performance between 2.6.26 and 2.6.31, carefuly removing
> some atomic ops in network stack, adding RCU where it was sensible, but this
> is a painful process, not something Jiri can use to fix bugs on legacy RedHat
> kernels :) (We still are sorting out regressions)
> 

Yep, I can understand that. Tbench on localhost is an especially good
benchmark for this ;)

> To solve problem pointed by Jiri, we have to insert an smp_mb() at this point,
> (not mentioning the other change in select() logic of course)
> 
>  static void sock_def_readable(struct sock *sk, int len)
>  {
>  	read_lock(&sk->sk_callback_lock);
> +	smp_mb(); /* paired with opposite smp_mb() in sk poll logic */
>  	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
>  		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
>  						POLLRDNORM | POLLRDBAND);
>  	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
>  	read_unlock(&sk->sk_callback_lock);
>  }
> 
> As about every incoming packet calls this path, we should be very careful not
> slowing down stack if not necessary.
> 
> On x86 this extra smp_mb() is not needed, since previous call to read_lock()
> already gives the full barrier for free.
> 
> 

Well, I see the __read_lock()+2x smp_mb+__read_unlock is not well suited
for x86. You're right.

But read_lock + smp_mb__after_lock + read_unlock is not well suited for
powerpc, arm, mips and probably others where there is an explicit memory
barrier at the end of the read lock primitive.

One thing that would be efficient for all architectures is to create a
locking primitive that contains the smp_mb, e.g.:

read_lock_smp_mb()

which would act as a read_lock which does a full smp_mb after the lock
is taken.

The naming may be a bit odd, better ideas are welcome.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 14:57                       ` Mathieu Desnoyers
@ 2009-07-07 15:23                         ` Eric Dumazet
  2009-07-08 17:47                           ` Jiri Olsa
  0 siblings, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-07-07 15:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Jiri Olsa, Ingo Molnar, Peter Zijlstra, netdev, linux-kernel, fbl,
	nhorman, davem, htejun, jarkao2, oleg, davidel

Mathieu Desnoyers a écrit :

> But read_lock + smp_mb__after_lock + read_unlock is not well suited for
> powerpc, arm, mips and probably others where there is an explicit memory
> barrier at the end of the read lock primitive.
> 
> One thing that would be efficient for all architectures is to create a
> locking primitive that contains the smp_mb, e.g.:
> 
> read_lock_smp_mb()
> 
> which would act as a read_lock which does a full smp_mb after the lock
> is taken.
> 
> The naming may be a bit odd, better ideas are welcome.

I see your point now, thanks for your patience.

Jiri, I think your first patch can be applied (including the full smp_mb()),
 then we will optimize both for x86 and other arches, when all 
arch maintainers have a chance to change
"read_lock();smp_mb()" to a faster "read_lock_mb()" or something :)


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-07 15:23                         ` Eric Dumazet
@ 2009-07-08 17:47                           ` Jiri Olsa
  2009-07-08 18:07                             ` David Miller
  0 siblings, 1 reply; 39+ messages in thread
From: Jiri Olsa @ 2009-07-08 17:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mathieu Desnoyers, Ingo Molnar, Peter Zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel

On Tue, Jul 07, 2009 at 05:23:18PM +0200, Eric Dumazet wrote:
> Mathieu Desnoyers a écrit :
> 
> > But read_lock + smp_mb__after_lock + read_unlock is not well suited for
> > powerpc, arm, mips and probably others where there is an explicit memory
> > barrier at the end of the read lock primitive.
> > 
> > One thing that would be efficient for all architectures is to create a
> > locking primitive that contains the smp_mb, e.g.:
> > 
> > read_lock_smp_mb()
> > 
> > which would act as a read_lock which does a full smp_mb after the lock
> > is taken.
> > 
> > The naming may be a bit odd, better ideas are welcome.
> 
> I see your point now, thanks for your patience.
> 
> Jiri, I think your first patch can be applied (including the full smp_mb()),
>  then we will optimize both for x86 and other arches, when all 
> arch maintainers have a chance to change
> "read_lock();smp_mb()" to a faster "read_lock_mb()" or something :)
>

great, I saw you Signed-off the 1/2 part.. could I leave it,
or do I need to resend as a single patch?

jirka

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-08 17:47                           ` Jiri Olsa
@ 2009-07-08 18:07                             ` David Miller
  2009-07-08 18:16                               ` Jiri Olsa
  0 siblings, 1 reply; 39+ messages in thread
From: David Miller @ 2009-07-08 18:07 UTC (permalink / raw)
  To: jolsa
  Cc: eric.dumazet, mathieu.desnoyers, mingo, a.p.zijlstra, netdev,
	linux-kernel, fbl, nhorman, htejun, jarkao2, oleg, davidel

From: Jiri Olsa <jolsa@redhat.com>
Date: Wed, 8 Jul 2009 19:47:48 +0200

> great, I saw you Signed-off the 1/2 part.. could I leave it,
> or do I need to resend as a single patch?

Could you formally resubmit things, regardless of whether
they are kept split up or combined?  That will help me
a lot.  Thanks.

I think things are settled down at this point and we can
push these patches into the tree.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-08 18:07                             ` David Miller
@ 2009-07-08 18:16                               ` Jiri Olsa
  0 siblings, 0 replies; 39+ messages in thread
From: Jiri Olsa @ 2009-07-08 18:16 UTC (permalink / raw)
  To: David Miller
  Cc: eric.dumazet, mathieu.desnoyers, mingo, a.p.zijlstra, netdev,
	linux-kernel, fbl, nhorman, htejun, jarkao2, oleg, davidel

On Wed, Jul 08, 2009 at 11:07:36AM -0700, David Miller wrote:
> From: Jiri Olsa <jolsa@redhat.com>
> Date: Wed, 8 Jul 2009 19:47:48 +0200
> 
> > great, I saw you Signed-off the 1/2 part.. could I leave it,
> > or do I need to resend as a single patch?
> 
> Could you formally resubmit things, regardless of whether
> they are kept split up or combined?  That will help me
> a lot.  Thanks.
> 
> I think things are settled down at this point and we can
> push these patches into the tree.

np, I'll send out shortly complete v6 then..

jirka

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03  9:06   ` Ingo Molnar
  2009-07-03  9:20     ` Eric Dumazet
@ 2009-07-03 14:04     ` Mathieu Desnoyers
  2009-07-03 15:29       ` Herbert Xu
  1 sibling, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-03 14:04 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Jiri Olsa, Peter Zijlstra, netdev, linux-kernel, fbl, nhorman,
	davem, htejun, jarkao2, oleg, davidel, eric.dumazet,
	Paul McKenney

* Ingo Molnar (mingo@elte.hu) wrote:
> 
> * Jiri Olsa <jolsa@redhat.com> wrote:
> 
> > +++ b/arch/x86/include/asm/spinlock.h
> > @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >  #define _raw_read_relax(lock)	cpu_relax()
> >  #define _raw_write_relax(lock)	cpu_relax()
> >  
> > +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> > +#define smp_mb__after_lock() do { } while (0)
> 

Hm. Looking at http://lkml.org/lkml/2009/6/23/192, a very basic question
comes to my mind :

Why don't we create a read_lock without acquire semantic instead (e.g.
read_lock_nomb(), or something with a better name like __read_lock()) ?
On architectures where memory barriers are needed to provide the acquire
semantic, it would be faster to do :

__read_lock();
smp_mb();

than :

read_lock(); <- e.g. lwsync + isync or something like that
smp_mb(); <- full sync.

Second point : __add_wait_queue/waitqueue_active/wake_up_interruptible
would probably benefit from adding comments about their combined use
with other checks and how nice memory barriers are.

Mathieu


> Two small stylistic comments, please make this an inline function:
> 
> static inline void smp_mb__after_lock(void) { }
> #define smp_mb__after_lock
> 
> (untested)
> 
> > +/* The lock does not imply full memory barrier. */
> > +#ifndef smp_mb__after_lock
> > +#define smp_mb__after_lock() smp_mb()
> > +#endif
> 
> ditto.
> 
> 	Ingo

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 14:04     ` Mathieu Desnoyers
@ 2009-07-03 15:29       ` Herbert Xu
  2009-07-03 15:37         ` Eric Dumazet
  2009-07-03 15:40         ` Mathieu Desnoyers
  0 siblings, 2 replies; 39+ messages in thread
From: Herbert Xu @ 2009-07-03 15:29 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: mingo, jolsa, a.p.zijlstra, netdev, linux-kernel, fbl, nhorman,
	davem, htejun, jarkao2, oleg, davidel, eric.dumazet,
	Paul.McKenney

Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> 
> Why don't we create a read_lock without acquire semantic instead (e.g.
> read_lock_nomb(), or something with a better name like __read_lock()) ?
> On architectures where memory barriers are needed to provide the acquire
> semantic, it would be faster to do :
> 
> __read_lock();
> smp_mb();
> 
> than :
> 
> read_lock(); <- e.g. lwsync + isync or something like that
> smp_mb(); <- full sync.

Hmm, why do we even care when read_lock should just die?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 15:29       ` Herbert Xu
@ 2009-07-03 15:37         ` Eric Dumazet
  2009-07-03 15:47           ` Mathieu Desnoyers
  2009-07-03 15:40         ` Mathieu Desnoyers
  1 sibling, 1 reply; 39+ messages in thread
From: Eric Dumazet @ 2009-07-03 15:37 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Mathieu Desnoyers, mingo, jolsa, a.p.zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	eric.dumazet, Paul.McKenney

Herbert Xu a écrit :
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>> Why don't we create a read_lock without acquire semantic instead (e.g.
>> read_lock_nomb(), or something with a better name like __read_lock()) ?
>> On architectures where memory barriers are needed to provide the acquire
>> semantic, it would be faster to do :
>>
>> __read_lock();
>> smp_mb();
>>
>> than :
>>
>> read_lock(); <- e.g. lwsync + isync or something like that
>> smp_mb(); <- full sync.
> 
> Hmm, why do we even care when read_lock should just die?
> 
> Cheers,

+1 :)

Do you mean using a spinlock instead or what ?

Also, how many arches are able to have a true __read_lock()
(or __spin_lock() if that matters), without acquire semantic ?

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 15:37         ` Eric Dumazet
@ 2009-07-03 15:47           ` Mathieu Desnoyers
  2009-07-03 17:06             ` Paul E. McKenney
  0 siblings, 1 reply; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-03 15:47 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Herbert Xu, mingo, jolsa, a.p.zijlstra, netdev, linux-kernel, fbl,
	nhorman, davem, htejun, jarkao2, oleg, davidel, Paul.McKenney

* Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Herbert Xu a écrit :
> > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> >> Why don't we create a read_lock without acquire semantic instead (e.g.
> >> read_lock_nomb(), or something with a better name like __read_lock()) ?
> >> On architectures where memory barriers are needed to provide the acquire
> >> semantic, it would be faster to do :
> >>
> >> __read_lock();
> >> smp_mb();
> >>
> >> than :
> >>
> >> read_lock(); <- e.g. lwsync + isync or something like that
> >> smp_mb(); <- full sync.
> > 
> > Hmm, why do we even care when read_lock should just die?
> > 
> > Cheers,
> 
> +1 :)
> 
> Do you mean using a spinlock instead or what ?
> 

I think he meant RCU.

> Also, how many arches are able to have a true __read_lock()
> (or __spin_lock() if that matters), without acquire semantic ?

At least PowerPC, MIPS, recent ARM, alpha.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 15:47           ` Mathieu Desnoyers
@ 2009-07-03 17:06             ` Paul E. McKenney
  2009-07-03 17:31               ` Mathieu Desnoyers
  0 siblings, 1 reply; 39+ messages in thread
From: Paul E. McKenney @ 2009-07-03 17:06 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Eric Dumazet, Herbert Xu, mingo, jolsa, a.p.zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	Paul.McKenney

On Fri, Jul 03, 2009 at 11:47:00AM -0400, Mathieu Desnoyers wrote:
> * Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Herbert Xu a écrit :
> > > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > >> Why don't we create a read_lock without acquire semantic instead (e.g.
> > >> read_lock_nomb(), or something with a better name like __read_lock()) ?
> > >> On architectures where memory barriers are needed to provide the acquire
> > >> semantic, it would be faster to do :
> > >>
> > >> __read_lock();
> > >> smp_mb();
> > >>
> > >> than :
> > >>
> > >> read_lock(); <- e.g. lwsync + isync or something like that
> > >> smp_mb(); <- full sync.
> > > 
> > > Hmm, why do we even care when read_lock should just die?
> > > 
> > > Cheers,
> > 
> > +1 :)
> > 
> > Do you mean using a spinlock instead or what ?
> > 
> 
> I think he meant RCU.
> 
> > Also, how many arches are able to have a true __read_lock()
> > (or __spin_lock() if that matters), without acquire semantic ?
> 
> At least PowerPC, MIPS, recent ARM, alpha.

Are you guys sure you are in agreement about what you all mean by
"acquire semantics"?

Clearly, any correct __read_lock() implementation must enforce ordering
with respect to the most recent __write_unlock(), but this does not
necesarily imply all possible definitions of "acquire semantics".

							Thanx, Paul

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 17:06             ` Paul E. McKenney
@ 2009-07-03 17:31               ` Mathieu Desnoyers
  0 siblings, 0 replies; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-03 17:31 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, Herbert Xu, mingo, jolsa, a.p.zijlstra, netdev,
	linux-kernel, fbl, nhorman, davem, htejun, jarkao2, oleg, davidel,
	Paul.McKenney

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Jul 03, 2009 at 11:47:00AM -0400, Mathieu Desnoyers wrote:
> > * Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > > Herbert Xu a écrit :
> > > > Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > > >> Why don't we create a read_lock without acquire semantic instead (e.g.
> > > >> read_lock_nomb(), or something with a better name like __read_lock()) ?
> > > >> On architectures where memory barriers are needed to provide the acquire
> > > >> semantic, it would be faster to do :
> > > >>
> > > >> __read_lock();
> > > >> smp_mb();
> > > >>
> > > >> than :
> > > >>
> > > >> read_lock(); <- e.g. lwsync + isync or something like that
> > > >> smp_mb(); <- full sync.
> > > > 
> > > > Hmm, why do we even care when read_lock should just die?
> > > > 
> > > > Cheers,
> > > 
> > > +1 :)
> > > 
> > > Do you mean using a spinlock instead or what ?
> > > 
> > 
> > I think he meant RCU.
> > 
> > > Also, how many arches are able to have a true __read_lock()
> > > (or __spin_lock() if that matters), without acquire semantic ?
> > 
> > At least PowerPC, MIPS, recent ARM, alpha.
> 
> Are you guys sure you are in agreement about what you all mean by
> "acquire semantics"?
> 

I use acquire/release semantic with the following meaning :

...
read A
read_unlock()

read B

read_lock();
read C

read_unlock would provide release semantic by disallowing read A to move
after the read_unlock.

read_lock would provide acquire semantic by disallowing read C to move
before read_lock.

read B is free to move.


> Clearly, any correct __read_lock() implementation must enforce ordering
> with respect to the most recent __write_unlock(), but this does not
> necesarily imply all possible definitions of "acquire semantics".
> 

Yes, you are right. We could never remove _all_ memory barriers from
__read_lock()/__read_unlock implementations even if we require something
such as :

__read_lock()
smp_mb()

critical section.

smp_mb()
__read_unlock()

Because we also need to guarantee that consecutive unlock/lock won't be
reordered, which implies a barrier _outside_ of the read lock/unlock
atomic operations.

But anyway I'm not sure it's worth trying to optimize rwlocks, given
that for critical sections where the performance hit of a memory barrier
would be perceivable, we should really think about using RCU rather than
beating this dead horse. :)

Thanks,

Mathieu.


> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock
  2009-07-03 15:29       ` Herbert Xu
  2009-07-03 15:37         ` Eric Dumazet
@ 2009-07-03 15:40         ` Mathieu Desnoyers
  1 sibling, 0 replies; 39+ messages in thread
From: Mathieu Desnoyers @ 2009-07-03 15:40 UTC (permalink / raw)
  To: Herbert Xu
  Cc: mingo, jolsa, a.p.zijlstra, netdev, linux-kernel, fbl, nhorman,
	davem, htejun, jarkao2, oleg, davidel, eric.dumazet,
	Paul.McKenney

* Herbert Xu (herbert@gondor.apana.org.au) wrote:
> Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
> > 
> > Why don't we create a read_lock without acquire semantic instead (e.g.
> > read_lock_nomb(), or something with a better name like __read_lock()) ?
> > On architectures where memory barriers are needed to provide the acquire
> > semantic, it would be faster to do :
> > 
> > __read_lock();
> > smp_mb();
> > 
> > than :
> > 
> > read_lock(); <- e.g. lwsync + isync or something like that
> > smp_mb(); <- full sync.
> 
> Hmm, why do we even care when read_lock should just die?
> 

I guess you are proposing migration from rwlock to RCU ?

Well, in cases where critical sections are in the order of 20000
cycles or more, and with 8 to 64-core machines, there is no significant
gain in using RCU vs rwlocks, so the added complexity might not be
justified if the critical sections are very long.

But then it's a case by case thing. We would have to see what exactly is
protected by this read lock and how long the critical section is.
However, in any case, you are right: rwlocks are acceptable only for
long critical sections, for which we just don't care about one extra
memory barrier.

Instead of optimizing away these read-side rwlock barriers, we would
spend our time much more efficiently switching to RCU read side.

Mathieu

> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2009-07-08 18:16 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-03  8:12 [PATCHv5 0/2] net: fix race in the receive/select Jiri Olsa
2009-07-03  8:13 ` [PATCHv5 1/2] net: adding memory barrier to the poll and receive callbacks Jiri Olsa
2009-07-07 15:56   ` Eric Dumazet
2009-07-03  8:14 ` [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock Jiri Olsa
2009-07-03  9:06   ` Ingo Molnar
2009-07-03  9:20     ` Eric Dumazet
2009-07-03  9:24       ` Ingo Molnar
2009-07-03  9:56         ` Jiri Olsa
2009-07-03 10:25           ` Ingo Molnar
2009-07-03 11:18             ` Jiri Olsa
2009-07-03 11:30               ` Jarek Poplawski
2009-07-03 11:43                 ` Jiri Olsa
2009-07-07 10:18               ` Jiri Olsa
2009-07-07 13:46                 ` Jiri Olsa
2009-07-07 14:01                   ` Mathieu Desnoyers
2009-07-07 14:34                     ` Oleg Nesterov
2009-07-07 15:04                       ` Mathieu Desnoyers
2009-07-07 15:44                         ` Oleg Nesterov
2009-07-07 15:50                           ` Peter Zijlstra
2009-07-07 19:45                             ` Mathieu Desnoyers
2009-07-07 22:44                               ` Eric Dumazet
2009-07-07 23:28                                 ` Mathieu Desnoyers
2009-07-07 23:51                                   ` Oleg Nesterov
2009-07-08  4:34                                     ` Mathieu Desnoyers
2009-07-08  7:18                                       ` Jarek Poplawski
2009-07-07 14:34                     ` Jiri Olsa
2009-07-07 14:42                     ` Eric Dumazet
2009-07-07 14:57                       ` Mathieu Desnoyers
2009-07-07 15:23                         ` Eric Dumazet
2009-07-08 17:47                           ` Jiri Olsa
2009-07-08 18:07                             ` David Miller
2009-07-08 18:16                               ` Jiri Olsa
2009-07-03 14:04     ` Mathieu Desnoyers
2009-07-03 15:29       ` Herbert Xu
2009-07-03 15:37         ` Eric Dumazet
2009-07-03 15:47           ` Mathieu Desnoyers
2009-07-03 17:06             ` Paul E. McKenney
2009-07-03 17:31               ` Mathieu Desnoyers
2009-07-03 15:40         ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).