[PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields.
@ 2024-06-04 16:52 Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer Kuniyuki Iwashima
                   ` (15 more replies)
  0 siblings, 16 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The patch 1 fixes a bug where SOCK_DGRAM's sk->sk_state is changed
to TCP_CLOSE even if the socket is connect()ed to another socket.

The rest of this series annotates lockless accesses to the following
fields.

  * sk->sk_state
  * sk->sk_sndbuf
  * net->unx.sysctl_max_dgram_qlen
  * sk->sk_receive_queue.qlen
  * sk->sk_shutdown

Note that with this series there is skb_queue_empty() left in
unix_dgram_disconnected() that needs to be changed to lockless
version, and unix_peer(other) access there should be protected
by unix_state_lock().

This will require some refactoring, so another series will follow.


Changes:
  v2:
    * Patch 1: Fix wrong double lock

  v1: https://lore.kernel.org/netdev/20240603143231.62085-1-kuniyu@amazon.com/


Kuniyuki Iwashima (15):
  af_unix: Set sk->sk_state under unix_state_lock() for truly
    disconencted peer.
  af_unix: Annodate data-races around sk->sk_state for writers.
  af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  af_unix: Annotate data-races around sk->sk_state in unix_write_space()
    and poll().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  af_unix: Annotate data-race of sk->sk_state in unix_accept().
  af_unix: Annotate data-races around sk->sk_state in sendmsg() and
    recvmsg().
  af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  af_unix: Annotate data-races around sk->sk_sndbuf.
  af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().

 net/unix/af_unix.c | 90 +++++++++++++++++++++++-----------------------
 net/unix/diag.c    | 12 +++----
 2 files changed, 50 insertions(+), 52 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-09 11:28   ` Michal Luczaj
  2024-06-04 16:52 ` [PATCH v2 net 02/15] af_unix: Annodate data-races around sk->sk_state for writers Kuniyuki Iwashima
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, Cong Wang

When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
to BPF SOCKMAP.

When the socket disconnects from the peer by connect(AF_UNSPEC), the state
is set back to TCP_CLOSE.

Then, the peer's state is also set to TCP_CLOSE, but the update is done
locklessly and unconditionally.

Let's say socket A connect()ed to B, B connect()ed to C, and A disconnects
from B.

After the first two connect()s, all three sockets' sk->sk_state are
TCP_ESTABLISHED:

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr ESTAB  0      0       @A 641              * 642
  u_dgr ESTAB  0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

And after the disconnect, B's state is TCP_CLOSE even though it's still
connected to C and C's state is TCP_ESTABLISHED.

  $ ss -xa
  Netid State  Recv-Q Send-Q  Local Address:Port  Peer Address:PortProcess
  u_dgr UNCONN 0      0       @A 641              * 0
  u_dgr UNCONN 0      0       @B 642              * 643
  u_dgr ESTAB  0      0       @C 643              * 0

In this case, we cannot register B to SOCKMAP.

So, when a socket disconnects from the peer, we should not set TCP_CLOSE to
the peer if the peer is connected to yet another socket, and this must be
done under unix_state_lock().

Note that we use WRITE_ONCE() for sk->sk_state as there are many lockless
readers.  These data-races will be fixed in the following patches.

Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
CC: Cong Wang <cong.wang@bytedance.com>
---
 net/unix/af_unix.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 25b49efc0926..b162164b7a42 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -570,7 +570,6 @@ static void unix_dgram_disconnected(struct sock *sk, struct sock *other)
 			sk_error_report(other);
 		}
 	}
-	other->sk_state = TCP_CLOSE;
 }
 
 static void unix_sock_destructor(struct sock *sk)
@@ -1424,8 +1423,15 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 
 		unix_state_double_unlock(sk, other);
 
-		if (other != old_peer)
+		if (other != old_peer) {
 			unix_dgram_disconnected(sk, old_peer);
+
+			unix_state_lock(old_peer);
+			if (!unix_peer(old_peer))
+				WRITE_ONCE(old_peer->sk_state, TCP_CLOSE);
+			unix_state_unlock(old_peer);
+		}
+
 		sock_put(old_peer);
 	} else {
 		unix_peer(sk) = other;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 02/15] af_unix: Annodate data-races around sk->sk_state for writers.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 03/15] af_unix: Annotate data-race of sk->sk_state in unix_inq_len() Kuniyuki Iwashima
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

sk->sk_state is changed under unix_state_lock(), but it's read locklessly
in many places.

This patch adds WRITE_ONCE() on the writer side.

We will add READ_ONCE() to the lockless readers in the following patches.

Fixes: 83301b5367a9 ("af_unix: Set TCP_ESTABLISHED for datagram sockets too")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b162164b7a42..424d021a4d7d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -616,7 +616,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
 	u->path.dentry = NULL;
 	u->path.mnt = NULL;
 	state = sk->sk_state;
-	sk->sk_state = TCP_CLOSE;
+	WRITE_ONCE(sk->sk_state, TCP_CLOSE);
 
 	skpair = unix_peer(sk);
 	unix_peer(sk) = NULL;
@@ -738,7 +738,8 @@ static int unix_listen(struct socket *sock, int backlog)
 	if (backlog > sk->sk_max_ack_backlog)
 		wake_up_interruptible_all(&u->peer_wait);
 	sk->sk_max_ack_backlog	= backlog;
-	sk->sk_state		= TCP_LISTEN;
+	WRITE_ONCE(sk->sk_state, TCP_LISTEN);
+
 	/* set credentials so connect can copy them */
 	init_peercred(sk);
 	err = 0;
@@ -1401,7 +1402,8 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 		if (err)
 			goto out_unlock;
 
-		sk->sk_state = other->sk_state = TCP_ESTABLISHED;
+		WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
+		WRITE_ONCE(other->sk_state, TCP_ESTABLISHED);
 	} else {
 		/*
 		 *	1003.1g breaking connected state with AF_UNSPEC
@@ -1418,7 +1420,7 @@ static int unix_dgram_connect(struct socket *sock, struct sockaddr *addr,
 
 		unix_peer(sk) = other;
 		if (!other)
-			sk->sk_state = TCP_CLOSE;
+			WRITE_ONCE(sk->sk_state, TCP_CLOSE);
 		unix_dgram_peer_wake_disconnect_wakeup(sk, old_peer);
 
 		unix_state_double_unlock(sk, other);
@@ -1639,7 +1641,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	copy_peercred(sk, other);
 
 	sock->state	= SS_CONNECTED;
-	sk->sk_state	= TCP_ESTABLISHED;
+	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
 	sock_hold(newsk);
 
 	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
@@ -2050,7 +2052,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 			unix_peer(sk) = NULL;
 			unix_dgram_peer_wake_disconnect_wakeup(sk, other);
 
-			sk->sk_state = TCP_CLOSE;
+			WRITE_ONCE(sk->sk_state, TCP_CLOSE);
 			unix_state_unlock(sk);
 
 			unix_dgram_disconnected(sk, other);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 03/15] af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 02/15] af_unix: Annodate data-races around sk->sk_state for writers Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 04/15] af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll() Kuniyuki Iwashima
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

ioctl(SIOCINQ) calls unix_inq_len() that checks sk->sk_state first
and returns -EINVAL if it's TCP_LISTEN.

Then, for SOCK_STREAM sockets, unix_inq_len() returns the number of
bytes in recvq.

However, unix_inq_len() does not hold unix_state_lock(), and the
concurrent listen() might change the state after checking sk->sk_state.

If the race occurs, 0 is returned for the listener, instead of -EINVAL,
because the length of skb with embryo is 0.

We could hold unix_state_lock() in unix_inq_len(), but it's overkill
given the result is true for pre-listen() TCP_CLOSE state.

So, let's use READ_ONCE() for sk->sk_state in unix_inq_len().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 424d021a4d7d..b37b53767b29 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -3017,7 +3017,7 @@ long unix_inq_len(struct sock *sk)
 	struct sk_buff *skb;
 	long amount = 0;
 
-	if (sk->sk_state == TCP_LISTEN)
+	if (READ_ONCE(sk->sk_state) == TCP_LISTEN)
 		return -EINVAL;
 
 	spin_lock(&sk->sk_receive_queue.lock);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 04/15] af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (2 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 03/15] af_unix: Annotate data-race of sk->sk_state in unix_inq_len() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 05/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_connect() Kuniyuki Iwashima
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

unix_poll() and unix_dgram_poll() read sk->sk_state locklessly and
calls unix_writable() which also reads sk->sk_state without holding
unix_state_lock().

Let's use READ_ONCE() in unix_poll() and unix_dgram_poll() and pass
it to unix_writable().

While at it, we remove TCP_SYN_SENT check in unix_dgram_poll() as
that state does not exist for AF_UNIX socket since the code was added.

Fixes: 1586a5877db9 ("af_unix: do not report POLLOUT on listeners")
Fixes: 3c73419c09a5 ("af_unix: fix 'poll for write'/ connected DGRAM sockets")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 25 ++++++++++++-------------
 1 file changed, 12 insertions(+), 13 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index b37b53767b29..3bacf47cb94d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -530,9 +530,9 @@ static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
 	return 0;
 }
 
-static int unix_writable(const struct sock *sk)
+static int unix_writable(const struct sock *sk, unsigned char state)
 {
-	return sk->sk_state != TCP_LISTEN &&
+	return state != TCP_LISTEN &&
 	       (refcount_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
 }
 
@@ -541,7 +541,7 @@ static void unix_write_space(struct sock *sk)
 	struct socket_wq *wq;
 
 	rcu_read_lock();
-	if (unix_writable(sk)) {
+	if (unix_writable(sk, READ_ONCE(sk->sk_state))) {
 		wq = rcu_dereference(sk->sk_wq);
 		if (skwq_has_sleeper(wq))
 			wake_up_interruptible_sync_poll(&wq->wait,
@@ -3129,12 +3129,14 @@ static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned lon
 static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wait)
 {
 	struct sock *sk = sock->sk;
+	unsigned char state;
 	__poll_t mask;
 	u8 shutdown;
 
 	sock_poll_wait(file, sock, wait);
 	mask = 0;
 	shutdown = READ_ONCE(sk->sk_shutdown);
+	state = READ_ONCE(sk->sk_state);
 
 	/* exceptional events? */
 	if (READ_ONCE(sk->sk_err))
@@ -3156,14 +3158,14 @@ static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wa
 
 	/* Connection-based need to check for termination and startup */
 	if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) &&
-	    sk->sk_state == TCP_CLOSE)
+	    state == TCP_CLOSE)
 		mask |= EPOLLHUP;
 
 	/*
 	 * we set writable also when the other side has shut down the
 	 * connection. This prevents stuck sockets.
 	 */
-	if (unix_writable(sk))
+	if (unix_writable(sk, state))
 		mask |= EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
 
 	return mask;
@@ -3174,12 +3176,14 @@ static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
 {
 	struct sock *sk = sock->sk, *other;
 	unsigned int writable;
+	unsigned char state;
 	__poll_t mask;
 	u8 shutdown;
 
 	sock_poll_wait(file, sock, wait);
 	mask = 0;
 	shutdown = READ_ONCE(sk->sk_shutdown);
+	state = READ_ONCE(sk->sk_state);
 
 	/* exceptional events? */
 	if (READ_ONCE(sk->sk_err) ||
@@ -3199,19 +3203,14 @@ static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
 		mask |= EPOLLIN | EPOLLRDNORM;
 
 	/* Connection-based need to check for termination and startup */
-	if (sk->sk_type == SOCK_SEQPACKET) {
-		if (sk->sk_state == TCP_CLOSE)
-			mask |= EPOLLHUP;
-		/* connection hasn't started yet? */
-		if (sk->sk_state == TCP_SYN_SENT)
-			return mask;
-	}
+	if (sk->sk_type == SOCK_SEQPACKET && state == TCP_CLOSE)
+		mask |= EPOLLHUP;
 
 	/* No write status requested, avoid expensive OUT tests. */
 	if (!(poll_requested_events(wait) & (EPOLLWRBAND|EPOLLWRNORM|EPOLLOUT)))
 		return mask;
 
-	writable = unix_writable(sk);
+	writable = unix_writable(sk, state);
 	if (writable) {
 		unix_state_lock(sk);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 05/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (3 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 04/15] af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 06/15] af_unix: Annotate data-race of sk->sk_state in unix_accept() Kuniyuki Iwashima
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

As small optimisation, unix_stream_connect() prefetches the client's
sk->sk_state without unix_state_lock() and checks if it's TCP_CLOSE.

Later, sk->sk_state is checked again under unix_state_lock().

Let's use READ_ONCE() for the first check and TCP_CLOSE directly for
the second check.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3bacf47cb94d..84552826530d 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1481,7 +1481,6 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	struct sk_buff *skb = NULL;
 	long timeo;
 	int err;
-	int st;
 
 	err = unix_validate_addr(sunaddr, addr_len);
 	if (err)
@@ -1571,9 +1570,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 	   Well, and we have to recheck the state after socket locked.
 	 */
-	st = sk->sk_state;
-
-	switch (st) {
+	switch (READ_ONCE(sk->sk_state)) {
 	case TCP_CLOSE:
 		/* This is ok... continue with connect */
 		break;
@@ -1588,7 +1585,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 	unix_state_lock_nested(sk, U_LOCK_SECOND);
 
-	if (sk->sk_state != st) {
+	if (sk->sk_state != TCP_CLOSE) {
 		unix_state_unlock(sk);
 		unix_state_unlock(other);
 		sock_put(other);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 06/15] af_unix: Annotate data-race of sk->sk_state in unix_accept().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (4 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 05/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_connect() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 07/15] af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg() Kuniyuki Iwashima
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Once sk->sk_state is changed to TCP_LISTEN, it never changes.

unix_accept() takes the advantage and reads sk->sk_state without
holding unix_state_lock().

Let's use READ_ONCE() there.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 84552826530d..4763c26ae480 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1710,7 +1710,7 @@ static int unix_accept(struct socket *sock, struct socket *newsock,
 		goto out;
 
 	arg->err = -EINVAL;
-	if (sk->sk_state != TCP_LISTEN)
+	if (READ_ONCE(sk->sk_state) != TCP_LISTEN)
 		goto out;
 
 	/* If socket state is TCP_LISTEN it cannot change (for now...),
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 07/15] af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (5 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 06/15] af_unix: Annotate data-race of sk->sk_state in unix_accept() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 08/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb() Kuniyuki Iwashima
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

The following functions read sk->sk_state locklessly and proceed only if
the state is TCP_ESTABLISHED.

  * unix_stream_sendmsg
  * unix_stream_read_generic
  * unix_seqpacket_sendmsg
  * unix_seqpacket_recvmsg

Let's use READ_ONCE() there.

Fixes: a05d2ad1c1f3 ("af_unix: Only allow recv on connected seqpacket sockets.")
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4763c26ae480..4ef9c21783a5 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2226,7 +2226,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 	}
 
 	if (msg->msg_namelen) {
-		err = sk->sk_state == TCP_ESTABLISHED ? -EISCONN : -EOPNOTSUPP;
+		err = READ_ONCE(sk->sk_state) == TCP_ESTABLISHED ? -EISCONN : -EOPNOTSUPP;
 		goto out_err;
 	} else {
 		err = -ENOTCONN;
@@ -2340,7 +2340,7 @@ static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
 	if (err)
 		return err;
 
-	if (sk->sk_state != TCP_ESTABLISHED)
+	if (READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)
 		return -ENOTCONN;
 
 	if (msg->msg_namelen)
@@ -2354,7 +2354,7 @@ static int unix_seqpacket_recvmsg(struct socket *sock, struct msghdr *msg,
 {
 	struct sock *sk = sock->sk;
 
-	if (sk->sk_state != TCP_ESTABLISHED)
+	if (READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)
 		return -ENOTCONN;
 
 	return unix_dgram_recvmsg(sock, msg, size, flags);
@@ -2683,7 +2683,7 @@ static int unix_stream_read_generic(struct unix_stream_read_state *state,
 	size_t size = state->size;
 	unsigned int last_len;
 
-	if (unlikely(sk->sk_state != TCP_ESTABLISHED)) {
+	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)) {
 		err = -EINVAL;
 		goto out;
 	}
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 08/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (6 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 07/15] af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 09/15] af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG Kuniyuki Iwashima
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

unix_stream_read_skb() is called from sk->sk_data_ready() context
where unix_state_lock() is not held.

Let's use READ_ONCE() there.

Fixes: 77462de14a43 ("af_unix: Add read_sock for stream socket types")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 4ef9c21783a5..e7b74207aa3b 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2659,7 +2659,7 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 
 static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 {
-	if (unlikely(sk->sk_state != TCP_ESTABLISHED))
+	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED))
 		return -ENOTCONN;
 
 	return unix_read_skb(sk, recv_actor);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 09/15] af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (7 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 08/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 10/15] af_unix: Annotate data-races around sk->sk_sndbuf Kuniyuki Iwashima
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

While dumping AF_UNIX sockets via UNIX_DIAG, sk->sk_state is read
locklessly.

Let's use READ_ONCE() there.

Note that the result could be inconsistent if the socket is dumped
during the state change.  This is common for other SOCK_DIAG and
similar interfaces.

Fixes: c9da99e6475f ("unix_diag: Fixup RQLEN extension report")
Fixes: 2aac7a2cb0d9 ("unix_diag: Pending connections IDs NLA")
Fixes: 45a96b9be6ec ("unix_diag: Dumping all sockets core")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/diag.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/unix/diag.c b/net/unix/diag.c
index ae39538c5042..116cf508aea4 100644
--- a/net/unix/diag.c
+++ b/net/unix/diag.c
@@ -65,7 +65,7 @@ static int sk_diag_dump_icons(struct sock *sk, struct sk_buff *nlskb)
 	u32 *buf;
 	int i;
 
-	if (sk->sk_state == TCP_LISTEN) {
+	if (READ_ONCE(sk->sk_state) == TCP_LISTEN) {
 		spin_lock(&sk->sk_receive_queue.lock);
 
 		attr = nla_reserve(nlskb, UNIX_DIAG_ICONS,
@@ -103,7 +103,7 @@ static int sk_diag_show_rqlen(struct sock *sk, struct sk_buff *nlskb)
 {
 	struct unix_diag_rqlen rql;
 
-	if (sk->sk_state == TCP_LISTEN) {
+	if (READ_ONCE(sk->sk_state) == TCP_LISTEN) {
 		rql.udiag_rqueue = sk->sk_receive_queue.qlen;
 		rql.udiag_wqueue = sk->sk_max_ack_backlog;
 	} else {
@@ -136,7 +136,7 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, struct unix_diag_r
 	rep = nlmsg_data(nlh);
 	rep->udiag_family = AF_UNIX;
 	rep->udiag_type = sk->sk_type;
-	rep->udiag_state = sk->sk_state;
+	rep->udiag_state = READ_ONCE(sk->sk_state);
 	rep->pad = 0;
 	rep->udiag_ino = sk_ino;
 	sock_diag_save_cookie(sk, rep->udiag_cookie);
@@ -215,7 +215,7 @@ static int unix_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
 		sk_for_each(sk, &net->unx.table.buckets[slot]) {
 			if (num < s_num)
 				goto next;
-			if (!(req->udiag_states & (1 << sk->sk_state)))
+			if (!(req->udiag_states & (1 << READ_ONCE(sk->sk_state))))
 				goto next;
 			if (sk_diag_dump(sk, skb, req, sk_user_ns(skb->sk),
 					 NETLINK_CB(cb->skb).portid,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 10/15] af_unix: Annotate data-races around sk->sk_sndbuf.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (8 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 09/15] af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 11/15] af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen Kuniyuki Iwashima
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

sk_setsockopt() changes sk->sk_sndbuf under lock_sock(), but it's
not used in af_unix.c.

Let's use READ_ONCE() to read sk->sk_sndbuf in unix_writable(),
unix_dgram_sendmsg(), and unix_stream_sendmsg().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index e7b74207aa3b..de2a5e334a88 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -533,7 +533,7 @@ static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
 static int unix_writable(const struct sock *sk, unsigned char state)
 {
 	return state != TCP_LISTEN &&
-	       (refcount_read(&sk->sk_wmem_alloc) << 2) <= sk->sk_sndbuf;
+		(refcount_read(&sk->sk_wmem_alloc) << 2) <= READ_ONCE(sk->sk_sndbuf);
 }
 
 static void unix_write_space(struct sock *sk)
@@ -1967,7 +1967,7 @@ static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
 	}
 
 	err = -EMSGSIZE;
-	if (len > sk->sk_sndbuf - 32)
+	if (len > READ_ONCE(sk->sk_sndbuf) - 32)
 		goto out;
 
 	if (len > SKB_MAX_ALLOC) {
@@ -2247,7 +2247,7 @@ static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 						   &err, 0);
 		} else {
 			/* Keep two messages in the pipe so it schedules better */
-			size = min_t(int, size, (sk->sk_sndbuf >> 1) - 64);
+			size = min_t(int, size, (READ_ONCE(sk->sk_sndbuf) >> 1) - 64);
 
 			/* allow fallback to order-0 allocations */
 			size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 11/15] af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (9 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 10/15] af_unix: Annotate data-races around sk->sk_sndbuf Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 12/15] af_unix: Use unix_recvq_full_lockless() in unix_stream_connect() Kuniyuki Iwashima
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

net->unx.sysctl_max_dgram_qlen is exposed as a sysctl knob and can be
changed concurrently.

Let's use READ_ONCE() in unix_create1().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index de2a5e334a88..623ee39657e2 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -976,7 +976,7 @@ static struct sock *unix_create1(struct net *net, struct socket *sock, int kern,
 	sk->sk_hash		= unix_unbound_hash(sk);
 	sk->sk_allocation	= GFP_KERNEL_ACCOUNT;
 	sk->sk_write_space	= unix_write_space;
-	sk->sk_max_ack_backlog	= net->unx.sysctl_max_dgram_qlen;
+	sk->sk_max_ack_backlog	= READ_ONCE(net->unx.sysctl_max_dgram_qlen);
 	sk->sk_destruct		= unix_sock_destructor;
 	u = unix_sk(sk);
 	u->listener = NULL;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 12/15] af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (10 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 11/15] af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 13/15] af_unix: Use skb_queue_empty_lockless() in unix_release_sock() Kuniyuki Iwashima
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Once sk->sk_state is changed to TCP_LISTEN, it never changes.

unix_accept() takes advantage of this characteristics; it does not
hold the listener's unix_state_lock() and only acquires recvq lock
to pop one skb.

It means unix_state_lock() does not prevent the queue length from
changing in unix_stream_connect().

Thus, we need to use unix_recvq_full_lockless() to avoid data-race.

Now we remove unix_recvq_full() as no one uses it.

Note that we can remove READ_ONCE() for sk->sk_max_ack_backlog in
unix_recvq_full_lockless() because of the following reasons:

  (1) For SOCK_DGRAM, it is a written-once field in unix_create1()

  (2) For SOCK_STREAM and SOCK_SEQPACKET, it is changed under the
      listener's unix_state_lock() in unix_listen(), and we hold
      the lock in unix_stream_connect()

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 10 ++--------
 1 file changed, 2 insertions(+), 8 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 623ee39657e2..eb3ba3448ed3 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -221,15 +221,9 @@ static inline int unix_may_send(struct sock *sk, struct sock *osk)
 	return unix_peer(osk) == NULL || unix_our_peer(sk, osk);
 }
 
-static inline int unix_recvq_full(const struct sock *sk)
-{
-	return skb_queue_len(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
-}
-
 static inline int unix_recvq_full_lockless(const struct sock *sk)
 {
-	return skb_queue_len_lockless(&sk->sk_receive_queue) >
-		READ_ONCE(sk->sk_max_ack_backlog);
+	return skb_queue_len_lockless(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
 }
 
 struct sock *unix_peer_get(struct sock *s)
@@ -1545,7 +1539,7 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	if (other->sk_shutdown & RCV_SHUTDOWN)
 		goto out_unlock;
 
-	if (unix_recvq_full(other)) {
+	if (unix_recvq_full_lockless(other)) {
 		err = -EAGAIN;
 		if (!timeo)
 			goto out_unlock;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 13/15] af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (11 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 12/15] af_unix: Use unix_recvq_full_lockless() in unix_stream_connect() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 14/15] af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen() Kuniyuki Iwashima
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

If the socket type is SOCK_STREAM or SOCK_SEQPACKET, unix_release_sock()
checks the length of the peer socket's recvq under unix_state_lock().

However, unix_stream_read_generic() calls skb_unlink() after releasing
the lock.  Also, for SOCK_SEQPACKET, __skb_try_recv_datagram() unlinks
skb without unix_state_lock().

Thues, unix_state_lock() does not protect qlen.

Let's use skb_queue_empty_lockless() in unix_release_sock().

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/af_unix.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index eb3ba3448ed3..80846279de9f 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -631,7 +631,7 @@ static void unix_release_sock(struct sock *sk, int embrion)
 			unix_state_lock(skpair);
 			/* No more writes */
 			WRITE_ONCE(skpair->sk_shutdown, SHUTDOWN_MASK);
-			if (!skb_queue_empty(&sk->sk_receive_queue) || embrion)
+			if (!skb_queue_empty_lockless(&sk->sk_receive_queue) || embrion)
 				WRITE_ONCE(skpair->sk_err, ECONNRESET);
 			unix_state_unlock(skpair);
 			skpair->sk_state_change(skpair);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 14/15] af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (12 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 13/15] af_unix: Use skb_queue_empty_lockless() in unix_release_sock() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-04 16:52 ` [PATCH v2 net 15/15] af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill() Kuniyuki Iwashima
  2024-06-06 11:10 ` [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields patchwork-bot+netdevbpf
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

We can dump the socket queue length via UNIX_DIAG by specifying
UDIAG_SHOW_RQLEN.

If sk->sk_state is TCP_LISTEN, we return the recv queue length,
but here we do not hold recvq lock.

Let's use skb_queue_len_lockless() in sk_diag_show_rqlen().

Fixes: c9da99e6475f ("unix_diag: Fixup RQLEN extension report")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/diag.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/diag.c b/net/unix/diag.c
index 116cf508aea4..321336f91a0a 100644
--- a/net/unix/diag.c
+++ b/net/unix/diag.c
@@ -104,7 +104,7 @@ static int sk_diag_show_rqlen(struct sock *sk, struct sk_buff *nlskb)
 	struct unix_diag_rqlen rql;
 
 	if (READ_ONCE(sk->sk_state) == TCP_LISTEN) {
-		rql.udiag_rqueue = sk->sk_receive_queue.qlen;
+		rql.udiag_rqueue = skb_queue_len_lockless(&sk->sk_receive_queue);
 		rql.udiag_wqueue = sk->sk_max_ack_backlog;
 	} else {
 		rql.udiag_rqueue = (u32) unix_inq_len(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 net 15/15] af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (13 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 14/15] af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen() Kuniyuki Iwashima
@ 2024-06-04 16:52 ` Kuniyuki Iwashima
  2024-06-06 11:10 ` [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields patchwork-bot+netdevbpf
  15 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-04 16:52 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

While dumping sockets via UNIX_DIAG, we do not hold unix_state_lock().

Let's use READ_ONCE() to read sk->sk_shutdown.

Fixes: e4e541a84863 ("sock-diag: Report shutdown for inet and unix sockets (v2)")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/unix/diag.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/unix/diag.c b/net/unix/diag.c
index 321336f91a0a..937edf4afed4 100644
--- a/net/unix/diag.c
+++ b/net/unix/diag.c
@@ -165,7 +165,7 @@ static int sk_diag_fill(struct sock *sk, struct sk_buff *skb, struct unix_diag_r
 	    sock_diag_put_meminfo(sk, skb, UNIX_DIAG_MEMINFO))
 		goto out_nlmsg_trim;
 
-	if (nla_put_u8(skb, UNIX_DIAG_SHUTDOWN, sk->sk_shutdown))
+	if (nla_put_u8(skb, UNIX_DIAG_SHUTDOWN, READ_ONCE(sk->sk_shutdown)))
 		goto out_nlmsg_trim;
 
 	if ((req->udiag_show & UDIAG_SHOW_UID) &&
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields.
  2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
                   ` (14 preceding siblings ...)
  2024-06-04 16:52 ` [PATCH v2 net 15/15] af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill() Kuniyuki Iwashima
@ 2024-06-06 11:10 ` patchwork-bot+netdevbpf
  15 siblings, 0 replies; 32+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-06-06 11:10 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, pabeni, kuni1840, netdev

Hello:

This series was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Tue, 4 Jun 2024 09:52:26 -0700 you wrote:
> The patch 1 fixes a bug where SOCK_DGRAM's sk->sk_state is changed
> to TCP_CLOSE even if the socket is connect()ed to another socket.
> 
> The rest of this series annotates lockless accesses to the following
> fields.
> 
>   * sk->sk_state
>   * sk->sk_sndbuf
>   * net->unx.sysctl_max_dgram_qlen
>   * sk->sk_receive_queue.qlen
>   * sk->sk_shutdown
> 
> [...]

Here is the summary with links:
  - [v2,net,01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
    https://git.kernel.org/netdev/net/c/26bfb8b57063
  - [v2,net,02/15] af_unix: Annodate data-races around sk->sk_state for writers.
    https://git.kernel.org/netdev/net/c/942238f9735a
  - [v2,net,03/15] af_unix: Annotate data-race of sk->sk_state in unix_inq_len().
    https://git.kernel.org/netdev/net/c/3a0f38eb285c
  - [v2,net,04/15] af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll().
    https://git.kernel.org/netdev/net/c/eb0718fb3e97
  - [v2,net,05/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_connect().
    https://git.kernel.org/netdev/net/c/a9bf9c7dc6a5
  - [v2,net,06/15] af_unix: Annotate data-race of sk->sk_state in unix_accept().
    https://git.kernel.org/netdev/net/c/1b536948e805
  - [v2,net,07/15] af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg().
    https://git.kernel.org/netdev/net/c/8a34d4e8d974
  - [v2,net,08/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb().
    https://git.kernel.org/netdev/net/c/af4c733b6b1a
  - [v2,net,09/15] af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG.
    https://git.kernel.org/netdev/net/c/0aa3be7b3e1f
  - [v2,net,10/15] af_unix: Annotate data-races around sk->sk_sndbuf.
    https://git.kernel.org/netdev/net/c/b0632e53e0da
  - [v2,net,11/15] af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen.
    https://git.kernel.org/netdev/net/c/bd9f2d05731f
  - [v2,net,12/15] af_unix: Use unix_recvq_full_lockless() in unix_stream_connect().
    https://git.kernel.org/netdev/net/c/45d872f0e655
  - [v2,net,13/15] af_unix: Use skb_queue_empty_lockless() in unix_release_sock().
    https://git.kernel.org/netdev/net/c/83690b82d228
  - [v2,net,14/15] af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen().
    https://git.kernel.org/netdev/net/c/5d915e584d84
  - [v2,net,15/15] af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill().
    https://git.kernel.org/netdev/net/c/efaf24e30ec3

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-04 16:52 ` [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer Kuniyuki Iwashima
@ 2024-06-09 11:28   ` Michal Luczaj
  2024-06-09 19:53     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-09 11:28 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni
  Cc: Kuniyuki Iwashima, netdev, Cong Wang

On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> to BPF SOCKMAP. (...)

Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:

	T0 bpf				T1 connect
	======				==========

				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
sock_map_sk_state_allowed(sk)
...
sk_pair = unix_peer(sk)
sock_hold(sk_pair)
				sock_hold(newsk)
				smp_mb__after_atomic()
				unix_peer(sk) = newsk
				unix_state_unlock(sk)

With mdelay(1) stuffed in unix_stream_connect():

[  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
[  902.277633] #PF: supervisor write access in kernel mode
[  902.277661] #PF: error_code(0x0002) - not-present page
[  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
[  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
[  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
[  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150

Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
something like

@@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
        /* Set credentials */
        copy_peercred(sk, other);

-       sock->state     = SS_CONNECTED;
-       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
        sock_hold(newsk);
+       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
+       WRITE_ONCE(unix_peer(sk), newsk);
+       smp_wmb(); /* ensure peer is set before sk_state */

-       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
-       unix_peer(sk)   = newsk;
+       sock->state = SS_CONNECTED;
+       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);

        unix_state_unlock(sk);

@@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
         * be a single matching destroy operation.
         */
        if (!psock->sk_pair) {
-               sk_pair = unix_peer(sk);
+               smp_rmb();
+               sk_pair = READ_ONCE(unix_peer(sk));
                sock_hold(sk_pair);
                psock->sk_pair = sk_pair;
        }

This should keep things ordered and lockless... I hope.

Alternatively, maybe it would be better just to make BPF respect the unix
state lock?

@@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
+               unix_state_lock(sk);
                sk_pair = unix_peer(sk);
+               unix_state_unlock(sk);
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;

What do you think?

Thanks,
Michal

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-09 11:28   ` Michal Luczaj
@ 2024-06-09 19:53     ` Kuniyuki Iwashima
  2024-06-09 21:03       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-09 19:53 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 9 Jun 2024 13:28:34 +0200
> On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> > When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> > to BPF SOCKMAP. (...)
> 
> Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
> bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
> sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
> unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:
> 
> 	T0 bpf				T1 connect
> 	======				==========
> 
> 				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> sock_map_sk_state_allowed(sk)
> ...
> sk_pair = unix_peer(sk)
> sock_hold(sk_pair)
> 				sock_hold(newsk)
> 				smp_mb__after_atomic()
> 				unix_peer(sk) = newsk
> 				unix_state_unlock(sk)
> 
> With mdelay(1) stuffed in unix_stream_connect():
> 
> [  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
> [  902.277633] #PF: supervisor write access in kernel mode
> [  902.277661] #PF: error_code(0x0002) - not-present page
> [  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
> [  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
> [  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
> [  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150
> 
> Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
> something like
> 
> @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
>         /* Set credentials */
>         copy_peercred(sk, other);
> 
> -       sock->state     = SS_CONNECTED;
> -       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
>         sock_hold(newsk);
> +       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> +       WRITE_ONCE(unix_peer(sk), newsk);
> +       smp_wmb(); /* ensure peer is set before sk_state */
> 
> -       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> -       unix_peer(sk)   = newsk;
> +       sock->state = SS_CONNECTED;
> +       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> 
>         unix_state_unlock(sk);
> 
> @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>          * be a single matching destroy operation.
>          */
>         if (!psock->sk_pair) {
> -               sk_pair = unix_peer(sk);
> +               smp_rmb();
> +               sk_pair = READ_ONCE(unix_peer(sk));
>                 sock_hold(sk_pair);
>                 psock->sk_pair = sk_pair;
>         }
> 
> This should keep things ordered and lockless... I hope.

sock_map_update_elem() assumes that the socket is protected
by lock_sock(), but AF_UNIX uses it only for the general path.

So, I think we should fix sock_map_sk_state_allowed() and
then use smp_store_release()/smp_load_acquire() rather than
smp_[rw]mb() for unix_peer(sk).

Could you test this with the mdelay(1) change ?

Note that we need not touch sock->state.  I have a patch for
net-next that removes sock->state uses completely from AF_UNIX
as we don't use it.  Even unix_seq_show() depends on sk->sk_state.

---8<---
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..67794d2c7498 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	return true;
 }
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 80846279de9f..a558745c7d76 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1632,11 +1632,11 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 	copy_peercred(sk, other);
 
 	sock->state	= SS_CONNECTED;
-	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
 	sock_hold(newsk);
 
 	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
-	unix_peer(sk)	= newsk;
+	smp_store_release(&unix_peer(sk), newsk);
+	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
 
 	unix_state_unlock(sk);
 
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..6d9ae8e63901 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -180,7 +180,7 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
+		sk_pair = smp_load_acquire(&unix_peer(sk));
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;
 	}
---8<---


> 
> Alternatively, maybe it would be better just to make BPF respect the unix
> state lock?
> 
> @@ -180,6 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 * be a single matching destroy operation.
>  	 */
>  	if (!psock->sk_pair) {
> +               unix_state_lock(sk);
>                 sk_pair = unix_peer(sk);
> +               unix_state_unlock(sk);
>  		sock_hold(sk_pair);
>  		psock->sk_pair = sk_pair;
> 
> What do you think?

If we'd go this way, I'd change like this:

---8<---
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..1db42cfee70d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
-	struct sock *sk_pair;
-
 	/* Restore does not decrement the sk_pair reference yet because we must
 	 * keep the a reference to the socket until after an RCU grace period
 	 * and any pending sends have completed.
@@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
-		sock_hold(sk_pair);
-		psock->sk_pair = sk_pair;
+		psock->sk_pair = unix_peer_get(sk);
+		if (WARN_ON_ONCE(!psock->sk_pair))
+			return -EINVAL;
 	}
 
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
---8<---


And the _last_ option would be..., no :)

---8<---
diff --git a/include/net/af_unix.h b/include/net/af_unix.h
index b6eedf7650da..c7e31bc3e95e 100644
--- a/include/net/af_unix.h
+++ b/include/net/af_unix.h
@@ -94,8 +94,8 @@ struct unix_sock {
 #define unix_sk(ptr) container_of_const(ptr, struct unix_sock, sk)
 #define unix_peer(sk) (unix_sk(sk)->peer)
 
-#define unix_state_lock(s)	spin_lock(&unix_sk(s)->lock)
-#define unix_state_unlock(s)	spin_unlock(&unix_sk(s)->lock)
+#define unix_state_lock(s)	lock_sock(s)
+#define unix_state_unlock(s)	release_sock(s)
 enum unix_socket_lock_class {
 	U_LOCK_NORMAL,
 	U_LOCK_SECOND,	/* for double locking, see unix_state_double_lock(). */
@@ -108,7 +108,7 @@ enum unix_socket_lock_class {
 static inline void unix_state_lock_nested(struct sock *sk,
 				   enum unix_socket_lock_class subclass)
 {
-	spin_lock_nested(&unix_sk(sk)->lock, subclass);
+	lock_sock_nested(sk, subclass);
 }
 
 #define peer_wait peer_wq.wait
---8<---

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-09 19:53     ` Kuniyuki Iwashima
@ 2024-06-09 21:03       ` Kuniyuki Iwashima
  2024-06-10 12:55         ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-09 21:03 UTC (permalink / raw)
  To: kuniyu; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, mhal, netdev, pabeni

From: Kuniyuki Iwashima <kuniyu@amazon.com>
Date: Sun, 9 Jun 2024 12:53:20 -0700
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 9 Jun 2024 13:28:34 +0200
> > On 6/4/24 18:52, Kuniyuki Iwashima wrote:
> > > When a SOCK_DGRAM socket connect()s to another socket, the both sockets'
> > > sk->sk_state are changed to TCP_ESTABLISHED so that we can register them
> > > to BPF SOCKMAP. (...)
> > 
> > Speaking of af_unix and sockmap, SOCK_STREAM has a tiny window for
> > bpf(BPF_MAP_UPDATE_ELEM) and unix_stream_connect() to race: when
> > sock_map_sk_state_allowed() passes (sk_state == TCP_ESTABLISHED), but
> > unix_peer(sk) in unix_stream_bpf_update_proto() _still_ returns NULL:
> > 
> > 	T0 bpf				T1 connect
> > 	======				==========
> > 
> > 				WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
> > sock_map_sk_state_allowed(sk)
> > ...
> > sk_pair = unix_peer(sk)
> > sock_hold(sk_pair)
> > 				sock_hold(newsk)
> > 				smp_mb__after_atomic()
> > 				unix_peer(sk) = newsk
> > 				unix_state_unlock(sk)
> > 
> > With mdelay(1) stuffed in unix_stream_connect():
> > 
> > [  902.277593] BUG: kernel NULL pointer dereference, address: 0000000000000080
> > [  902.277633] #PF: supervisor write access in kernel mode
> > [  902.277661] #PF: error_code(0x0002) - not-present page
> > [  902.277688] PGD 107191067 P4D 107191067 PUD 10f63c067 PMD 0
> > [  902.277716] Oops: Oops: 0002 [#23] PREEMPT SMP NOPTI
> > [  902.277742] CPU: 2 PID: 1505 Comm: a.out Tainted: G      D            6.10.0-rc1+ #130
> > [  902.277769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> > [  902.277793] RIP: 0010:unix_stream_bpf_update_proto+0xa1/0x150
> > 
> > Setting TCP_ESTABLISHED _after_ unix_peer() fixes the issue, so how about
> > something like
> > 
> > @@ -1631,12 +1631,13 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
> >         /* Set credentials */
> >         copy_peercred(sk, other);
> > 
> > -       sock->state     = SS_CONNECTED;
> > -       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> >         sock_hold(newsk);
> > +       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> > +       WRITE_ONCE(unix_peer(sk), newsk);
> > +       smp_wmb(); /* ensure peer is set before sk_state */
> > 
> > -       smp_mb__after_atomic(); /* sock_hold() does an atomic_inc() */
> > -       unix_peer(sk)   = newsk;
> > +       sock->state = SS_CONNECTED;
> > +       WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
> > 
> >         unix_state_unlock(sk);
> > 
> > @@ -180,7 +180,8 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
> >          * be a single matching destroy operation.
> >          */
> >         if (!psock->sk_pair) {
> > -               sk_pair = unix_peer(sk);
> > +               smp_rmb();
> > +               sk_pair = READ_ONCE(unix_peer(sk));
> >                 sock_hold(sk_pair);
> >                 psock->sk_pair = sk_pair;
> >         }
> > 
> > This should keep things ordered and lockless... I hope.
> 
> sock_map_update_elem() assumes that the socket is protected
> by lock_sock(), but AF_UNIX uses it only for the general path.
> 
> So, I think we should fix sock_map_sk_state_allowed() and
> then use smp_store_release()/smp_load_acquire() rather than
> smp_[rw]mb() for unix_peer(sk).

Sorry, I think I was wrong and we can't use smp_store_release()
and smp_load_acquire(), and smp_[rw]mb() is needed.

Given we avoid adding code in the hotpath in the original fix
8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
path again.

[0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/

---8<---
diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index d3dbb92153f2..67794d2c7498 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
 	if (sk_is_tcp(sk))
 		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
 	if (sk_is_stream_unix(sk))
-		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
+		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
 	return true;
 }
 
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..1db42cfee70d 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
 
 int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 {
-	struct sock *sk_pair;
-
 	/* Restore does not decrement the sk_pair reference yet because we must
 	 * keep the a reference to the socket until after an RCU grace period
 	 * and any pending sends have completed.
@@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 * be a single matching destroy operation.
 	 */
 	if (!psock->sk_pair) {
-		sk_pair = unix_peer(sk);
-		sock_hold(sk_pair);
-		psock->sk_pair = sk_pair;
+		psock->sk_pair = unix_peer_get(sk);
+		if (WARN_ON_ONCE(!psock->sk_pair))
+			return -EINVAL;
 	}
 
 	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
---8<---

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-09 21:03       ` Kuniyuki Iwashima
@ 2024-06-10 12:55         ` Michal Luczaj
  2024-06-10 17:49           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-10 12:55 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> (...)
> Sorry, I think I was wrong and we can't use smp_store_release()
> and smp_load_acquire(), and smp_[rw]mb() is needed.
> 
> Given we avoid adding code in the hotpath in the original fix
> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> path again.
>
> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/

You're saying smp_wmb() in connect() is too much for the hot path, do I
understand correctly?

> ---8<---
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index d3dbb92153f2..67794d2c7498 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -549,7 +549,7 @@ static bool sock_map_sk_state_allowed(const struct sock *sk)
>  	if (sk_is_tcp(sk))
>  		return (1 << sk->sk_state) & (TCPF_ESTABLISHED | TCPF_LISTEN);
>  	if (sk_is_stream_unix(sk))
> -		return (1 << sk->sk_state) & TCPF_ESTABLISHED;
> +		return (1 << READ_ONCE(sk->sk_state)) & TCPF_ESTABLISHED;
>  	return true;
>  }
>  
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index bd84785bf8d6..1db42cfee70d 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -159,8 +159,6 @@ int unix_dgram_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool re
>  
>  int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
>  {
> -	struct sock *sk_pair;
> -
>  	/* Restore does not decrement the sk_pair reference yet because we must
>  	 * keep the a reference to the socket until after an RCU grace period
>  	 * and any pending sends have completed.
> @@ -180,9 +178,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 * be a single matching destroy operation.
>  	 */
>  	if (!psock->sk_pair) {
> -		sk_pair = unix_peer(sk);
> -		sock_hold(sk_pair);
> -		psock->sk_pair = sk_pair;
> +		psock->sk_pair = unix_peer_get(sk);
> +		if (WARN_ON_ONCE(!psock->sk_pair))
> +			return -EINVAL;
>  	}
>  
>  	unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
> ---8<---

FWIW, we've passed sock_map_sk_state_allowed(), so critical section can be
empty:

	if (!psock->sk_pair) {
		unix_state_lock(sk);
		unix_state_unlock(sk);
		sk_pair = READ_ONCE(unix_peer(sk));
		...
	}

Thanks,
Michal


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-10 12:55         ` Michal Luczaj
@ 2024-06-10 17:49           ` Kuniyuki Iwashima
  2024-06-16 23:28             ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-10 17:49 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 10 Jun 2024 14:55:08 +0200
> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> > (...)
> > Sorry, I think I was wrong and we can't use smp_store_release()
> > and smp_load_acquire(), and smp_[rw]mb() is needed.
> > 
> > Given we avoid adding code in the hotpath in the original fix
> > 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> > path again.
> >
> > [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
> 
> You're saying smp_wmb() in connect() is too much for the hot path, do I
> understand correctly?

Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
that the delay happens between the two store ops and concurrent bpf()
is in progress.

If syzkaller was able to hit this on vanilla kernel, we can revisit.

Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
who call bpf() in such a way know that the state was TCP_CLOSE while
calling bpf().

---8<---
diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
index bd84785bf8d6..46dc747349f2 100644
--- a/net/unix/unix_bpf.c
+++ b/net/unix/unix_bpf.c
@@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
 	 */
 	if (!psock->sk_pair) {
 		sk_pair = unix_peer(sk);
+		if (WARN_ON_ONCE(!sk_pair))
+			return -EINVAL;
+
 		sock_hold(sk_pair);
 		psock->sk_pair = sk_pair;
 	}
---8<---

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-10 17:49           ` Kuniyuki Iwashima
@ 2024-06-16 23:28             ` Michal Luczaj
  2024-06-17 18:21               ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-16 23:28 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/10/24 19:49, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 10 Jun 2024 14:55:08 +0200
>> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
>>> (...)
>>> Sorry, I think I was wrong and we can't use smp_store_release()
>>> and smp_load_acquire(), and smp_[rw]mb() is needed.
>>>
>>> Given we avoid adding code in the hotpath in the original fix
>>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
>>> path again.
>>>
>>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
>>
>> You're saying smp_wmb() in connect() is too much for the hot path, do I
>> understand correctly?
> 
> Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
> that the delay happens between the two store ops and concurrent bpf()
> is in progress.
> 
> If syzkaller was able to hit this on vanilla kernel, we can revisit.
> 
> Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
> who call bpf() in such a way know that the state was TCP_CLOSE while
> calling bpf().
> 
> ---8<---
> diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> index bd84785bf8d6..46dc747349f2 100644
> --- a/net/unix/unix_bpf.c
> +++ b/net/unix/unix_bpf.c
> @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
>  	 */
>  	if (!psock->sk_pair) {
>  		sk_pair = unix_peer(sk);
> +		if (WARN_ON_ONCE(!sk_pair))
> +			return -EINVAL;
> +
>  		sock_hold(sk_pair);
>  		psock->sk_pair = sk_pair;
>  	}
> ---8<---

Oh. That's a peculiar approach :) But, hey, it's your call.

Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
added to recv queue, but also u->oob_skb is set. Here's the problem: when
this skb goes through bpf_sk_redirect_map() and is moved between socks,
oob_skb remains set on the original sock.

[   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
[   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
[   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   23.689024] Workqueue: events_unbound __unix_gc
[   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0

I wanted to write a patch, but then I realized I'm not sure what's the
expected behaviour. Should the oob_skb setting follow to the skb's new sock
or should it be dropped (similarly to what is happening today with
scm_fp_list, i.e. redirect strips inflights)?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-16 23:28             ` Michal Luczaj
@ 2024-06-17 18:21               ` Kuniyuki Iwashima
  2024-06-19 18:14                 ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-17 18:21 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 17 Jun 2024 01:28:52 +0200
> On 6/10/24 19:49, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 10 Jun 2024 14:55:08 +0200
> >> On 6/9/24 23:03, Kuniyuki Iwashima wrote:
> >>> (...)
> >>> Sorry, I think I was wrong and we can't use smp_store_release()
> >>> and smp_load_acquire(), and smp_[rw]mb() is needed.
> >>>
> >>> Given we avoid adding code in the hotpath in the original fix
> >>> 8866730aed510 [0], I prefer adding unix_state_lock() in the SOCKMAP
> >>> path again.
> >>>
> >>> [0]: https://lore.kernel.org/bpf/6545bc9f7e443_3358c208ae@john.notmuch/
> >>
> >> You're saying smp_wmb() in connect() is too much for the hot path, do I
> >> understand correctly?
> > 
> > Yes, and now I think WARN_ON_ONCE() would be enough because it's unlikely
> > that the delay happens between the two store ops and concurrent bpf()
> > is in progress.
> > 
> > If syzkaller was able to hit this on vanilla kernel, we can revisit.
> > 
> > Then, probably we could just do s/WARN_ON_ONCE/unlikely/ because users
> > who call bpf() in such a way know that the state was TCP_CLOSE while
> > calling bpf().
> > 
> > ---8<---
> > diff --git a/net/unix/unix_bpf.c b/net/unix/unix_bpf.c
> > index bd84785bf8d6..46dc747349f2 100644
> > --- a/net/unix/unix_bpf.c
> > +++ b/net/unix/unix_bpf.c
> > @@ -181,6 +181,9 @@ int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool r
> >  	 */
> >  	if (!psock->sk_pair) {
> >  		sk_pair = unix_peer(sk);
> > +		if (WARN_ON_ONCE(!sk_pair))
> > +			return -EINVAL;
> > +
> >  		sock_hold(sk_pair);
> >  		psock->sk_pair = sk_pair;
> >  	}
> > ---8<---
> 
> Oh. That's a peculiar approach :) But, hey, it's your call.
> 
> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
> added to recv queue, but also u->oob_skb is set. Here's the problem: when
> this skb goes through bpf_sk_redirect_map() and is moved between socks,
> oob_skb remains set on the original sock.

Good catch!

> 
> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> [   23.689024] Workqueue: events_unbound __unix_gc
> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
> 
> I wanted to write a patch, but then I realized I'm not sure what's the
> expected behaviour. Should the oob_skb setting follow to the skb's new sock
> or should it be dropped (similarly to what is happening today with
> scm_fp_list, i.e. redirect strips inflights)?

The former will require large refactoring as we need to check if the
redirect happens for BPF_F_INGRESS and if the redirected sk is also
SOCK_STREAM etc.

So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
and drop OOB data and retry next in unix_stream_read_skb(), and forbid
MSG_OOB in unix_bpf_recvmsg().

Both features were merged in 5.15 and OOB was a month later than SOCKMAP,
so the Fixes tag would be 314001f0bf927 again, where ioctl(SIOCATMARK)
(and epoll(EPOLLPRI) after d9a232d435dcc was backported to all stable)
is lying due to redirected OOB msg.

Thanks!

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-17 18:21               ` Kuniyuki Iwashima
@ 2024-06-19 18:14                 ` Michal Luczaj
  2024-06-19 19:19                   ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-19 18:14 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/17/24 20:21, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 17 Jun 2024 01:28:52 +0200
>> (...)
>> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
>> added to recv queue, but also u->oob_skb is set. Here's the problem: when
>> this skb goes through bpf_sk_redirect_map() and is moved between socks,
>> oob_skb remains set on the original sock.
> 
> Good catch!
> 
>>
>> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
>> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
>> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>> [   23.689024] Workqueue: events_unbound __unix_gc
>> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
>>
>> I wanted to write a patch, but then I realized I'm not sure what's the
>> expected behaviour. Should the oob_skb setting follow to the skb's new sock
>> or should it be dropped (similarly to what is happening today with
>> scm_fp_list, i.e. redirect strips inflights)?
> 
> The former will require large refactoring as we need to check if the
> redirect happens for BPF_F_INGRESS and if the redirected sk is also
> SOCK_STREAM etc.
> 
> So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
> and drop OOB data and retry next in unix_stream_read_skb(), and forbid
> MSG_OOB in unix_bpf_recvmsg().
> (...)

Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
each skb_queue_tail() there's one ->sk_data_ready() (which does
->read_skb()). Why bother with a retry?

This is what I was thinking:

 static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
 {
+	struct unix_sock *u = unix_sk(sk);
+	struct sk_buff *skb;
+	int err;
+
 	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED))
 		return -ENOTCONN;
 
-	return unix_read_skb(sk, recv_actor);
+	mutex_lock(&u->iolock);
+	skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err);
+
+#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
+	if (skb) {
+		bool drop = false;
+
+		spin_lock(&sk->sk_receive_queue.lock);
+		if (skb == u->oob_skb) {
+			WRITE_ONCE(u->oob_skb, NULL);
+			drop = true;
+		}
+		spin_unlock(&sk->sk_receive_queue.lock);
+
+		if (drop) {
+			WARN_ON_ONCE(skb_unref(skb));
+			kfree_skb(skb);
+			skb = NULL;
+			err = 0;
+		}
+	}
+#endif
+
+	mutex_unlock(&u->iolock);
+	return skb ? recv_actor(sk, skb) : err;
 }

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-19 18:14                 ` Michal Luczaj
@ 2024-06-19 19:19                   ` Kuniyuki Iwashima
  2024-06-20 20:35                     ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-19 19:19 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Wed, 19 Jun 2024 20:14:48 +0200
> On 6/17/24 20:21, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 17 Jun 2024 01:28:52 +0200
> >> (...)
> >> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
> >> added to recv queue, but also u->oob_skb is set. Here's the problem: when
> >> this skb goes through bpf_sk_redirect_map() and is moved between socks,
> >> oob_skb remains set on the original sock.
> > 
> > Good catch!
> > 
> >>
> >> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
> >> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
> >> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
> >> [   23.689024] Workqueue: events_unbound __unix_gc
> >> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
> >>
> >> I wanted to write a patch, but then I realized I'm not sure what's the
> >> expected behaviour. Should the oob_skb setting follow to the skb's new sock
> >> or should it be dropped (similarly to what is happening today with
> >> scm_fp_list, i.e. redirect strips inflights)?
> > 
> > The former will require large refactoring as we need to check if the
> > redirect happens for BPF_F_INGRESS and if the redirected sk is also
> > SOCK_STREAM etc.
> > 
> > So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
> > and drop OOB data and retry next in unix_stream_read_skb(), and forbid
> > MSG_OOB in unix_bpf_recvmsg().
> > (...)
> 
> Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
> each skb_queue_tail() there's one ->sk_data_ready() (which does
> ->read_skb()). Why bother with a retry?

Exactly.


> 
> This is what I was thinking:
>

When you post it, please make sure to CC bpf@ and sockmap maintainers too.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-19 19:19                   ` Kuniyuki Iwashima
@ 2024-06-20 20:35                     ` Michal Luczaj
  2024-06-20 21:56                       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-20 20:35 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/19/24 21:19, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Wed, 19 Jun 2024 20:14:48 +0200
>> On 6/17/24 20:21, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon, 17 Jun 2024 01:28:52 +0200
>>>> (...)
>>>> Another AF_UNIX sockmap issue is with OOB. When OOB packet is sent, skb is
>>>> added to recv queue, but also u->oob_skb is set. Here's the problem: when
>>>> this skb goes through bpf_sk_redirect_map() and is moved between socks,
>>>> oob_skb remains set on the original sock.
>>>
>>> Good catch!
>>>
>>>>
>>>> [   23.688994] WARNING: CPU: 2 PID: 993 at net/unix/garbage.c:351 unix_collect_queue+0x6c/0xb0
>>>> [   23.689019] CPU: 2 PID: 993 Comm: kworker/u32:13 Not tainted 6.10.0-rc2+ #137
>>>> [   23.689021] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
>>>> [   23.689024] Workqueue: events_unbound __unix_gc
>>>> [   23.689027] RIP: 0010:unix_collect_queue+0x6c/0xb0
>>>>
>>>> I wanted to write a patch, but then I realized I'm not sure what's the
>>>> expected behaviour. Should the oob_skb setting follow to the skb's new sock
>>>> or should it be dropped (similarly to what is happening today with
>>>> scm_fp_list, i.e. redirect strips inflights)?
>>>
>>> The former will require large refactoring as we need to check if the
>>> redirect happens for BPF_F_INGRESS and if the redirected sk is also
>>> SOCK_STREAM etc.
>>>
>>> So, I'd go with the latter.  Probably we can check if skb is u->oob_skb
>>> and drop OOB data and retry next in unix_stream_read_skb(), and forbid
>>> MSG_OOB in unix_bpf_recvmsg().
>>> (...)
>>
>> Yeah, sounds reasonable. I'm just not sure I understand the retry part. For
>> each skb_queue_tail() there's one ->sk_data_ready() (which does
>> ->read_skb()). Why bother with a retry?
> 
> Exactly.
> 
> 
>>
>> This is what I was thinking:
>>
> 
> When you post it, please make sure to CC bpf@ and sockmap maintainers too.

Done: https://lore.kernel.org/netdev/20240620203009.2610301-1-mhal@rbox.co/
Thanks!

In fact, should I try to document those not-so-obvious OOB/sockmap
interaction? And speaking of documentation, an astute reader noted that
`man unix` is lying:

  Sockets API
       ...
       UNIX domain sockets do not support the transmission of out-of-band
       data (the MSG_OOB flag for send(2) and recv(2)).

  NOTES
       ...
       UNIX domain stream sockets do not support the notion of out-of-band
       data.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-20 20:35                     ` Michal Luczaj
@ 2024-06-20 21:56                       ` Kuniyuki Iwashima
  2024-06-22 22:43                         ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-20 21:56 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Thu, 20 Jun 2024 22:35:55 +0200
> In fact, should I try to document those not-so-obvious OOB/sockmap
> interaction? And speaking of documentation, an astute reader noted that
> `man unix` is lying:

At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
is stable enough; the behaviour is not compliant with TCP now.

While rewriting the oob test thoroughly, I found few more weird behaviours,
and patches will follow.

For example:

  >>> from socket import *
  >>> c1, c2 = socketpair(AF_UNIX)
  >>> c1.send(b'hello', MSG_OOB)
  5
  >>> c1.send(b'world')
  5
  >>> c2.recv(10)
  b'hell'
  >>> c2.recv(1, MSG_OOB)
  b'o'
  >>> c2.setblocking(False)  # This causes -EAGAIN even with available data
  >>> c2.recv(5)
  Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
  BlockingIOError: [Errno 11] Resource temporarily unavailable
  >>> c2.recv(5)
  b'world'


And we need

---8<---
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5e695a9a609c..2875a7ce1887 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
 	struct unix_sock *u = unix_sk(sk);
 
 	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
-		skb_unlink(skb, &sk->sk_receive_queue);
-		consume_skb(skb);
-		skb = NULL;
+		struct sk_buff *unlinked_skb = skb;
+
+		spin_lock(&sk->sk_receive_queue.lock);
+
+		__skb_unlink(skb, &sk->sk_receive_queue);
+
+		if (copied)
+			skb = NULL;
+		else
+			skb = skb_peek(&sk->sk_receive_queue);
+
+		spin_unlock(&sk->sk_receive_queue.lock);
+
+		consume_skb(unlinked_skb);
 	} else {
 		struct sk_buff *unlinked_skb = NULL;
 
---8<---

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-20 21:56                       ` Kuniyuki Iwashima
@ 2024-06-22 22:43                         ` Michal Luczaj
  2024-06-23  5:19                           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-22 22:43 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/20/24 23:56, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Thu, 20 Jun 2024 22:35:55 +0200
>> In fact, should I try to document those not-so-obvious OOB/sockmap
>> interaction? And speaking of documentation, an astute reader noted that
>> `man unix` is lying:
> 
> At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
> is stable enough; the behaviour is not compliant with TCP now.

Sure, I get it.

> (...)
> And we need
> 
> ---8<---
> diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> index 5e695a9a609c..2875a7ce1887 100644
> --- a/net/unix/af_unix.c
> +++ b/net/unix/af_unix.c
> @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
>  	struct unix_sock *u = unix_sk(sk);
>  
>  	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
> -		skb_unlink(skb, &sk->sk_receive_queue);
> -		consume_skb(skb);
> -		skb = NULL;
> +		struct sk_buff *unlinked_skb = skb;
> +
> +		spin_lock(&sk->sk_receive_queue.lock);
> +
> +		__skb_unlink(skb, &sk->sk_receive_queue);
> +
> +		if (copied)
> +			skb = NULL;
> +		else
> +			skb = skb_peek(&sk->sk_receive_queue);
> +
> +		spin_unlock(&sk->sk_receive_queue.lock);
> +
> +		consume_skb(unlinked_skb);
>  	} else {
>  		struct sk_buff *unlinked_skb = NULL;
>  
> ---8<---

I gotta ask, is there a reason for unlinking an already consumed
('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
IOW, can't it be unlinked immediately once it's consumed in
unix_stream_recv_urg()? I suppose that would simplify things.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-22 22:43                         ` Michal Luczaj
@ 2024-06-23  5:19                           ` Kuniyuki Iwashima
  2024-06-26 10:48                             ` Michal Luczaj
  0 siblings, 1 reply; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-23  5:19 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Sun, 23 Jun 2024 00:43:27 +0200
> On 6/20/24 23:56, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Thu, 20 Jun 2024 22:35:55 +0200
> >> In fact, should I try to document those not-so-obvious OOB/sockmap
> >> interaction? And speaking of documentation, an astute reader noted that
> >> `man unix` is lying:
> > 
> > At least I wouldn't update man until we can say AF_UNIX MSG_OOB handling
> > is stable enough; the behaviour is not compliant with TCP now.
> 
> Sure, I get it.
> 
> > (...)
> > And we need
> > 
> > ---8<---
> > diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
> > index 5e695a9a609c..2875a7ce1887 100644
> > --- a/net/unix/af_unix.c
> > +++ b/net/unix/af_unix.c
> > @@ -2614,9 +2614,20 @@ static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
> >  	struct unix_sock *u = unix_sk(sk);
> >  
> >  	if (!unix_skb_len(skb) && !(flags & MSG_PEEK)) {
> > -		skb_unlink(skb, &sk->sk_receive_queue);
> > -		consume_skb(skb);
> > -		skb = NULL;
> > +		struct sk_buff *unlinked_skb = skb;
> > +
> > +		spin_lock(&sk->sk_receive_queue.lock);
> > +
> > +		__skb_unlink(skb, &sk->sk_receive_queue);
> > +
> > +		if (copied)
> > +			skb = NULL;
> > +		else
> > +			skb = skb_peek(&sk->sk_receive_queue);
> > +
> > +		spin_unlock(&sk->sk_receive_queue.lock);
> > +
> > +		consume_skb(unlinked_skb);
> >  	} else {
> >  		struct sk_buff *unlinked_skb = NULL;
> >  
> > ---8<---
> 
> I gotta ask, is there a reason for unlinking an already consumed
> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
> IOW, can't it be unlinked immediately once it's consumed in
> unix_stream_recv_urg()? I suppose that would simplify things.

I also thought that before, but we can't do that.

Even after reading OOB data, we need to remember the position
and break recv() at that point.  That's why the skb is unlinked
in manage_oob() rather than unix_stream_recv_urg().

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-23  5:19                           ` Kuniyuki Iwashima
@ 2024-06-26 10:48                             ` Michal Luczaj
  2024-06-26 21:56                               ` Kuniyuki Iwashima
  0 siblings, 1 reply; 32+ messages in thread
From: Michal Luczaj @ 2024-06-26 10:48 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: cong.wang, davem, edumazet, kuba, kuni1840, netdev, pabeni

On 6/23/24 07:19, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Sun, 23 Jun 2024 00:43:27 +0200
>> I gotta ask, is there a reason for unlinking an already consumed
>> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
>> IOW, can't it be unlinked immediately once it's consumed in
>> unix_stream_recv_urg()? I suppose that would simplify things.
> 
> I also thought that before, but we can't do that.
> 
> Even after reading OOB data, we need to remember the position
> and break recv() at that point.  That's why the skb is unlinked
> in manage_oob() rather than unix_stream_recv_urg().

Ahh, I see. Thanks for explaining.

One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length
packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the
psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET
SOCK_DGRAM does the same, so is this a bug or a feature?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer.
  2024-06-26 10:48                             ` Michal Luczaj
@ 2024-06-26 21:56                               ` Kuniyuki Iwashima
  0 siblings, 0 replies; 32+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-26 21:56 UTC (permalink / raw)
  To: mhal; +Cc: cong.wang, davem, edumazet, kuba, kuni1840, kuniyu, netdev,
	pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Wed, 26 Jun 2024 12:48:27 +0200
> On 6/23/24 07:19, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Sun, 23 Jun 2024 00:43:27 +0200
> >> I gotta ask, is there a reason for unlinking an already consumed
> >> ('consumed' as in 'unix_skb_len(skb) == 0') skb so late, in manage_oob()?
> >> IOW, can't it be unlinked immediately once it's consumed in
> >> unix_stream_recv_urg()? I suppose that would simplify things.
> > 
> > I also thought that before, but we can't do that.
> > 
> > Even after reading OOB data, we need to remember the position
> > and break recv() at that point.  That's why the skb is unlinked
> > in manage_oob() rather than unix_stream_recv_urg().
> 
> Ahh, I see. Thanks for explaining.
> 
> One more thing about unix sockmap. AF_UNIX SOCK_DGRAM supports 0-length
> packets. But sockmap doesn't handle that; once a 0-length skb/msg is in the
> psock queue, unix_bpf_recvmsg() starts throwing -EFAULT. Sockmap'ed AF_INET
> SOCK_DGRAM does the same, so is this a bug or a feature?

I guess it's kind of safeguard.

The retval 0 has special meaning for SOCK_STREAM as EOF/shutdown().
If we bypass 0-byte dgram to SOCK_STREAM sk, the application will be
confused as if the original peer has disconnected.

At least, -EFAULT avoids such confusion so that can only the true
peer trigger 0-byte via the saved ->recvmsg().

So, the requirement would be similar to scm handling, we need to
recognize the sockmap verdict and destination to support full
features.

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2024-06-26 21:57 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-04 16:52 [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 01/15] af_unix: Set sk->sk_state under unix_state_lock() for truly disconencted peer Kuniyuki Iwashima
2024-06-09 11:28   ` Michal Luczaj
2024-06-09 19:53     ` Kuniyuki Iwashima
2024-06-09 21:03       ` Kuniyuki Iwashima
2024-06-10 12:55         ` Michal Luczaj
2024-06-10 17:49           ` Kuniyuki Iwashima
2024-06-16 23:28             ` Michal Luczaj
2024-06-17 18:21               ` Kuniyuki Iwashima
2024-06-19 18:14                 ` Michal Luczaj
2024-06-19 19:19                   ` Kuniyuki Iwashima
2024-06-20 20:35                     ` Michal Luczaj
2024-06-20 21:56                       ` Kuniyuki Iwashima
2024-06-22 22:43                         ` Michal Luczaj
2024-06-23  5:19                           ` Kuniyuki Iwashima
2024-06-26 10:48                             ` Michal Luczaj
2024-06-26 21:56                               ` Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 02/15] af_unix: Annodate data-races around sk->sk_state for writers Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 03/15] af_unix: Annotate data-race of sk->sk_state in unix_inq_len() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 04/15] af_unix: Annotate data-races around sk->sk_state in unix_write_space() and poll() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 05/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_connect() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 06/15] af_unix: Annotate data-race of sk->sk_state in unix_accept() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 07/15] af_unix: Annotate data-races around sk->sk_state in sendmsg() and recvmsg() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 08/15] af_unix: Annotate data-race of sk->sk_state in unix_stream_read_skb() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 09/15] af_unix: Annotate data-races around sk->sk_state in UNIX_DIAG Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 10/15] af_unix: Annotate data-races around sk->sk_sndbuf Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 11/15] af_unix: Annotate data-race of net->unx.sysctl_max_dgram_qlen Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 12/15] af_unix: Use unix_recvq_full_lockless() in unix_stream_connect() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 13/15] af_unix: Use skb_queue_empty_lockless() in unix_release_sock() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 14/15] af_unix: Use skb_queue_len_lockless() in sk_diag_show_rqlen() Kuniyuki Iwashima
2024-06-04 16:52 ` [PATCH v2 net 15/15] af_unix: Annotate data-race of sk->sk_shutdown in sk_diag_fill() Kuniyuki Iwashima
2024-06-06 11:10 ` [PATCH v2 net 00/15] af_unix: Fix lockless access of sk->sk_state and others fields patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).