netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition
@ 2024-04-08 15:58 Michal Luczaj
  2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
  2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
  0 siblings, 2 replies; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj

Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------

NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
			// V count=1 inflight=0

 			NS = unix_peer(S)
 			skb2 = sock_alloc()
			skb_queue_tail(NS, skb2[V])

			// V became in-flight
			// V count=2 inflight=1

			close(V)

			// V count=1 inflight=1
			// GC candidate condition met

						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)

The idea behind the patch is to unix_state_lock()/unlock() L during the
gc_candidates selection. That would guarantee either connect() has
concluded, or - if we raced it - parallel sendmsg() with SCM_RIGHTS could
not happen as we are already holding the unix_gc_lock.

Running the reproducer with mdelay(1) stuffed in unix_stream_connect()
results in immediate splats:

$ ./tools/testing/selftests/net/af_unix/gc_vs_connect
running
[   47.019387] WARNING: CPU: 3 PID: 12 at net/unix/garbage.c:284 __unix_gc+0x473/0x4a0
[   47.019405] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[   47.019419] CPU: 3 PID: 12 Comm: kworker/u32:1 Not tainted 6.9.0-rc2nokasan+ #1
[   47.019421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   47.019422] Workqueue: events_unbound __unix_gc
[   47.019424] RIP: 0010:__unix_gc+0x473/0x4a0
[   47.019425] Code: 8d bb d0 f9 ff ff 48 0f ba 73 58 01 0f b6 83 e2 f9 ff ff 31 d2 48 c7 c6 a0 c3 de 81 3c 0a 74 18 e8 e2 f8 ff ff e9 96 fd ff ff <0f> 0b e9 ef fb ff ff 0f 0b e9 39 fc ff ff e8 5a fa ff ff e9 7e fd
[   47.019427] RSP: 0018:ffffc9000006bd90 EFLAGS: 00010297
[   47.019429] RAX: 0000000000000002 RBX: ffff888109459680 RCX: 000000003342a6b0
[   47.019430] RDX: 0000000000000001 RSI: ffffffff82634e8c RDI: ffffffff83161740
[   47.019431] RBP: ffffc9000006be40 R08: 00000000000003d7 R09: 00000000000003d7
[   47.019432] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff831610e0
[   47.019433] R13: ffff888109459cb0 R14: ffffc9000006bd90 R15: ffffffff831616c0
[   47.019434] FS:  0000000000000000(0000) GS:ffff88842fb80000(0000) knlGS:0000000000000000
[   47.019435] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.019436] CR2: 000055c16e3893b8 CR3: 0000000002e37000 CR4: 0000000000750ef0
[   47.019438] PKRU: 55555554
[   47.019439] Call Trace:
[   47.019440]  <TASK>
[   47.019442]  ? __warn+0x88/0x180
[   47.019445]  ? __unix_gc+0x473/0x4a0
[   47.019447]  ? report_bug+0x189/0x1c0
[   47.019451]  ? handle_bug+0x38/0x70
[   47.019453]  ? exc_invalid_op+0x13/0x60
[   47.019454]  ? asm_exc_invalid_op+0x16/0x20
[   47.019460]  ? __unix_gc+0x473/0x4a0
[   47.019464]  ? lock_acquire+0xd5/0x2c0
[   47.019466]  ? lock_release+0x133/0x290
[   47.019471]  process_one_work+0x215/0x700
[   47.019473]  ? move_linked_works+0x70/0xa0
[   47.019477]  worker_thread+0x1ca/0x3b0
[   47.019479]  ? rescuer_thread+0x340/0x340
[   47.019480]  kthread+0xdd/0x110
[   47.019482]  ? kthread_complete_and_exit+0x20/0x20
[   47.019485]  ret_from_fork+0x2d/0x50
[   47.019487]  ? kthread_complete_and_exit+0x20/0x20
[   47.019489]  ret_from_fork_asm+0x11/0x20
[   47.019495]  </TASK>
[   47.019496] irq event stamp: 446769
[   47.019497] hardirqs last  enabled at (446775): [<ffffffff81199a69>] console_unlock+0xf9/0x120
[   47.019499] hardirqs last disabled at (446780): [<ffffffff81199a4e>] console_unlock+0xde/0x120
[   47.019501] softirqs last  enabled at (444340): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[   47.019504] softirqs last disabled at (444333): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[   47.019505] ---[ end trace 0000000000000000 ]---
...
[   47.019555] list_add corruption. prev->next should be next (ffffffff83161710), but was ffff888109459cb0. (prev=ffff888109459cb0).
[   47.019572] kernel BUG at lib/list_debug.c:32!
[   47.019654] invalid opcode: 0000 [#1] PREEMPT SMP
[   47.019676] CPU: 0 PID: 1057 Comm: gc_vs_connect Tainted: G        W          6.9.0-rc2nokasan+ #1
[   47.019698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   47.019720] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[   47.019743] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[   47.019766] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[   47.019790] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[   47.019814] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[   47.019838] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[   47.019861] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[   47.019883] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[   47.019906] FS:  00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[   47.019928] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.019949] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[   47.019972] PKRU: 55555554
[   47.019993] Call Trace:
[   47.020014]  <TASK>
[   47.020034]  ? die+0x32/0x80
[   47.020056]  ? do_trap+0xd5/0x100
[   47.020078]  ? __list_add_valid_or_report+0x70/0x90
[   47.020100]  ? __list_add_valid_or_report+0x70/0x90
[   47.020122]  ? do_error_trap+0x81/0x110
[   47.020144]  ? __list_add_valid_or_report+0x70/0x90
[   47.020166]  ? exc_invalid_op+0x4c/0x60
[   47.020189]  ? __list_add_valid_or_report+0x70/0x90
[   47.020211]  ? asm_exc_invalid_op+0x16/0x20
[   47.020237]  ? __list_add_valid_or_report+0x70/0x90
[   47.020355]  ? __list_add_valid_or_report+0x70/0x90
[   47.020424]  unix_inflight+0x6a/0xf0
[   47.020483]  unix_scm_to_skb+0xe4/0x160
[   47.020518]  unix_stream_sendmsg+0x174/0x630
[   47.020573]  __sock_sendmsg+0x38/0x70
[   47.020635]  ____sys_sendmsg+0x237/0x2a0
[   47.020694]  ? import_iovec+0x16/0x20
[   47.020754]  ___sys_sendmsg+0x86/0xd0
[   47.020791]  ? find_held_lock+0x2b/0x80
[   47.021011]  ? lock_release+0x133/0x290
[   47.021467]  ? __fget_files+0xca/0x180
[   47.021564]  __sys_sendmsg+0x47/0x80
[   47.021663]  do_syscall_64+0x94/0x180
[   47.021751]  ? do_syscall_64+0xa1/0x180
[   47.021812]  ? do_syscall_64+0xa1/0x180
[   47.021864]  entry_SYSCALL_64_after_hwframe+0x46/0x4e
[   47.021890] RIP: 0033:0x7fba6fd15dbb
[   47.021917] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 69 2c f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 c1 2c f7 ff 48 8b
[   47.021948] RSP: 002b:00007ffc1aa77410 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   47.021974] RAX: ffffffffffffffda RBX: 00007ffc1aa775c8 RCX: 00007fba6fd15dbb
[   47.022000] RDX: 0000000000000000 RSI: 00005625cb5df0e0 RDI: 0000000000000003
[   47.022025] RBP: 00007ffc1aa77430 R08: 0000000000000000 R09: 0000000000001033
[   47.022050] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[   47.022076] R13: 0000000000000000 R14: 00007fba6fe2c000 R15: 00005625cb5dedd8
[   47.022103]  </TASK>
[   47.022127] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[   47.022209] ---[ end trace 0000000000000000 ]---
[   47.022233] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[   47.022258] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[   47.022284] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[   47.022308] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[   47.022332] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[   47.022356] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[   47.022406] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[   47.022451] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[   47.022517] FS:  00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[   47.022606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.022653] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[   47.022698] PKRU: 55555554
[   47.022764] note: gc_vs_connect[1057] exited with preempt_count 1

Michal Luczaj (2):
  af_unix: Fix garbage collector racing against connect()
  af_unix: Add GC race reproducer + slow down unix_stream_connect()

 net/unix/af_unix.c                            |   2 +
 net/unix/garbage.c                            |  20 ++-
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/gc_vs_connect.c     | 158 ++++++++++++++++++
 4 files changed, 178 insertions(+), 4 deletions(-)
 create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c

-- 
2.44.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
@ 2024-04-08 15:58 ` Michal Luczaj
  2024-04-08 21:18   ` Kuniyuki Iwashima
  2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
  1 sibling, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj

Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.

sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------

NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
			// V count=1 inflight=0

 			NS = unix_peer(S)
 			skb2 = sock_alloc()
			skb_queue_tail(NS, skb2[V])

			// V became in-flight
			// V count=2 inflight=1

			close(V)

			// V count=1 inflight=1
			// GC candidate condition met

						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)

If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
already took the unix_gc_lock.

Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/garbage.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index fa39b6265238..cd3e8585ceb2 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work)
 	 * receive queues.  Other, non candidate sockets _can_ be
 	 * added to queue, so we must make sure only to touch
 	 * candidates.
+	 *
+	 * Embryos, though never candidates themselves, affect which
+	 * candidates are reachable by the garbage collector.  Before
+	 * being added to a listener's queue, an embryo may already
+	 * receive data carrying SCM_RIGHTS, potentially making the
+	 * passed socket a candidate that is not yet reachable by the
+	 * collector.  It becomes reachable once the embryo is
+	 * enqueued.  Therefore, we must ensure that no SCM-laden
+	 * embryo appears in a (candidate) listener's queue between
+	 * consecutive scan_children() calls.
 	 */
 	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
-		long total_refs;
-
-		total_refs = file_count(u->sk.sk_socket->file);
+		struct sock *sk = &u->sk;
+		long total_refs = file_count(sk->sk_socket->file);
 
 		WARN_ON_ONCE(!u->inflight);
 		WARN_ON_ONCE(total_refs < u->inflight);
@@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
 			list_move_tail(&u->link, &gc_candidates);
 			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
 			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+
+			if (sk->sk_state == TCP_LISTEN) {
+				unix_state_lock(sk);
+				unix_state_unlock(sk);
+			}
 		}
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect()
  2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
  2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
@ 2024-04-08 15:58 ` Michal Luczaj
  1 sibling, 0 replies; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
  To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj

Attempt to crash kernel racing unix socket garbage collector.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/unix/af_unix.c                            |   2 +
 tools/testing/selftests/net/af_unix/Makefile  |   2 +-
 .../selftests/net/af_unix/gc_vs_connect.c     | 158 ++++++++++++++++++
 3 files changed, 161 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5b41e2321209..8e56a094dc80 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1636,6 +1636,8 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
 
 	unix_state_unlock(sk);
 
+	mdelay(1);
+
 	/* take ten and send info to listening sock */
 	spin_lock(&other->sk_receive_queue.lock);
 	__skb_queue_tail(&other->sk_receive_queue, skb);
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 221c387a7d7f..3b12a9290e06 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -1,4 +1,4 @@
 CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd
+TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd gc_vs_connect
 
 include ../../lib.mk
diff --git a/tools/testing/selftests/net/af_unix/gc_vs_connect.c b/tools/testing/selftests/net/af_unix/gc_vs_connect.c
new file mode 100644
index 000000000000..8b724f1616dd
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/gc_vs_connect.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+
+#define SOCK_TYPE	SOCK_STREAM	/* or SOCK_SEQPACKET */
+
+union {
+	char buf[CMSG_SPACE(sizeof(int))];
+	struct cmsghdr align;
+} cbuf;
+
+struct iovec io = {
+	.iov_base = (char[1]) {0},
+	.iov_len = 1
+};
+
+struct msghdr msg = {
+	.msg_iov = &io,
+	.msg_iovlen = 1,
+	.msg_control = cbuf.buf,
+	.msg_controllen = sizeof(cbuf.buf)
+};
+
+pthread_barrier_t barr;
+struct sockaddr_un saddr;
+int salen, client;
+
+static void barrier(void)
+{
+	int ret = pthread_barrier_wait(&barr);
+
+	assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD);
+}
+
+static int socket_unix(void)
+{
+	int sock = socket(AF_UNIX, SOCK_TYPE, 0);
+
+	assert(sock != -1);
+	return sock;
+}
+
+static int recv_fd(int socket)
+{
+	struct cmsghdr *cmsg;
+	int ret, fd;
+
+	ret = recvmsg(socket, &msg, 0);
+	assert(ret == 1 && !(msg.msg_flags & MSG_CTRUNC));
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	assert(cmsg);
+	memcpy(&fd, CMSG_DATA(cmsg), sizeof(fd));
+	assert(fd >= 0);
+
+	return fd;
+}
+
+static void send_fd(int socket, int fd)
+{
+	struct cmsghdr *cmsg;
+	int ret;
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	assert(cmsg);
+	cmsg->cmsg_level = SOL_SOCKET;
+	cmsg->cmsg_type = SCM_RIGHTS;
+	cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
+
+	memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
+
+	do {
+		ret = sendmsg(socket, &msg, 0);
+		assert(ret == 1 || (ret == -1 && errno == ENOTCONN));
+	} while (ret != 1);
+}
+
+static void *racer_connect(void *arg)
+{
+	for (;;) {
+		int ret;
+
+		barrier();
+		ret = connect(client, (struct sockaddr *)&saddr, salen);
+		assert(!ret);
+		barrier();
+	}
+
+	return NULL;
+}
+
+static void *racer_gc(void *arg)
+{
+	for (;;)
+		close(socket_unix()); /* trigger GC */
+
+	return NULL;
+}
+
+int main(void)
+{
+	pthread_t thread_conn, thread_gc;
+	int ret, pair[2];
+
+	printf("running\n");
+
+	ret = pthread_barrier_init(&barr, NULL, 2);
+	assert(!ret);
+
+	ret = pthread_create(&thread_conn, NULL, racer_connect, NULL);
+	assert(!ret);
+
+	ret = pthread_create(&thread_gc, NULL, racer_gc, NULL);
+	assert(!ret);
+
+	ret = socketpair(AF_UNIX, SOCK_TYPE, 0, pair);
+	assert(!ret);
+
+	saddr.sun_family = AF_UNIX;
+	salen = sizeof(saddr.sun_family) +
+		sprintf(saddr.sun_path, "%c/unix-gc-%d", '\0', getpid());
+
+	for (;;) {
+		int server, victim;
+
+		server = socket_unix();
+		ret = bind(server, (struct sockaddr *)&saddr, salen);
+		assert(!ret);
+		ret = listen(server, -1);
+		assert(!ret);
+
+		send_fd(pair[0], server);
+		close(server);
+
+		client = socket_unix();
+		victim = socket_unix();
+
+		barrier();
+		send_fd(client, victim);
+		close(victim);
+		barrier();
+
+		server = recv_fd(pair[1]);
+		close(client);
+		close(server);
+	}
+
+	return 0;
+}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
@ 2024-04-08 21:18   ` Kuniyuki Iwashima
  2024-04-08 23:25     ` Michal Luczaj
  0 siblings, 1 reply; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-08 21:18 UTC (permalink / raw)
  To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Mon,  8 Apr 2024 17:58:45 +0200
> Garbage collector does not take into account the risk of embryo getting
> enqueued during the garbage collection. If such embryo has a peer that
> carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
> different set of children. Leading to an incorrectly elevated inflight
> count, and then a dangling pointer within the gc_inflight_list.
> 
> sockets are AF_UNIX/SOCK_STREAM
> S is an unconnected socket
> L is a listening in-flight socket bound to addr, not in fdtable
> V's fd will be passed via sendmsg(), gets inflight count bumped
> 
> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> ----------------	-------------------------	-----------
> 
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> unix_state_lock(L)
> unix_peer(S) = NS
> 			// V count=1 inflight=0
> 
>  			NS = unix_peer(S)
>  			skb2 = sock_alloc()
> 			skb_queue_tail(NS, skb2[V])
> 
> 			// V became in-flight
> 			// V count=2 inflight=1
> 
> 			close(V)
> 
> 			// V count=1 inflight=1
> 			// GC candidate condition met
> 
> 						for u in gc_inflight_list:
> 						  if (total_refs == inflight_refs)
> 						    add u to gc_candidates
> 
> 						// gc_candidates={L, V}
> 
> 						for u in gc_candidates:
> 						  scan_children(u, dec_inflight)
> 
> 						// embryo (skb1) was not
> 						// reachable from L yet, so V's
> 						// inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> 						for u in gc_candidates:
> 						  if (u.inflight)
> 						    scan_children(u, inc_inflight_move_tail)
> 
> 						// V count=1 inflight=2 (!)
> 
> If there is a GC-candidate listening socket, lock/unlock its state. This
> makes GC wait until the end of any ongoing connect() to that socket. After
> flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
> there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
> already took the unix_gc_lock.
> 
> Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  net/unix/garbage.c | 20 +++++++++++++++++---
>  1 file changed, 17 insertions(+), 3 deletions(-)
> 
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index fa39b6265238..cd3e8585ceb2 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work)
>  	 * receive queues.  Other, non candidate sockets _can_ be
>  	 * added to queue, so we must make sure only to touch
>  	 * candidates.
> +	 *
> +	 * Embryos, though never candidates themselves, affect which
> +	 * candidates are reachable by the garbage collector.  Before
> +	 * being added to a listener's queue, an embryo may already
> +	 * receive data carrying SCM_RIGHTS, potentially making the
> +	 * passed socket a candidate that is not yet reachable by the
> +	 * collector.  It becomes reachable once the embryo is
> +	 * enqueued.  Therefore, we must ensure that no SCM-laden
> +	 * embryo appears in a (candidate) listener's queue between
> +	 * consecutive scan_children() calls.
>  	 */
>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> -		long total_refs;
> -
> -		total_refs = file_count(u->sk.sk_socket->file);
> +		struct sock *sk = &u->sk;
> +		long total_refs = file_count(sk->sk_socket->file);
>  
>  		WARN_ON_ONCE(!u->inflight);
>  		WARN_ON_ONCE(total_refs < u->inflight);
> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
>  			list_move_tail(&u->link, &gc_candidates);
>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> +
> +			if (sk->sk_state == TCP_LISTEN) {
> +				unix_state_lock(sk);
> +				unix_state_unlock(sk);

Less likely though, what if the same connect() happens after this ?

connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
----------------	-------------------------	-----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
						for u in gc_inflight_list:
						  if (total_refs == inflight_refs)
						    add u to gc_candidates
						    // L was already traversed
						    // in a previous iteration.
unix_state_lock(L)
unix_peer(S) = NS

						// gc_candidates={L, V}

						for u in gc_candidates:
						  scan_children(u, dec_inflight)

						// embryo (skb1) was not
						// reachable from L yet, so V's
						// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
						for u in gc_candidates:
						  if (u.inflight)
						    scan_children(u, inc_inflight_move_tail)

						// V count=1 inflight=2 (!)


As you pointed out, this GC's assumption is basically wrong; the GC
works correctly only when the set of traversed sockets does not change
over 3 scan_children() calls.

That's why I reworked the GC not to rely on receive queue.
https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/


> +			}
>  		}
>  	}
>  
> -- 
> 2.44.0
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-08 21:18   ` Kuniyuki Iwashima
@ 2024-04-08 23:25     ` Michal Luczaj
  2024-04-09  0:22       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 23:25 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni

On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon,  8 Apr 2024 17:58:45 +0200
...
>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
>> -		long total_refs;
>> -
>> -		total_refs = file_count(u->sk.sk_socket->file);
>> +		struct sock *sk = &u->sk;
>> +		long total_refs = file_count(sk->sk_socket->file);
>>  
>>  		WARN_ON_ONCE(!u->inflight);
>>  		WARN_ON_ONCE(total_refs < u->inflight);
>> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
>>  			list_move_tail(&u->link, &gc_candidates);
>>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
>>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
>> +
>> +			if (sk->sk_state == TCP_LISTEN) {
>> +				unix_state_lock(sk);
>> +				unix_state_unlock(sk);
> 
> Less likely though, what if the same connect() happens after this ?
> 
> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> ----------------	-------------------------	-----------
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> 						for u in gc_inflight_list:
> 						  if (total_refs == inflight_refs)
> 						    add u to gc_candidates
> 						    // L was already traversed
> 						    // in a previous iteration.
> unix_state_lock(L)
> unix_peer(S) = NS
> 
> 						// gc_candidates={L, V}
> 
> 						for u in gc_candidates:
> 						  scan_children(u, dec_inflight)
> 
> 						// embryo (skb1) was not
> 						// reachable from L yet, so V's
> 						// inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> 						for u in gc_candidates:
> 						  if (u.inflight)
> 						    scan_children(u, inc_inflight_move_tail)
> 
> 						// V count=1 inflight=2 (!)

If I understand your question, in this case L's queue technically does change
between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
(i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
same unix_gc_lock.

Is there something I'm missing?

> As you pointed out, this GC's assumption is basically wrong; the GC
> works correctly only when the set of traversed sockets does not change
> over 3 scan_children() calls.
> 
> That's why I reworked the GC not to rely on receive queue.
> https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/

Right, I'll try to get my head around your series :)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-08 23:25     ` Michal Luczaj
@ 2024-04-09  0:22       ` Kuniyuki Iwashima
  2024-04-09  9:16         ` Michal Luczaj
  0 siblings, 1 reply; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-09  0:22 UTC (permalink / raw)
  To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 01:25:23 +0200
> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon,  8 Apr 2024 17:58:45 +0200
> ...
> >>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {

Please move sk declaration here and


> >> -		long total_refs;
> >> -
> >> -		total_refs = file_count(u->sk.sk_socket->file);

keep these 3 lines for reverse xmax tree order.


> >> +		struct sock *sk = &u->sk;
> >> +		long total_refs = file_count(sk->sk_socket->file);
> >>  
> >>  		WARN_ON_ONCE(!u->inflight);
> >>  		WARN_ON_ONCE(total_refs < u->inflight);
> >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
> >>  			list_move_tail(&u->link, &gc_candidates);
> >>  			__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
> >>  			__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> >> +
> >> +			if (sk->sk_state == TCP_LISTEN) {
> >> +				unix_state_lock(sk);
> >> +				unix_state_unlock(sk);
> > 
> > Less likely though, what if the same connect() happens after this ?
> > 
> > connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> > ----------------	-------------------------	-----------
> > NS = unix_create1()
> > skb1 = sock_wmalloc(NS)
> > L = unix_find_other(addr)
> > 						for u in gc_inflight_list:
> > 						  if (total_refs == inflight_refs)
> > 						    add u to gc_candidates
> > 						    // L was already traversed
> > 						    // in a previous iteration.
> > unix_state_lock(L)
> > unix_peer(S) = NS
> > 
> > 						// gc_candidates={L, V}
> > 
> > 						for u in gc_candidates:
> > 						  scan_children(u, dec_inflight)
> > 
> > 						// embryo (skb1) was not
> > 						// reachable from L yet, so V's
> > 						// inflight remains unchanged
> > __skb_queue_tail(L, skb1)
> > unix_state_unlock(L)
> > 						for u in gc_candidates:
> > 						  if (u.inflight)
> > 						    scan_children(u, inc_inflight_move_tail)
> > 
> > 						// V count=1 inflight=2 (!)
> 
> If I understand your question, in this case L's queue technically does change
> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> same unix_gc_lock.
> 
> Is there something I'm missing?

Ah exactly, you are right.

Could you repost this patch only with my comment above addressed ?

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-09  0:22       ` Kuniyuki Iwashima
@ 2024-04-09  9:16         ` Michal Luczaj
  2024-04-09 19:02           ` Kuniyuki Iwashima
  0 siblings, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-09  9:16 UTC (permalink / raw)
  To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni

On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Tue, 9 Apr 2024 01:25:23 +0200
>> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon,  8 Apr 2024 17:58:45 +0200
>> ...
>>>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> 
> Please move sk declaration here and
> 
>>>> -		long total_refs;
>>>> -
>>>> -		total_refs = file_count(u->sk.sk_socket->file);
> 
> keep these 3 lines for reverse xmax tree order.

Tricky to have them all 3 in reverse xmax. Did you mean

	struct sock *sk = &u->sk;
	long total_refs;

	total_refs = file_count(sk->sk_socket->file);

?

>>> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
>>> ----------------	-------------------------	-----------
>>> NS = unix_create1()
>>> skb1 = sock_wmalloc(NS)
>>> L = unix_find_other(addr)
>>> 						for u in gc_inflight_list:
>>> 						  if (total_refs == inflight_refs)
>>> 						    add u to gc_candidates
>>> 						    // L was already traversed
>>> 						    // in a previous iteration.
>>> unix_state_lock(L)
>>> unix_peer(S) = NS
>>>
>>> 						// gc_candidates={L, V}
>>>
>>> 						for u in gc_candidates:
>>> 						  scan_children(u, dec_inflight)
>>>
>>> 						// embryo (skb1) was not
>>> 						// reachable from L yet, so V's
>>> 						// inflight remains unchanged
>>> __skb_queue_tail(L, skb1)
>>> unix_state_unlock(L)
>>> 						for u in gc_candidates:
>>> 						  if (u.inflight)
>>> 						    scan_children(u, inc_inflight_move_tail)
>>>
>>> 						// V count=1 inflight=2 (!)
>>
>> If I understand your question, in this case L's queue technically does change
>> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
>> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
>> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
>> same unix_gc_lock.
>>
>> Is there something I'm missing?
> 
> Ah exactly, you are right.
> 
> Could you repost this patch only with my comment above addressed ?

Yeah, sure. One question though: what I wrote above is basically a rephrasing of
the commit message:

    (...) After flipping the lock, a possibly SCM-laden embryo is already
    enqueued. And if there is another connect() coming, its embryo won't
    carry SCM_RIGHTS as we already took the unix_gc_lock.

As I understand, the important missing part was the clarification that embryo,
even though enqueued after the lock flipping, won't affect the inflight graph,
right? So how about:

    (...) After flipping the lock, a possibly SCM-laden embryo is already
    enqueued. And if there is another embryo coming, it can not possibly carry
    SCM_RIGHTS. At this point, unix_inflight() can not happen because
    unix_gc_lock is already taken. Inflight graph remains unaffected.

Thanks!


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
  2024-04-09  9:16         ` Michal Luczaj
@ 2024-04-09 19:02           ` Kuniyuki Iwashima
  0 siblings, 0 replies; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-09 19:02 UTC (permalink / raw)
  To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni

From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 11:16:35 +0200
> On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Tue, 9 Apr 2024 01:25:23 +0200
> >> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Mon,  8 Apr 2024 17:58:45 +0200
> >> ...
> >>>>  	list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> > 
> > Please move sk declaration here and
> > 
> >>>> -		long total_refs;
> >>>> -
> >>>> -		total_refs = file_count(u->sk.sk_socket->file);
> > 
> > keep these 3 lines for reverse xmax tree order.
> 
> Tricky to have them all 3 in reverse xmax. Did you mean
> 
> 	struct sock *sk = &u->sk;
> 	long total_refs;
> 
> 	total_refs = file_count(sk->sk_socket->file);
> 
> ?

Yes, it's netdev convention.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs


> 
> >>> connect(S, addr)	sendmsg(S, [V]); close(V)	__unix_gc()
> >>> ----------------	-------------------------	-----------
> >>> NS = unix_create1()
> >>> skb1 = sock_wmalloc(NS)
> >>> L = unix_find_other(addr)
> >>> 						for u in gc_inflight_list:
> >>> 						  if (total_refs == inflight_refs)
> >>> 						    add u to gc_candidates
> >>> 						    // L was already traversed
> >>> 						    // in a previous iteration.
> >>> unix_state_lock(L)
> >>> unix_peer(S) = NS
> >>>
> >>> 						// gc_candidates={L, V}
> >>>
> >>> 						for u in gc_candidates:
> >>> 						  scan_children(u, dec_inflight)
> >>>
> >>> 						// embryo (skb1) was not
> >>> 						// reachable from L yet, so V's
> >>> 						// inflight remains unchanged
> >>> __skb_queue_tail(L, skb1)
> >>> unix_state_unlock(L)
> >>> 						for u in gc_candidates:
> >>> 						  if (u.inflight)
> >>> 						    scan_children(u, inc_inflight_move_tail)
> >>>
> >>> 						// V count=1 inflight=2 (!)
> >>
> >> If I understand your question, in this case L's queue technically does change
> >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> >> same unix_gc_lock.
> >>
> >> Is there something I'm missing?
> > 
> > Ah exactly, you are right.
> > 
> > Could you repost this patch only with my comment above addressed ?
> 
> Yeah, sure. One question though: what I wrote above is basically a rephrasing of
> the commit message:
> 
>     (...) After flipping the lock, a possibly SCM-laden embryo is already
>     enqueued. And if there is another connect() coming, its embryo won't
>     carry SCM_RIGHTS as we already took the unix_gc_lock.
> 
> As I understand, the important missing part was the clarification that embryo,
> even though enqueued after the lock flipping, won't affect the inflight graph,
> right? So how about:
> 
>     (...) After flipping the lock, a possibly SCM-laden embryo is already
>     enqueued. And if there is another embryo coming, it can not possibly carry
>     SCM_RIGHTS. At this point, unix_inflight() can not happen because
>     unix_gc_lock is already taken. Inflight graph remains unaffected.

Sounds good to me.

Thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-04-09 19:29 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
2024-04-08 21:18   ` Kuniyuki Iwashima
2024-04-08 23:25     ` Michal Luczaj
2024-04-09  0:22       ` Kuniyuki Iwashima
2024-04-09  9:16         ` Michal Luczaj
2024-04-09 19:02           ` Kuniyuki Iwashima
2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).