* [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition
@ 2024-04-08 15:58 Michal Luczaj
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
0 siblings, 2 replies; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.
sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
// V count=1 inflight=0
NS = unix_peer(S)
skb2 = sock_alloc()
skb_queue_tail(NS, skb2[V])
// V became in-flight
// V count=2 inflight=1
close(V)
// V count=1 inflight=1
// GC candidate condition met
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
The idea behind the patch is to unix_state_lock()/unlock() L during the
gc_candidates selection. That would guarantee either connect() has
concluded, or - if we raced it - parallel sendmsg() with SCM_RIGHTS could
not happen as we are already holding the unix_gc_lock.
Running the reproducer with mdelay(1) stuffed in unix_stream_connect()
results in immediate splats:
$ ./tools/testing/selftests/net/af_unix/gc_vs_connect
running
[ 47.019387] WARNING: CPU: 3 PID: 12 at net/unix/garbage.c:284 __unix_gc+0x473/0x4a0
[ 47.019405] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[ 47.019419] CPU: 3 PID: 12 Comm: kworker/u32:1 Not tainted 6.9.0-rc2nokasan+ #1
[ 47.019421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 47.019422] Workqueue: events_unbound __unix_gc
[ 47.019424] RIP: 0010:__unix_gc+0x473/0x4a0
[ 47.019425] Code: 8d bb d0 f9 ff ff 48 0f ba 73 58 01 0f b6 83 e2 f9 ff ff 31 d2 48 c7 c6 a0 c3 de 81 3c 0a 74 18 e8 e2 f8 ff ff e9 96 fd ff ff <0f> 0b e9 ef fb ff ff 0f 0b e9 39 fc ff ff e8 5a fa ff ff e9 7e fd
[ 47.019427] RSP: 0018:ffffc9000006bd90 EFLAGS: 00010297
[ 47.019429] RAX: 0000000000000002 RBX: ffff888109459680 RCX: 000000003342a6b0
[ 47.019430] RDX: 0000000000000001 RSI: ffffffff82634e8c RDI: ffffffff83161740
[ 47.019431] RBP: ffffc9000006be40 R08: 00000000000003d7 R09: 00000000000003d7
[ 47.019432] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff831610e0
[ 47.019433] R13: ffff888109459cb0 R14: ffffc9000006bd90 R15: ffffffff831616c0
[ 47.019434] FS: 0000000000000000(0000) GS:ffff88842fb80000(0000) knlGS:0000000000000000
[ 47.019435] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.019436] CR2: 000055c16e3893b8 CR3: 0000000002e37000 CR4: 0000000000750ef0
[ 47.019438] PKRU: 55555554
[ 47.019439] Call Trace:
[ 47.019440] <TASK>
[ 47.019442] ? __warn+0x88/0x180
[ 47.019445] ? __unix_gc+0x473/0x4a0
[ 47.019447] ? report_bug+0x189/0x1c0
[ 47.019451] ? handle_bug+0x38/0x70
[ 47.019453] ? exc_invalid_op+0x13/0x60
[ 47.019454] ? asm_exc_invalid_op+0x16/0x20
[ 47.019460] ? __unix_gc+0x473/0x4a0
[ 47.019464] ? lock_acquire+0xd5/0x2c0
[ 47.019466] ? lock_release+0x133/0x290
[ 47.019471] process_one_work+0x215/0x700
[ 47.019473] ? move_linked_works+0x70/0xa0
[ 47.019477] worker_thread+0x1ca/0x3b0
[ 47.019479] ? rescuer_thread+0x340/0x340
[ 47.019480] kthread+0xdd/0x110
[ 47.019482] ? kthread_complete_and_exit+0x20/0x20
[ 47.019485] ret_from_fork+0x2d/0x50
[ 47.019487] ? kthread_complete_and_exit+0x20/0x20
[ 47.019489] ret_from_fork_asm+0x11/0x20
[ 47.019495] </TASK>
[ 47.019496] irq event stamp: 446769
[ 47.019497] hardirqs last enabled at (446775): [<ffffffff81199a69>] console_unlock+0xf9/0x120
[ 47.019499] hardirqs last disabled at (446780): [<ffffffff81199a4e>] console_unlock+0xde/0x120
[ 47.019501] softirqs last enabled at (444340): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[ 47.019504] softirqs last disabled at (444333): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[ 47.019505] ---[ end trace 0000000000000000 ]---
...
[ 47.019555] list_add corruption. prev->next should be next (ffffffff83161710), but was ffff888109459cb0. (prev=ffff888109459cb0).
[ 47.019572] kernel BUG at lib/list_debug.c:32!
[ 47.019654] invalid opcode: 0000 [#1] PREEMPT SMP
[ 47.019676] CPU: 0 PID: 1057 Comm: gc_vs_connect Tainted: G W 6.9.0-rc2nokasan+ #1
[ 47.019698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 47.019720] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[ 47.019743] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[ 47.019766] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[ 47.019790] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[ 47.019814] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[ 47.019838] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[ 47.019861] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[ 47.019883] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[ 47.019906] FS: 00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 47.019928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.019949] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[ 47.019972] PKRU: 55555554
[ 47.019993] Call Trace:
[ 47.020014] <TASK>
[ 47.020034] ? die+0x32/0x80
[ 47.020056] ? do_trap+0xd5/0x100
[ 47.020078] ? __list_add_valid_or_report+0x70/0x90
[ 47.020100] ? __list_add_valid_or_report+0x70/0x90
[ 47.020122] ? do_error_trap+0x81/0x110
[ 47.020144] ? __list_add_valid_or_report+0x70/0x90
[ 47.020166] ? exc_invalid_op+0x4c/0x60
[ 47.020189] ? __list_add_valid_or_report+0x70/0x90
[ 47.020211] ? asm_exc_invalid_op+0x16/0x20
[ 47.020237] ? __list_add_valid_or_report+0x70/0x90
[ 47.020355] ? __list_add_valid_or_report+0x70/0x90
[ 47.020424] unix_inflight+0x6a/0xf0
[ 47.020483] unix_scm_to_skb+0xe4/0x160
[ 47.020518] unix_stream_sendmsg+0x174/0x630
[ 47.020573] __sock_sendmsg+0x38/0x70
[ 47.020635] ____sys_sendmsg+0x237/0x2a0
[ 47.020694] ? import_iovec+0x16/0x20
[ 47.020754] ___sys_sendmsg+0x86/0xd0
[ 47.020791] ? find_held_lock+0x2b/0x80
[ 47.021011] ? lock_release+0x133/0x290
[ 47.021467] ? __fget_files+0xca/0x180
[ 47.021564] __sys_sendmsg+0x47/0x80
[ 47.021663] do_syscall_64+0x94/0x180
[ 47.021751] ? do_syscall_64+0xa1/0x180
[ 47.021812] ? do_syscall_64+0xa1/0x180
[ 47.021864] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 47.021890] RIP: 0033:0x7fba6fd15dbb
[ 47.021917] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 69 2c f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 c1 2c f7 ff 48 8b
[ 47.021948] RSP: 002b:00007ffc1aa77410 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[ 47.021974] RAX: ffffffffffffffda RBX: 00007ffc1aa775c8 RCX: 00007fba6fd15dbb
[ 47.022000] RDX: 0000000000000000 RSI: 00005625cb5df0e0 RDI: 0000000000000003
[ 47.022025] RBP: 00007ffc1aa77430 R08: 0000000000000000 R09: 0000000000001033
[ 47.022050] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[ 47.022076] R13: 0000000000000000 R14: 00007fba6fe2c000 R15: 00005625cb5dedd8
[ 47.022103] </TASK>
[ 47.022127] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[ 47.022209] ---[ end trace 0000000000000000 ]---
[ 47.022233] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[ 47.022258] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[ 47.022284] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[ 47.022308] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[ 47.022332] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[ 47.022356] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[ 47.022406] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[ 47.022451] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[ 47.022517] FS: 00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 47.022606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.022653] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[ 47.022698] PKRU: 55555554
[ 47.022764] note: gc_vs_connect[1057] exited with preempt_count 1
Michal Luczaj (2):
af_unix: Fix garbage collector racing against connect()
af_unix: Add GC race reproducer + slow down unix_stream_connect()
net/unix/af_unix.c | 2 +
net/unix/garbage.c | 20 ++-
tools/testing/selftests/net/af_unix/Makefile | 2 +-
.../selftests/net/af_unix/gc_vs_connect.c | 158 ++++++++++++++++++
4 files changed, 178 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c
--
2.44.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
@ 2024-04-08 15:58 ` Michal Luczaj
2024-04-08 21:18 ` Kuniyuki Iwashima
2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
1 sibling, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.
sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
// V count=1 inflight=0
NS = unix_peer(S)
skb2 = sock_alloc()
skb_queue_tail(NS, skb2[V])
// V became in-flight
// V count=2 inflight=1
close(V)
// V count=1 inflight=1
// GC candidate condition met
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
If there is a GC-candidate listening socket, lock/unlock its state. This
makes GC wait until the end of any ongoing connect() to that socket. After
flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
already took the unix_gc_lock.
Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/unix/garbage.c | 20 +++++++++++++++++---
1 file changed, 17 insertions(+), 3 deletions(-)
diff --git a/net/unix/garbage.c b/net/unix/garbage.c
index fa39b6265238..cd3e8585ceb2 100644
--- a/net/unix/garbage.c
+++ b/net/unix/garbage.c
@@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work)
* receive queues. Other, non candidate sockets _can_ be
* added to queue, so we must make sure only to touch
* candidates.
+ *
+ * Embryos, though never candidates themselves, affect which
+ * candidates are reachable by the garbage collector. Before
+ * being added to a listener's queue, an embryo may already
+ * receive data carrying SCM_RIGHTS, potentially making the
+ * passed socket a candidate that is not yet reachable by the
+ * collector. It becomes reachable once the embryo is
+ * enqueued. Therefore, we must ensure that no SCM-laden
+ * embryo appears in a (candidate) listener's queue between
+ * consecutive scan_children() calls.
*/
list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
- long total_refs;
-
- total_refs = file_count(u->sk.sk_socket->file);
+ struct sock *sk = &u->sk;
+ long total_refs = file_count(sk->sk_socket->file);
WARN_ON_ONCE(!u->inflight);
WARN_ON_ONCE(total_refs < u->inflight);
@@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
list_move_tail(&u->link, &gc_candidates);
__set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
__set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
+
+ if (sk->sk_state == TCP_LISTEN) {
+ unix_state_lock(sk);
+ unix_state_unlock(sk);
+ }
}
}
--
2.44.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect()
2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
@ 2024-04-08 15:58 ` Michal Luczaj
1 sibling, 0 replies; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj
Attempt to crash kernel racing unix socket garbage collector.
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
net/unix/af_unix.c | 2 +
tools/testing/selftests/net/af_unix/Makefile | 2 +-
.../selftests/net/af_unix/gc_vs_connect.c | 158 ++++++++++++++++++
3 files changed, 161 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 5b41e2321209..8e56a094dc80 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -1636,6 +1636,8 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr,
unix_state_unlock(sk);
+ mdelay(1);
+
/* take ten and send info to listening sock */
spin_lock(&other->sk_receive_queue.lock);
__skb_queue_tail(&other->sk_receive_queue, skb);
diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile
index 221c387a7d7f..3b12a9290e06 100644
--- a/tools/testing/selftests/net/af_unix/Makefile
+++ b/tools/testing/selftests/net/af_unix/Makefile
@@ -1,4 +1,4 @@
CFLAGS += $(KHDR_INCLUDES)
-TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd
+TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd gc_vs_connect
include ../../lib.mk
diff --git a/tools/testing/selftests/net/af_unix/gc_vs_connect.c b/tools/testing/selftests/net/af_unix/gc_vs_connect.c
new file mode 100644
index 000000000000..8b724f1616dd
--- /dev/null
+++ b/tools/testing/selftests/net/af_unix/gc_vs_connect.c
@@ -0,0 +1,158 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <stdio.h>
+#include <errno.h>
+#include <string.h>
+#include <assert.h>
+#include <unistd.h>
+#include <stdlib.h>
+#include <pthread.h>
+#include <sys/un.h>
+#include <sys/socket.h>
+
+#define SOCK_TYPE SOCK_STREAM /* or SOCK_SEQPACKET */
+
+union {
+ char buf[CMSG_SPACE(sizeof(int))];
+ struct cmsghdr align;
+} cbuf;
+
+struct iovec io = {
+ .iov_base = (char[1]) {0},
+ .iov_len = 1
+};
+
+struct msghdr msg = {
+ .msg_iov = &io,
+ .msg_iovlen = 1,
+ .msg_control = cbuf.buf,
+ .msg_controllen = sizeof(cbuf.buf)
+};
+
+pthread_barrier_t barr;
+struct sockaddr_un saddr;
+int salen, client;
+
+static void barrier(void)
+{
+ int ret = pthread_barrier_wait(&barr);
+
+ assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD);
+}
+
+static int socket_unix(void)
+{
+ int sock = socket(AF_UNIX, SOCK_TYPE, 0);
+
+ assert(sock != -1);
+ return sock;
+}
+
+static int recv_fd(int socket)
+{
+ struct cmsghdr *cmsg;
+ int ret, fd;
+
+ ret = recvmsg(socket, &msg, 0);
+ assert(ret == 1 && !(msg.msg_flags & MSG_CTRUNC));
+
+ cmsg = CMSG_FIRSTHDR(&msg);
+ assert(cmsg);
+ memcpy(&fd, CMSG_DATA(cmsg), sizeof(fd));
+ assert(fd >= 0);
+
+ return fd;
+}
+
+static void send_fd(int socket, int fd)
+{
+ struct cmsghdr *cmsg;
+ int ret;
+
+ cmsg = CMSG_FIRSTHDR(&msg);
+ assert(cmsg);
+ cmsg->cmsg_level = SOL_SOCKET;
+ cmsg->cmsg_type = SCM_RIGHTS;
+ cmsg->cmsg_len = CMSG_LEN(sizeof(fd));
+
+ memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd));
+
+ do {
+ ret = sendmsg(socket, &msg, 0);
+ assert(ret == 1 || (ret == -1 && errno == ENOTCONN));
+ } while (ret != 1);
+}
+
+static void *racer_connect(void *arg)
+{
+ for (;;) {
+ int ret;
+
+ barrier();
+ ret = connect(client, (struct sockaddr *)&saddr, salen);
+ assert(!ret);
+ barrier();
+ }
+
+ return NULL;
+}
+
+static void *racer_gc(void *arg)
+{
+ for (;;)
+ close(socket_unix()); /* trigger GC */
+
+ return NULL;
+}
+
+int main(void)
+{
+ pthread_t thread_conn, thread_gc;
+ int ret, pair[2];
+
+ printf("running\n");
+
+ ret = pthread_barrier_init(&barr, NULL, 2);
+ assert(!ret);
+
+ ret = pthread_create(&thread_conn, NULL, racer_connect, NULL);
+ assert(!ret);
+
+ ret = pthread_create(&thread_gc, NULL, racer_gc, NULL);
+ assert(!ret);
+
+ ret = socketpair(AF_UNIX, SOCK_TYPE, 0, pair);
+ assert(!ret);
+
+ saddr.sun_family = AF_UNIX;
+ salen = sizeof(saddr.sun_family) +
+ sprintf(saddr.sun_path, "%c/unix-gc-%d", '\0', getpid());
+
+ for (;;) {
+ int server, victim;
+
+ server = socket_unix();
+ ret = bind(server, (struct sockaddr *)&saddr, salen);
+ assert(!ret);
+ ret = listen(server, -1);
+ assert(!ret);
+
+ send_fd(pair[0], server);
+ close(server);
+
+ client = socket_unix();
+ victim = socket_unix();
+
+ barrier();
+ send_fd(client, victim);
+ close(victim);
+ barrier();
+
+ server = recv_fd(pair[1]);
+ close(client);
+ close(server);
+ }
+
+ return 0;
+}
--
2.44.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
@ 2024-04-08 21:18 ` Kuniyuki Iwashima
2024-04-08 23:25 ` Michal Luczaj
0 siblings, 1 reply; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-08 21:18 UTC (permalink / raw)
To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Mon, 8 Apr 2024 17:58:45 +0200
> Garbage collector does not take into account the risk of embryo getting
> enqueued during the garbage collection. If such embryo has a peer that
> carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
> different set of children. Leading to an incorrectly elevated inflight
> count, and then a dangling pointer within the gc_inflight_list.
>
> sockets are AF_UNIX/SOCK_STREAM
> S is an unconnected socket
> L is a listening in-flight socket bound to addr, not in fdtable
> V's fd will be passed via sendmsg(), gets inflight count bumped
>
> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
> ---------------- ------------------------- -----------
>
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> unix_state_lock(L)
> unix_peer(S) = NS
> // V count=1 inflight=0
>
> NS = unix_peer(S)
> skb2 = sock_alloc()
> skb_queue_tail(NS, skb2[V])
>
> // V became in-flight
> // V count=2 inflight=1
>
> close(V)
>
> // V count=1 inflight=1
> // GC candidate condition met
>
> for u in gc_inflight_list:
> if (total_refs == inflight_refs)
> add u to gc_candidates
>
> // gc_candidates={L, V}
>
> for u in gc_candidates:
> scan_children(u, dec_inflight)
>
> // embryo (skb1) was not
> // reachable from L yet, so V's
> // inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> for u in gc_candidates:
> if (u.inflight)
> scan_children(u, inc_inflight_move_tail)
>
> // V count=1 inflight=2 (!)
>
> If there is a GC-candidate listening socket, lock/unlock its state. This
> makes GC wait until the end of any ongoing connect() to that socket. After
> flipping the lock, a possibly SCM-laden embryo is already enqueued. And if
> there is another connect() coming, its embryo won't carry SCM_RIGHTS as we
> already took the unix_gc_lock.
>
> Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
> net/unix/garbage.c | 20 +++++++++++++++++---
> 1 file changed, 17 insertions(+), 3 deletions(-)
>
> diff --git a/net/unix/garbage.c b/net/unix/garbage.c
> index fa39b6265238..cd3e8585ceb2 100644
> --- a/net/unix/garbage.c
> +++ b/net/unix/garbage.c
> @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work)
> * receive queues. Other, non candidate sockets _can_ be
> * added to queue, so we must make sure only to touch
> * candidates.
> + *
> + * Embryos, though never candidates themselves, affect which
> + * candidates are reachable by the garbage collector. Before
> + * being added to a listener's queue, an embryo may already
> + * receive data carrying SCM_RIGHTS, potentially making the
> + * passed socket a candidate that is not yet reachable by the
> + * collector. It becomes reachable once the embryo is
> + * enqueued. Therefore, we must ensure that no SCM-laden
> + * embryo appears in a (candidate) listener's queue between
> + * consecutive scan_children() calls.
> */
> list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> - long total_refs;
> -
> - total_refs = file_count(u->sk.sk_socket->file);
> + struct sock *sk = &u->sk;
> + long total_refs = file_count(sk->sk_socket->file);
>
> WARN_ON_ONCE(!u->inflight);
> WARN_ON_ONCE(total_refs < u->inflight);
> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
> list_move_tail(&u->link, &gc_candidates);
> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> +
> + if (sk->sk_state == TCP_LISTEN) {
> + unix_state_lock(sk);
> + unix_state_unlock(sk);
Less likely though, what if the same connect() happens after this ?
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// L was already traversed
// in a previous iteration.
unix_state_lock(L)
unix_peer(S) = NS
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
As you pointed out, this GC's assumption is basically wrong; the GC
works correctly only when the set of traversed sockets does not change
over 3 scan_children() calls.
That's why I reworked the GC not to rely on receive queue.
https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/
> + }
> }
> }
>
> --
> 2.44.0
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-08 21:18 ` Kuniyuki Iwashima
@ 2024-04-08 23:25 ` Michal Luczaj
2024-04-09 0:22 ` Kuniyuki Iwashima
0 siblings, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 23:25 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni
On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Mon, 8 Apr 2024 17:58:45 +0200
...
>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
>> - long total_refs;
>> -
>> - total_refs = file_count(u->sk.sk_socket->file);
>> + struct sock *sk = &u->sk;
>> + long total_refs = file_count(sk->sk_socket->file);
>>
>> WARN_ON_ONCE(!u->inflight);
>> WARN_ON_ONCE(total_refs < u->inflight);
>> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
>> list_move_tail(&u->link, &gc_candidates);
>> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
>> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
>> +
>> + if (sk->sk_state == TCP_LISTEN) {
>> + unix_state_lock(sk);
>> + unix_state_unlock(sk);
>
> Less likely though, what if the same connect() happens after this ?
>
> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
> ---------------- ------------------------- -----------
> NS = unix_create1()
> skb1 = sock_wmalloc(NS)
> L = unix_find_other(addr)
> for u in gc_inflight_list:
> if (total_refs == inflight_refs)
> add u to gc_candidates
> // L was already traversed
> // in a previous iteration.
> unix_state_lock(L)
> unix_peer(S) = NS
>
> // gc_candidates={L, V}
>
> for u in gc_candidates:
> scan_children(u, dec_inflight)
>
> // embryo (skb1) was not
> // reachable from L yet, so V's
> // inflight remains unchanged
> __skb_queue_tail(L, skb1)
> unix_state_unlock(L)
> for u in gc_candidates:
> if (u.inflight)
> scan_children(u, inc_inflight_move_tail)
>
> // V count=1 inflight=2 (!)
If I understand your question, in this case L's queue technically does change
between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
(i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
same unix_gc_lock.
Is there something I'm missing?
> As you pointed out, this GC's assumption is basically wrong; the GC
> works correctly only when the set of traversed sockets does not change
> over 3 scan_children() calls.
>
> That's why I reworked the GC not to rely on receive queue.
> https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/
Right, I'll try to get my head around your series :)
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-08 23:25 ` Michal Luczaj
@ 2024-04-09 0:22 ` Kuniyuki Iwashima
2024-04-09 9:16 ` Michal Luczaj
0 siblings, 1 reply; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-09 0:22 UTC (permalink / raw)
To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 01:25:23 +0200
> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Mon, 8 Apr 2024 17:58:45 +0200
> ...
> >> list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
Please move sk declaration here and
> >> - long total_refs;
> >> -
> >> - total_refs = file_count(u->sk.sk_socket->file);
keep these 3 lines for reverse xmax tree order.
> >> + struct sock *sk = &u->sk;
> >> + long total_refs = file_count(sk->sk_socket->file);
> >>
> >> WARN_ON_ONCE(!u->inflight);
> >> WARN_ON_ONCE(total_refs < u->inflight);
> >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work)
> >> list_move_tail(&u->link, &gc_candidates);
> >> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags);
> >> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags);
> >> +
> >> + if (sk->sk_state == TCP_LISTEN) {
> >> + unix_state_lock(sk);
> >> + unix_state_unlock(sk);
> >
> > Less likely though, what if the same connect() happens after this ?
> >
> > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
> > ---------------- ------------------------- -----------
> > NS = unix_create1()
> > skb1 = sock_wmalloc(NS)
> > L = unix_find_other(addr)
> > for u in gc_inflight_list:
> > if (total_refs == inflight_refs)
> > add u to gc_candidates
> > // L was already traversed
> > // in a previous iteration.
> > unix_state_lock(L)
> > unix_peer(S) = NS
> >
> > // gc_candidates={L, V}
> >
> > for u in gc_candidates:
> > scan_children(u, dec_inflight)
> >
> > // embryo (skb1) was not
> > // reachable from L yet, so V's
> > // inflight remains unchanged
> > __skb_queue_tail(L, skb1)
> > unix_state_unlock(L)
> > for u in gc_candidates:
> > if (u.inflight)
> > scan_children(u, inc_inflight_move_tail)
> >
> > // V count=1 inflight=2 (!)
>
> If I understand your question, in this case L's queue technically does change
> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> same unix_gc_lock.
>
> Is there something I'm missing?
Ah exactly, you are right.
Could you repost this patch only with my comment above addressed ?
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-09 0:22 ` Kuniyuki Iwashima
@ 2024-04-09 9:16 ` Michal Luczaj
2024-04-09 19:02 ` Kuniyuki Iwashima
0 siblings, 1 reply; 8+ messages in thread
From: Michal Luczaj @ 2024-04-09 9:16 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni
On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> From: Michal Luczaj <mhal@rbox.co>
> Date: Tue, 9 Apr 2024 01:25:23 +0200
>> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
>>> From: Michal Luczaj <mhal@rbox.co>
>>> Date: Mon, 8 Apr 2024 17:58:45 +0200
>> ...
>>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
>
> Please move sk declaration here and
>
>>>> - long total_refs;
>>>> -
>>>> - total_refs = file_count(u->sk.sk_socket->file);
>
> keep these 3 lines for reverse xmax tree order.
Tricky to have them all 3 in reverse xmax. Did you mean
struct sock *sk = &u->sk;
long total_refs;
total_refs = file_count(sk->sk_socket->file);
?
>>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
>>> ---------------- ------------------------- -----------
>>> NS = unix_create1()
>>> skb1 = sock_wmalloc(NS)
>>> L = unix_find_other(addr)
>>> for u in gc_inflight_list:
>>> if (total_refs == inflight_refs)
>>> add u to gc_candidates
>>> // L was already traversed
>>> // in a previous iteration.
>>> unix_state_lock(L)
>>> unix_peer(S) = NS
>>>
>>> // gc_candidates={L, V}
>>>
>>> for u in gc_candidates:
>>> scan_children(u, dec_inflight)
>>>
>>> // embryo (skb1) was not
>>> // reachable from L yet, so V's
>>> // inflight remains unchanged
>>> __skb_queue_tail(L, skb1)
>>> unix_state_unlock(L)
>>> for u in gc_candidates:
>>> if (u.inflight)
>>> scan_children(u, inc_inflight_move_tail)
>>>
>>> // V count=1 inflight=2 (!)
>>
>> If I understand your question, in this case L's queue technically does change
>> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
>> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
>> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
>> same unix_gc_lock.
>>
>> Is there something I'm missing?
>
> Ah exactly, you are right.
>
> Could you repost this patch only with my comment above addressed ?
Yeah, sure. One question though: what I wrote above is basically a rephrasing of
the commit message:
(...) After flipping the lock, a possibly SCM-laden embryo is already
enqueued. And if there is another connect() coming, its embryo won't
carry SCM_RIGHTS as we already took the unix_gc_lock.
As I understand, the important missing part was the clarification that embryo,
even though enqueued after the lock flipping, won't affect the inflight graph,
right? So how about:
(...) After flipping the lock, a possibly SCM-laden embryo is already
enqueued. And if there is another embryo coming, it can not possibly carry
SCM_RIGHTS. At this point, unix_inflight() can not happen because
unix_gc_lock is already taken. Inflight graph remains unaffected.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect()
2024-04-09 9:16 ` Michal Luczaj
@ 2024-04-09 19:02 ` Kuniyuki Iwashima
0 siblings, 0 replies; 8+ messages in thread
From: Kuniyuki Iwashima @ 2024-04-09 19:02 UTC (permalink / raw)
To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni
From: Michal Luczaj <mhal@rbox.co>
Date: Tue, 9 Apr 2024 11:16:35 +0200
> On 4/9/24 02:22, Kuniyuki Iwashima wrote:
> > From: Michal Luczaj <mhal@rbox.co>
> > Date: Tue, 9 Apr 2024 01:25:23 +0200
> >> On 4/8/24 23:18, Kuniyuki Iwashima wrote:
> >>> From: Michal Luczaj <mhal@rbox.co>
> >>> Date: Mon, 8 Apr 2024 17:58:45 +0200
> >> ...
> >>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) {
> >
> > Please move sk declaration here and
> >
> >>>> - long total_refs;
> >>>> -
> >>>> - total_refs = file_count(u->sk.sk_socket->file);
> >
> > keep these 3 lines for reverse xmax tree order.
>
> Tricky to have them all 3 in reverse xmax. Did you mean
>
> struct sock *sk = &u->sk;
> long total_refs;
>
> total_refs = file_count(sk->sk_socket->file);
>
> ?
Yes, it's netdev convention.
https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs
>
> >>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
> >>> ---------------- ------------------------- -----------
> >>> NS = unix_create1()
> >>> skb1 = sock_wmalloc(NS)
> >>> L = unix_find_other(addr)
> >>> for u in gc_inflight_list:
> >>> if (total_refs == inflight_refs)
> >>> add u to gc_candidates
> >>> // L was already traversed
> >>> // in a previous iteration.
> >>> unix_state_lock(L)
> >>> unix_peer(S) = NS
> >>>
> >>> // gc_candidates={L, V}
> >>>
> >>> for u in gc_candidates:
> >>> scan_children(u, dec_inflight)
> >>>
> >>> // embryo (skb1) was not
> >>> // reachable from L yet, so V's
> >>> // inflight remains unchanged
> >>> __skb_queue_tail(L, skb1)
> >>> unix_state_unlock(L)
> >>> for u in gc_candidates:
> >>> if (u.inflight)
> >>> scan_children(u, inc_inflight_move_tail)
> >>>
> >>> // V count=1 inflight=2 (!)
> >>
> >> If I understand your question, in this case L's queue technically does change
> >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc()
> >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS
> >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the
> >> same unix_gc_lock.
> >>
> >> Is there something I'm missing?
> >
> > Ah exactly, you are right.
> >
> > Could you repost this patch only with my comment above addressed ?
>
> Yeah, sure. One question though: what I wrote above is basically a rephrasing of
> the commit message:
>
> (...) After flipping the lock, a possibly SCM-laden embryo is already
> enqueued. And if there is another connect() coming, its embryo won't
> carry SCM_RIGHTS as we already took the unix_gc_lock.
>
> As I understand, the important missing part was the clarification that embryo,
> even though enqueued after the lock flipping, won't affect the inflight graph,
> right? So how about:
>
> (...) After flipping the lock, a possibly SCM-laden embryo is already
> enqueued. And if there is another embryo coming, it can not possibly carry
> SCM_RIGHTS. At this point, unix_inflight() can not happen because
> unix_gc_lock is already taken. Inflight graph remains unaffected.
Sounds good to me.
Thanks!
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-09 19:29 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
2024-04-08 21:18 ` Kuniyuki Iwashima
2024-04-08 23:25 ` Michal Luczaj
2024-04-09 0:22 ` Kuniyuki Iwashima
2024-04-09 9:16 ` Michal Luczaj
2024-04-09 19:02 ` Kuniyuki Iwashima
2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).