* [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition
@ 2024-04-08 15:58 Michal Luczaj
2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj
2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
0 siblings, 2 replies; 8+ messages in thread
From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw)
To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj
Garbage collector does not take into account the risk of embryo getting
enqueued during the garbage collection. If such embryo has a peer that
carries SCM_RIGHTS, two consecutive passes of scan_children() may see a
different set of children. Leading to an incorrectly elevated inflight
count, and then a dangling pointer within the gc_inflight_list.
sockets are AF_UNIX/SOCK_STREAM
S is an unconnected socket
L is a listening in-flight socket bound to addr, not in fdtable
V's fd will be passed via sendmsg(), gets inflight count bumped
connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc()
---------------- ------------------------- -----------
NS = unix_create1()
skb1 = sock_wmalloc(NS)
L = unix_find_other(addr)
unix_state_lock(L)
unix_peer(S) = NS
// V count=1 inflight=0
NS = unix_peer(S)
skb2 = sock_alloc()
skb_queue_tail(NS, skb2[V])
// V became in-flight
// V count=2 inflight=1
close(V)
// V count=1 inflight=1
// GC candidate condition met
for u in gc_inflight_list:
if (total_refs == inflight_refs)
add u to gc_candidates
// gc_candidates={L, V}
for u in gc_candidates:
scan_children(u, dec_inflight)
// embryo (skb1) was not
// reachable from L yet, so V's
// inflight remains unchanged
__skb_queue_tail(L, skb1)
unix_state_unlock(L)
for u in gc_candidates:
if (u.inflight)
scan_children(u, inc_inflight_move_tail)
// V count=1 inflight=2 (!)
The idea behind the patch is to unix_state_lock()/unlock() L during the
gc_candidates selection. That would guarantee either connect() has
concluded, or - if we raced it - parallel sendmsg() with SCM_RIGHTS could
not happen as we are already holding the unix_gc_lock.
Running the reproducer with mdelay(1) stuffed in unix_stream_connect()
results in immediate splats:
$ ./tools/testing/selftests/net/af_unix/gc_vs_connect
running
[ 47.019387] WARNING: CPU: 3 PID: 12 at net/unix/garbage.c:284 __unix_gc+0x473/0x4a0
[ 47.019405] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[ 47.019419] CPU: 3 PID: 12 Comm: kworker/u32:1 Not tainted 6.9.0-rc2nokasan+ #1
[ 47.019421] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 47.019422] Workqueue: events_unbound __unix_gc
[ 47.019424] RIP: 0010:__unix_gc+0x473/0x4a0
[ 47.019425] Code: 8d bb d0 f9 ff ff 48 0f ba 73 58 01 0f b6 83 e2 f9 ff ff 31 d2 48 c7 c6 a0 c3 de 81 3c 0a 74 18 e8 e2 f8 ff ff e9 96 fd ff ff <0f> 0b e9 ef fb ff ff 0f 0b e9 39 fc ff ff e8 5a fa ff ff e9 7e fd
[ 47.019427] RSP: 0018:ffffc9000006bd90 EFLAGS: 00010297
[ 47.019429] RAX: 0000000000000002 RBX: ffff888109459680 RCX: 000000003342a6b0
[ 47.019430] RDX: 0000000000000001 RSI: ffffffff82634e8c RDI: ffffffff83161740
[ 47.019431] RBP: ffffc9000006be40 R08: 00000000000003d7 R09: 00000000000003d7
[ 47.019432] R10: 0000000000000000 R11: 0000000000000001 R12: ffffffff831610e0
[ 47.019433] R13: ffff888109459cb0 R14: ffffc9000006bd90 R15: ffffffff831616c0
[ 47.019434] FS: 0000000000000000(0000) GS:ffff88842fb80000(0000) knlGS:0000000000000000
[ 47.019435] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.019436] CR2: 000055c16e3893b8 CR3: 0000000002e37000 CR4: 0000000000750ef0
[ 47.019438] PKRU: 55555554
[ 47.019439] Call Trace:
[ 47.019440] <TASK>
[ 47.019442] ? __warn+0x88/0x180
[ 47.019445] ? __unix_gc+0x473/0x4a0
[ 47.019447] ? report_bug+0x189/0x1c0
[ 47.019451] ? handle_bug+0x38/0x70
[ 47.019453] ? exc_invalid_op+0x13/0x60
[ 47.019454] ? asm_exc_invalid_op+0x16/0x20
[ 47.019460] ? __unix_gc+0x473/0x4a0
[ 47.019464] ? lock_acquire+0xd5/0x2c0
[ 47.019466] ? lock_release+0x133/0x290
[ 47.019471] process_one_work+0x215/0x700
[ 47.019473] ? move_linked_works+0x70/0xa0
[ 47.019477] worker_thread+0x1ca/0x3b0
[ 47.019479] ? rescuer_thread+0x340/0x340
[ 47.019480] kthread+0xdd/0x110
[ 47.019482] ? kthread_complete_and_exit+0x20/0x20
[ 47.019485] ret_from_fork+0x2d/0x50
[ 47.019487] ? kthread_complete_and_exit+0x20/0x20
[ 47.019489] ret_from_fork_asm+0x11/0x20
[ 47.019495] </TASK>
[ 47.019496] irq event stamp: 446769
[ 47.019497] hardirqs last enabled at (446775): [<ffffffff81199a69>] console_unlock+0xf9/0x120
[ 47.019499] hardirqs last disabled at (446780): [<ffffffff81199a4e>] console_unlock+0xde/0x120
[ 47.019501] softirqs last enabled at (444340): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[ 47.019504] softirqs last disabled at (444333): [<ffffffff810fcd73>] __irq_exit_rcu+0x93/0x100
[ 47.019505] ---[ end trace 0000000000000000 ]---
...
[ 47.019555] list_add corruption. prev->next should be next (ffffffff83161710), but was ffff888109459cb0. (prev=ffff888109459cb0).
[ 47.019572] kernel BUG at lib/list_debug.c:32!
[ 47.019654] invalid opcode: 0000 [#1] PREEMPT SMP
[ 47.019676] CPU: 0 PID: 1057 Comm: gc_vs_connect Tainted: G W 6.9.0-rc2nokasan+ #1
[ 47.019698] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[ 47.019720] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[ 47.019743] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[ 47.019766] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[ 47.019790] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[ 47.019814] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[ 47.019838] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[ 47.019861] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[ 47.019883] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[ 47.019906] FS: 00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 47.019928] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.019949] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[ 47.019972] PKRU: 55555554
[ 47.019993] Call Trace:
[ 47.020014] <TASK>
[ 47.020034] ? die+0x32/0x80
[ 47.020056] ? do_trap+0xd5/0x100
[ 47.020078] ? __list_add_valid_or_report+0x70/0x90
[ 47.020100] ? __list_add_valid_or_report+0x70/0x90
[ 47.020122] ? do_error_trap+0x81/0x110
[ 47.020144] ? __list_add_valid_or_report+0x70/0x90
[ 47.020166] ? exc_invalid_op+0x4c/0x60
[ 47.020189] ? __list_add_valid_or_report+0x70/0x90
[ 47.020211] ? asm_exc_invalid_op+0x16/0x20
[ 47.020237] ? __list_add_valid_or_report+0x70/0x90
[ 47.020355] ? __list_add_valid_or_report+0x70/0x90
[ 47.020424] unix_inflight+0x6a/0xf0
[ 47.020483] unix_scm_to_skb+0xe4/0x160
[ 47.020518] unix_stream_sendmsg+0x174/0x630
[ 47.020573] __sock_sendmsg+0x38/0x70
[ 47.020635] ____sys_sendmsg+0x237/0x2a0
[ 47.020694] ? import_iovec+0x16/0x20
[ 47.020754] ___sys_sendmsg+0x86/0xd0
[ 47.020791] ? find_held_lock+0x2b/0x80
[ 47.021011] ? lock_release+0x133/0x290
[ 47.021467] ? __fget_files+0xca/0x180
[ 47.021564] __sys_sendmsg+0x47/0x80
[ 47.021663] do_syscall_64+0x94/0x180
[ 47.021751] ? do_syscall_64+0xa1/0x180
[ 47.021812] ? do_syscall_64+0xa1/0x180
[ 47.021864] entry_SYSCALL_64_after_hwframe+0x46/0x4e
[ 47.021890] RIP: 0033:0x7fba6fd15dbb
[ 47.021917] Code: 48 89 e5 48 83 ec 20 89 55 ec 48 89 75 f0 89 7d f8 e8 69 2c f7 ff 8b 55 ec 48 8b 75 f0 41 89 c0 8b 7d f8 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2d 44 89 c7 48 89 45 f8 e8 c1 2c f7 ff 48 8b
[ 47.021948] RSP: 002b:00007ffc1aa77410 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[ 47.021974] RAX: ffffffffffffffda RBX: 00007ffc1aa775c8 RCX: 00007fba6fd15dbb
[ 47.022000] RDX: 0000000000000000 RSI: 00005625cb5df0e0 RDI: 0000000000000003
[ 47.022025] RBP: 00007ffc1aa77430 R08: 0000000000000000 R09: 0000000000001033
[ 47.022050] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[ 47.022076] R13: 0000000000000000 R14: 00007fba6fe2c000 R15: 00005625cb5dedd8
[ 47.022103] </TASK>
[ 47.022127] Modules linked in: 9p netfs kvm_intel kvm 9pnet_virtio 9pnet i2c_piix4 zram crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel virtio_blk serio_raw fuse qemu_fw_cfg virtio_console
[ 47.022209] ---[ end trace 0000000000000000 ]---
[ 47.022233] RIP: 0010:__list_add_valid_or_report+0x70/0x90
[ 47.022258] Code: 98 ff 0f 0b 48 89 c1 48 c7 c7 90 1a 72 82 e8 77 ed 98 ff 0f 0b 48 89 d1 48 89 c6 4c 89 c2 48 c7 c7 e8 1a 72 82 e8 60 ed 98 ff <0f> 0b 48 89 f2 48 89 c1 48 89 fe 48 c7 c7 40 1b 72 82 e8 49 ed 98
[ 47.022284] RSP: 0018:ffffc9000125fba8 EFLAGS: 00010286
[ 47.022308] RAX: 0000000000000075 RBX: ffff888109459680 RCX: 0000000000000000
[ 47.022332] RDX: 0000000000000002 RSI: ffffffff82667570 RDI: 00000000ffffffff
[ 47.022356] RBP: ffff8881067bc400 R08: 0000000000000000 R09: 0000000000000003
[ 47.022406] R10: ffffc9000125fa78 R11: ffffffff82f571e8 R12: ffff888109459cb0
[ 47.022451] R13: ffff888109459cb0 R14: ffff88810945ad00 R15: 0000000000000000
[ 47.022517] FS: 00007fba6fde1680(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
[ 47.022606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 47.022653] CR2: 00007fba6fc5ac80 CR3: 0000000104819000 CR4: 0000000000750ef0
[ 47.022698] PKRU: 55555554
[ 47.022764] note: gc_vs_connect[1057] exited with preempt_count 1
Michal Luczaj (2):
af_unix: Fix garbage collector racing against connect()
af_unix: Add GC race reproducer + slow down unix_stream_connect()
net/unix/af_unix.c | 2 +
net/unix/garbage.c | 20 ++-
tools/testing/selftests/net/af_unix/Makefile | 2 +-
.../selftests/net/af_unix/gc_vs_connect.c | 158 ++++++++++++++++++
4 files changed, 178 insertions(+), 4 deletions(-)
create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c
--
2.44.0
^ permalink raw reply [flat|nested] 8+ messages in thread* [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj @ 2024-04-08 15:58 ` Michal Luczaj 2024-04-08 21:18 ` Kuniyuki Iwashima 2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj 1 sibling, 1 reply; 8+ messages in thread From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw) To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj Garbage collector does not take into account the risk of embryo getting enqueued during the garbage collection. If such embryo has a peer that carries SCM_RIGHTS, two consecutive passes of scan_children() may see a different set of children. Leading to an incorrectly elevated inflight count, and then a dangling pointer within the gc_inflight_list. sockets are AF_UNIX/SOCK_STREAM S is an unconnected socket L is a listening in-flight socket bound to addr, not in fdtable V's fd will be passed via sendmsg(), gets inflight count bumped connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() ---------------- ------------------------- ----------- NS = unix_create1() skb1 = sock_wmalloc(NS) L = unix_find_other(addr) unix_state_lock(L) unix_peer(S) = NS // V count=1 inflight=0 NS = unix_peer(S) skb2 = sock_alloc() skb_queue_tail(NS, skb2[V]) // V became in-flight // V count=2 inflight=1 close(V) // V count=1 inflight=1 // GC candidate condition met for u in gc_inflight_list: if (total_refs == inflight_refs) add u to gc_candidates // gc_candidates={L, V} for u in gc_candidates: scan_children(u, dec_inflight) // embryo (skb1) was not // reachable from L yet, so V's // inflight remains unchanged __skb_queue_tail(L, skb1) unix_state_unlock(L) for u in gc_candidates: if (u.inflight) scan_children(u, inc_inflight_move_tail) // V count=1 inflight=2 (!) If there is a GC-candidate listening socket, lock/unlock its state. This makes GC wait until the end of any ongoing connect() to that socket. After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another connect() coming, its embryo won't carry SCM_RIGHTS as we already took the unix_gc_lock. Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.") Signed-off-by: Michal Luczaj <mhal@rbox.co> --- net/unix/garbage.c | 20 +++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/net/unix/garbage.c b/net/unix/garbage.c index fa39b6265238..cd3e8585ceb2 100644 --- a/net/unix/garbage.c +++ b/net/unix/garbage.c @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work) * receive queues. Other, non candidate sockets _can_ be * added to queue, so we must make sure only to touch * candidates. + * + * Embryos, though never candidates themselves, affect which + * candidates are reachable by the garbage collector. Before + * being added to a listener's queue, an embryo may already + * receive data carrying SCM_RIGHTS, potentially making the + * passed socket a candidate that is not yet reachable by the + * collector. It becomes reachable once the embryo is + * enqueued. Therefore, we must ensure that no SCM-laden + * embryo appears in a (candidate) listener's queue between + * consecutive scan_children() calls. */ list_for_each_entry_safe(u, next, &gc_inflight_list, link) { - long total_refs; - - total_refs = file_count(u->sk.sk_socket->file); + struct sock *sk = &u->sk; + long total_refs = file_count(sk->sk_socket->file); WARN_ON_ONCE(!u->inflight); WARN_ON_ONCE(total_refs < u->inflight); @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) list_move_tail(&u->link, &gc_candidates); __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); + + if (sk->sk_state == TCP_LISTEN) { + unix_state_lock(sk); + unix_state_unlock(sk); + } } } -- 2.44.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj @ 2024-04-08 21:18 ` Kuniyuki Iwashima 2024-04-08 23:25 ` Michal Luczaj 0 siblings, 1 reply; 8+ messages in thread From: Kuniyuki Iwashima @ 2024-04-08 21:18 UTC (permalink / raw) To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni From: Michal Luczaj <mhal@rbox.co> Date: Mon, 8 Apr 2024 17:58:45 +0200 > Garbage collector does not take into account the risk of embryo getting > enqueued during the garbage collection. If such embryo has a peer that > carries SCM_RIGHTS, two consecutive passes of scan_children() may see a > different set of children. Leading to an incorrectly elevated inflight > count, and then a dangling pointer within the gc_inflight_list. > > sockets are AF_UNIX/SOCK_STREAM > S is an unconnected socket > L is a listening in-flight socket bound to addr, not in fdtable > V's fd will be passed via sendmsg(), gets inflight count bumped > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > ---------------- ------------------------- ----------- > > NS = unix_create1() > skb1 = sock_wmalloc(NS) > L = unix_find_other(addr) > unix_state_lock(L) > unix_peer(S) = NS > // V count=1 inflight=0 > > NS = unix_peer(S) > skb2 = sock_alloc() > skb_queue_tail(NS, skb2[V]) > > // V became in-flight > // V count=2 inflight=1 > > close(V) > > // V count=1 inflight=1 > // GC candidate condition met > > for u in gc_inflight_list: > if (total_refs == inflight_refs) > add u to gc_candidates > > // gc_candidates={L, V} > > for u in gc_candidates: > scan_children(u, dec_inflight) > > // embryo (skb1) was not > // reachable from L yet, so V's > // inflight remains unchanged > __skb_queue_tail(L, skb1) > unix_state_unlock(L) > for u in gc_candidates: > if (u.inflight) > scan_children(u, inc_inflight_move_tail) > > // V count=1 inflight=2 (!) > > If there is a GC-candidate listening socket, lock/unlock its state. This > makes GC wait until the end of any ongoing connect() to that socket. After > flipping the lock, a possibly SCM-laden embryo is already enqueued. And if > there is another connect() coming, its embryo won't carry SCM_RIGHTS as we > already took the unix_gc_lock. > > Fixes: 1fd05ba5a2f2 ("[AF_UNIX]: Rewrite garbage collector, fixes race.") > Signed-off-by: Michal Luczaj <mhal@rbox.co> > --- > net/unix/garbage.c | 20 +++++++++++++++++--- > 1 file changed, 17 insertions(+), 3 deletions(-) > > diff --git a/net/unix/garbage.c b/net/unix/garbage.c > index fa39b6265238..cd3e8585ceb2 100644 > --- a/net/unix/garbage.c > +++ b/net/unix/garbage.c > @@ -274,11 +274,20 @@ static void __unix_gc(struct work_struct *work) > * receive queues. Other, non candidate sockets _can_ be > * added to queue, so we must make sure only to touch > * candidates. > + * > + * Embryos, though never candidates themselves, affect which > + * candidates are reachable by the garbage collector. Before > + * being added to a listener's queue, an embryo may already > + * receive data carrying SCM_RIGHTS, potentially making the > + * passed socket a candidate that is not yet reachable by the > + * collector. It becomes reachable once the embryo is > + * enqueued. Therefore, we must ensure that no SCM-laden > + * embryo appears in a (candidate) listener's queue between > + * consecutive scan_children() calls. > */ > list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > - long total_refs; > - > - total_refs = file_count(u->sk.sk_socket->file); > + struct sock *sk = &u->sk; > + long total_refs = file_count(sk->sk_socket->file); > > WARN_ON_ONCE(!u->inflight); > WARN_ON_ONCE(total_refs < u->inflight); > @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) > list_move_tail(&u->link, &gc_candidates); > __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); > __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); > + > + if (sk->sk_state == TCP_LISTEN) { > + unix_state_lock(sk); > + unix_state_unlock(sk); Less likely though, what if the same connect() happens after this ? connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() ---------------- ------------------------- ----------- NS = unix_create1() skb1 = sock_wmalloc(NS) L = unix_find_other(addr) for u in gc_inflight_list: if (total_refs == inflight_refs) add u to gc_candidates // L was already traversed // in a previous iteration. unix_state_lock(L) unix_peer(S) = NS // gc_candidates={L, V} for u in gc_candidates: scan_children(u, dec_inflight) // embryo (skb1) was not // reachable from L yet, so V's // inflight remains unchanged __skb_queue_tail(L, skb1) unix_state_unlock(L) for u in gc_candidates: if (u.inflight) scan_children(u, inc_inflight_move_tail) // V count=1 inflight=2 (!) As you pointed out, this GC's assumption is basically wrong; the GC works correctly only when the set of traversed sockets does not change over 3 scan_children() calls. That's why I reworked the GC not to rely on receive queue. https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/ > + } > } > } > > -- > 2.44.0 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-08 21:18 ` Kuniyuki Iwashima @ 2024-04-08 23:25 ` Michal Luczaj 2024-04-09 0:22 ` Kuniyuki Iwashima 0 siblings, 1 reply; 8+ messages in thread From: Michal Luczaj @ 2024-04-08 23:25 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni On 4/8/24 23:18, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Mon, 8 Apr 2024 17:58:45 +0200 ... >> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { >> - long total_refs; >> - >> - total_refs = file_count(u->sk.sk_socket->file); >> + struct sock *sk = &u->sk; >> + long total_refs = file_count(sk->sk_socket->file); >> >> WARN_ON_ONCE(!u->inflight); >> WARN_ON_ONCE(total_refs < u->inflight); >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) >> list_move_tail(&u->link, &gc_candidates); >> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); >> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); >> + >> + if (sk->sk_state == TCP_LISTEN) { >> + unix_state_lock(sk); >> + unix_state_unlock(sk); > > Less likely though, what if the same connect() happens after this ? > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > ---------------- ------------------------- ----------- > NS = unix_create1() > skb1 = sock_wmalloc(NS) > L = unix_find_other(addr) > for u in gc_inflight_list: > if (total_refs == inflight_refs) > add u to gc_candidates > // L was already traversed > // in a previous iteration. > unix_state_lock(L) > unix_peer(S) = NS > > // gc_candidates={L, V} > > for u in gc_candidates: > scan_children(u, dec_inflight) > > // embryo (skb1) was not > // reachable from L yet, so V's > // inflight remains unchanged > __skb_queue_tail(L, skb1) > unix_state_unlock(L) > for u in gc_candidates: > if (u.inflight) > scan_children(u, inc_inflight_move_tail) > > // V count=1 inflight=2 (!) If I understand your question, in this case L's queue technically does change between scan_children()s: embryo appears, but that's meaningless. __unix_gc() already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the same unix_gc_lock. Is there something I'm missing? > As you pointed out, this GC's assumption is basically wrong; the GC > works correctly only when the set of traversed sockets does not change > over 3 scan_children() calls. > > That's why I reworked the GC not to rely on receive queue. > https://lore.kernel.org/netdev/20240325202425.60930-1-kuniyu@amazon.com/ Right, I'll try to get my head around your series :) ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-08 23:25 ` Michal Luczaj @ 2024-04-09 0:22 ` Kuniyuki Iwashima 2024-04-09 9:16 ` Michal Luczaj 0 siblings, 1 reply; 8+ messages in thread From: Kuniyuki Iwashima @ 2024-04-09 0:22 UTC (permalink / raw) To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni From: Michal Luczaj <mhal@rbox.co> Date: Tue, 9 Apr 2024 01:25:23 +0200 > On 4/8/24 23:18, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Mon, 8 Apr 2024 17:58:45 +0200 > ... > >> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { Please move sk declaration here and > >> - long total_refs; > >> - > >> - total_refs = file_count(u->sk.sk_socket->file); keep these 3 lines for reverse xmax tree order. > >> + struct sock *sk = &u->sk; > >> + long total_refs = file_count(sk->sk_socket->file); > >> > >> WARN_ON_ONCE(!u->inflight); > >> WARN_ON_ONCE(total_refs < u->inflight); > >> @@ -286,6 +295,11 @@ static void __unix_gc(struct work_struct *work) > >> list_move_tail(&u->link, &gc_candidates); > >> __set_bit(UNIX_GC_CANDIDATE, &u->gc_flags); > >> __set_bit(UNIX_GC_MAYBE_CYCLE, &u->gc_flags); > >> + > >> + if (sk->sk_state == TCP_LISTEN) { > >> + unix_state_lock(sk); > >> + unix_state_unlock(sk); > > > > Less likely though, what if the same connect() happens after this ? > > > > connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > > ---------------- ------------------------- ----------- > > NS = unix_create1() > > skb1 = sock_wmalloc(NS) > > L = unix_find_other(addr) > > for u in gc_inflight_list: > > if (total_refs == inflight_refs) > > add u to gc_candidates > > // L was already traversed > > // in a previous iteration. > > unix_state_lock(L) > > unix_peer(S) = NS > > > > // gc_candidates={L, V} > > > > for u in gc_candidates: > > scan_children(u, dec_inflight) > > > > // embryo (skb1) was not > > // reachable from L yet, so V's > > // inflight remains unchanged > > __skb_queue_tail(L, skb1) > > unix_state_unlock(L) > > for u in gc_candidates: > > if (u.inflight) > > scan_children(u, inc_inflight_move_tail) > > > > // V count=1 inflight=2 (!) > > If I understand your question, in this case L's queue technically does change > between scan_children()s: embryo appears, but that's meaningless. __unix_gc() > already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS > (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the > same unix_gc_lock. > > Is there something I'm missing? Ah exactly, you are right. Could you repost this patch only with my comment above addressed ? Thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-09 0:22 ` Kuniyuki Iwashima @ 2024-04-09 9:16 ` Michal Luczaj 2024-04-09 19:02 ` Kuniyuki Iwashima 0 siblings, 1 reply; 8+ messages in thread From: Michal Luczaj @ 2024-04-09 9:16 UTC (permalink / raw) To: Kuniyuki Iwashima; +Cc: davem, edumazet, kuba, netdev, pabeni On 4/9/24 02:22, Kuniyuki Iwashima wrote: > From: Michal Luczaj <mhal@rbox.co> > Date: Tue, 9 Apr 2024 01:25:23 +0200 >> On 4/8/24 23:18, Kuniyuki Iwashima wrote: >>> From: Michal Luczaj <mhal@rbox.co> >>> Date: Mon, 8 Apr 2024 17:58:45 +0200 >> ... >>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > > Please move sk declaration here and > >>>> - long total_refs; >>>> - >>>> - total_refs = file_count(u->sk.sk_socket->file); > > keep these 3 lines for reverse xmax tree order. Tricky to have them all 3 in reverse xmax. Did you mean struct sock *sk = &u->sk; long total_refs; total_refs = file_count(sk->sk_socket->file); ? >>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() >>> ---------------- ------------------------- ----------- >>> NS = unix_create1() >>> skb1 = sock_wmalloc(NS) >>> L = unix_find_other(addr) >>> for u in gc_inflight_list: >>> if (total_refs == inflight_refs) >>> add u to gc_candidates >>> // L was already traversed >>> // in a previous iteration. >>> unix_state_lock(L) >>> unix_peer(S) = NS >>> >>> // gc_candidates={L, V} >>> >>> for u in gc_candidates: >>> scan_children(u, dec_inflight) >>> >>> // embryo (skb1) was not >>> // reachable from L yet, so V's >>> // inflight remains unchanged >>> __skb_queue_tail(L, skb1) >>> unix_state_unlock(L) >>> for u in gc_candidates: >>> if (u.inflight) >>> scan_children(u, inc_inflight_move_tail) >>> >>> // V count=1 inflight=2 (!) >> >> If I understand your question, in this case L's queue technically does change >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc() >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the >> same unix_gc_lock. >> >> Is there something I'm missing? > > Ah exactly, you are right. > > Could you repost this patch only with my comment above addressed ? Yeah, sure. One question though: what I wrote above is basically a rephrasing of the commit message: (...) After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another connect() coming, its embryo won't carry SCM_RIGHTS as we already took the unix_gc_lock. As I understand, the important missing part was the clarification that embryo, even though enqueued after the lock flipping, won't affect the inflight graph, right? So how about: (...) After flipping the lock, a possibly SCM-laden embryo is already enqueued. And if there is another embryo coming, it can not possibly carry SCM_RIGHTS. At this point, unix_inflight() can not happen because unix_gc_lock is already taken. Inflight graph remains unaffected. Thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() 2024-04-09 9:16 ` Michal Luczaj @ 2024-04-09 19:02 ` Kuniyuki Iwashima 0 siblings, 0 replies; 8+ messages in thread From: Kuniyuki Iwashima @ 2024-04-09 19:02 UTC (permalink / raw) To: mhal; +Cc: davem, edumazet, kuba, kuniyu, netdev, pabeni From: Michal Luczaj <mhal@rbox.co> Date: Tue, 9 Apr 2024 11:16:35 +0200 > On 4/9/24 02:22, Kuniyuki Iwashima wrote: > > From: Michal Luczaj <mhal@rbox.co> > > Date: Tue, 9 Apr 2024 01:25:23 +0200 > >> On 4/8/24 23:18, Kuniyuki Iwashima wrote: > >>> From: Michal Luczaj <mhal@rbox.co> > >>> Date: Mon, 8 Apr 2024 17:58:45 +0200 > >> ... > >>>> list_for_each_entry_safe(u, next, &gc_inflight_list, link) { > > > > Please move sk declaration here and > > > >>>> - long total_refs; > >>>> - > >>>> - total_refs = file_count(u->sk.sk_socket->file); > > > > keep these 3 lines for reverse xmax tree order. > > Tricky to have them all 3 in reverse xmax. Did you mean > > struct sock *sk = &u->sk; > long total_refs; > > total_refs = file_count(sk->sk_socket->file); > > ? Yes, it's netdev convention. https://docs.kernel.org/process/maintainer-netdev.html#local-variable-ordering-reverse-xmas-tree-rcs > > >>> connect(S, addr) sendmsg(S, [V]); close(V) __unix_gc() > >>> ---------------- ------------------------- ----------- > >>> NS = unix_create1() > >>> skb1 = sock_wmalloc(NS) > >>> L = unix_find_other(addr) > >>> for u in gc_inflight_list: > >>> if (total_refs == inflight_refs) > >>> add u to gc_candidates > >>> // L was already traversed > >>> // in a previous iteration. > >>> unix_state_lock(L) > >>> unix_peer(S) = NS > >>> > >>> // gc_candidates={L, V} > >>> > >>> for u in gc_candidates: > >>> scan_children(u, dec_inflight) > >>> > >>> // embryo (skb1) was not > >>> // reachable from L yet, so V's > >>> // inflight remains unchanged > >>> __skb_queue_tail(L, skb1) > >>> unix_state_unlock(L) > >>> for u in gc_candidates: > >>> if (u.inflight) > >>> scan_children(u, inc_inflight_move_tail) > >>> > >>> // V count=1 inflight=2 (!) > >> > >> If I understand your question, in this case L's queue technically does change > >> between scan_children()s: embryo appears, but that's meaningless. __unix_gc() > >> already holds unix_gc_lock, so the enqueued embryo can not carry any SCM_RIGHTS > >> (i.e. it doesn't affect the inflight graph). Note that unix_inflight() takes the > >> same unix_gc_lock. > >> > >> Is there something I'm missing? > > > > Ah exactly, you are right. > > > > Could you repost this patch only with my comment above addressed ? > > Yeah, sure. One question though: what I wrote above is basically a rephrasing of > the commit message: > > (...) After flipping the lock, a possibly SCM-laden embryo is already > enqueued. And if there is another connect() coming, its embryo won't > carry SCM_RIGHTS as we already took the unix_gc_lock. > > As I understand, the important missing part was the clarification that embryo, > even though enqueued after the lock flipping, won't affect the inflight graph, > right? So how about: > > (...) After flipping the lock, a possibly SCM-laden embryo is already > enqueued. And if there is another embryo coming, it can not possibly carry > SCM_RIGHTS. At this point, unix_inflight() can not happen because > unix_gc_lock is already taken. Inflight graph remains unaffected. Sounds good to me. Thanks! ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() 2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj 2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj @ 2024-04-08 15:58 ` Michal Luczaj 1 sibling, 0 replies; 8+ messages in thread From: Michal Luczaj @ 2024-04-08 15:58 UTC (permalink / raw) To: netdev; +Cc: davem, edumazet, kuba, pabeni, kuniyu, Michal Luczaj Attempt to crash kernel racing unix socket garbage collector. Signed-off-by: Michal Luczaj <mhal@rbox.co> --- net/unix/af_unix.c | 2 + tools/testing/selftests/net/af_unix/Makefile | 2 +- .../selftests/net/af_unix/gc_vs_connect.c | 158 ++++++++++++++++++ 3 files changed, 161 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/net/af_unix/gc_vs_connect.c diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c index 5b41e2321209..8e56a094dc80 100644 --- a/net/unix/af_unix.c +++ b/net/unix/af_unix.c @@ -1636,6 +1636,8 @@ static int unix_stream_connect(struct socket *sock, struct sockaddr *uaddr, unix_state_unlock(sk); + mdelay(1); + /* take ten and send info to listening sock */ spin_lock(&other->sk_receive_queue.lock); __skb_queue_tail(&other->sk_receive_queue, skb); diff --git a/tools/testing/selftests/net/af_unix/Makefile b/tools/testing/selftests/net/af_unix/Makefile index 221c387a7d7f..3b12a9290e06 100644 --- a/tools/testing/selftests/net/af_unix/Makefile +++ b/tools/testing/selftests/net/af_unix/Makefile @@ -1,4 +1,4 @@ CFLAGS += $(KHDR_INCLUDES) -TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd +TEST_GEN_PROGS := diag_uid test_unix_oob unix_connect scm_pidfd gc_vs_connect include ../../lib.mk diff --git a/tools/testing/selftests/net/af_unix/gc_vs_connect.c b/tools/testing/selftests/net/af_unix/gc_vs_connect.c new file mode 100644 index 000000000000..8b724f1616dd --- /dev/null +++ b/tools/testing/selftests/net/af_unix/gc_vs_connect.c @@ -0,0 +1,158 @@ +// SPDX-License-Identifier: GPL-2.0 + +#define _GNU_SOURCE +#include <stdio.h> +#include <errno.h> +#include <string.h> +#include <assert.h> +#include <unistd.h> +#include <stdlib.h> +#include <pthread.h> +#include <sys/un.h> +#include <sys/socket.h> + +#define SOCK_TYPE SOCK_STREAM /* or SOCK_SEQPACKET */ + +union { + char buf[CMSG_SPACE(sizeof(int))]; + struct cmsghdr align; +} cbuf; + +struct iovec io = { + .iov_base = (char[1]) {0}, + .iov_len = 1 +}; + +struct msghdr msg = { + .msg_iov = &io, + .msg_iovlen = 1, + .msg_control = cbuf.buf, + .msg_controllen = sizeof(cbuf.buf) +}; + +pthread_barrier_t barr; +struct sockaddr_un saddr; +int salen, client; + +static void barrier(void) +{ + int ret = pthread_barrier_wait(&barr); + + assert(!ret || ret == PTHREAD_BARRIER_SERIAL_THREAD); +} + +static int socket_unix(void) +{ + int sock = socket(AF_UNIX, SOCK_TYPE, 0); + + assert(sock != -1); + return sock; +} + +static int recv_fd(int socket) +{ + struct cmsghdr *cmsg; + int ret, fd; + + ret = recvmsg(socket, &msg, 0); + assert(ret == 1 && !(msg.msg_flags & MSG_CTRUNC)); + + cmsg = CMSG_FIRSTHDR(&msg); + assert(cmsg); + memcpy(&fd, CMSG_DATA(cmsg), sizeof(fd)); + assert(fd >= 0); + + return fd; +} + +static void send_fd(int socket, int fd) +{ + struct cmsghdr *cmsg; + int ret; + + cmsg = CMSG_FIRSTHDR(&msg); + assert(cmsg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SCM_RIGHTS; + cmsg->cmsg_len = CMSG_LEN(sizeof(fd)); + + memcpy(CMSG_DATA(cmsg), &fd, sizeof(fd)); + + do { + ret = sendmsg(socket, &msg, 0); + assert(ret == 1 || (ret == -1 && errno == ENOTCONN)); + } while (ret != 1); +} + +static void *racer_connect(void *arg) +{ + for (;;) { + int ret; + + barrier(); + ret = connect(client, (struct sockaddr *)&saddr, salen); + assert(!ret); + barrier(); + } + + return NULL; +} + +static void *racer_gc(void *arg) +{ + for (;;) + close(socket_unix()); /* trigger GC */ + + return NULL; +} + +int main(void) +{ + pthread_t thread_conn, thread_gc; + int ret, pair[2]; + + printf("running\n"); + + ret = pthread_barrier_init(&barr, NULL, 2); + assert(!ret); + + ret = pthread_create(&thread_conn, NULL, racer_connect, NULL); + assert(!ret); + + ret = pthread_create(&thread_gc, NULL, racer_gc, NULL); + assert(!ret); + + ret = socketpair(AF_UNIX, SOCK_TYPE, 0, pair); + assert(!ret); + + saddr.sun_family = AF_UNIX; + salen = sizeof(saddr.sun_family) + + sprintf(saddr.sun_path, "%c/unix-gc-%d", '\0', getpid()); + + for (;;) { + int server, victim; + + server = socket_unix(); + ret = bind(server, (struct sockaddr *)&saddr, salen); + assert(!ret); + ret = listen(server, -1); + assert(!ret); + + send_fd(pair[0], server); + close(server); + + client = socket_unix(); + victim = socket_unix(); + + barrier(); + send_fd(client, victim); + close(victim); + barrier(); + + server = recv_fd(pair[1]); + close(client); + close(server); + } + + return 0; +} -- 2.44.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2024-04-09 19:29 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-08 15:58 [PATCH net 0/2] af_unix: Garbage collector vs connect() race condition Michal Luczaj 2024-04-08 15:58 ` [PATCH net 1/2] af_unix: Fix garbage collector racing against connect() Michal Luczaj 2024-04-08 21:18 ` Kuniyuki Iwashima 2024-04-08 23:25 ` Michal Luczaj 2024-04-09 0:22 ` Kuniyuki Iwashima 2024-04-09 9:16 ` Michal Luczaj 2024-04-09 19:02 ` Kuniyuki Iwashima 2024-04-08 15:58 ` [PATCH net 2/2] af_unix: Add GC race reproducer + slow down unix_stream_connect() Michal Luczaj
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).