netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf 0/3] bpf, sockmap: Fix the element replace
@ 2024-12-02 11:29 Michal Luczaj
  2024-12-02 11:29 ` [PATCH bpf 1/3] bpf, sockmap: Fix update element with same Michal Luczaj
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Michal Luczaj @ 2024-12-02 11:29 UTC (permalink / raw)
  To: Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Series takes care of two issues with sockmap update: inconsistent behaviour
after update with same, and race/refcount imbalance on element replace.

I am hesitant if patch 3/3 ("bpf, sockmap: Fix race between element replace
and close()") is the right approach. I might have missed some detail of the
current __sock_map_delete() implementation. I'd be grateful for comments,
thanks.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
Michal Luczaj (3):
      bpf, sockmap: Fix update element with same
      selftest/bpf: Extend test for sockmap update with same
      bpf, sockmap: Fix race between element replace and close()

 net/core/sock_map.c                                    | 6 +++---
 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c | 8 +++++---
 2 files changed, 8 insertions(+), 6 deletions(-)
---
base-commit: 537a2525eaf76ea9b0dca62b994500d8670b39d5
change-id: 20241201-sockmap-replace-67c7077f3a31

Best regards,
-- 
Michal Luczaj <mhal@rbox.co>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH bpf 1/3] bpf, sockmap: Fix update element with same
  2024-12-02 11:29 [PATCH bpf 0/3] bpf, sockmap: Fix the element replace Michal Luczaj
@ 2024-12-02 11:29 ` Michal Luczaj
  2024-12-09  5:47   ` John Fastabend
  2024-12-02 11:29 ` [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update " Michal Luczaj
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Michal Luczaj @ 2024-12-02 11:29 UTC (permalink / raw)
  To: Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Consider a sockmap entry being updated with the same socket:

	osk = stab->sks[idx];
	sock_map_add_link(psock, link, map, &stab->sks[idx]);
	stab->sks[idx] = sk;
	if (osk)
		sock_map_unref(osk, &stab->sks[idx]);

Due to sock_map_unref(), which invokes sock_map_del_link(), all the psock's
links for stab->sks[idx] are torn:

	list_for_each_entry_safe(link, tmp, &psock->link, list) {
		if (link->link_raw == link_raw) {
			...
			list_del(&link->list);
			sk_psock_free_link(link);
		}
	}

And that includes the new link sock_map_add_link() added just before the
unref.

This results in a sockmap holding a socket, but without the respective
link. This in turn means that close(sock) won't trigger the cleanup, i.e. a
closed socket will not be automatically removed from the sockmap.

Stop tearing the links when a matching link_raw is found.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 78347d7d25ef31525f8ec0a755a18e5793ad92c0..20b348b1964a10a1b0bfbe1a90a4a4cd99715b81 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -159,6 +159,7 @@ static void sock_map_del_link(struct sock *sk,
 				verdict_stop = true;
 			list_del(&link->list);
 			sk_psock_free_link(link);
+			break;
 		}
 	}
 	spin_unlock_bh(&psock->link_lock);

-- 
2.46.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update with same
  2024-12-02 11:29 [PATCH bpf 0/3] bpf, sockmap: Fix the element replace Michal Luczaj
  2024-12-02 11:29 ` [PATCH bpf 1/3] bpf, sockmap: Fix update element with same Michal Luczaj
@ 2024-12-02 11:29 ` Michal Luczaj
  2024-12-09  5:50   ` John Fastabend
  2024-12-02 11:29 ` [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close() Michal Luczaj
  2024-12-10 16:50 ` [PATCH bpf 0/3] bpf, sockmap: Fix the element replace patchwork-bot+netdevbpf
  3 siblings, 1 reply; 9+ messages in thread
From: Michal Luczaj @ 2024-12-02 11:29 UTC (permalink / raw)
  To: Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Verify that the sockmap link was not severed, and socket's entry is indeed
removed from the map when the corresponding descriptor gets closed.

Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 tools/testing/selftests/bpf/prog_tests/sockmap_basic.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
index fdff0652d7ef5cdd4892af8c2c83cbf18cbf163f..248754296d972286e45d79331e95a8a6ae824590 100644
--- a/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
+++ b/tools/testing/selftests/bpf/prog_tests/sockmap_basic.c
@@ -934,8 +934,10 @@ static void test_sockmap_same_sock(void)
 
 	err = socketpair(AF_UNIX, SOCK_STREAM, 0, stream);
 	ASSERT_OK(err, "socketpair(af_unix, sock_stream)");
-	if (err)
+	if (err) {
+		close(tcp);
 		goto out;
+	}
 
 	for (i = 0; i < 2; i++) {
 		err = bpf_map_update_elem(map, &zero, &stream[0], BPF_ANY);
@@ -954,14 +956,14 @@ static void test_sockmap_same_sock(void)
 		ASSERT_OK(err, "bpf_map_update_elem(tcp)");
 	}
 
+	close(tcp);
 	err = bpf_map_delete_elem(map, &zero);
-	ASSERT_OK(err, "bpf_map_delete_elem(entry)");
+	ASSERT_ERR(err, "bpf_map_delete_elem(entry)");
 
 	close(stream[0]);
 	close(stream[1]);
 out:
 	close(dgram);
-	close(tcp);
 	close(udp);
 	test_sockmap_pass_prog__destroy(skel);
 }

-- 
2.46.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close()
  2024-12-02 11:29 [PATCH bpf 0/3] bpf, sockmap: Fix the element replace Michal Luczaj
  2024-12-02 11:29 ` [PATCH bpf 1/3] bpf, sockmap: Fix update element with same Michal Luczaj
  2024-12-02 11:29 ` [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update " Michal Luczaj
@ 2024-12-02 11:29 ` Michal Luczaj
  2024-12-09  6:11   ` John Fastabend
  2024-12-10 16:50 ` [PATCH bpf 0/3] bpf, sockmap: Fix the element replace patchwork-bot+netdevbpf
  3 siblings, 1 reply; 9+ messages in thread
From: Michal Luczaj @ 2024-12-02 11:29 UTC (permalink / raw)
  To: Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Element replace (with a socket different from the one stored) may race with
socket's close() link popping & unlinking. __sock_map_delete()
unconditionally unrefs the (wrong) element:

// set map[0] = s0
map_update_elem(map, 0, s0)

// drop fd of s0
close(s0)
  sock_map_close()
    lock_sock(sk)               (s0!)
    sock_map_remove_links(sk)
      link = sk_psock_link_pop()
      sock_map_unlink(sk, link)
        sock_map_delete_from_link
                                        // replace map[0] with s1
                                        map_update_elem(map, 0, s1)
                                          sock_map_update_elem
                                (s1!)       lock_sock(sk)
                                            sock_map_update_common
                                              psock = sk_psock(sk)
                                              spin_lock(&stab->lock)
                                              osk = stab->sks[idx]
                                              sock_map_add_link(..., &stab->sks[idx])
                                              sock_map_unref(osk, &stab->sks[idx])
                                                psock = sk_psock(osk)
                                                sk_psock_put(sk, psock)
                                                  if (refcount_dec_and_test(&psock))
                                                    sk_psock_drop(sk, psock)
                                              spin_unlock(&stab->lock)
                                            unlock_sock(sk)
          __sock_map_delete
            spin_lock(&stab->lock)
            sk = *psk                        // s1 replaced s0; sk == s1
            if (!sk_test || sk_test == sk)   // sk_test (s0) != sk (s1); no branch
              sk = xchg(psk, NULL)
            if (sk)
              sock_map_unref(sk, psk)        // unref s1; sks[idx] will dangle
                psock = sk_psock(sk)
                sk_psock_put(sk, psock)
                  if (refcount_dec_and_test())
                    sk_psock_drop(sk, psock)
            spin_unlock(&stab->lock)
    release_sock(sk)

Then close(map) enqueues bpf_map_free_deferred, which finally calls
sock_map_free(). This results in some refcount_t warnings along with a
KASAN splat[1].

Fix __sock_map_delete(), do not allow sock_map_unref() on elements that may
have been replaced.

[1]:
BUG: KASAN: slab-use-after-free in sock_map_free+0x10e/0x330
Write of size 4 at addr ffff88811f5b9100 by task kworker/u64:12/1063

CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Not tainted 6.12.0+ #125
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
Call Trace:
 <TASK>
 dump_stack_lvl+0x68/0x90
 print_report+0x174/0x4f6
 kasan_report+0xb9/0x190
 kasan_check_range+0x10f/0x1e0
 sock_map_free+0x10e/0x330
 bpf_map_free_deferred+0x173/0x320
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x29e/0x360
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30
 </TASK>

Allocated by task 1202:
 kasan_save_stack+0x1e/0x40
 kasan_save_track+0x10/0x30
 __kasan_slab_alloc+0x85/0x90
 kmem_cache_alloc_noprof+0x131/0x450
 sk_prot_alloc+0x5b/0x220
 sk_alloc+0x2c/0x870
 unix_create1+0x88/0x8a0
 unix_create+0xc5/0x180
 __sock_create+0x241/0x650
 __sys_socketpair+0x1ce/0x420
 __x64_sys_socketpair+0x92/0x100
 do_syscall_64+0x93/0x180
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 46:
 kasan_save_stack+0x1e/0x40
 kasan_save_track+0x10/0x30
 kasan_save_free_info+0x37/0x60
 __kasan_slab_free+0x4b/0x70
 kmem_cache_free+0x1a1/0x590
 __sk_destruct+0x388/0x5a0
 sk_psock_destroy+0x73e/0xa50
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x29e/0x360
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30

The buggy address belongs to the object at ffff88811f5b9080
 which belongs to the cache UNIX-STREAM of size 1984
The buggy address is located 128 bytes inside of
 freed 1984-byte region [ffff88811f5b9080, ffff88811f5b9840)

The buggy address belongs to the physical page:
page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11f5b8
head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0
memcg:ffff888127d49401
flags: 0x17ffffc0000040(head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: f5(slab)
raw: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000
raw: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401
head: 0017ffffc0000040 ffff8881042e4500 dead000000000122 0000000000000000
head: 0000000000000000 00000000800f000f 00000001f5000000 ffff888127d49401
head: 0017ffffc0000003 ffffea00047d6e01 ffffffffffffffff 0000000000000000
head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88811f5b9000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88811f5b9080: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88811f5b9100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                   ^
 ffff88811f5b9180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88811f5b9200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Disabling lock debugging due to kernel taint

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 14 PID: 1063 at lib/refcount.c:25 refcount_warn_saturate+0xce/0x150
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G    B              6.12.0+ #125
Tainted: [B]=BAD_PAGE
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:refcount_warn_saturate+0xce/0x150
Code: 34 73 eb 03 01 e8 82 53 ad fe 0f 0b eb b1 80 3d 27 73 eb 03 00 75 a8 48 c7 c7 80 bd 95 84 c6 05 17 73 eb 03 01 e8 62 53 ad fe <0f> 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05
RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: 0000000000000002 R08: 0000000000000001 R09: ffffed10bcde6349
R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000
R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024
FS:  0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn.cold+0x5f/0x1ff
 ? refcount_warn_saturate+0xce/0x150
 ? report_bug+0x1ec/0x390
 ? handle_bug+0x58/0x90
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? refcount_warn_saturate+0xce/0x150
 sock_map_free+0x2e5/0x330
 bpf_map_free_deferred+0x173/0x320
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x29e/0x360
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30
 </TASK>
irq event stamp: 10741
hardirqs last  enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20
hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770
softirqs last  enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210

refcount_t: underflow; use-after-free.
WARNING: CPU: 14 PID: 1063 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150
CPU: 14 UID: 0 PID: 1063 Comm: kworker/u64:12 Tainted: G    B   W          6.12.0+ #125
Tainted: [B]=BAD_PAGE, [W]=WARN
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
Workqueue: events_unbound bpf_map_free_deferred
RIP: 0010:refcount_warn_saturate+0xee/0x150
Code: 17 73 eb 03 01 e8 62 53 ad fe 0f 0b eb 91 80 3d 06 73 eb 03 00 75 88 48 c7 c7 e0 bd 95 84 c6 05 f6 72 eb 03 01 e8 42 53 ad fe <0f> 0b e9 6e ff ff ff 80 3d e6 72 eb 03 00 0f 85 61 ff ff ff 48 c7
RSP: 0018:ffff88815c49fc70 EFLAGS: 00010282
RAX: 0000000000000000 RBX: ffff88811f5b9100 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
RBP: 0000000000000003 R08: 0000000000000001 R09: ffffed10bcde6349
R10: ffff8885e6f31a4b R11: 0000000000000000 R12: ffff88813be0b000
R13: ffff88811f5b9100 R14: ffff88811f5b9080 R15: ffff88813be0b024
FS:  0000000000000000(0000) GS:ffff8885e6f00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055dda99b0250 CR3: 000000015dbac000 CR4: 0000000000752ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __warn.cold+0x5f/0x1ff
 ? refcount_warn_saturate+0xee/0x150
 ? report_bug+0x1ec/0x390
 ? handle_bug+0x58/0x90
 ? exc_invalid_op+0x13/0x40
 ? asm_exc_invalid_op+0x16/0x20
 ? refcount_warn_saturate+0xee/0x150
 sock_map_free+0x2d3/0x330
 bpf_map_free_deferred+0x173/0x320
 process_one_work+0x846/0x1420
 worker_thread+0x5b3/0xf80
 kthread+0x29e/0x360
 ret_from_fork+0x2d/0x70
 ret_from_fork_asm+0x1a/0x30
 </TASK>
irq event stamp: 10741
hardirqs last  enabled at (10741): [<ffffffff84400ec6>] asm_sysvec_apic_timer_interrupt+0x16/0x20
hardirqs last disabled at (10740): [<ffffffff811e532d>] handle_softirqs+0x60d/0x770
softirqs last  enabled at (10506): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210
softirqs last disabled at (10301): [<ffffffff811e55a9>] __irq_exit_rcu+0x109/0x210

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Michal Luczaj <mhal@rbox.co>
---
 net/core/sock_map.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/core/sock_map.c b/net/core/sock_map.c
index 20b348b1964a10a1b0bfbe1a90a4a4cd99715b81..f1b9b3958792cd599efcb591742874e9b3f4a76b 100644
--- a/net/core/sock_map.c
+++ b/net/core/sock_map.c
@@ -412,12 +412,11 @@ static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
 static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
 			     struct sock **psk)
 {
-	struct sock *sk;
+	struct sock *sk = NULL;
 	int err = 0;
 
 	spin_lock_bh(&stab->lock);
-	sk = *psk;
-	if (!sk_test || sk_test == sk)
+	if (!sk_test || sk_test == *psk)
 		sk = xchg(psk, NULL);
 
 	if (likely(sk))

-- 
2.46.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* RE: [PATCH bpf 1/3] bpf, sockmap: Fix update element with same
  2024-12-02 11:29 ` [PATCH bpf 1/3] bpf, sockmap: Fix update element with same Michal Luczaj
@ 2024-12-09  5:47   ` John Fastabend
  2024-12-09  9:54     ` Michal Luczaj
  0 siblings, 1 reply; 9+ messages in thread
From: John Fastabend @ 2024-12-09  5:47 UTC (permalink / raw)
  To: Michal Luczaj, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Michal Luczaj wrote:
> Consider a sockmap entry being updated with the same socket:
> 
> 	osk = stab->sks[idx];
> 	sock_map_add_link(psock, link, map, &stab->sks[idx]);
> 	stab->sks[idx] = sk;
> 	if (osk)
> 		sock_map_unref(osk, &stab->sks[idx]);
> 
> Due to sock_map_unref(), which invokes sock_map_del_link(), all the psock's
> links for stab->sks[idx] are torn:
> 
> 	list_for_each_entry_safe(link, tmp, &psock->link, list) {
> 		if (link->link_raw == link_raw) {
> 			...
> 			list_del(&link->list);
> 			sk_psock_free_link(link);
> 		}
> 	}
> 
> And that includes the new link sock_map_add_link() added just before the
> unref.
> 
> This results in a sockmap holding a socket, but without the respective
> link. This in turn means that close(sock) won't trigger the cleanup, i.e. a
> closed socket will not be automatically removed from the sockmap.
> 
> Stop tearing the links when a matching link_raw is found.
> 
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---

Thanks. LGTM.

Reviewed-by: John Fastabend <john.fastabend@gmail.com>

>  net/core/sock_map.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 78347d7d25ef31525f8ec0a755a18e5793ad92c0..20b348b1964a10a1b0bfbe1a90a4a4cd99715b81 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -159,6 +159,7 @@ static void sock_map_del_link(struct sock *sk,
>  				verdict_stop = true;
>  			list_del(&link->list);
>  			sk_psock_free_link(link);
> +			break;
>  		}
>  	}
>  	spin_unlock_bh(&psock->link_lock);
> 
> -- 
> 2.46.2
> 



^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update with same
  2024-12-02 11:29 ` [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update " Michal Luczaj
@ 2024-12-09  5:50   ` John Fastabend
  0 siblings, 0 replies; 9+ messages in thread
From: John Fastabend @ 2024-12-09  5:50 UTC (permalink / raw)
  To: Michal Luczaj, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Michal Luczaj wrote:
> Verify that the sockmap link was not severed, and socket's entry is indeed
> removed from the map when the corresponding descriptor gets closed.
> 
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---

Reviewed-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close()
  2024-12-02 11:29 ` [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close() Michal Luczaj
@ 2024-12-09  6:11   ` John Fastabend
  0 siblings, 0 replies; 9+ messages in thread
From: John Fastabend @ 2024-12-09  6:11 UTC (permalink / raw)
  To: Michal Luczaj, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Shuah Khan, Jakub Sitnicki, David S. Miller,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev, Michal Luczaj

Michal Luczaj wrote:
> Element replace (with a socket different from the one stored) may race with
> socket's close() link popping & unlinking. __sock_map_delete()
> unconditionally unrefs the (wrong) element:
> 
> // set map[0] = s0
> map_update_elem(map, 0, s0)
> 
> // drop fd of s0
> close(s0)
>   sock_map_close()
>     lock_sock(sk)               (s0!)
>     sock_map_remove_links(sk)
>       link = sk_psock_link_pop()
>       sock_map_unlink(sk, link)
>         sock_map_delete_from_link
>                                         // replace map[0] with s1
>                                         map_update_elem(map, 0, s1)
>                                           sock_map_update_elem
>                                 (s1!)       lock_sock(sk)
>                                             sock_map_update_common
>                                               psock = sk_psock(sk)
>                                               spin_lock(&stab->lock)
>                                               osk = stab->sks[idx]
>                                               sock_map_add_link(..., &stab->sks[idx])
>                                               sock_map_unref(osk, &stab->sks[idx])
>                                                 psock = sk_psock(osk)
>                                                 sk_psock_put(sk, psock)
>                                                   if (refcount_dec_and_test(&psock))
>                                                     sk_psock_drop(sk, psock)
>                                               spin_unlock(&stab->lock)
>                                             unlock_sock(sk)
>           __sock_map_delete
>             spin_lock(&stab->lock)
>             sk = *psk                        // s1 replaced s0; sk == s1
>             if (!sk_test || sk_test == sk)   // sk_test (s0) != sk (s1); no branch
>               sk = xchg(psk, NULL)
>             if (sk)
>               sock_map_unref(sk, psk)        // unref s1; sks[idx] will dangle
>                 psock = sk_psock(sk)
>                 sk_psock_put(sk, psock)
>                   if (refcount_dec_and_test())
>                     sk_psock_drop(sk, psock)
>             spin_unlock(&stab->lock)
>     release_sock(sk)
> 
> Then close(map) enqueues bpf_map_free_deferred, which finally calls
> sock_map_free(). This results in some refcount_t warnings along with a
> KASAN splat[1].
> 

[...]
 
> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")
> Signed-off-by: Michal Luczaj <mhal@rbox.co>
> ---
>  net/core/sock_map.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
> index 20b348b1964a10a1b0bfbe1a90a4a4cd99715b81..f1b9b3958792cd599efcb591742874e9b3f4a76b 100644
> --- a/net/core/sock_map.c
> +++ b/net/core/sock_map.c
> @@ -412,12 +412,11 @@ static void *sock_map_lookup_sys(struct bpf_map *map, void *key)
>  static int __sock_map_delete(struct bpf_stab *stab, struct sock *sk_test,
>  			     struct sock **psk)
>  {
> -	struct sock *sk;
> +	struct sock *sk = NULL;
>  	int err = 0;
>  
>  	spin_lock_bh(&stab->lock);
> -	sk = *psk;
> -	if (!sk_test || sk_test == sk)
> +	if (!sk_test || sk_test == *psk)
>  		sk = xchg(psk, NULL);
>  
>  	if (likely(sk))
> 
> -- 
> 2.46.2
> 

Reviewed-by: John Fastabend <john.fastabend@gmail.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 1/3] bpf, sockmap: Fix update element with same
  2024-12-09  5:47   ` John Fastabend
@ 2024-12-09  9:54     ` Michal Luczaj
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Luczaj @ 2024-12-09  9:54 UTC (permalink / raw)
  To: John Fastabend, Andrii Nakryiko, Eduard Zingerman, Mykola Lysenko,
	Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Song Liu,
	Yonghong Song, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Jakub Sitnicki, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Simon Horman
  Cc: bpf, linux-kselftest, netdev

On 12/9/24 06:47, John Fastabend wrote:
> Michal Luczaj wrote:
>> Consider a sockmap entry being updated with the same socket:
>>
>> 	osk = stab->sks[idx];
>> 	sock_map_add_link(psock, link, map, &stab->sks[idx]);
>> 	stab->sks[idx] = sk;
>> 	if (osk)
>> 		sock_map_unref(osk, &stab->sks[idx]);
>>
>> Due to sock_map_unref(), which invokes sock_map_del_link(), all the psock's
>> links for stab->sks[idx] are torn:
>>
>> 	list_for_each_entry_safe(link, tmp, &psock->link, list) {
>> 		if (link->link_raw == link_raw) {
>> 			...
>> 			list_del(&link->list);
>> 			sk_psock_free_link(link);
>> 		}
>> 	}
>>
>> And that includes the new link sock_map_add_link() added just before the
>> unref.
>>
>> This results in a sockmap holding a socket, but without the respective
>> link. This in turn means that close(sock) won't trigger the cleanup, i.e. a
>> closed socket will not be automatically removed from the sockmap.
>>
>> Stop tearing the links when a matching link_raw is found.
>>
>> Signed-off-by: Michal Luczaj <mhal@rbox.co>
>> ---
> 
> Thanks. LGTM.
> 
> Reviewed-by: John Fastabend <john.fastabend@gmail.com>

Thanks, and sorry for a missing tag:

Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface")


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH bpf 0/3] bpf, sockmap: Fix the element replace
  2024-12-02 11:29 [PATCH bpf 0/3] bpf, sockmap: Fix the element replace Michal Luczaj
                   ` (2 preceding siblings ...)
  2024-12-02 11:29 ` [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close() Michal Luczaj
@ 2024-12-10 16:50 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-12-10 16:50 UTC (permalink / raw)
  To: Michal Luczaj
  Cc: andrii, eddyz87, mykolal, ast, daniel, martin.lau, song,
	yonghong.song, john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah,
	jakub, davem, edumazet, kuba, pabeni, horms, bpf, linux-kselftest,
	netdev

Hello:

This series was applied to bpf/bpf.git (master)
by Daniel Borkmann <daniel@iogearbox.net>:

On Mon, 02 Dec 2024 12:29:22 +0100 you wrote:
> Series takes care of two issues with sockmap update: inconsistent behaviour
> after update with same, and race/refcount imbalance on element replace.
> 
> I am hesitant if patch 3/3 ("bpf, sockmap: Fix race between element replace
> and close()") is the right approach. I might have missed some detail of the
> current __sock_map_delete() implementation. I'd be grateful for comments,
> thanks.
> 
> [...]

Here is the summary with links:
  - [bpf,1/3] bpf, sockmap: Fix update element with same
    https://git.kernel.org/bpf/bpf/c/75e072a390da
  - [bpf,2/3] selftest/bpf: Extend test for sockmap update with same
    https://git.kernel.org/bpf/bpf/c/11d5245f608f
  - [bpf,3/3] bpf, sockmap: Fix race between element replace and close()
    https://git.kernel.org/bpf/bpf/c/ed1fc5d76b81

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-10 16:50 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-02 11:29 [PATCH bpf 0/3] bpf, sockmap: Fix the element replace Michal Luczaj
2024-12-02 11:29 ` [PATCH bpf 1/3] bpf, sockmap: Fix update element with same Michal Luczaj
2024-12-09  5:47   ` John Fastabend
2024-12-09  9:54     ` Michal Luczaj
2024-12-02 11:29 ` [PATCH bpf 2/3] selftest/bpf: Extend test for sockmap update " Michal Luczaj
2024-12-09  5:50   ` John Fastabend
2024-12-02 11:29 ` [PATCH bpf 3/3] bpf, sockmap: Fix race between element replace and close() Michal Luczaj
2024-12-09  6:11   ` John Fastabend
2024-12-10 16:50 ` [PATCH bpf 0/3] bpf, sockmap: Fix the element replace patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).