* [PATCH net 1/2] netlink: terminate outstanding dump on socket close
@ 2024-11-05 1:03 Jakub Kicinski
2024-11-05 1:03 ` [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress Jakub Kicinski
2024-11-05 5:31 ` [PATCH net 1/2] netlink: terminate outstanding dump on socket close Kuniyuki Iwashima
0 siblings, 2 replies; 6+ messages in thread
From: Jakub Kicinski @ 2024-11-05 1:03 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, johannes, pablo, Jakub Kicinski, syzbot
Netlink supports iterative dumping of data. It provides the families
the following ops:
- start - (optional) kicks off the dumping process
- dump - actual dump helper, keeps getting called until it returns 0
- done - (optional) pairs with .start, can be used for cleanup
The whole process is asynchronous and the repeated calls to .dump
don't actually happen in a tight loop, but rather are triggered
in response to recvmsg() on the socket.
This gives the user full control over the dump, but also means that
the user can close the socket without getting to the end of the dump.
To make sure .start is always paired with .done we check if there
is an ongoing dump before freeing the socket, and if so call .done.
The complication is that sockets can get freed from BH and .done
is allowed to sleep. So we use a workqueue to defer the call, when
needed.
Unfortunately this does not work correctly. What we defer is not
the cleanup but rather releasing a reference on the socket.
We have no guarantee that we own the last reference, if someone
else holds the socket they may release it in BH and we're back
to square one.
The whole dance, however, appears to be unnecessary. Only the user
can interact with dumps, so we can clean up when socket is closed.
And close always happens in process context. Some async code may
still access the socket after close, queue notification skbs to it etc.
but no dumps can start, end or otherwise make progress.
Delete the workqueue and flush the dump state directly from the release
handler. Note that further cleanup is possible in -next, for instance
we now always call .done before releasing the main module reference,
so dump doesn't have to take a reference of its own.
Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
net/netlink/af_netlink.c | 31 ++++++++-----------------------
net/netlink/af_netlink.h | 2 --
2 files changed, 8 insertions(+), 25 deletions(-)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a9287fadb47..f84aad420d44 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -393,15 +393,6 @@ static void netlink_skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
static void netlink_sock_destruct(struct sock *sk)
{
- struct netlink_sock *nlk = nlk_sk(sk);
-
- if (nlk->cb_running) {
- if (nlk->cb.done)
- nlk->cb.done(&nlk->cb);
- module_put(nlk->cb.module);
- kfree_skb(nlk->cb.skb);
- }
-
skb_queue_purge(&sk->sk_receive_queue);
if (!sock_flag(sk, SOCK_DEAD)) {
@@ -414,14 +405,6 @@ static void netlink_sock_destruct(struct sock *sk)
WARN_ON(nlk_sk(sk)->groups);
}
-static void netlink_sock_destruct_work(struct work_struct *work)
-{
- struct netlink_sock *nlk = container_of(work, struct netlink_sock,
- work);
-
- sk_free(&nlk->sk);
-}
-
/* This lock without WQ_FLAG_EXCLUSIVE is good on UP and it is _very_ bad on
* SMP. Look, when several writers sleep and reader wakes them up, all but one
* immediately hit write lock and grab all the cpus. Exclusive sleep solves
@@ -731,12 +714,6 @@ static void deferred_put_nlk_sk(struct rcu_head *head)
if (!refcount_dec_and_test(&sk->sk_refcnt))
return;
- if (nlk->cb_running && nlk->cb.done) {
- INIT_WORK(&nlk->work, netlink_sock_destruct_work);
- schedule_work(&nlk->work);
- return;
- }
-
sk_free(sk);
}
@@ -788,6 +765,14 @@ static int netlink_release(struct socket *sock)
NETLINK_URELEASE, &n);
}
+ /* Terminate any outstanding dump */
+ if (nlk->cb_running) {
+ if (nlk->cb.done)
+ nlk->cb.done(&nlk->cb);
+ module_put(nlk->cb.module);
+ kfree_skb(nlk->cb.skb);
+ }
+
module_put(nlk->module);
if (netlink_is_kernel(sk)) {
diff --git a/net/netlink/af_netlink.h b/net/netlink/af_netlink.h
index 5b0e4e62ab8b..778a3809361f 100644
--- a/net/netlink/af_netlink.h
+++ b/net/netlink/af_netlink.h
@@ -4,7 +4,6 @@
#include <linux/rhashtable.h>
#include <linux/atomic.h>
-#include <linux/workqueue.h>
#include <net/sock.h>
/* flags */
@@ -50,7 +49,6 @@ struct netlink_sock {
struct rhash_head node;
struct rcu_head rcu;
- struct work_struct work;
};
static inline struct netlink_sock *nlk_sk(struct sock *sk)
--
2.47.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress
2024-11-05 1:03 [PATCH net 1/2] netlink: terminate outstanding dump on socket close Jakub Kicinski
@ 2024-11-05 1:03 ` Jakub Kicinski
2024-11-05 6:21 ` Kuniyuki Iwashima
2024-11-05 5:31 ` [PATCH net 1/2] netlink: terminate outstanding dump on socket close Kuniyuki Iwashima
1 sibling, 1 reply; 6+ messages in thread
From: Jakub Kicinski @ 2024-11-05 1:03 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, johannes, pablo, Jakub Kicinski
Close a socket with dump in progress. We need a dump which generates
enough info not to fit into a single skb. Policy dump fits the bill.
Use the trick discovered by syzbot for keeping a ref on the socket
longer than just close, with mqueue.
TAP version 13
1..3
# Starting 3 tests from 1 test cases.
# RUN global.test_sanity ...
# OK global.test_sanity
ok 1 global.test_sanity
# RUN global.close_in_progress ...
# OK global.close_in_progress
ok 2 global.close_in_progress
# RUN global.close_with_ref ...
# OK global.close_with_ref
ok 3 global.close_with_ref
# PASSED: 3 / 3 tests passed.
# Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
Note that this test is not expected to fail but rather crash
the kernel if we get the cleanup wrong.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
tools/testing/selftests/net/Makefile | 1 +
tools/testing/selftests/net/netlink-dumps.c | 110 ++++++++++++++++++++
2 files changed, 111 insertions(+)
create mode 100644 tools/testing/selftests/net/netlink-dumps.c
diff --git a/tools/testing/selftests/net/Makefile b/tools/testing/selftests/net/Makefile
index 649f1fe0dc46..816447323b39 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -34,6 +34,7 @@ TEST_PROGS += gre_gso.sh
TEST_PROGS += cmsg_so_mark.sh
TEST_PROGS += cmsg_time.sh cmsg_ipv6.sh
TEST_PROGS += netns-name.sh
+TEST_PROGS += netlink-dumps
TEST_PROGS += nl_netdev.py
TEST_PROGS += srv6_end_dt46_l3vpn_test.sh
TEST_PROGS += srv6_end_dt4_l3vpn_test.sh
diff --git a/tools/testing/selftests/net/netlink-dumps.c b/tools/testing/selftests/net/netlink-dumps.c
new file mode 100644
index 000000000000..7ee6dcd334df
--- /dev/null
+++ b/tools/testing/selftests/net/netlink-dumps.c
@@ -0,0 +1,110 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+
+#include <fcntl.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#include <linux/genetlink.h>
+#include <linux/netlink.h>
+#include <linux/mqueue.h>
+
+#include "../kselftest_harness.h"
+
+static const struct {
+ struct nlmsghdr nlhdr;
+ struct genlmsghdr genlhdr;
+ struct nlattr ahdr;
+ __u16 val;
+ __u16 pad;
+} dump_policies = {
+ .nlhdr = {
+ .nlmsg_len = sizeof(dump_policies),
+ .nlmsg_type = GENL_ID_CTRL,
+ .nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK | NLM_F_DUMP,
+ .nlmsg_seq = 1,
+ },
+ .genlhdr = {
+ .cmd = CTRL_CMD_GETPOLICY,
+ .version = 2,
+ },
+ .ahdr = {
+ .nla_len = 6,
+ .nla_type = CTRL_ATTR_FAMILY_ID,
+ },
+ .val = GENL_ID_CTRL,
+ .pad = 0,
+};
+
+// Sanity check for the test itself, make sure the dump doesn't fit in one msg
+TEST(test_sanity)
+{
+ int netlink_sock;
+ char buf[8192];
+ ssize_t n;
+
+ netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+ ASSERT_GE(netlink_sock, 0);
+
+ n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0);
+ ASSERT_EQ(n, sizeof(dump_policies));
+
+ n = recv(netlink_sock, buf, sizeof(buf), MSG_DONTWAIT);
+ ASSERT_GE(n, sizeof(struct nlmsghdr));
+
+ n = recv(netlink_sock, buf, sizeof(buf), MSG_DONTWAIT);
+ ASSERT_GE(n, sizeof(struct nlmsghdr));
+
+ close(netlink_sock);
+}
+
+TEST(close_in_progress)
+{
+ int netlink_sock;
+ ssize_t n;
+
+ netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+ ASSERT_GE(netlink_sock, 0);
+
+ n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0);
+ ASSERT_EQ(n, sizeof(dump_policies));
+
+ close(netlink_sock);
+}
+
+TEST(close_with_ref)
+{
+ char cookie[NOTIFY_COOKIE_LEN] = {};
+ int netlink_sock, mq_fd;
+ struct sigevent sigev;
+ ssize_t n;
+
+ netlink_sock = socket(AF_NETLINK, SOCK_RAW, NETLINK_GENERIC);
+ ASSERT_GE(netlink_sock, 0);
+
+ n = send(netlink_sock, &dump_policies, sizeof(dump_policies), 0);
+ ASSERT_EQ(n, sizeof(dump_policies));
+
+ mq_fd = syscall(__NR_mq_open, "sed", O_CREAT | O_WRONLY, 0600, 0);
+ ASSERT_GE(mq_fd, 0);
+
+ memset(&sigev, 0, sizeof(sigev));
+ sigev.sigev_notify = SIGEV_THREAD;
+ sigev.sigev_value.sival_ptr = cookie;
+ sigev.sigev_signo = netlink_sock;
+
+ syscall(__NR_mq_notify, mq_fd, &sigev);
+
+ close(netlink_sock);
+
+ // give mqueue time to fire
+ usleep(100 * 1000);
+}
+
+TEST_HARNESS_MAIN
--
2.47.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH net 1/2] netlink: terminate outstanding dump on socket close
2024-11-05 1:03 [PATCH net 1/2] netlink: terminate outstanding dump on socket close Jakub Kicinski
2024-11-05 1:03 ` [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress Jakub Kicinski
@ 2024-11-05 5:31 ` Kuniyuki Iwashima
2024-11-05 10:03 ` Eric Dumazet
2024-11-05 19:42 ` Jakub Kicinski
1 sibling, 2 replies; 6+ messages in thread
From: Kuniyuki Iwashima @ 2024-11-05 5:31 UTC (permalink / raw)
To: kuba; +Cc: davem, edumazet, johannes, netdev, pabeni, pablo, syzkaller,
kuniyu
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 4 Nov 2024 17:03:46 -0800
> Netlink supports iterative dumping of data. It provides the families
> the following ops:
> - start - (optional) kicks off the dumping process
> - dump - actual dump helper, keeps getting called until it returns 0
> - done - (optional) pairs with .start, can be used for cleanup
> The whole process is asynchronous and the repeated calls to .dump
> don't actually happen in a tight loop, but rather are triggered
> in response to recvmsg() on the socket.
>
> This gives the user full control over the dump, but also means that
> the user can close the socket without getting to the end of the dump.
> To make sure .start is always paired with .done we check if there
> is an ongoing dump before freeing the socket, and if so call .done.
>
> The complication is that sockets can get freed from BH and .done
> is allowed to sleep. So we use a workqueue to defer the call, when
> needed.
>
> Unfortunately this does not work correctly. What we defer is not
> the cleanup but rather releasing a reference on the socket.
> We have no guarantee that we own the last reference, if someone
> else holds the socket they may release it in BH and we're back
> to square one.
>
> The whole dance, however, appears to be unnecessary. Only the user
> can interact with dumps, so we can clean up when socket is closed.
> And close always happens in process context. Some async code may
> still access the socket after close, queue notification skbs to it etc.
> but no dumps can start, end or otherwise make progress.
>
> Delete the workqueue and flush the dump state directly from the release
> handler. Note that further cleanup is possible in -next, for instance
> we now always call .done before releasing the main module reference,
> so dump doesn't have to take a reference of its own.
and we can remove netns & reftracker switching for kernel socket
>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct")
Do you have a link to a public report ?
Previously syzkaller's author asked me to use a different name for
Reported-by not to confuse their internal metrics if the report is
generated by local syzkaller.
Reported-by: syzkaller <syzkaller@googlegroups.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Change itself looks good to me
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress
2024-11-05 1:03 ` [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress Jakub Kicinski
@ 2024-11-05 6:21 ` Kuniyuki Iwashima
0 siblings, 0 replies; 6+ messages in thread
From: Kuniyuki Iwashima @ 2024-11-05 6:21 UTC (permalink / raw)
To: kuba; +Cc: davem, edumazet, johannes, netdev, pabeni, pablo, kuniyu
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 4 Nov 2024 17:03:47 -0800
> Close a socket with dump in progress. We need a dump which generates
> enough info not to fit into a single skb. Policy dump fits the bill.
>
> Use the trick discovered by syzbot for keeping a ref on the socket
> longer than just close, with mqueue.
>
> TAP version 13
> 1..3
> # Starting 3 tests from 1 test cases.
> # RUN global.test_sanity ...
> # OK global.test_sanity
> ok 1 global.test_sanity
> # RUN global.close_in_progress ...
> # OK global.close_in_progress
> ok 2 global.close_in_progress
> # RUN global.close_with_ref ...
> # OK global.close_with_ref
> ok 3 global.close_with_ref
> # PASSED: 3 / 3 tests passed.
> # Totals: pass:3 fail:0 xfail:0 xpass:0 skip:0 error:0
>
> Note that this test is not expected to fail but rather crash
> the kernel if we get the cleanup wrong.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
TIL mq_notify, and the trick was interesting that calls fdget(),
netlink_getsockbyfilp(), and fdput() to keep sock_hold() only :)
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net 1/2] netlink: terminate outstanding dump on socket close
2024-11-05 5:31 ` [PATCH net 1/2] netlink: terminate outstanding dump on socket close Kuniyuki Iwashima
@ 2024-11-05 10:03 ` Eric Dumazet
2024-11-05 19:42 ` Jakub Kicinski
1 sibling, 0 replies; 6+ messages in thread
From: Eric Dumazet @ 2024-11-05 10:03 UTC (permalink / raw)
To: Kuniyuki Iwashima; +Cc: kuba, davem, johannes, netdev, pabeni, pablo, syzkaller
On Tue, Nov 5, 2024 at 6:31 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Mon, 4 Nov 2024 17:03:46 -0800
> > Netlink supports iterative dumping of data. It provides the families
> > the following ops:
> > - start - (optional) kicks off the dumping process
> > - dump - actual dump helper, keeps getting called until it returns 0
> > - done - (optional) pairs with .start, can be used for cleanup
> > The whole process is asynchronous and the repeated calls to .dump
> > don't actually happen in a tight loop, but rather are triggered
> > in response to recvmsg() on the socket.
> >
> > This gives the user full control over the dump, but also means that
> > the user can close the socket without getting to the end of the dump.
> > To make sure .start is always paired with .done we check if there
> > is an ongoing dump before freeing the socket, and if so call .done.
> >
> > The complication is that sockets can get freed from BH and .done
> > is allowed to sleep. So we use a workqueue to defer the call, when
> > needed.
> >
> > Unfortunately this does not work correctly. What we defer is not
> > the cleanup but rather releasing a reference on the socket.
> > We have no guarantee that we own the last reference, if someone
> > else holds the socket they may release it in BH and we're back
> > to square one.
> >
> > The whole dance, however, appears to be unnecessary. Only the user
> > can interact with dumps, so we can clean up when socket is closed.
> > And close always happens in process context. Some async code may
> > still access the socket after close, queue notification skbs to it etc.
> > but no dumps can start, end or otherwise make progress.
> >
> > Delete the workqueue and flush the dump state directly from the release
> > handler. Note that further cleanup is possible in -next, for instance
> > we now always call .done before releasing the main module reference,
> > so dump doesn't have to take a reference of its own.
>
> and we can remove netns & reftracker switching for kernel socket
>
>
> >
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct")
>
> Do you have a link to a public report ?
I only had reports for old kernels (6.1 stable), but the repro seems
to work on current kernel.
>
> Previously syzkaller's author asked me to use a different name for
> Reported-by not to confuse their internal metrics if the report is
> generated by local syzkaller.
I definitely have upstream reports (latest tree) but no repro yet.
I can release them, but IMO this would add noise to the mailing lists,
already flooded with such reports.
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH net 1/2] netlink: terminate outstanding dump on socket close
2024-11-05 5:31 ` [PATCH net 1/2] netlink: terminate outstanding dump on socket close Kuniyuki Iwashima
2024-11-05 10:03 ` Eric Dumazet
@ 2024-11-05 19:42 ` Jakub Kicinski
1 sibling, 0 replies; 6+ messages in thread
From: Jakub Kicinski @ 2024-11-05 19:42 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: davem, edumazet, johannes, netdev, pabeni, pablo, syzkaller
On Mon, 4 Nov 2024 21:31:15 -0800 Kuniyuki Iwashima wrote:
> > Reported-by: syzbot <syzkaller@googlegroups.com>
> > Fixes: ed5d7788a934 ("netlink: Do not schedule work from sk_destruct")
>
> Do you have a link to a public report ?
>
> Previously syzkaller's author asked me to use a different name for
> Reported-by not to confuse their internal metrics if the report is
> generated by local syzkaller.
>
> Reported-by: syzkaller <syzkaller@googlegroups.com>
Ah, will use that, thanks.
I need to repost, anyway, because I added the test to the wrong
Makefile group :(
--
pw-bot: cr
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2024-11-05 19:42 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-05 1:03 [PATCH net 1/2] netlink: terminate outstanding dump on socket close Jakub Kicinski
2024-11-05 1:03 ` [PATCH net 2/2] selftests: net: add a test for closing a netlink socket ith dump in progress Jakub Kicinski
2024-11-05 6:21 ` Kuniyuki Iwashima
2024-11-05 5:31 ` [PATCH net 1/2] netlink: terminate outstanding dump on socket close Kuniyuki Iwashima
2024-11-05 10:03 ` Eric Dumazet
2024-11-05 19:42 ` Jakub Kicinski
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).