* [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
@ 2024-11-13 18:46 Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys Dmitry Safonov via B4 Relay
` (7 more replies)
0 siblings, 8 replies; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov, stable
Changes in v2:
- Fixup for uninitilized md5sig_count stack variable
(Oops! Kudos to kernel test robot <lkp@intel.com>)
- Correct space damage, add a missing Fixes tag &
reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
- Take out patch for maximum attribute length, see (4) below.
Going to send it later with the next TCP-AO-diag part
(Kuniyuki Iwashima)
- Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail.com
My original intent was to replace the last non-upstream Arista's TCP-AO
piece. That is per-netns procfs seqfile which lists AO keys. In my view
an acceptable upstream alternative would be TCP-AO-diag uAPI.
So, I started by looking and reviewing TCP-MD5-diag code. And straight
away I saw a bunch of issues:
1. Similarly to TCP_MD5SIG_EXT, which doesn't check tcpm_flags for
unknown flags and so being non-extendable setsockopt(), the same
way tcp_diag_put_md5sig() dumps md5 keys in an array of
tcp_diag_md5sig, which makes it ABI non-extendable structure
as userspace can't tolerate any new members in it.
2. Inet-diag allocates netlink message for sockets in
inet_diag_dump_one_icsk(), which uses a TCP-diag callback
.idiag_get_aux_size(), that pre-calculates the needed space for
TCP-diag related information. But as neither socket lock nor
rcu_readlock() are held between allocation and the actual TCP
info filling, the TCP-related space requirement may change before
reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
3. Inet-diag "do" request* can create skb of any message required size.
But "dump" request* the skb size, since d35c99ff77ec ("netlink: do
not enter direct reclaim from netlink_dump()") is limited by
32 KB. Having in mind that sizeof(struct tcp_diag_md5sig) = 100 bytes,
dumps for sockets that have more than 327 keys are going to fail
(not counting other diag infos, which lower this limit futher).
That is much lower than the number of TCP-MD5 keys that can be
allocated on a socket with the current default
optmem_max limit (128Kb).
So, then I went and written selftests for TCP-MD5-diag and besides
confirming that (2) and (3) are not theoretical issues, I also
discovered another issues, that I didn't notice on code inspection:
4. nlattr::nla_len is __u16, which limits the largest netlink attibute
by 64Kb or by 655 tcp_diag_md5sig keys in the diag array. What
happens de-facto is that the netlink attribute gets u16 overflow,
breaking the userspace parsing - RTA_NEXT(), that should point
to the next attribute, points into the middle of md5 keys array.
In this patch set issues (2) and (4) are addressed.
(2) by not returning EMSGSIZE when the dump raced with modifying
TCP-MD5 keys on a socket, but mark the dump inconsistent by setting
NLM_F_DUMP_INTR nlmsg flag. Which changes uAPI in situations where
previously kernel did WARN() and errored the dump.
(4) by artificially limiting the maximum attribute size by U16_MAX - 1.
In order to remove the new limit from (4) solution, my plan is to
convert the dump of TCP-MD5 keys from an array to
NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
And for (3), it's needed to teach tcp-diag how-to remember not only
socket on which previous recvmsg() stopped, but potentially TCP-MD5
key as well.
I plan in the next part of patch set address (3), (1) and the new limit
for (4), together with adding new TCP-AO-diag.
* Terminology from Documentation/userspace-api/netlink/intro.rst
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
Dmitry Safonov (5):
net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
net/diag: Warn only once on EMSGSIZE
net/diag: Pre-allocate optional info only if requested
net/diag: Always pre-allocate tcp_ulp info
net/netlink: Correct the comment on netlink message max cap
include/linux/inet_diag.h | 3 +-
include/net/tcp.h | 1 -
net/ipv4/inet_diag.c | 89 ++++++++++++++++++++++++++++++++++++++---------
net/ipv4/tcp_diag.c | 68 ++++++++++++++++++------------------
net/mptcp/diag.c | 20 -----------
net/netlink/af_netlink.c | 4 +--
net/tls/tls_main.c | 17 ---------
7 files changed, 110 insertions(+), 92 deletions(-)
---
base-commit: f1b785f4c7870c42330b35522c2514e39a1e28e7
change-id: 20241106-tcp-md5-diag-prep-2f0dcf371d90
Best regards,
--
Dmitry Safonov <0x7f454c46@gmail.com>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH net v2 1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
@ 2024-11-13 18:46 ` Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 2/5] net/diag: Warn only once on EMSGSIZE Dmitry Safonov via B4 Relay
` (6 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov, stable
From: Dmitry Safonov <0x7f454c46@gmail.com>
Inet-diag has two modes: (1) dumping information for a specific socket,
for which kernel creates one netlink message with the information and
(2) dumping information for multiple sockets (possibly with a filter),
where for the reply kernel sends many messages, one for each matched
socket.
Currently those two modes work differently as the information about
a specific socket is never split between multiple messages. For (2),
multi-socket dump for the reply kernel allocates up to 32Kb skb and
fills that with as many socket dumps as possible. For (1), one-socket
dump kernel pre-calculates the required space for the reply, allocates
a new skb and nlmsg and only then starts filling the socket's details.
Preallocating the needed size quite makes sense as most of the details
are fix-sized and provided for each socket, see inet_sk_attr_size().
But there's an exception: .idiag_get_aux_size() which is optional for
a socket. This is provided only for TCP sockets by tcp_diag.
For TCP-MD5 it calculates the memory needed to fill an array of
(struct tcp_diag_md5sig). The issue here is that the amount of keys may
change in inet_diag_dump_one_icsk() between inet_sk_attr_size() and
sk_diag_fill() calls. As the code expects fix-sized information on any
socket, it considers sk_diag_fill() failures by -EMSGSIZE reason as
a bug, resulting in such WARN_ON():
[] ------------[ cut here ]------------
[] WARNING: CPU: 7 PID: 17420 at net/ipv4/inet_diag.c:586 inet_diag_dump_one_icsk+0x3c8/0x420
[] CPU: 7 UID: 0 PID: 17420 Comm: diag_ipv4 Tainted: G W 6.11.0-rc6-00022-gc9fd7a9f9aca-dirty #2
[] pc : inet_diag_dump_one_icsk+0x3c8/0x420
[] lr : inet_diag_dump_one_icsk+0x1d4/0x420
[] sp : ffff8000aef87460
...
[] Call trace:
[] inet_diag_dump_one_icsk+0x3c8/0x420
[] tcp_diag_dump_one+0xa0/0xf0
[] inet_diag_cmd_exact+0x234/0x278
[] inet_diag_handler_cmd+0x16c/0x288
[] sock_diag_rcv_msg+0x1a8/0x550
[] netlink_rcv_skb+0x198/0x378
[] sock_diag_rcv+0x20/0x48
[] netlink_unicast+0x400/0x6a8
[] netlink_sendmsg+0x654/0xa58
[] __sys_sendto+0x1ec/0x330
[] __arm64_sys_sendto+0xc8/0x168
...
[] ---[ end trace 0000000000000000 ]---
One way to solve it would be to grab lock_sock() in
inet_diag_dump_one_icsk(), but that may be costly and bring new lock
dependencies. The alternative is to call tcp_diag_put_md5sig() as
the last attribute of the netlink message and calculate how much space
left after all previous attributes filled and translate it into
(struct tcp_diag_md5sig)-sized units. If it turns out that there's not
enough space for all TCP-MD5 keys, mark the dump as inconsistent by
setting NLM_F_DUMP_INTR flag. Userspace may figure out that dumping
raced with the socket properties change and retry again.
Currently it may be unexpected by userspace that netlink message for one
socket may be inconsistent, but I believe we're on a safe side from
breaking userspace as previously dump would fail and an ugly WARN was
produced in dmesg. IOW, it is a clear improvement.
This is not a theoretical issue: I've written a test and that reproduces
the issue I suspected (the backtrace above).
Fixes: c03fa9bcacd9 ("tcp_diag: report TCP MD5 signing keys and addresses")
Cc: stable@vger.kernel.org
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
include/linux/inet_diag.h | 3 ++-
net/ipv4/inet_diag.c | 8 +++----
net/ipv4/tcp_diag.c | 55 ++++++++++++++++++++++++++++-------------------
3 files changed, 39 insertions(+), 27 deletions(-)
diff --git a/include/linux/inet_diag.h b/include/linux/inet_diag.h
index a9033696b0aad36ab9abd47e4b68e272053019d7..cb2ba672eba131986d0432dd628fc42bbf800886 100644
--- a/include/linux/inet_diag.h
+++ b/include/linux/inet_diag.h
@@ -22,7 +22,8 @@ struct inet_diag_handler {
int (*idiag_get_aux)(struct sock *sk,
bool net_admin,
- struct sk_buff *skb);
+ struct sk_buff *skb,
+ struct nlmsghdr *nlh);
size_t (*idiag_get_aux_size)(struct sock *sk,
bool net_admin);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67639309163d05c034fad80fc9a6096c3b79d42f..67b9cc4c0e47a596a4d588e793b7f13ee040a1e3 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -350,10 +350,6 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
handler->idiag_get_info(sk, r, info);
- if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
- if (handler->idiag_get_aux(sk, net_admin, skb) < 0)
- goto errout;
-
if (sk->sk_state < TCP_TIME_WAIT) {
union tcp_cc_info info;
size_t sz = 0;
@@ -368,6 +364,10 @@ int inet_sk_diag_fill(struct sock *sk, struct inet_connection_sock *icsk,
goto errout;
}
+ if (ext & (1 << (INET_DIAG_INFO - 1)) && handler->idiag_get_aux)
+ if (handler->idiag_get_aux(sk, net_admin, skb, nlh) < 0)
+ goto errout;
+
/* Keep it at the end for potential retry with a larger skb,
* or else do best-effort fitting, which is only done for the
* first_nlmsg.
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index f428ecf9120f2f596e1d67db2b2a0d0d0e211905..d752dc5de3536303aeb075c10fbdc2c9fc417cd5 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -53,29 +53,39 @@ static void tcp_diag_md5sig_fill(struct tcp_diag_md5sig *info,
}
static int tcp_diag_put_md5sig(struct sk_buff *skb,
- const struct tcp_md5sig_info *md5sig)
+ const struct tcp_md5sig_info *md5sig,
+ struct nlmsghdr *nlh)
{
+ size_t key_size = sizeof(struct tcp_diag_md5sig);
+ unsigned int attrlen, md5sig_count = 0;
const struct tcp_md5sig_key *key;
struct tcp_diag_md5sig *info;
struct nlattr *attr;
- int md5sig_count = 0;
+ /*
+ * Userspace doesn't like to see zero-filled key-values, so
+ * allocating too large attribute is bad.
+ */
hlist_for_each_entry_rcu(key, &md5sig->head, node)
md5sig_count++;
if (md5sig_count == 0)
return 0;
- attr = nla_reserve(skb, INET_DIAG_MD5SIG,
- md5sig_count * sizeof(struct tcp_diag_md5sig));
+ attrlen = skb_availroom(skb) - NLA_HDRLEN;
+ md5sig_count = min(md5sig_count, attrlen / key_size);
+ attr = nla_reserve(skb, INET_DIAG_MD5SIG, md5sig_count * key_size);
if (!attr)
return -EMSGSIZE;
info = nla_data(attr);
- memset(info, 0, md5sig_count * sizeof(struct tcp_diag_md5sig));
+ memset(info, 0, md5sig_count * key_size);
hlist_for_each_entry_rcu(key, &md5sig->head, node) {
- tcp_diag_md5sig_fill(info++, key);
- if (--md5sig_count == 0)
+ /* More keys on a socket than pre-allocated space available */
+ if (md5sig_count-- == 0) {
+ nlh->nlmsg_flags |= NLM_F_DUMP_INTR;
break;
+ }
+ tcp_diag_md5sig_fill(info++, key);
}
return 0;
@@ -110,25 +120,11 @@ static int tcp_diag_put_ulp(struct sk_buff *skb, struct sock *sk,
}
static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
- struct sk_buff *skb)
+ struct sk_buff *skb, struct nlmsghdr *nlh)
{
struct inet_connection_sock *icsk = inet_csk(sk);
int err = 0;
-#ifdef CONFIG_TCP_MD5SIG
- if (net_admin) {
- struct tcp_md5sig_info *md5sig;
-
- rcu_read_lock();
- md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
- if (md5sig)
- err = tcp_diag_put_md5sig(skb, md5sig);
- rcu_read_unlock();
- if (err < 0)
- return err;
- }
-#endif
-
if (net_admin) {
const struct tcp_ulp_ops *ulp_ops;
@@ -138,6 +134,21 @@ static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
if (err)
return err;
}
+
+#ifdef CONFIG_TCP_MD5SIG
+ if (net_admin) {
+ struct tcp_md5sig_info *md5sig;
+
+ rcu_read_lock();
+ md5sig = rcu_dereference(tcp_sk(sk)->md5sig_info);
+ if (md5sig)
+ err = tcp_diag_put_md5sig(skb, md5sig, nlh);
+ rcu_read_unlock();
+ if (err < 0)
+ return err;
+ }
+#endif
+
return 0;
}
--
2.42.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH net v2 2/5] net/diag: Warn only once on EMSGSIZE
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys Dmitry Safonov via B4 Relay
@ 2024-11-13 18:46 ` Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 3/5] net/diag: Pre-allocate optional info only if requested Dmitry Safonov via B4 Relay
` (5 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov
From: Dmitry Safonov <0x7f454c46@gmail.com>
The code clearly expects that the pre-allocated skb will be enough for
the netlink reply message. But if in an unbelievable situation there is
a kernel issue and sk_diag_fill() fails with -EMSGSIZE, this WARN_ON()
can be triggered from userspace. That aggravates the issue from KASLR
leak into possible DOS vector. Use WARN_ON_ONCE() which is clearly
enough to provide an information on a kernel issue.
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
net/ipv4/inet_diag.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 67b9cc4c0e47a596a4d588e793b7f13ee040a1e3..ca9a7e61d8d7de80cb234c45c41d6357fde50c11 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -583,7 +583,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
if (err < 0) {
- WARN_ON(err == -EMSGSIZE);
+ WARN_ON_ONCE(err == -EMSGSIZE);
nlmsg_free(rep);
goto out;
}
--
2.42.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH net v2 3/5] net/diag: Pre-allocate optional info only if requested
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 2/5] net/diag: Warn only once on EMSGSIZE Dmitry Safonov via B4 Relay
@ 2024-11-13 18:46 ` Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 4/5] net/diag: Always pre-allocate tcp_ulp info Dmitry Safonov via B4 Relay
` (4 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov
From: Dmitry Safonov <0x7f454c46@gmail.com>
Those INET_DIAG_* flags from req->idiag_ext are provided by the
userspace, so they are not going to change during one socket dump.
This is going to save just nits and bits for typical netlink reply,
which I'm going to utilise in the very next patch by always allocating
tls_get_info_size().
It's possible to save even some more by checking the request in
inet_diag_msg_attrs_size(), but that's being on very stingy side.
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
net/ipv4/inet_diag.c | 29 ++++++++++++++++++-----------
1 file changed, 18 insertions(+), 11 deletions(-)
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index ca9a7e61d8d7de80cb234c45c41d6357fde50c11..2dd173a73bd1e2657957e5e4ecb70401cc85dfda 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -102,24 +102,31 @@ static size_t inet_sk_attr_size(struct sock *sk,
bool net_admin)
{
const struct inet_diag_handler *handler;
- size_t aux = 0;
+ int ext = req->idiag_ext;
+ size_t ret = 0;
rcu_read_lock();
handler = rcu_dereference(inet_diag_table[req->sdiag_protocol]);
DEBUG_NET_WARN_ON_ONCE(!handler);
if (handler && handler->idiag_get_aux_size)
- aux = handler->idiag_get_aux_size(sk, net_admin);
+ ret += handler->idiag_get_aux_size(sk, net_admin);
rcu_read_unlock();
- return nla_total_size(sizeof(struct tcp_info))
- + nla_total_size(sizeof(struct inet_diag_msg))
- + inet_diag_msg_attrs_size()
- + nla_total_size(sizeof(struct inet_diag_meminfo))
- + nla_total_size(SK_MEMINFO_VARS * sizeof(u32))
- + nla_total_size(TCP_CA_NAME_MAX)
- + nla_total_size(sizeof(struct tcpvegas_info))
- + aux
- + 64;
+ ret += nla_total_size(sizeof(struct tcp_info))
+ + nla_total_size(sizeof(struct inet_diag_msg))
+ + inet_diag_msg_attrs_size()
+ + 64;
+
+ if (ext & (1 << (INET_DIAG_MEMINFO - 1)))
+ ret += nla_total_size(sizeof(struct inet_diag_meminfo));
+ if (ext & (1 << (INET_DIAG_SKMEMINFO - 1)))
+ ret += nla_total_size(SK_MEMINFO_VARS * sizeof(u32));
+ if (ext & (1 << (INET_DIAG_CONG - 1)))
+ ret += nla_total_size(TCP_CA_NAME_MAX);
+ if (ext & (1 << (INET_DIAG_VEGASINFO - 1)))
+ ret += nla_total_size(sizeof(struct tcpvegas_info));
+
+ return ret;
}
int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
--
2.42.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH net v2 4/5] net/diag: Always pre-allocate tcp_ulp info
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
` (2 preceding siblings ...)
2024-11-13 18:46 ` [PATCH net v2 3/5] net/diag: Pre-allocate optional info only if requested Dmitry Safonov via B4 Relay
@ 2024-11-13 18:46 ` Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap Dmitry Safonov via B4 Relay
` (3 subsequent siblings)
7 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov
From: Dmitry Safonov <0x7f454c46@gmail.com>
Currently there is a theoretical race between netlink one-socket dump
and allocating icsk->icsk_ulp_ops.
Simplify the expectations by always allocating maximum tcp_ulp-info.
With the previous patch the typical netlink message allocation was
decreased for kernel replies on requests without idiag_ext flags,
so let's use it.
Fixes: 61723b393292 ("tcp: ulp: add functions to dump ulp-specific information")
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
include/net/tcp.h | 1 -
net/ipv4/inet_diag.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_diag.c | 13 -------------
net/mptcp/diag.c | 20 --------------------
net/tls/tls_main.c | 17 -----------------
5 files changed, 50 insertions(+), 51 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d1948d357dade0842777265d3397842919f9eee0..757711aa5337ae7e6abee62d303eb66d37082e19 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -2568,7 +2568,6 @@ struct tcp_ulp_ops {
void (*release)(struct sock *sk);
/* diagnostic */
int (*get_info)(struct sock *sk, struct sk_buff *skb);
- size_t (*get_info_size)(const struct sock *sk);
/* clone ulp */
void (*clone)(const struct request_sock *req, struct sock *newsk,
const gfp_t priority);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 2dd173a73bd1e2657957e5e4ecb70401cc85dfda..ac6d9ee8e2cc21fc97d6018547d33b540712e780 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -97,6 +97,55 @@ void inet_diag_msg_common_fill(struct inet_diag_msg *r, struct sock *sk)
}
EXPORT_SYMBOL_GPL(inet_diag_msg_common_fill);
+static size_t tls_get_info_size(void)
+{
+ size_t size = 0;
+
+#ifdef CONFIG_TLS
+ size += nla_total_size(0) + /* INET_ULP_INFO_TLS */
+ nla_total_size(sizeof(u16)) + /* TLS_INFO_VERSION */
+ nla_total_size(sizeof(u16)) + /* TLS_INFO_CIPHER */
+ nla_total_size(sizeof(u16)) + /* TLS_INFO_RXCONF */
+ nla_total_size(sizeof(u16)) + /* TLS_INFO_TXCONF */
+ nla_total_size(0) + /* TLS_INFO_ZC_RO_TX */
+ nla_total_size(0) + /* TLS_INFO_RX_NO_PAD */
+ 0;
+#endif
+
+ return size;
+}
+
+static size_t subflow_get_info_size(void)
+{
+ size_t size = 0;
+
+#ifdef CONFIG_MPTCP
+ size += nla_total_size(0) + /* INET_ULP_INFO_MPTCP */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_TOKEN_REM */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_TOKEN_LOC */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_RELWRITE_SEQ */
+ nla_total_size_64bit(8) + /* MPTCP_SUBFLOW_ATTR_MAP_SEQ */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_MAP_SFSEQ */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_SSN_OFFSET */
+ nla_total_size(2) + /* MPTCP_SUBFLOW_ATTR_MAP_DATALEN */
+ nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_FLAGS */
+ nla_total_size(1) + /* MPTCP_SUBFLOW_ATTR_ID_REM */
+ nla_total_size(1) + /* MPTCP_SUBFLOW_ATTR_ID_LOC */
+ 0;
+#endif
+
+ return size;
+}
+
+static size_t tcp_ulp_ops_size(void)
+{
+ size_t size = max(tls_get_info_size(), subflow_get_info_size());
+
+ return size +
+ nla_total_size(0) + /* INET_DIAG_ULP_INFO */
+ nla_total_size(TCP_ULP_NAME_MAX); /* INET_ULP_INFO_NAME */
+}
+
static size_t inet_sk_attr_size(struct sock *sk,
const struct inet_diag_req_v2 *req,
bool net_admin)
@@ -115,6 +164,7 @@ static size_t inet_sk_attr_size(struct sock *sk,
ret += nla_total_size(sizeof(struct tcp_info))
+ nla_total_size(sizeof(struct inet_diag_msg))
+ inet_diag_msg_attrs_size()
+ + tcp_ulp_ops_size()
+ 64;
if (ext & (1 << (INET_DIAG_MEMINFO - 1)))
diff --git a/net/ipv4/tcp_diag.c b/net/ipv4/tcp_diag.c
index d752dc5de3536303aeb075c10fbdc2c9fc417cd5..a60968ceb1553b2d219290b84a36b05a2d1fa8d2 100644
--- a/net/ipv4/tcp_diag.c
+++ b/net/ipv4/tcp_diag.c
@@ -154,7 +154,6 @@ static int tcp_diag_get_aux(struct sock *sk, bool net_admin,
static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
{
- struct inet_connection_sock *icsk = inet_csk(sk);
size_t size = 0;
#ifdef CONFIG_TCP_MD5SIG
@@ -174,18 +173,6 @@ static size_t tcp_diag_get_aux_size(struct sock *sk, bool net_admin)
sizeof(struct tcp_diag_md5sig));
}
#endif
-
- if (net_admin && sk_fullsock(sk)) {
- const struct tcp_ulp_ops *ulp_ops;
-
- ulp_ops = icsk->icsk_ulp_ops;
- if (ulp_ops) {
- size += nla_total_size(0) +
- nla_total_size(TCP_ULP_NAME_MAX);
- if (ulp_ops->get_info_size)
- size += ulp_ops->get_info_size(sk);
- }
- }
return size;
}
diff --git a/net/mptcp/diag.c b/net/mptcp/diag.c
index 2d3efb405437d85c0bca70d7a92ca3a7363365e1..8b36867e4ddd5f45cebcf60e9093a061d5208756 100644
--- a/net/mptcp/diag.c
+++ b/net/mptcp/diag.c
@@ -84,27 +84,7 @@ static int subflow_get_info(struct sock *sk, struct sk_buff *skb)
return err;
}
-static size_t subflow_get_info_size(const struct sock *sk)
-{
- size_t size = 0;
-
- size += nla_total_size(0) + /* INET_ULP_INFO_MPTCP */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_TOKEN_REM */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_TOKEN_LOC */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_RELWRITE_SEQ */
- nla_total_size_64bit(8) + /* MPTCP_SUBFLOW_ATTR_MAP_SEQ */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_MAP_SFSEQ */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_SSN_OFFSET */
- nla_total_size(2) + /* MPTCP_SUBFLOW_ATTR_MAP_DATALEN */
- nla_total_size(4) + /* MPTCP_SUBFLOW_ATTR_FLAGS */
- nla_total_size(1) + /* MPTCP_SUBFLOW_ATTR_ID_REM */
- nla_total_size(1) + /* MPTCP_SUBFLOW_ATTR_ID_LOC */
- 0;
- return size;
-}
-
void mptcp_diag_subflow_init(struct tcp_ulp_ops *ops)
{
ops->get_info = subflow_get_info;
- ops->get_info_size = subflow_get_info_size;
}
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 6b4b9f2749a6fd6de495940c5cb3f2154a5a451e..f3491c4e942e08dc882cb81eef071203384b2b37 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -1072,22 +1072,6 @@ static int tls_get_info(struct sock *sk, struct sk_buff *skb)
return err;
}
-static size_t tls_get_info_size(const struct sock *sk)
-{
- size_t size = 0;
-
- size += nla_total_size(0) + /* INET_ULP_INFO_TLS */
- nla_total_size(sizeof(u16)) + /* TLS_INFO_VERSION */
- nla_total_size(sizeof(u16)) + /* TLS_INFO_CIPHER */
- nla_total_size(sizeof(u16)) + /* TLS_INFO_RXCONF */
- nla_total_size(sizeof(u16)) + /* TLS_INFO_TXCONF */
- nla_total_size(0) + /* TLS_INFO_ZC_RO_TX */
- nla_total_size(0) + /* TLS_INFO_RX_NO_PAD */
- 0;
-
- return size;
-}
-
static int __net_init tls_init_net(struct net *net)
{
int err;
@@ -1123,7 +1107,6 @@ static struct tcp_ulp_ops tcp_tls_ulp_ops __read_mostly = {
.init = tls_init,
.update = tls_update,
.get_info = tls_get_info,
- .get_info_size = tls_get_info_size,
};
static int __init tls_register(void)
--
2.42.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
` (3 preceding siblings ...)
2024-11-13 18:46 ` [PATCH net v2 4/5] net/diag: Always pre-allocate tcp_ulp info Dmitry Safonov via B4 Relay
@ 2024-11-13 18:46 ` Dmitry Safonov via B4 Relay
2024-11-16 0:13 ` Jakub Kicinski
2024-11-16 0:08 ` [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Jakub Kicinski
` (2 subsequent siblings)
7 siblings, 1 reply; 23+ messages in thread
From: Dmitry Safonov via B4 Relay @ 2024-11-13 18:46 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima
Cc: netdev, linux-kernel, mptcp, Dmitry Safonov
From: Dmitry Safonov <0x7f454c46@gmail.com>
Since commit d35c99ff77ec ("netlink: do not enter direct reclaim from
netlink_dump()") the cap is 32KiB.
Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
---
net/netlink/af_netlink.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index 0a9287fadb47a2afaf0babe675738bc43051c5a7..27979cefc06256bde052898d193ed99f710c2087 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2279,7 +2279,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
goto errout_skb;
/* NLMSG_GOODSIZE is small to avoid high order allocations being
- * required, but it makes sense to _attempt_ a 16K bytes allocation
+ * required, but it makes sense to _attempt_ a 32KiB allocation
* to reduce number of system calls on dump operations, if user
* ever provided a big enough buffer.
*/
@@ -2301,7 +2301,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
goto errout_skb;
/* Trim skb to allocated size. User is expected to provide buffer as
- * large as max(min_dump_alloc, 16KiB (mac_recvmsg_len capped at
+ * large as max(min_dump_alloc, 32KiB (max_recvmsg_len capped at
* netlink_recvmsg())). dump will pack as many smaller messages as
* could fit within the allocated skb. skb is typically allocated
* with larger space than required (could be as much as near 2x the
--
2.42.2
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
` (4 preceding siblings ...)
2024-11-13 18:46 ` [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap Dmitry Safonov via B4 Relay
@ 2024-11-16 0:08 ` Jakub Kicinski
2024-11-16 0:48 ` Dmitry Safonov
2024-11-20 8:44 ` Johannes Berg
2024-11-16 0:20 ` patchwork-bot+netdevbpf
2024-12-05 1:13 ` Jakub Kicinski
7 siblings, 2 replies; 23+ messages in thread
From: Jakub Kicinski @ 2024-11-16 0:08 UTC (permalink / raw)
To: Dmitry Safonov via B4 Relay
Cc: 0x7f454c46, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, John Fastabend, Davide Caratti,
Kuniyuki Iwashima, netdev, linux-kernel, mptcp, Johannes Berg
On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> 2. Inet-diag allocates netlink message for sockets in
> inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> .idiag_get_aux_size(), that pre-calculates the needed space for
> TCP-diag related information. But as neither socket lock nor
> rcu_readlock() are held between allocation and the actual TCP
> info filling, the TCP-related space requirement may change before
> reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
Would it be too ugly if we simply retried with a 32kB skb if the initial
dump failed with EMSGSIZE?
Another option would be to automatically grow the skb. The size
accounting is an endless source of bugs. We'd just need to scan
the codebase to make sure there are no cases where someone does
ptr = __nla_reserve();
nla_put();
*ptr = 0;
Which may be too much of a project and source of bugs in itself.
Or do both, retry as a fix, and auto-grow in net-next.
> In order to remove the new limit from (4) solution, my plan is to
> convert the dump of TCP-MD5 keys from an array to
> NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> And for (3), it's needed to teach tcp-diag how-to remember not only
> socket on which previous recvmsg() stopped, but potentially TCP-MD5
> key as well.
Just putting the same attribute type multiple times is preferable
to array types.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap
2024-11-13 18:46 ` [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap Dmitry Safonov via B4 Relay
@ 2024-11-16 0:13 ` Jakub Kicinski
0 siblings, 0 replies; 23+ messages in thread
From: Jakub Kicinski @ 2024-11-16 0:13 UTC (permalink / raw)
To: Dmitry Safonov via B4 Relay
Cc: 0x7f454c46, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, Boris Pismenny, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp
On Wed, 13 Nov 2024 18:46:44 +0000 Dmitry Safonov via B4 Relay wrote:
> From: Dmitry Safonov <0x7f454c46@gmail.com>
>
> Since commit d35c99ff77ec ("netlink: do not enter direct reclaim from
> netlink_dump()") the cap is 32KiB.
>
> Signed-off-by: Dmitry Safonov <0x7f454c46@gmail.com>
I'll apply this one already, looks obviously correct (tm)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
` (5 preceding siblings ...)
2024-11-16 0:08 ` [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Jakub Kicinski
@ 2024-11-16 0:20 ` patchwork-bot+netdevbpf
2024-12-05 1:13 ` Jakub Kicinski
7 siblings, 0 replies; 23+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-11-16 0:20 UTC (permalink / raw)
To: Dmitry Safonov via B4 Relay
Cc: davem, edumazet, kuba, pabeni, horms, dsahern, colona, matttbe,
martineau, geliang, borisp, john.fastabend, dcaratti, kuniyu,
netdev, linux-kernel, mptcp, 0x7f454c46, stable
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Wed, 13 Nov 2024 18:46:39 +0000 you wrote:
> Changes in v2:
> - Fixup for uninitilized md5sig_count stack variable
> (Oops! Kudos to kernel test robot <lkp@intel.com>)
> - Correct space damage, add a missing Fixes tag &
> reformat tcp_ulp_ops_size() (Kuniyuki Iwashima)
> - Take out patch for maximum attribute length, see (4) below.
> Going to send it later with the next TCP-AO-diag part
> (Kuniyuki Iwashima)
> - Link to v1: https://lore.kernel.org/r/20241106-tcp-md5-diag-prep-v1-0-d62debf3dded@gmail.com
>
> [...]
Here is the summary with links:
- [net,v2,1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys
(no matching commit)
- [net,v2,2/5] net/diag: Warn only once on EMSGSIZE
(no matching commit)
- [net,v2,3/5] net/diag: Pre-allocate optional info only if requested
(no matching commit)
- [net,v2,4/5] net/diag: Always pre-allocate tcp_ulp info
(no matching commit)
- [net,v2,5/5] net/netlink: Correct the comment on netlink message max cap
https://git.kernel.org/netdev/net-next/c/e51edeaf3506
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-16 0:08 ` [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Jakub Kicinski
@ 2024-11-16 0:48 ` Dmitry Safonov
2024-11-16 1:58 ` Jakub Kicinski
2024-11-20 8:44 ` Johannes Berg
1 sibling, 1 reply; 23+ messages in thread
From: Dmitry Safonov @ 2024-11-16 0:48 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Dmitry Safonov via B4 Relay, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp,
Johannes Berg
Hi Jakub,
On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > 2. Inet-diag allocates netlink message for sockets in
> > inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> > .idiag_get_aux_size(), that pre-calculates the needed space for
> > TCP-diag related information. But as neither socket lock nor
> > rcu_readlock() are held between allocation and the actual TCP
> > info filling, the TCP-related space requirement may change before
> > reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> > a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> > return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
>
> Would it be too ugly if we simply retried with a 32kB skb if the initial
> dump failed with EMSGSIZE?
Yeah, I'm not sure. I thought of keeping it simple and just marking
the nlmsg "inconsistent". This is arguably a change of meaning for
NLM_F_DUMP_INTR because previously, it meant that the multi-message
dump became inconsistent between recvmsg() calls. And now, it is also
utilized in the "do" version if it raced with the socket setsockopts()
in another thread.
> Another option would be to automatically grow the skb. The size
> accounting is an endless source of bugs. We'd just need to scan
> the codebase to make sure there are no cases where someone does
>
> ptr = __nla_reserve();
> nla_put();
> *ptr = 0;
>
> Which may be too much of a project and source of bugs in itself.
This seems quite more complex than just marking the dump inconsistent
and letting the userspace deal with the result or retry if it wants
precise key information.
> Or do both, retry as a fix, and auto-grow in net-next.
>
> > In order to remove the new limit from (4) solution, my plan is to
> > convert the dump of TCP-MD5 keys from an array to
> > NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> > And for (3), it's needed to teach tcp-diag how-to remember not only
> > socket on which previous recvmsg() stopped, but potentially TCP-MD5
> > key as well.
>
> Just putting the same attribute type multiple times is preferable
> to array types.
Cool. I didn't know that. I think I was confused by iproute way of
parsing [which I read very briefly, so might have misunderstood]:
: while (RTA_OK(rta, len)) {
: type = rta->rta_type & ~flags;
: if ((type <= max) && (!tb[type]))
: tb[type] = rta;
: rta = RTA_NEXT(rta, len);
: }
https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526
which seems like it will just ignore duplicate attributes.
That doesn't mean iproute has to dictate new code in kernel, for sure.
Thanks,
Dmitry
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-16 0:48 ` Dmitry Safonov
@ 2024-11-16 1:58 ` Jakub Kicinski
2024-11-16 3:52 ` Dmitry Safonov
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2024-11-16 1:58 UTC (permalink / raw)
To: Dmitry Safonov
Cc: Dmitry Safonov via B4 Relay, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp,
Johannes Berg
On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:
> On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > > 2. Inet-diag allocates netlink message for sockets in
> > > inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> > > .idiag_get_aux_size(), that pre-calculates the needed space for
> > > TCP-diag related information. But as neither socket lock nor
> > > rcu_readlock() are held between allocation and the actual TCP
> > > info filling, the TCP-related space requirement may change before
> > > reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> > > a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> > > return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
> >
> > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > dump failed with EMSGSIZE?
>
> Yeah, I'm not sure. I thought of keeping it simple and just marking
> the nlmsg "inconsistent". This is arguably a change of meaning for
> NLM_F_DUMP_INTR because previously, it meant that the multi-message
> dump became inconsistent between recvmsg() calls. And now, it is also
> utilized in the "do" version if it raced with the socket setsockopts()
> in another thread.
NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
was a workaround for consistency of the dump as a whole. Single message
we can re-generate quite easily in the kernel, so forcing the user to
handle INTR and retry seems unnecessarily cruel ;)
> > > In order to remove the new limit from (4) solution, my plan is to
> > > convert the dump of TCP-MD5 keys from an array to
> > > NL_ATTR_TYPE_NESTED_ARRAY (or alike), which should also address (1).
> > > And for (3), it's needed to teach tcp-diag how-to remember not only
> > > socket on which previous recvmsg() stopped, but potentially TCP-MD5
> > > key as well.
> >
> > Just putting the same attribute type multiple times is preferable
> > to array types.
>
> Cool. I didn't know that. I think I was confused by iproute way of
> parsing [which I read very briefly, so might have misunderstood]:
> : while (RTA_OK(rta, len)) {
> : type = rta->rta_type & ~flags;
> : if ((type <= max) && (!tb[type]))
> : tb[type] = rta;
> : rta = RTA_NEXT(rta, len);
> : }
> https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526
>
> which seems like it will just ignore duplicate attributes.
>
> That doesn't mean iproute has to dictate new code in kernel, for sure.
Right, the table based parsing doesn't work well with multi-attr,
but other table formats aren't fundamentally better. Or at least
I never came up with a good way of solving this. And the multi-attr
at least doesn't suffer from the u16 problem.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-16 1:58 ` Jakub Kicinski
@ 2024-11-16 3:52 ` Dmitry Safonov
2024-11-19 0:12 ` Jakub Kicinski
0 siblings, 1 reply; 23+ messages in thread
From: Dmitry Safonov @ 2024-11-16 3:52 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Dmitry Safonov via B4 Relay, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp,
Johannes Berg
On Sat, 16 Nov 2024 at 01:58, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:
> > On Sat, 16 Nov 2024 at 00:08, Jakub Kicinski <kuba@kernel.org> wrote:
> > > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > > dump failed with EMSGSIZE?
> >
> > Yeah, I'm not sure. I thought of keeping it simple and just marking
> > the nlmsg "inconsistent". This is arguably a change of meaning for
> > NLM_F_DUMP_INTR because previously, it meant that the multi-message
> > dump became inconsistent between recvmsg() calls. And now, it is also
> > utilized in the "do" version if it raced with the socket setsockopts()
> > in another thread.
>
> NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
> was a workaround for consistency of the dump as a whole. Single message
> we can re-generate quite easily in the kernel, so forcing the user to
> handle INTR and retry seems unnecessarily cruel ;)
Kind of agree. But then, it seems to be quite rare. Even on a
purposely created selftest it fires not each time (maybe I'm not
skilful enough). Yet somewhat sceptical about a re-try in the kernel:
the need for it is caused by another thread manipulating keys, so we
may need another re-try after the first re-try... So, then we would
have to introduce a limit on retries :D
Hmm, what do you think about a kind of middle-ground/compromise
solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
hardly ever/never happen by purposely allocating larger skb. I don't
want to set some value in stone as one day it might become not enough
for all different socket infos, but maybe just add 4kB more to the
initial allocation? So, for it to reproduce, another thread would have
to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
socket between this thread's skb allocation and filling of the info
array. I'd call it "attempting to be nice to a user, but not at their
busylooping expense".
> > > Just putting the same attribute type multiple times is preferable
> > > to array types.
> >
> > Cool. I didn't know that. I think I was confused by iproute way of
> > parsing [which I read very briefly, so might have misunderstood]:
> > : while (RTA_OK(rta, len)) {
> > : type = rta->rta_type & ~flags;
> > : if ((type <= max) && (!tb[type]))
> > : tb[type] = rta;
> > : rta = RTA_NEXT(rta, len);
> > : }
> > https://github.com/iproute2/iproute2/blob/main/lib/libnetlink.c#L1526
> >
> > which seems like it will just ignore duplicate attributes.
> >
> > That doesn't mean iproute has to dictate new code in kernel, for sure.
>
> Right, the table based parsing doesn't work well with multi-attr,
> but other table formats aren't fundamentally better. Or at least
> I never came up with a good way of solving this. And the multi-attr
> at least doesn't suffer from the u16 problem.
Yeah, also an array of structs that makes it impossible to extend such
an ABI with new members.
And with regards to u16, I was thinking of this diff for net-next, but
was not sure if it's worth it:
diff --git a/lib/nlattr.c b/lib/nlattr.c
index be9c576b6e2d..01c5a49ffa34 100644
--- a/lib/nlattr.c
+++ b/lib/nlattr.c
@@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
int attrtype, int attrlen)
{
struct nlattr *nla;
+ DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
+ "requested nlattr::nla_len %d >= U16_MAX", attrlen);
+
nla = skb_put(skb, nla_total_size(attrlen));
nla->nla_type = attrtype;
nla->nla_len = nla_attr_size(attrlen);
--->8---
Thanks,
Dmitry
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-16 3:52 ` Dmitry Safonov
@ 2024-11-19 0:12 ` Jakub Kicinski
2024-11-20 0:19 ` Dmitry Safonov
0 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2024-11-19 0:12 UTC (permalink / raw)
To: Dmitry Safonov
Cc: Dmitry Safonov via B4 Relay, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp,
Johannes Berg
On Sat, 16 Nov 2024 03:52:47 +0000 Dmitry Safonov wrote:
> On Sat, 16 Nov 2024 at 01:58, Jakub Kicinski <kuba@kernel.org> wrote:
> > On Sat, 16 Nov 2024 00:48:17 +0000 Dmitry Safonov wrote:
> > > Yeah, I'm not sure. I thought of keeping it simple and just marking
> > > the nlmsg "inconsistent". This is arguably a change of meaning for
> > > NLM_F_DUMP_INTR because previously, it meant that the multi-message
> > > dump became inconsistent between recvmsg() calls. And now, it is also
> > > utilized in the "do" version if it raced with the socket setsockopts()
> > > in another thread.
> >
> > NLM_F_DUMP_INTR is an interesting idea, but exactly as you say NLM_F_DUMP_INTR
> > was a workaround for consistency of the dump as a whole. Single message
> > we can re-generate quite easily in the kernel, so forcing the user to
> > handle INTR and retry seems unnecessarily cruel ;)
>
> Kind of agree. But then, it seems to be quite rare. Even on a
> purposely created selftest it fires not each time (maybe I'm not
> skilful enough). Yet somewhat sceptical about a re-try in the kernel:
> the need for it is caused by another thread manipulating keys, so we
> may need another re-try after the first re-try... So, then we would
> have to introduce a limit on retries :D
Wouldn't be the first time ;)
But I'd just retry once with a "very large" buffer.
> Hmm, what do you think about a kind of middle-ground/compromise
> solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
> hardly ever/never happen by purposely allocating larger skb. I don't
> want to set some value in stone as one day it might become not enough
> for all different socket infos, but maybe just add 4kB more to the
> initial allocation? So, for it to reproduce, another thread would have
> to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
> socket between this thread's skb allocation and filling of the info
> array. I'd call it "attempting to be nice to a user, but not at their
> busylooping expense".
The size of the retry buffer should be larger than any valid size.
We can add a warning if calculated size >= 32kB.
If we support an inf number of md5 keys we need to cap it.
Eric is back later this week, perhaps we should wait for his advice.
> > Right, the table based parsing doesn't work well with multi-attr,
> > but other table formats aren't fundamentally better. Or at least
> > I never came up with a good way of solving this. And the multi-attr
> > at least doesn't suffer from the u16 problem.
>
> Yeah, also an array of structs that makes it impossible to extend such
> an ABI with new members.
>
> And with regards to u16, I was thinking of this diff for net-next, but
> was not sure if it's worth it:
>
> diff --git a/lib/nlattr.c b/lib/nlattr.c
> index be9c576b6e2d..01c5a49ffa34 100644
> --- a/lib/nlattr.c
> +++ b/lib/nlattr.c
> @@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
> int attrtype, int attrlen)
> {
> struct nlattr *nla;
>
> + DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
> + "requested nlattr::nla_len %d >= U16_MAX", attrlen);
> +
> nla = skb_put(skb, nla_total_size(attrlen));
> nla->nla_type = attrtype;
> nla->nla_len = nla_attr_size(attrlen);
I'm slightly worried that this can be triggered already from user
space, but we can try DEBUG_NET_* and see. Here and in nla_nest_end().
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-19 0:12 ` Jakub Kicinski
@ 2024-11-20 0:19 ` Dmitry Safonov
0 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov @ 2024-11-20 0:19 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Dmitry Safonov via B4 Relay, David S. Miller, Eric Dumazet,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, John Fastabend,
Davide Caratti, Kuniyuki Iwashima, netdev, linux-kernel, mptcp,
Johannes Berg
On Tue, 19 Nov 2024 at 00:12, Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 16 Nov 2024 03:52:47 +0000 Dmitry Safonov wrote:
> > Kind of agree. But then, it seems to be quite rare. Even on a
> > purposely created selftest it fires not each time (maybe I'm not
> > skilful enough). Yet somewhat sceptical about a re-try in the kernel:
> > the need for it is caused by another thread manipulating keys, so we
> > may need another re-try after the first re-try... So, then we would
> > have to introduce a limit on retries :D
>
> Wouldn't be the first time ;)
> But I'd just retry once with a "very large" buffer.
>
> > Hmm, what do you think about a kind of middle-ground/compromise
> > solution: keeping this NLM_F_DUMP_INTR flag and logic, but making it
> > hardly ever/never happen by purposely allocating larger skb. I don't
> > want to set some value in stone as one day it might become not enough
> > for all different socket infos, but maybe just add 4kB more to the
> > initial allocation? So, for it to reproduce, another thread would have
> > to add 4kB/sizeof(tcp_diag_md5sig) = 4kB/100 ~= 40 MD5 keys on the
> > socket between this thread's skb allocation and filling of the info
> > array. I'd call it "attempting to be nice to a user, but not at their
> > busylooping expense".
>
> The size of the retry buffer should be larger than any valid size.
> We can add a warning if calculated size >= 32kB.
Currently, md5/ao keys are limited by sock_kmalloc(), which uses
optmem_max sysctl limit. The default nowadays is 128KB.
From [1] I see that the current in-kernel (struct tcp_md5sig_key) hits
optmem_max on
# ok 38 optmem limit was hit on adding 655 key
IOW, with the default limit and sizeof(struct tcp_diag_md5sig) = 100,
the maximum skb size would be ~= 65Kb. Sounds a little too big for
kmemcache allocation.
Initially, my idea was to limit this old version of tcp-md5-diag by
U16_MAX. Now I'm thinking of adopting your idea by always allocating
32kB skb for single-message and marking it somehow, if it's not big
enough to fit all the keys on a socket (NLM_F_DUMP_INTR or any other
alternative for userspace to get a clue that the single message wasn't
enough).
Then, as I planned, teach the multi-message dump iterator to stop
between recvmsg() on N-th md5/ao key and continue the dump from that
key on the next recvmsg().
> If we support an inf number of md5 keys we need to cap it.
Yeah, unfortunately, we have some customers with 1000 peers (and
because of that we internally test BGP with even more peers).
And that's with an assumption of one key per peer, which is not
necessarily true for AO.
> Eric is back later this week, perhaps we should wait for his advice.
Sure, I will be glad to have advice from you both, thanks!
> > > Right, the table based parsing doesn't work well with multi-attr,
> > > but other table formats aren't fundamentally better. Or at least
> > > I never came up with a good way of solving this. And the multi-attr
> > > at least doesn't suffer from the u16 problem.
> >
> > Yeah, also an array of structs that makes it impossible to extend such
> > an ABI with new members.
> >
> > And with regards to u16, I was thinking of this diff for net-next, but
> > was not sure if it's worth it:
> >
> > diff --git a/lib/nlattr.c b/lib/nlattr.c
> > index be9c576b6e2d..01c5a49ffa34 100644
> > --- a/lib/nlattr.c
> > +++ b/lib/nlattr.c
> > @@ -903,6 +903,9 @@ struct nlattr *__nla_reserve(struct sk_buff *skb,
> > int attrtype, int attrlen)
> > {
> > struct nlattr *nla;
> >
> > + DEBUG_NET_WARN_ONCE(attrlen >= U16_MAX,
> > + "requested nlattr::nla_len %d >= U16_MAX", attrlen);
> > +
> > nla = skb_put(skb, nla_total_size(attrlen));
> > nla->nla_type = attrtype;
> > nla->nla_len = nla_attr_size(attrlen);
>
> I'm slightly worried that this can be triggered already from user
> space, but we can try DEBUG_NET_* and see. Here and in nla_nest_end().
Yeah, I thought that CONFIG_DEBUG_NET is not enabled on generic
distros, but the description is:
: Enable extra sanity checks in networking.
: This is mostly used by fuzzers, but is safe to select.
not sure if that guards any production users from enabling it.
But that would be interesting to see if, with those new additions,
netdev doesn't produce any warnings.
[1] https://netdev-3.bots.linux.dev/vmksft-tcp-ao/results/867500/14-setsockopt-closed-ipv4/stdout
Thanks,
Dmitry
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-16 0:08 ` [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Jakub Kicinski
2024-11-16 0:48 ` Dmitry Safonov
@ 2024-11-20 8:44 ` Johannes Berg
2024-11-20 16:13 ` Dmitry Safonov
1 sibling, 1 reply; 23+ messages in thread
From: Johannes Berg @ 2024-11-20 8:44 UTC (permalink / raw)
To: Jakub Kicinski, Dmitry Safonov via B4 Relay
Cc: 0x7f454c46, David S. Miller, Eric Dumazet, Paolo Abeni,
Simon Horman, David Ahern, Ivan Delalande, Matthieu Baerts,
Mat Martineau, Geliang Tang, John Fastabend, Davide Caratti,
Kuniyuki Iwashima, netdev, linux-kernel, mptcp
(Sorry, late to the party)
On Fri, 2024-11-15 at 16:08 -0800, Jakub Kicinski wrote:
> On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > 2. Inet-diag allocates netlink message for sockets in
> > inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> > .idiag_get_aux_size(), that pre-calculates the needed space for
> > TCP-diag related information. But as neither socket lock nor
> > rcu_readlock() are held between allocation and the actual TCP
> > info filling, the TCP-related space requirement may change before
> > reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> > a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> > return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
>
> Would it be too ugly if we simply retried with a 32kB skb if the initial
> dump failed with EMSGSIZE?
We have min_dump_alloc, which a number of places are setting much higher
than the default, partially at least because there were similar issues,
e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.
Kind of ugly? Sure! And we shouldn't use it now with newer userspace
that knows to request a more finely split dump. For older userspace it's
the only way though.
Also, we don't even give all the data to older userspace (it must
support split dumps to get information about the more modern features, 6
GHz channels, etc.), but I gather that's not an option here.
> Another option would be to automatically grow the skb. The size
> accounting is an endless source of bugs. We'd just need to scan
> the codebase to make sure there are no cases where someone does
>
> ptr = __nla_reserve();
> nla_put();
> *ptr = 0;
>
> Which may be too much of a project and source of bugs in itself.
>
> Or do both, retry as a fix, and auto-grow in net-next.
For auto-grow you'd also have to have information about the userspace
buffer, I think? It still has to fit there, might as well fail anyway if
that buffer is too small? I'm not sure we have that link back? But I'm
not really sure right now, just remember this as an additional wrinkle
from the above-mentioned nl80211 problem.
johannes
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-20 8:44 ` Johannes Berg
@ 2024-11-20 16:13 ` Dmitry Safonov
2024-11-20 19:36 ` Johannes Berg
0 siblings, 1 reply; 23+ messages in thread
From: Dmitry Safonov @ 2024-11-20 16:13 UTC (permalink / raw)
To: Johannes Berg
Cc: Jakub Kicinski, Dmitry Safonov via B4 Relay, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, David Ahern,
Ivan Delalande, Matthieu Baerts, Mat Martineau, Geliang Tang,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp
On Wed, 20 Nov 2024 at 08:44, Johannes Berg <johannes@sipsolutions.net> wrote:
>
> (Sorry, late to the party)
Thanks for joining! :-)
> On Fri, 2024-11-15 at 16:08 -0800, Jakub Kicinski wrote:
> > Would it be too ugly if we simply retried with a 32kB skb if the initial
> > dump failed with EMSGSIZE?
>
> We have min_dump_alloc, which a number of places are setting much higher
> than the default, partially at least because there were similar issues,
> e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.
Yeah, your example seems alike what netlink_dump() does with
min_dump_alloc and max_recvmsg_len. You have there
.doit = nl80211_get_wiphy,
.dumpit = nl80211_dump_wiphy,
So at this initial patch set, I'm trying to fix
inet_diag_handler::dump_one() callback, which is to my understanding
same as .doit() for generic netlink [should we just rename struct
inet_diag_handler callbacks to match the generics?]. See
inet_diag_handler_cmd() and NLM_F_DUMP in
Documentation/userspace-api/netlink/intro.rst
For TCP-MD5-diag even the single message reply may have a variable
number of keys on a socket's dump. For multi-messages dump, my plan is
to use netlink_callback::ctx[] and add an iterator type that will
allow to stop on N-th key between recvmsg() calls.
> Kind of ugly? Sure! And we shouldn't use it now with newer userspace
> that knows to request a more finely split dump. For older userspace it's
> the only way though.
Heh, the comment in nl80211_dump_wiphy() on sending an empty message
and retrying is ouch!
>
> Also, we don't even give all the data to older userspace (it must
> support split dumps to get information about the more modern features, 6
> GHz channels, etc.), but I gather that's not an option here.
>
> > Another option would be to automatically grow the skb. The size
> > accounting is an endless source of bugs. We'd just need to scan
> > the codebase to make sure there are no cases where someone does
> >
> > ptr = __nla_reserve();
> > nla_put();
> > *ptr = 0;
> >
> > Which may be too much of a project and source of bugs in itself.
> >
> > Or do both, retry as a fix, and auto-grow in net-next.
>
> For auto-grow you'd also have to have information about the userspace
> buffer, I think? It still has to fit there, might as well fail anyway if
> that buffer is too small? I'm not sure we have that link back? But I'm
> not really sure right now, just remember this as an additional wrinkle
> from the above-mentioned nl80211 problem.
Yeah, netlink_recvmsg() attempts to track what the userspace is asking:
: /* Record the max length of recvmsg() calls for future allocations */
: max_recvmsg_len = max(READ_ONCE(nlk->max_recvmsg_len), len);
: max_recvmsg_len = min_t(size_t, max_recvmsg_len,
: SKB_WITH_OVERHEAD(32768));
: WRITE_ONCE(nlk->max_recvmsg_len, max_recvmsg_len);
Thanks,
Dmitry
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-20 16:13 ` Dmitry Safonov
@ 2024-11-20 19:36 ` Johannes Berg
0 siblings, 0 replies; 23+ messages in thread
From: Johannes Berg @ 2024-11-20 19:36 UTC (permalink / raw)
To: Dmitry Safonov
Cc: Jakub Kicinski, Dmitry Safonov via B4 Relay, David S. Miller,
Eric Dumazet, Paolo Abeni, Simon Horman, David Ahern,
Ivan Delalande, Matthieu Baerts, Mat Martineau, Geliang Tang,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp
On Wed, 2024-11-20 at 16:13 +0000, Dmitry Safonov wrote:
> > We have min_dump_alloc, which a number of places are setting much higher
> > than the default, partially at least because there were similar issues,
> > e.g. in nl80211. See e.g. nl80211_dump_wiphy() doing it dynamically.
>
> Yeah, your example seems alike what netlink_dump() does with
> min_dump_alloc and max_recvmsg_len. You have there
> .doit = nl80211_get_wiphy,
> .dumpit = nl80211_dump_wiphy,
>
> So at this initial patch set, I'm trying to fix
> inet_diag_handler::dump_one() callback, which is to my understanding
> same as .doit() for generic netlink [should we just rename struct
> inet_diag_handler callbacks to match the generics?].
dump_one() doesn't sound like doit(), it sounds more like dump one
object? In generic netlink dumpit has to handle that internally, and
doit() is for commands, without F_DUMP.
But also generic netlink is just one netlink family, so I wouldn't
really rename anything to match it anyway.
> See
> inet_diag_handler_cmd() and NLM_F_DUMP in
> Documentation/userspace-api/netlink/intro.rst
> For TCP-MD5-diag even the single message reply may have a variable
> number of keys on a socket's dump.
Right.
> For multi-messages dump, my plan is
> to use netlink_callback::ctx[] and add an iterator type that will
> allow to stop on N-th key between recvmsg() calls.
Right, so userspace has to understand that format. In nl80211 we've made
an input flag attribute (inputs are often unused with dump, but are
present) to request that split format.
> > Kind of ugly? Sure! And we shouldn't use it now with newer userspace
> > that knows to request a more finely split dump. For older userspace it's
> > the only way though.
>
> Heh, the comment in nl80211_dump_wiphy() on sending an empty message
> and retrying is ouch!
Yeah ... Luckily we basically converted all userspace to request split
dumps, so we shouldn't ever get there now.
> > For auto-grow you'd also have to have information about the userspace
> > buffer, I think? It still has to fit there, might as well fail anyway if
> > that buffer is too small? I'm not sure we have that link back? But I'm
> > not really sure right now, just remember this as an additional wrinkle
> > from the above-mentioned nl80211 problem.
>
> Yeah, netlink_recvmsg() attempts to track what the userspace is asking:
>
> : /* Record the max length of recvmsg() calls for future allocations */
> : max_recvmsg_len = max(READ_ONCE(nlk->max_recvmsg_len), len);
> : max_recvmsg_len = min_t(size_t, max_recvmsg_len,
> : SKB_WITH_OVERHEAD(32768));
> : WRITE_ONCE(nlk->max_recvmsg_len, max_recvmsg_len);
Right, OK, so that's sorted then :)
johannes
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
` (6 preceding siblings ...)
2024-11-16 0:20 ` patchwork-bot+netdevbpf
@ 2024-12-05 1:13 ` Jakub Kicinski
2024-12-05 9:09 ` Eric Dumazet
7 siblings, 1 reply; 23+ messages in thread
From: Jakub Kicinski @ 2024-12-05 1:13 UTC (permalink / raw)
To: Eric Dumazet
Cc: Dmitry Safonov via B4 Relay, 0x7f454c46, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> 2. Inet-diag allocates netlink message for sockets in
> inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> .idiag_get_aux_size(), that pre-calculates the needed space for
> TCP-diag related information. But as neither socket lock nor
> rcu_readlock() are held between allocation and the actual TCP
> info filling, the TCP-related space requirement may change before
> reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
Hi Eric!
This was posted while you were away -- any thoughts or recommendation on
how to address the required nl message size changing? Or other problems
pointed out by Dmitry? My suggestion in the subthread is to re-dump
with a fixed, large buffer on EMSGSIZE, but that's not super clean..
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-12-05 1:13 ` Jakub Kicinski
@ 2024-12-05 9:09 ` Eric Dumazet
2024-12-06 0:31 ` Jakub Kicinski
2024-12-06 2:49 ` Dmitry Safonov
0 siblings, 2 replies; 23+ messages in thread
From: Eric Dumazet @ 2024-12-05 9:09 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Dmitry Safonov via B4 Relay, 0x7f454c46, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Wed, 13 Nov 2024 18:46:39 +0000 Dmitry Safonov via B4 Relay wrote:
> > 2. Inet-diag allocates netlink message for sockets in
> > inet_diag_dump_one_icsk(), which uses a TCP-diag callback
> > .idiag_get_aux_size(), that pre-calculates the needed space for
> > TCP-diag related information. But as neither socket lock nor
> > rcu_readlock() are held between allocation and the actual TCP
> > info filling, the TCP-related space requirement may change before
> > reaching tcp_diag_put_md5sig(). I.e., the number of TCP-MD5 keys on
> > a socket. Thankfully, TCP-MD5-diag won't overwrite the skb, but will
> > return EMSGSIZE, triggering WARN_ON() in inet_diag_dump_one_icsk().
>
> Hi Eric!
>
> This was posted while you were away -- any thoughts or recommendation on
> how to address the required nl message size changing? Or other problems
> pointed out by Dmitry? My suggestion in the subthread is to re-dump
> with a fixed, large buffer on EMSGSIZE, but that's not super clean..
Hi Jakub
inet_diag_dump_one_icsk() could retry, doubling the size until the
~32768 byte limit is reached ?
Also, we could make sure inet_sk_attr_size() returns at least
NLMSG_DEFAULT_SIZE, there is no
point trying to save memory for a single skb in inet_diag_dump_one_icsk().
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 321acc8abf17e8c7d6a4e3326615123fff19deab..cd2e7fe9b090ea9127aebbba0faf2ef12c0f86a4
100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -102,7 +102,7 @@ static size_t inet_sk_attr_size(struct sock *sk,
bool net_admin)
{
const struct inet_diag_handler *handler;
- size_t aux = 0;
+ size_t aux = 0, res;
rcu_read_lock();
handler = rcu_dereference(inet_diag_table[req->sdiag_protocol]);
@@ -111,7 +111,7 @@ static size_t inet_sk_attr_size(struct sock *sk,
aux = handler->idiag_get_aux_size(sk, net_admin);
rcu_read_unlock();
- return nla_total_size(sizeof(struct tcp_info))
+ res = nla_total_size(sizeof(struct tcp_info))
+ nla_total_size(sizeof(struct inet_diag_msg))
+ inet_diag_msg_attrs_size()
+ nla_total_size(sizeof(struct inet_diag_meminfo))
@@ -120,6 +120,7 @@ static size_t inet_sk_attr_size(struct sock *sk,
+ nla_total_size(sizeof(struct tcpvegas_info))
+ aux
+ 64;
+ return max(res, NLMSG_DEFAULT_SIZE);
}
int inet_diag_msg_attrs_fill(struct sock *sk, struct sk_buff *skb,
@@ -570,6 +571,7 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
bool net_admin = netlink_net_capable(in_skb, CAP_NET_ADMIN);
struct net *net = sock_net(in_skb->sk);
struct sk_buff *rep;
+ size_t attr_size;
struct sock *sk;
int err;
@@ -577,7 +579,9 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
if (IS_ERR(sk))
return PTR_ERR(sk);
- rep = nlmsg_new(inet_sk_attr_size(sk, req, net_admin), GFP_KERNEL);
+ attr_size = inet_sk_attr_size(sk, req, net_admin);
+retry:
+ rep = nlmsg_new(attr_size, GFP_KERNEL);
if (!rep) {
err = -ENOMEM;
goto out;
@@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
if (err < 0) {
- WARN_ON(err == -EMSGSIZE);
nlmsg_free(rep);
+ if (err == -EMSGSIZE) {
+ attr_size <<= 1;
+ if (attr_size + NLMSG_HDRLEN <=
SKB_WITH_OVERHEAD(32768)) {
+ cond_resched();
+ goto retry;
+ }
+ }
goto out;
}
err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-12-05 9:09 ` Eric Dumazet
@ 2024-12-06 0:31 ` Jakub Kicinski
2024-12-06 2:49 ` Dmitry Safonov
1 sibling, 0 replies; 23+ messages in thread
From: Jakub Kicinski @ 2024-12-06 0:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: Dmitry Safonov via B4 Relay, 0x7f454c46, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
On Thu, 5 Dec 2024 10:09:02 +0100 Eric Dumazet wrote:
> inet_diag_dump_one_icsk() could retry, doubling the size until the
> ~32768 byte limit is reached ?
>
> Also, we could make sure inet_sk_attr_size() returns at least
> NLMSG_DEFAULT_SIZE, there is no
> point trying to save memory for a single skb in inet_diag_dump_one_icsk().
SGTM :)
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-12-05 9:09 ` Eric Dumazet
2024-12-06 0:31 ` Jakub Kicinski
@ 2024-12-06 2:49 ` Dmitry Safonov
2024-12-06 15:14 ` Eric Dumazet
1 sibling, 1 reply; 23+ messages in thread
From: Dmitry Safonov @ 2024-12-06 2:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jakub Kicinski, Dmitry Safonov via B4 Relay, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
Hi Jakub, Eric,
On Thu, 5 Dec 2024 at 09:09, Eric Dumazet <edumazet@google.com> wrote:
> On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski <kuba@kernel.org> wrote:
> >
> > Hi Eric!
> >
> > This was posted while you were away -- any thoughts or recommendation on
> > how to address the required nl message size changing? Or other problems
> > pointed out by Dmitry? My suggestion in the subthread is to re-dump
> > with a fixed, large buffer on EMSGSIZE, but that's not super clean..
>
> Hi Jakub
>
> inet_diag_dump_one_icsk() could retry, doubling the size until the
> ~32768 byte limit is reached ?
>
> Also, we could make sure inet_sk_attr_size() returns at least
> NLMSG_DEFAULT_SIZE, there is no
> point trying to save memory for a single skb in inet_diag_dump_one_icsk().
Starting from NLMSG_DEFAULT_SIZE sounds like a really sane idea! :-)
[..]
> @@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
>
> err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
> if (err < 0) {
> - WARN_ON(err == -EMSGSIZE);
> nlmsg_free(rep);
> + if (err == -EMSGSIZE) {
> + attr_size <<= 1;
> + if (attr_size + NLMSG_HDRLEN <=
> SKB_WITH_OVERHEAD(32768)) {
> + cond_resched();
> + goto retry;
> + }
> + }
> goto out;
> }
> err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
To my personal taste on larger than 327 md5 keys scale, I'd prefer to
see "dump may be inconsistent, retry if you need consistency" than
-EMSGSIZE fail, yet userspace potentially may use the errno as a
"retry" signal.
Do you plan to re-send it as a proper patch? Or I can send it with my
next patches for TCP-MD5-diag issues (1), (3), (4) and TCP-AO-diag.
Thanks,
Dmitry
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-12-06 2:49 ` Dmitry Safonov
@ 2024-12-06 15:14 ` Eric Dumazet
2024-12-06 20:35 ` Dmitry Safonov
0 siblings, 1 reply; 23+ messages in thread
From: Eric Dumazet @ 2024-12-06 15:14 UTC (permalink / raw)
To: Dmitry Safonov
Cc: Jakub Kicinski, Dmitry Safonov via B4 Relay, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
>
> Hi Jakub, Eric,
>
> On Thu, 5 Dec 2024 at 09:09, Eric Dumazet <edumazet@google.com> wrote:
> > On Thu, Dec 5, 2024 at 2:13 AM Jakub Kicinski <kuba@kernel.org> wrote:
> > >
> > > Hi Eric!
> > >
> > > This was posted while you were away -- any thoughts or recommendation on
> > > how to address the required nl message size changing? Or other problems
> > > pointed out by Dmitry? My suggestion in the subthread is to re-dump
> > > with a fixed, large buffer on EMSGSIZE, but that's not super clean..
> >
> > Hi Jakub
> >
> > inet_diag_dump_one_icsk() could retry, doubling the size until the
> > ~32768 byte limit is reached ?
> >
> > Also, we could make sure inet_sk_attr_size() returns at least
> > NLMSG_DEFAULT_SIZE, there is no
> > point trying to save memory for a single skb in inet_diag_dump_one_icsk().
>
> Starting from NLMSG_DEFAULT_SIZE sounds like a really sane idea! :-)
There is a consensus for this one, I will cook a patch with this part only.
>
> [..]
> > @@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
> >
> > err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
> > if (err < 0) {
> > - WARN_ON(err == -EMSGSIZE);
> > nlmsg_free(rep);
> > + if (err == -EMSGSIZE) {
> > + attr_size <<= 1;
> > + if (attr_size + NLMSG_HDRLEN <=
> > SKB_WITH_OVERHEAD(32768)) {
> > + cond_resched();
> > + goto retry;
> > + }
> > + }
> > goto out;
> > }
> > err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
>
> To my personal taste on larger than 327 md5 keys scale, I'd prefer to
> see "dump may be inconsistent, retry if you need consistency" than
> -EMSGSIZE fail, yet userspace potentially may use the errno as a
> "retry" signal.
>
I do not yet understand this point. I will let you send a patch for
further discussion.
Thanks.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken
2024-12-06 15:14 ` Eric Dumazet
@ 2024-12-06 20:35 ` Dmitry Safonov
0 siblings, 0 replies; 23+ messages in thread
From: Dmitry Safonov @ 2024-12-06 20:35 UTC (permalink / raw)
To: Eric Dumazet
Cc: Jakub Kicinski, Dmitry Safonov via B4 Relay, David S. Miller,
Paolo Abeni, Simon Horman, David Ahern, Ivan Delalande,
Matthieu Baerts, Mat Martineau, Geliang Tang, Boris Pismenny,
John Fastabend, Davide Caratti, Kuniyuki Iwashima, netdev,
linux-kernel, mptcp, stable
On Fri, 6 Dec 2024 at 15:15, Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Dec 6, 2024 at 3:49 AM Dmitry Safonov <0x7f454c46@gmail.com> wrote:
> >
[..]
> > > @@ -585,8 +589,14 @@ int inet_diag_dump_one_icsk(struct inet_hashinfo *hashinfo,
> > >
> > > err = sk_diag_fill(sk, rep, cb, req, 0, net_admin);
> > > if (err < 0) {
> > > - WARN_ON(err == -EMSGSIZE);
> > > nlmsg_free(rep);
> > > + if (err == -EMSGSIZE) {
> > > + attr_size <<= 1;
> > > + if (attr_size + NLMSG_HDRLEN <=
> > > SKB_WITH_OVERHEAD(32768)) {
> > > + cond_resched();
> > > + goto retry;
> > > + }
> > > + }
> > > goto out;
> > > }
> > > err = nlmsg_unicast(net->diag_nlsk, rep, NETLINK_CB(in_skb).portid);
> >
> > To my personal taste on larger than 327 md5 keys scale, I'd prefer to
> > see "dump may be inconsistent, retry if you need consistency" than
> > -EMSGSIZE fail, yet userspace potentially may use the errno as a
> > "retry" signal.
> >
>
> I do not yet understand this point. I will let you send a patch for
> further discussion.
Let me explain my view. It's based on two points:
(a) TCP-MD5/AO-diag interfaces are mostly used for
debugging/investigating/monitoring by tools alike ss. Without a
side-synchronisation, they can't be used by BGP or other tools/tests
to make decisions as the socket is controlled by another process and
the resulting dump may be incomplete, inconsistent or outdated.
(b) The current default of optmem_max limit (128Kb) allows to allocate
on a socket 655 TCP-AO keys and even more TCP-MD5 keys. Some of
Arista's customers (I'd guess the same for other BGP users) have 1000
peers (for MD5 it's one key per peer on a listen socket, for AO might
be higher).
I think the situation that's being addressed here is a race and
potentially it's rare to hit (I have to run a reproducer in a loop to
hit it). That's why in my view a re-try jump is too big of a hammer.
And failing with -EMSGSIZE on 327+ keys scale sounds slighly worse
than just marking the resulting dump as inconsistent and letting the
user decide if he wants to re-run the dump or if the dump is "good
enough" to get a sense of the situation. I would even say "the dump is
inconsistent" has its value as a signal that the keys on a socket
change right now, which may be useful.
Regarding the patch, my attempt was in this thread:
https://lore.kernel.org/all/20241113-tcp-md5-diag-prep-v2-1-00a2a7feb1fa@gmail.com/
However, I should note that I'm fine with either of the approaches
(userspace to retry on EMSGSIZE or to get an inconsistent dump and
decide what to do with that). I'm somewhat looking forward to
switching to problems (1)/(3)/(4) from the cover-letter and adding
TCP-AO-diag, rather than being stuck arguing about what's the best
solution for quite a rare race :-)
Thanks,
Dmitry
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-12-06 20:35 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-13 18:46 [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 1/5] net/diag: Do not race on dumping MD5 keys with adding new MD5 keys Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 2/5] net/diag: Warn only once on EMSGSIZE Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 3/5] net/diag: Pre-allocate optional info only if requested Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 4/5] net/diag: Always pre-allocate tcp_ulp info Dmitry Safonov via B4 Relay
2024-11-13 18:46 ` [PATCH net v2 5/5] net/netlink: Correct the comment on netlink message max cap Dmitry Safonov via B4 Relay
2024-11-16 0:13 ` Jakub Kicinski
2024-11-16 0:08 ` [PATCH net v2 0/5] Make TCP-MD5-diag slightly less broken Jakub Kicinski
2024-11-16 0:48 ` Dmitry Safonov
2024-11-16 1:58 ` Jakub Kicinski
2024-11-16 3:52 ` Dmitry Safonov
2024-11-19 0:12 ` Jakub Kicinski
2024-11-20 0:19 ` Dmitry Safonov
2024-11-20 8:44 ` Johannes Berg
2024-11-20 16:13 ` Dmitry Safonov
2024-11-20 19:36 ` Johannes Berg
2024-11-16 0:20 ` patchwork-bot+netdevbpf
2024-12-05 1:13 ` Jakub Kicinski
2024-12-05 9:09 ` Eric Dumazet
2024-12-06 0:31 ` Jakub Kicinski
2024-12-06 2:49 ` Dmitry Safonov
2024-12-06 15:14 ` Eric Dumazet
2024-12-06 20:35 ` Dmitry Safonov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).