* [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
@ 2024-06-01 21:25 Jakub Kicinski
2024-06-01 23:10 ` Stephen Hemminger
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-01 21:25 UTC (permalink / raw)
To: davem; +Cc: netdev, edumazet, pabeni, Jakub Kicinski, Jaroslav Pulchart,
dsahern
Jaroslav reports Dell's OMSA Systems Management Data Engine
expects NLM_DONE in a separate recvmsg(), so revert to the
old behavior.
This is the same kind of fix as we added in
commit 460b0d33cf10 ("inet: bring NLM_DONE out to a separate recv() again")
so wrap the logic into a helper, to make it easier to keep track
of which dump handles we know to require legacy handling
(and possibly one day let sockets opt into not doing this).
Tested:
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_addr.yaml \
--dump getaddr --json '{"ifa-family": 2}'
./cli.py --dbg-small-recv 4096 --spec netlink/specs/rt_route.yaml \
--dump getroute --json '{"rtm-family": 2}'
Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()")
Reported-by: Jaroslav Pulchart <jaroslav.pulchart@gooddata.com>
Link: https://lore.kernel.org/all/CAK8fFZ7MKoFSEzMBDAOjoUt+vTZRRQgLDNXEOfdCCXSoXXKE0g@mail.gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: dsahern@kernel.org
---
include/net/netlink.h | 42 +++++++++++++++++++++++++++++++++++++++++
net/ipv4/devinet.c | 3 +++
net/ipv4/fib_frontend.c | 5 +----
3 files changed, 46 insertions(+), 4 deletions(-)
diff --git a/include/net/netlink.h b/include/net/netlink.h
index e78ce008e07c..8369aca32443 100644
--- a/include/net/netlink.h
+++ b/include/net/netlink.h
@@ -1198,6 +1198,48 @@ nl_dump_check_consistent(struct netlink_callback *cb,
cb->prev_seq = cb->seq;
}
+/**
+ * nl_dump_legacy_retval - get legacy return code for netlink dumps
+ * @err: error encountered during dump, negative errno or zero
+ * @skb: skb with the dump results
+ *
+ * Netlink dump callbacks get called multiple times per dump, because
+ * all the objects may not fit into a single skb. Whether another iteration
+ * is necessary gets decided based on the return value of the callback
+ * (with 0 meaning "end reached").
+ *
+ * The semantics used to be more complicated, with positive return values
+ * meaning "continue" and negative meaning "end with an error". A lot of
+ * handlers simplified this to return skb->len ? : -errno. Meaning that zero
+ * would only be returned when skb was empty, requiring another recvmsg()
+ * syscall just to get the NLM_DONE message.
+ *
+ * The current semantics allow handlers to also return -EMSGSIZE to continue.
+ *
+ * Unfortunately, some user space has started to depend on the NLM_DONE
+ * message being returned individually, in a separate recvmsg(). Select
+ * netlink dumps must preserve those semantics.
+ *
+ * This helper wraps the "legacy logic" and serves as an annotation for
+ * dumps which are known to require legacy handling.
+ *
+ * When used in combination with for_each_netdev_dump() - make sure to
+ * invalidate the ifindex when iteration is done. for_each_netdev_dump()
+ * does not move the iterator index "after" the last valid entry.
+ *
+ * NOTE: Do not use this helper for dumps without known legacy users!
+ * Most families are accessed only using well-written libraries
+ * so starting to coalesce NLM_DONE is perfectly fine, and more efficient.
+ *
+ * Return: return code to use for a dump handler
+ */
+static inline int nl_dump_legacy_retval(int err, const struct sk_buff *skb)
+{
+ if (err < 0 && err != -EMSGSIZE)
+ return err;
+ return skb->len;
+}
+
/**************************************************************************
* Netlink Attributes
**************************************************************************/
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 96accde527da..6d0e5cbd95b4 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -1911,7 +1911,10 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
if (err < 0)
goto done;
}
+ ctx->ifindex = ULONG_MAX;
done:
+ err = nl_dump_legacy_retval(err, skb);
+
if (fillargs.netnsid >= 0)
put_net(tgt_net);
rcu_read_unlock();
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index c484b1c0fc00..100d77eafe35 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1051,10 +1051,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
}
}
- /* Don't let NLM_DONE coalesce into a message, even if it could.
- * Some user space expects NLM_DONE in a separate recv().
- */
- err = skb->len;
+ err = nl_dump_legacy_retval(err, skb);
out:
cb->args[1] = e;
--
2.45.1
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-01 21:25 [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr() Jakub Kicinski
@ 2024-06-01 23:10 ` Stephen Hemminger
2024-06-01 23:48 ` Jakub Kicinski
0 siblings, 1 reply; 15+ messages in thread
From: Stephen Hemminger @ 2024-06-01 23:10 UTC (permalink / raw)
To: Jakub Kicinski
Cc: davem, netdev, edumazet, pabeni, Jaroslav Pulchart, dsahern
On Sat, 1 Jun 2024 14:25:17 -0700
Jakub Kicinski <kuba@kernel.org> wrote:
> + * When used in combination with for_each_netdev_dump() - make sure to
> + * invalidate the ifindex when iteration is done. for_each_netdev_dump()
> + * does not move the iterator index "after" the last valid entry.
> + *
> + * NOTE: Do not use this helper for dumps without known legacy users!
> + * Most families are accessed only using well-written libraries
> + * so starting to coalesce NLM_DONE is perfectly fine, and more efficient.
Sorry, I disagree.
You can't just fix the problem areas. The split was an ABI change, and there could
be a problem in any dump. This the ABI version of the old argument
If a tree falls in a forest and no one is around to hear it, does it make a sound?
All dumps must behave the same. You are stuck with the legacy behavior.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-01 23:10 ` Stephen Hemminger
@ 2024-06-01 23:48 ` Jakub Kicinski
2024-06-02 2:23 ` David Ahern
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-01 23:48 UTC (permalink / raw)
To: Stephen Hemminger
Cc: davem, netdev, edumazet, pabeni, Jaroslav Pulchart, dsahern
On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> Sorry, I disagree.
>
> You can't just fix the problem areas. The split was an ABI change, and there could
> be a problem in any dump. This the ABI version of the old argument
> If a tree falls in a forest and no one is around to hear it, does it make a sound?
>
> All dumps must behave the same. You are stuck with the legacy behavior.
The dump partitioning is up to the family. Multiple families
coalesce NLM_DONE from day 1. "All dumps must behave the same"
is saying we should convert all families to be poorly behaved.
Admittedly changing the most heavily used parts of rtnetlink is very
risky. And there's couple more corner cases which I'm afraid someone
will hit. I'm adding this helper to clearly annotate "legacy"
callbacks, so we don't regress again. At the same time nobody should
use this in new code or "just to be safe" (read: because they don't
understand netlink).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-01 23:48 ` Jakub Kicinski
@ 2024-06-02 2:23 ` David Ahern
2024-06-02 10:00 ` Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: David Ahern @ 2024-06-02 2:23 UTC (permalink / raw)
To: Jakub Kicinski, Stephen Hemminger
Cc: davem, netdev, edumazet, pabeni, Jaroslav Pulchart
On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
>> Sorry, I disagree.
>>
>> You can't just fix the problem areas. The split was an ABI change, and there could
>> be a problem in any dump. This the ABI version of the old argument
>> If a tree falls in a forest and no one is around to hear it, does it make a sound?
>>
>> All dumps must behave the same. You are stuck with the legacy behavior.
I don't agree with such a hard line stance. Mistakes made 20 years ago
cannot hold Linux back from moving forward. We have to continue
searching for ways to allow better or more performant behavior.
>
> The dump partitioning is up to the family. Multiple families
> coalesce NLM_DONE from day 1. "All dumps must behave the same"
> is saying we should convert all families to be poorly behaved.
>
> Admittedly changing the most heavily used parts of rtnetlink is very
> risky. And there's couple more corner cases which I'm afraid someone
> will hit. I'm adding this helper to clearly annotate "legacy"
> callbacks, so we don't regress again. At the same time nobody should
> use this in new code or "just to be safe" (read: because they don't
> understand netlink).
What about a socket option that says "I am a modern app and can handle
the new way" - similar to the strict mode option that was added? Then
the decision of requiring a separate message for NLM_DONE can be based
on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
to help old apps understand they need to move forward.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 2:23 ` David Ahern
@ 2024-06-02 10:00 ` Eric Dumazet
2024-06-02 22:21 ` Jakub Kicinski
2024-06-02 21:59 ` Jakub Kicinski
2024-06-03 14:05 ` Jamal Hadi Salim
2 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2024-06-02 10:00 UTC (permalink / raw)
To: David Ahern
Cc: Jakub Kicinski, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Sun, Jun 2, 2024 at 4:23 AM David Ahern <dsahern@kernel.org> wrote:
>
> On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> >> Sorry, I disagree.
> >>
> >> You can't just fix the problem areas. The split was an ABI change, and there could
> >> be a problem in any dump. This the ABI version of the old argument
> >> If a tree falls in a forest and no one is around to hear it, does it make a sound?
> >>
> >> All dumps must behave the same. You are stuck with the legacy behavior.
>
> I don't agree with such a hard line stance. Mistakes made 20 years ago
> cannot hold Linux back from moving forward. We have to continue
> searching for ways to allow better or more performant behavior.
>
> >
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.
Main motivation for me was to avoid re-grabbing RTNL again just to get NLM_DONE.
The avoidance of the two system calls was really secondary.
I think we could make a generic change in netlink_dump() to force NLM_DONE
in an empty message _and_ avoiding a useless call to the dump method, which
might still use RTNL or other contended mutex.
In a prior feedback I suggested a sysctl that Jakub disliked,
but really we do not care yet, as long as we avoid RTNL as much as we can.
Jakub, what about the following generic change, instead of ad-hoc changes ?
I tested it, I can send it with the minimal change (the alloc skb
optim will reach net-next)
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index fa9c090cf629e6e92c097285b262ed90324c7656..0a58e5d13b8e68dd3fbb2b3fb362c3399fa29909
100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2289,15 +2289,20 @@ static int netlink_dump(struct sock *sk, bool
lock_taken)
* ever provided a big enough buffer.
*/
cb = &nlk->cb;
- alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
-
- max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
- if (alloc_min_size < max_recvmsg_len) {
- alloc_size = max_recvmsg_len;
- skb = alloc_skb(alloc_size,
+ if (nlk->dump_done_errno) {
+ alloc_min_size = max_t(int, cb->min_dump_alloc, NLMSG_GOODSIZE);
+ max_recvmsg_len = READ_ONCE(nlk->max_recvmsg_len);
+ if (alloc_min_size < max_recvmsg_len) {
+ alloc_size = max_recvmsg_len;
+ skb = alloc_skb(alloc_size,
(GFP_KERNEL & ~__GFP_DIRECT_RECLAIM) |
__GFP_NOWARN | __GFP_NORETRY);
+ }
+ } else {
+ /* Allocate the space needed for NLMSG_DONE alone. */
+ alloc_min_size = nlmsg_total_size(sizeof(nlk->dump_done_errno));
}
+
if (!skb) {
alloc_size = alloc_min_size;
skb = alloc_skb(alloc_size, GFP_KERNEL);
@@ -2350,8 +2355,7 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
cb->extack = NULL;
}
- if (nlk->dump_done_errno > 0 ||
- skb_tailroom(skb) <
nlmsg_total_size(sizeof(nlk->dump_done_errno))) {
+ if (skb->len) {
mutex_unlock(&nlk->nl_cb_mutex);
if (sk_filter(sk, skb))
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 2:23 ` David Ahern
2024-06-02 10:00 ` Eric Dumazet
@ 2024-06-02 21:59 ` Jakub Kicinski
2024-06-03 2:42 ` David Ahern
2024-06-03 14:05 ` Jamal Hadi Salim
2 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-02 21:59 UTC (permalink / raw)
To: David Ahern
Cc: Stephen Hemminger, davem, netdev, edumazet, pabeni,
Jaroslav Pulchart
On Sat, 1 Jun 2024 20:23:17 -0600 David Ahern wrote:
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app.
That seems like a good solution, with the helper marking the "legacy"
handlers - I hope it should be trivial to add such option and change
the helper's behavior based on the socket state.
> Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.
Hm, do you think people would actually modernize all the legacy apps?
Coincidentally, looking at Jaroslav's traces it appears that the app
sets ifindex for the link dump, so it must not be opting into strict
checking, either.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 10:00 ` Eric Dumazet
@ 2024-06-02 22:21 ` Jakub Kicinski
2024-06-03 13:54 ` Jakub Kicinski
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-02 22:21 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Ahern, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Sun, 2 Jun 2024 12:00:09 +0200 Eric Dumazet wrote:
> Main motivation for me was to avoid re-grabbing RTNL again just to get NLM_DONE.
>
> The avoidance of the two system calls was really secondary.
>
> I think we could make a generic change in netlink_dump() to force NLM_DONE
> in an empty message _and_ avoiding a useless call to the dump method, which
> might still use RTNL or other contended mutex.
>
> In a prior feedback I suggested a sysctl that Jakub disliked,
> but really we do not care yet, as long as we avoid RTNL as much as we can.
>
> Jakub, what about the following generic change, instead of ad-hoc changes ?
Netlink is full of legacy behavior, the only way to make it usable
in modern environments is to let new families not repeat the mistakes.
That's why I'd really rather not add a workaround at the af_netlink
level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
suddenly start needed another recv(). A lot of the time the entire dump
fits in one skb.
If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
has also been correctly coded from day 1) - we can add a trampoline for
rtnetlink dump handlers?
(just to illustrate, not even compile tested)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 522bbd70c205..c59b39ee544f 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -6487,6 +6487,18 @@ static int rtnl_mdb_del(struct sk_buff *skb, struct nlmsghdr *nlh,
/* Process one rtnetlink message. */
+static int rtnl_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
+{
+ rtnl_dumpit_func dumpit = cb->data;
+ int err;
+
+ err = dumpit(skb, cb);
+ if (err < 0 && err != -EMSGSIZE)
+ return err;
+
+ return skb->len;
+}
+
static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
struct netlink_ext_ack *extack)
{
@@ -6546,10 +6558,11 @@ static int rtnetlink_rcv_msg(struct sk_buff *skb, struct nlmsghdr *nlh,
rtnl = net->rtnl;
if (err == 0) {
struct netlink_dump_control c = {
- .dump = dumpit,
+ .dump = rtnl_dumpit,
.min_dump_alloc = min_dump_alloc,
.module = owner,
.flags = flags,
+ .data = dumpit,
};
err = netlink_dump_start(rtnl, skb, nlh, &c);
/* netlink_dump_start() will keep a reference on
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 21:59 ` Jakub Kicinski
@ 2024-06-03 2:42 ` David Ahern
0 siblings, 0 replies; 15+ messages in thread
From: David Ahern @ 2024-06-03 2:42 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Stephen Hemminger, davem, netdev, edumazet, pabeni,
Jaroslav Pulchart
On 6/2/24 3:59 PM, Jakub Kicinski wrote:
> On Sat, 1 Jun 2024 20:23:17 -0600 David Ahern wrote:
>>> The dump partitioning is up to the family. Multiple families
>>> coalesce NLM_DONE from day 1. "All dumps must behave the same"
>>> is saying we should convert all families to be poorly behaved.
>>>
>>> Admittedly changing the most heavily used parts of rtnetlink is very
>>> risky. And there's couple more corner cases which I'm afraid someone
>>> will hit. I'm adding this helper to clearly annotate "legacy"
>>> callbacks, so we don't regress again. At the same time nobody should
>>> use this in new code or "just to be safe" (read: because they don't
>>> understand netlink).
>>
>> What about a socket option that says "I am a modern app and can handle
>> the new way" - similar to the strict mode option that was added? Then
>> the decision of requiring a separate message for NLM_DONE can be based
>> on the app.
>
> That seems like a good solution, with the helper marking the "legacy"
> handlers - I hope it should be trivial to add such option and change
> the helper's behavior based on the socket state.
>
>> Could even throw a `pr_warn_once("modernize app %s/%d\n")`
>> to help old apps understand they need to move forward.
>
> Hm, do you think people would actually modernize all the legacy apps?
I have worked for a few companies that do monitor dmesg and when given
the right push will update apps. Best an OS can do.
>
> Coincidentally, looking at Jaroslav's traces it appears that the app
> sets ifindex for the link dump, so it must not be opting into strict
> checking, either.
:-(
I should have added a warning back when the option was introduced - that
and a warning when options to a dump are ignored.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 22:21 ` Jakub Kicinski
@ 2024-06-03 13:54 ` Jakub Kicinski
2024-06-03 14:22 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-03 13:54 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Ahern, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Sun, 2 Jun 2024 15:21:02 -0700 Jakub Kicinski wrote:
> Netlink is full of legacy behavior, the only way to make it usable
> in modern environments is to let new families not repeat the mistakes.
> That's why I'd really rather not add a workaround at the af_netlink
> level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
> suddenly start needed another recv(). A lot of the time the entire dump
> fits in one skb.
>
> If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
> has also been correctly coded from day 1) - we can add a trampoline for
> rtnetlink dump handlers?
Hi Eric, how do you feel about this approach? It would also let us
extract the "RTNL unlocked dump" handling from af_netlink.c, which
would be nice.
BTW it will probably need to be paired with fixing the
for_each_netdev_dump() foot gun, maybe (untested):
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3025,7 +3025,8 @@ int call_netdevice_notifiers_info(unsigned long val,
#define net_device_entry(lh) list_entry(lh, struct net_device, dev_list)
#define for_each_netdev_dump(net, d, ifindex) \
- xa_for_each_start(&(net)->dev_by_index, (ifindex), (d), (ifindex))
+ for (; (d = xa_find(&(net)->dev_by_index, &ifindex, \
+ ULONG_MAX, XA_PRESENT)); ifindex++)
static inline struct net_device *next_net_device(struct net_device *dev)
{
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-02 2:23 ` David Ahern
2024-06-02 10:00 ` Eric Dumazet
2024-06-02 21:59 ` Jakub Kicinski
@ 2024-06-03 14:05 ` Jamal Hadi Salim
2024-06-03 15:33 ` David Ahern
2024-06-03 15:34 ` Eric Dumazet
2 siblings, 2 replies; 15+ messages in thread
From: Jamal Hadi Salim @ 2024-06-03 14:05 UTC (permalink / raw)
To: David Ahern
Cc: Jakub Kicinski, Stephen Hemminger, davem, netdev, edumazet,
pabeni, Jaroslav Pulchart
On Sat, Jun 1, 2024 at 10:23 PM David Ahern <dsahern@kernel.org> wrote:
>
> On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> >> Sorry, I disagree.
> >>
> >> You can't just fix the problem areas. The split was an ABI change, and there could
> >> be a problem in any dump. This the ABI version of the old argument
> >> If a tree falls in a forest and no one is around to hear it, does it make a sound?
> >>
> >> All dumps must behave the same. You are stuck with the legacy behavior.
>
> I don't agree with such a hard line stance. Mistakes made 20 years ago
> cannot hold Linux back from moving forward. We have to continue
> searching for ways to allow better or more performant behavior.
>
> >
> > The dump partitioning is up to the family. Multiple families
> > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > is saying we should convert all families to be poorly behaved.
> >
> > Admittedly changing the most heavily used parts of rtnetlink is very
> > risky. And there's couple more corner cases which I'm afraid someone
> > will hit. I'm adding this helper to clearly annotate "legacy"
> > callbacks, so we don't regress again. At the same time nobody should
> > use this in new code or "just to be safe" (read: because they don't
> > understand netlink).
>
> What about a socket option that says "I am a modern app and can handle
> the new way" - similar to the strict mode option that was added? Then
> the decision of requiring a separate message for NLM_DONE can be based
> on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> to help old apps understand they need to move forward.
>
Sorry, being a little lazy so asking instead:
NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
you get rid of it - how does the user know there are more msgs coming
or the dump transaction is over? In addition to the user->kernel "I am
modern", perhaps set the nlmsg_flag in the reverse path to either say
"there's more coming" which you dont set on the last message or "we
are doing this the new way". Backward compat is very important - there
are dinosaur apps out there that will break otherwise.
cheers,
jamal
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-03 13:54 ` Jakub Kicinski
@ 2024-06-03 14:22 ` Eric Dumazet
2024-06-03 14:59 ` Jakub Kicinski
0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2024-06-03 14:22 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David Ahern, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Mon, Jun 3, 2024 at 3:54 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sun, 2 Jun 2024 15:21:02 -0700 Jakub Kicinski wrote:
> > Netlink is full of legacy behavior, the only way to make it usable
> > in modern environments is to let new families not repeat the mistakes.
> > That's why I'd really rather not add a workaround at the af_netlink
> > level. Why would ethtool (which correctly coalesced NLM_DONE from day 1)
> > suddenly start needed another recv(). A lot of the time the entire dump
> > fits in one skb.
> >
> > If you prefer to sacrifice all of rtnetlink (some of which, to be clear,
> > has also been correctly coded from day 1) - we can add a trampoline for
> > rtnetlink dump handlers?
>
> Hi Eric, how do you feel about this approach? It would also let us
> extract the "RTNL unlocked dump" handling from af_netlink.c, which
> would be nice.
Sure, I have not thought of af_netlink
>
> BTW it will probably need to be paired with fixing the
> for_each_netdev_dump() foot gun, maybe (untested):
>
I confess I am a bit lost : this part relates to your original submission,
when you set "ctx->ifindex = ULONG_MAX;" in inet_dump_ifaddr() ?
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3025,7 +3025,8 @@ int call_netdevice_notifiers_info(unsigned long val,
> #define net_device_entry(lh) list_entry(lh, struct net_device, dev_list)
>
> #define for_each_netdev_dump(net, d, ifindex) \
> - xa_for_each_start(&(net)->dev_by_index, (ifindex), (d), (ifindex))
> + for (; (d = xa_find(&(net)->dev_by_index, &ifindex, \
> + ULONG_MAX, XA_PRESENT)); ifindex++)
>
> static inline struct net_device *next_net_device(struct net_device *dev)
> {
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-03 14:22 ` Eric Dumazet
@ 2024-06-03 14:59 ` Jakub Kicinski
2024-06-03 15:26 ` Eric Dumazet
0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2024-06-03 14:59 UTC (permalink / raw)
To: Eric Dumazet
Cc: David Ahern, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Mon, 3 Jun 2024 16:22:19 +0200 Eric Dumazet wrote:
> > Hi Eric, how do you feel about this approach? It would also let us
> > extract the "RTNL unlocked dump" handling from af_netlink.c, which
> > would be nice.
>
> Sure, I have not thought of af_netlink
>
> > BTW it will probably need to be paired with fixing the
> > for_each_netdev_dump() foot gun, maybe (untested):
>
> I confess I am a bit lost : this part relates to your original submission,
> when you set "ctx->ifindex = ULONG_MAX;" in inet_dump_ifaddr() ?
Yes, xa_for_each_start() leaves ifindex at the last valid entry, so
without this snippet it's not safe to call dump again after we reached
the end.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-03 14:59 ` Jakub Kicinski
@ 2024-06-03 15:26 ` Eric Dumazet
0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2024-06-03 15:26 UTC (permalink / raw)
To: Jakub Kicinski
Cc: David Ahern, Stephen Hemminger, davem, netdev, pabeni,
Jaroslav Pulchart
On Mon, Jun 3, 2024 at 4:59 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 3 Jun 2024 16:22:19 +0200 Eric Dumazet wrote:
> > > Hi Eric, how do you feel about this approach? It would also let us
> > > extract the "RTNL unlocked dump" handling from af_netlink.c, which
> > > would be nice.
> >
> > Sure, I have not thought of af_netlink
> >
> > > BTW it will probably need to be paired with fixing the
> > > for_each_netdev_dump() foot gun, maybe (untested):
> >
> > I confess I am a bit lost : this part relates to your original submission,
> > when you set "ctx->ifindex = ULONG_MAX;" in inet_dump_ifaddr() ?
>
> Yes, xa_for_each_start() leaves ifindex at the last valid entry, so
> without this snippet it's not safe to call dump again after we reached
> the end.
A generic change would be to make sure we remember we hit the end,
I guess this can be done later.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-03 14:05 ` Jamal Hadi Salim
@ 2024-06-03 15:33 ` David Ahern
2024-06-03 15:34 ` Eric Dumazet
1 sibling, 0 replies; 15+ messages in thread
From: David Ahern @ 2024-06-03 15:33 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: Jakub Kicinski, Stephen Hemminger, davem, netdev, edumazet,
pabeni, Jaroslav Pulchart
On 6/3/24 8:05 AM, Jamal Hadi Salim wrote:
>
> Sorry, being a little lazy so asking instead:
> NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
> you get rid of it - how does the user know there are more msgs coming
> or the dump transaction is over? In addition to the user->kernel "I am
> modern", perhaps set the nlmsg_flag in the reverse path to either say
> "there's more coming" which you dont set on the last message or "we
> are doing this the new way". Backward compat is very important - there
> are dinosaur apps out there that will break otherwise.
>
NLM_DONE is not getting removed. The recent changes allow the end
message signal without a separate system call.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr()
2024-06-03 14:05 ` Jamal Hadi Salim
2024-06-03 15:33 ` David Ahern
@ 2024-06-03 15:34 ` Eric Dumazet
1 sibling, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2024-06-03 15:34 UTC (permalink / raw)
To: Jamal Hadi Salim
Cc: David Ahern, Jakub Kicinski, Stephen Hemminger, davem, netdev,
pabeni, Jaroslav Pulchart
On Mon, Jun 3, 2024 at 4:05 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Sat, Jun 1, 2024 at 10:23 PM David Ahern <dsahern@kernel.org> wrote:
> >
> > On 6/1/24 5:48 PM, Jakub Kicinski wrote:
> > > On Sat, 1 Jun 2024 16:10:13 -0700 Stephen Hemminger wrote:
> > >> Sorry, I disagree.
> > >>
> > >> You can't just fix the problem areas. The split was an ABI change, and there could
> > >> be a problem in any dump. This the ABI version of the old argument
> > >> If a tree falls in a forest and no one is around to hear it, does it make a sound?
> > >>
> > >> All dumps must behave the same. You are stuck with the legacy behavior.
> >
> > I don't agree with such a hard line stance. Mistakes made 20 years ago
> > cannot hold Linux back from moving forward. We have to continue
> > searching for ways to allow better or more performant behavior.
> >
> > >
> > > The dump partitioning is up to the family. Multiple families
> > > coalesce NLM_DONE from day 1. "All dumps must behave the same"
> > > is saying we should convert all families to be poorly behaved.
> > >
> > > Admittedly changing the most heavily used parts of rtnetlink is very
> > > risky. And there's couple more corner cases which I'm afraid someone
> > > will hit. I'm adding this helper to clearly annotate "legacy"
> > > callbacks, so we don't regress again. At the same time nobody should
> > > use this in new code or "just to be safe" (read: because they don't
> > > understand netlink).
> >
> > What about a socket option that says "I am a modern app and can handle
> > the new way" - similar to the strict mode option that was added? Then
> > the decision of requiring a separate message for NLM_DONE can be based
> > on the app. Could even throw a `pr_warn_once("modernize app %s/%d\n")`
> > to help old apps understand they need to move forward.
> >
>
> Sorry, being a little lazy so asking instead:
> NLMSG_DONE is traditionally the "EOT" (end of transaction) signal, if
> you get rid of it - how does the user know there are more msgs coming
> or the dump transaction is over? In addition to the user->kernel "I am
> modern", perhaps set the nlmsg_flag in the reverse path to either say
> "there's more coming" which you dont set on the last message or "we
> are doing this the new way". Backward compat is very important - there
> are dinosaur apps out there that will break otherwise.
The NLMSG_DONE was not removed.
Some applications expected it to be carried in a standalone message
for some of the rtnetlink operations,
because old kernel implementations accidentally had this
(undocumented) behavior.
When the kernel started to be smart and piggy-back the NLMSG_DONE in
the 'last given answer',
these applications started to complain.
Basically these applications do not correctly parse the full answer
the kernel gives to them.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2024-06-03 15:34 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-01 21:25 [PATCH net] inet: bring NLM_DONE out to a separate recv() in inet_dump_ifaddr() Jakub Kicinski
2024-06-01 23:10 ` Stephen Hemminger
2024-06-01 23:48 ` Jakub Kicinski
2024-06-02 2:23 ` David Ahern
2024-06-02 10:00 ` Eric Dumazet
2024-06-02 22:21 ` Jakub Kicinski
2024-06-03 13:54 ` Jakub Kicinski
2024-06-03 14:22 ` Eric Dumazet
2024-06-03 14:59 ` Jakub Kicinski
2024-06-03 15:26 ` Eric Dumazet
2024-06-02 21:59 ` Jakub Kicinski
2024-06-03 2:42 ` David Ahern
2024-06-03 14:05 ` Jamal Hadi Salim
2024-06-03 15:33 ` David Ahern
2024-06-03 15:34 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).