netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core
@ 2024-03-03  5:24 Jakub Kicinski
  2024-03-03  5:24 ` [PATCH net-next v2 1/3] " Jakub Kicinski
                   ` (3 more replies)
  0 siblings, 4 replies; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-03  5:24 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, jiri, idosch, johannes, fw, pablo,
	Jakub Kicinski

Ido discovered some time back that we usually force NLMSG_DONE
to be delivered in a separate recv() syscall, even if it would
fit into the same skb as data messages. He made nexthop try
to fit DONE with data in commit 8743aeff5bc4 ("nexthop: Fix
infinite nexthop bucket dump when using maximum nexthop ID"),
and nobody has complained so far.

We have since also tried to follow the same pattern in new
genetlink families, but explaining to people, or even remembering
the correct handling ourselves is tedious.

Let the netlink socket layer consume -EMSGSIZE errors.
Practically speaking most families use this error code
as "dump needs more space", anyway.

v2:
 - init err to 0 in last patch
v1: https://lore.kernel.org/all/20240301012845.2951053-1-kuba@kernel.org/

Jakub Kicinski (3):
  netlink: handle EMSGSIZE errors in the core
  netdev: let netlink core handle -EMSGSIZE errors
  genetlink: fit NLMSG_DONE into same read() as families

 net/core/netdev-genl.c    | 15 +++------------
 net/core/page_pool_user.c |  2 --
 net/netlink/af_netlink.c  |  9 +++++++++
 net/netlink/genetlink.c   | 12 +++++++-----
 4 files changed, 19 insertions(+), 19 deletions(-)

-- 
2.44.0


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH net-next v2 1/3] netlink: handle EMSGSIZE errors in the core
  2024-03-03  5:24 [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core Jakub Kicinski
@ 2024-03-03  5:24 ` Jakub Kicinski
  2024-03-03 15:01   ` Ido Schimmel
  2024-03-03  5:24 ` [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors Jakub Kicinski
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-03  5:24 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, jiri, idosch, johannes, fw, pablo,
	Jakub Kicinski, kuniyu

Eric points out that our current suggested way of handling
EMSGSIZE errors ((err == -EMSGSIZE) ? skb->len : err) will
break if we didn't fit even a single object into the buffer
provided by the user. This should not happen for well behaved
applications, but we can fix that, and free netlink families
from dealing with that completely by moving error handling
into the core.

Let's assume from now on that all EMSGSIZE errors in dumps are
because we run out of skb space. Families can now propagate
the error nla_put_*() etc generated and not worry about any
return value magic. If some family really wants to send EMSGSIZE
to user space, assuming it generates the same error on the next
dump iteration the skb->len should be 0, and user space should
still see the EMSGSIZE.

This should simplify families and prevent mistakes in return
values which lead to DONE being forced into a separate recv()
call as discovered by Ido some time ago.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: kuniyu@amazon.com
---
 net/netlink/af_netlink.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index ad7b645e3ae7..da846212fb9b 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2267,6 +2267,15 @@ static int netlink_dump(struct sock *sk, bool lock_taken)
 		if (extra_mutex)
 			mutex_unlock(extra_mutex);
 
+		/* EMSGSIZE plus something already in the skb means
+		 * that there's more to dump but current skb has filled up.
+		 * If the callback really wants to return EMSGSIZE to user space
+		 * it needs to do so again, on the next cb->dump() call,
+		 * without putting data in the skb.
+		 */
+		if (nlk->dump_done_errno == -EMSGSIZE && skb->len)
+			nlk->dump_done_errno = skb->len;
+
 		cb->extack = NULL;
 	}
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors
  2024-03-03  5:24 [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core Jakub Kicinski
  2024-03-03  5:24 ` [PATCH net-next v2 1/3] " Jakub Kicinski
@ 2024-03-03  5:24 ` Jakub Kicinski
  2024-03-03 15:08   ` Ido Schimmel
  2024-03-03  5:24 ` [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families Jakub Kicinski
  2024-03-06  8:10 ` [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core patchwork-bot+netdevbpf
  3 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-03  5:24 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, jiri, idosch, johannes, fw, pablo,
	Jakub Kicinski, amritha.nambiar, sridhar.samudrala, hawk

Previous change added -EMSGSIZE handling to af_netlink, we don't
have to hide these errors any longer.

Theoretically the error handling changes from:
 if (err == -EMSGSIZE)
to
 if (err == -EMSGSIZE && skb->len)

everywhere, but in practice it doesn't matter.
All messages fit into NLMSG_GOODSIZE, so overflow of an empty
skb cannot happen.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: amritha.nambiar@intel.com
CC: sridhar.samudrala@intel.com
CC: hawk@kernel.org
---
 net/core/netdev-genl.c    | 15 +++------------
 net/core/page_pool_user.c |  2 --
 2 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fd98936da3ae..918b109e0cf4 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -152,10 +152,7 @@ int netdev_nl_dev_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 	rtnl_unlock();
 
-	if (err != -EMSGSIZE)
-		return err;
-
-	return skb->len;
+	return err;
 }
 
 static int
@@ -287,10 +284,7 @@ int netdev_nl_napi_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 	rtnl_unlock();
 
-	if (err != -EMSGSIZE)
-		return err;
-
-	return skb->len;
+	return err;
 }
 
 static int
@@ -463,10 +457,7 @@ int netdev_nl_queue_get_dumpit(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 	rtnl_unlock();
 
-	if (err != -EMSGSIZE)
-		return err;
-
-	return skb->len;
+	return err;
 }
 
 static int netdev_genl_netdevice_event(struct notifier_block *nb,
diff --git a/net/core/page_pool_user.c b/net/core/page_pool_user.c
index 278294aca66a..3a3277ba167b 100644
--- a/net/core/page_pool_user.c
+++ b/net/core/page_pool_user.c
@@ -103,8 +103,6 @@ netdev_nl_page_pool_get_dump(struct sk_buff *skb, struct netlink_callback *cb,
 	mutex_unlock(&page_pools_lock);
 	rtnl_unlock();
 
-	if (skb->len && err == -EMSGSIZE)
-		return skb->len;
 	return err;
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-03  5:24 [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core Jakub Kicinski
  2024-03-03  5:24 ` [PATCH net-next v2 1/3] " Jakub Kicinski
  2024-03-03  5:24 ` [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors Jakub Kicinski
@ 2024-03-03  5:24 ` Jakub Kicinski
  2024-03-03 15:10   ` Ido Schimmel
  2024-03-15 11:48   ` Stefano Brivio
  2024-03-06  8:10 ` [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core patchwork-bot+netdevbpf
  3 siblings, 2 replies; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-03  5:24 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, jiri, idosch, johannes, fw, pablo,
	Jakub Kicinski

Make sure ctrl_fill_info() returns sensible error codes and
propagate them out to netlink core. Let netlink core decide
when to return skb->len and when to treat the exit as an
error. Netlink core does better job at it, if we always
return skb->len the core doesn't know when we're done
dumping and NLMSG_DONE ends up in a separate read().

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: jiri@resnulli.us
---
 net/netlink/genetlink.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
index 50ec599a5cff..3b7666944b11 100644
--- a/net/netlink/genetlink.c
+++ b/net/netlink/genetlink.c
@@ -1232,7 +1232,7 @@ static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq,
 
 	hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd);
 	if (hdr == NULL)
-		return -1;
+		return -EMSGSIZE;
 
 	if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) ||
 	    nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) ||
@@ -1355,6 +1355,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net *net = sock_net(skb->sk);
 	int fams_to_skip = cb->args[0];
 	unsigned int id;
+	int err = 0;
 
 	idr_for_each_entry(&genl_fam_idr, rt, id) {
 		if (!rt->netnsok && !net_eq(net, &init_net))
@@ -1363,16 +1364,17 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
 		if (n++ < fams_to_skip)
 			continue;
 
-		if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
-				   cb->nlh->nlmsg_seq, NLM_F_MULTI,
-				   skb, CTRL_CMD_NEWFAMILY) < 0) {
+		err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
+				     cb->nlh->nlmsg_seq, NLM_F_MULTI,
+				     skb, CTRL_CMD_NEWFAMILY);
+		if (err) {
 			n--;
 			break;
 		}
 	}
 
 	cb->args[0] = n;
-	return skb->len;
+	return err;
 }
 
 static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 1/3] netlink: handle EMSGSIZE errors in the core
  2024-03-03  5:24 ` [PATCH net-next v2 1/3] " Jakub Kicinski
@ 2024-03-03 15:01   ` Ido Schimmel
  0 siblings, 0 replies; 25+ messages in thread
From: Ido Schimmel @ 2024-03-03 15:01 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, jiri, johannes, fw, pablo,
	kuniyu

On Sat, Mar 02, 2024 at 09:24:06PM -0800, Jakub Kicinski wrote:
> Eric points out that our current suggested way of handling
> EMSGSIZE errors ((err == -EMSGSIZE) ? skb->len : err) will
> break if we didn't fit even a single object into the buffer
> provided by the user. This should not happen for well behaved
> applications, but we can fix that, and free netlink families
> from dealing with that completely by moving error handling
> into the core.
> 
> Let's assume from now on that all EMSGSIZE errors in dumps are
> because we run out of skb space. Families can now propagate
> the error nla_put_*() etc generated and not worry about any
> return value magic. If some family really wants to send EMSGSIZE
> to user space, assuming it generates the same error on the next
> dump iteration the skb->len should be 0, and user space should
> still see the EMSGSIZE.
> 
> This should simplify families and prevent mistakes in return
> values which lead to DONE being forced into a separate recv()
> call as discovered by Ido some time ago.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

I read your comment here [1] and I believe this patch [2] is needed to
avoid another pass in case of an error code other than EMSGSIZE. I can
submit it after your series is accepted.

[1] https://lore.kernel.org/netdev/20240229073750.6e59155e@kernel.org/

[2]
diff --git a/net/ipv4/nexthop.c b/net/ipv4/nexthop.c
index 70509da4f080..b3a24b61f76b 100644
--- a/net/ipv4/nexthop.c
+++ b/net/ipv4/nexthop.c
@@ -3241,10 +3241,6 @@ static int rtm_dump_nexthop(struct sk_buff *skb, struct netlink_callback *cb)
 
        err = rtm_dump_walk_nexthops(skb, cb, root, ctx,
                                     &rtm_dump_nexthop_cb, &filter);
-       if (err < 0) {
-               if (likely(skb->len))
-                       err = skb->len;
-       }
 
        cb->seq = net->nexthop.seq;
        nl_dump_check_consistent(cb, nlmsg_hdr(skb));
@@ -3439,11 +3435,6 @@ static int rtm_dump_nexthop_bucket(struct sk_buff *skb,
                                             &rtm_dump_nexthop_bucket_cb, &dd);
        }
 
-       if (err < 0) {
-               if (likely(skb->len))
-                       err = skb->len;
-       }
-
        cb->seq = net->nexthop.seq;
        nl_dump_check_consistent(cb, nlmsg_hdr(skb));
        return err;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors
  2024-03-03  5:24 ` [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors Jakub Kicinski
@ 2024-03-03 15:08   ` Ido Schimmel
  0 siblings, 0 replies; 25+ messages in thread
From: Ido Schimmel @ 2024-03-03 15:08 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, jiri, johannes, fw, pablo,
	amritha.nambiar, sridhar.samudrala, hawk

On Sat, Mar 02, 2024 at 09:24:07PM -0800, Jakub Kicinski wrote:
> Previous change added -EMSGSIZE handling to af_netlink, we don't
> have to hide these errors any longer.
> 
> Theoretically the error handling changes from:
>  if (err == -EMSGSIZE)
> to
>  if (err == -EMSGSIZE && skb->len)
> 
> everywhere, but in practice it doesn't matter.
> All messages fit into NLMSG_GOODSIZE, so overflow of an empty
> skb cannot happen.
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-03  5:24 ` [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families Jakub Kicinski
@ 2024-03-03 15:10   ` Ido Schimmel
  2024-03-15 11:48   ` Stefano Brivio
  1 sibling, 0 replies; 25+ messages in thread
From: Ido Schimmel @ 2024-03-03 15:10 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, edumazet, pabeni, jiri, johannes, fw, pablo

On Sat, Mar 02, 2024 at 09:24:08PM -0800, Jakub Kicinski wrote:
> Make sure ctrl_fill_info() returns sensible error codes and
> propagate them out to netlink core. Let netlink core decide
> when to return skb->len and when to treat the exit as an
> error. Netlink core does better job at it, if we always
> return skb->len the core doesn't know when we're done
> dumping and NLMSG_DONE ends up in a separate read().
> 
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Ido Schimmel <idosch@nvidia.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core
  2024-03-03  5:24 [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core Jakub Kicinski
                   ` (2 preceding siblings ...)
  2024-03-03  5:24 ` [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families Jakub Kicinski
@ 2024-03-06  8:10 ` patchwork-bot+netdevbpf
  3 siblings, 0 replies; 25+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-03-06  8:10 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, jiri, idosch, johannes, fw,
	pablo

Hello:

This series was applied to netdev/net-next.git (main)
by David S. Miller <davem@davemloft.net>:

On Sat,  2 Mar 2024 21:24:05 -0800 you wrote:
> Ido discovered some time back that we usually force NLMSG_DONE
> to be delivered in a separate recv() syscall, even if it would
> fit into the same skb as data messages. He made nexthop try
> to fit DONE with data in commit 8743aeff5bc4 ("nexthop: Fix
> infinite nexthop bucket dump when using maximum nexthop ID"),
> and nobody has complained so far.
> 
> [...]

Here is the summary with links:
  - [net-next,v2,1/3] netlink: handle EMSGSIZE errors in the core
    https://git.kernel.org/netdev/net-next/c/b5a899154aa9
  - [net-next,v2,2/3] netdev: let netlink core handle -EMSGSIZE errors
    https://git.kernel.org/netdev/net-next/c/0b11b1c5c320
  - [net-next,v2,3/3] genetlink: fit NLMSG_DONE into same read() as families
    https://git.kernel.org/netdev/net-next/c/87d381973e49

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-03  5:24 ` [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families Jakub Kicinski
  2024-03-03 15:10   ` Ido Schimmel
@ 2024-03-15 11:48   ` Stefano Brivio
  2024-03-19 15:55     ` Jakub Kicinski
  1 sibling, 1 reply; 25+ messages in thread
From: Stefano Brivio @ 2024-03-15 11:48 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson

Hi,

On Sat,  2 Mar 2024 21:24:08 -0800
Jakub Kicinski <kuba@kernel.org> wrote:

> Make sure ctrl_fill_info() returns sensible error codes and
> propagate them out to netlink core. Let netlink core decide
> when to return skb->len and when to treat the exit as an
> error. Netlink core does better job at it, if we always
> return skb->len the core doesn't know when we're done
> dumping and NLMSG_DONE ends up in a separate read().

While this change is obviously correct, it breaks... well, broken
applications that _wrongly_ rely on the fact that NLMSG_DONE is
delivered in a separate datagram.

This was the (embarrassing) case for passt(1), which I just fixed:
  https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/

but the "separate" NLMSG_DONE is such an established behaviour,
I think, that this might raise a more general concern.

From my perspective, I'm just happy that this change revealed the
issue, but I wanted to report this anyway in case somebody has
similar possible breakages in mind.

> Reviewed-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: jiri@resnulli.us
> ---
>  net/netlink/genetlink.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> diff --git a/net/netlink/genetlink.c b/net/netlink/genetlink.c
> index 50ec599a5cff..3b7666944b11 100644
> --- a/net/netlink/genetlink.c
> +++ b/net/netlink/genetlink.c
> @@ -1232,7 +1232,7 @@ static int ctrl_fill_info(const struct genl_family *family, u32 portid, u32 seq,
>  
>  	hdr = genlmsg_put(skb, portid, seq, &genl_ctrl, flags, cmd);
>  	if (hdr == NULL)
> -		return -1;
> +		return -EMSGSIZE;
>  
>  	if (nla_put_string(skb, CTRL_ATTR_FAMILY_NAME, family->name) ||
>  	    nla_put_u16(skb, CTRL_ATTR_FAMILY_ID, family->id) ||
> @@ -1355,6 +1355,7 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
>  	struct net *net = sock_net(skb->sk);
>  	int fams_to_skip = cb->args[0];
>  	unsigned int id;
> +	int err = 0;
>  
>  	idr_for_each_entry(&genl_fam_idr, rt, id) {
>  		if (!rt->netnsok && !net_eq(net, &init_net))
> @@ -1363,16 +1364,17 @@ static int ctrl_dumpfamily(struct sk_buff *skb, struct netlink_callback *cb)
>  		if (n++ < fams_to_skip)
>  			continue;
>  
> -		if (ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
> -				   cb->nlh->nlmsg_seq, NLM_F_MULTI,
> -				   skb, CTRL_CMD_NEWFAMILY) < 0) {
> +		err = ctrl_fill_info(rt, NETLINK_CB(cb->skb).portid,
> +				     cb->nlh->nlmsg_seq, NLM_F_MULTI,
> +				     skb, CTRL_CMD_NEWFAMILY);
> +		if (err) {
>  			n--;
>  			break;
>  		}
>  	}
>  
>  	cb->args[0] = n;
> -	return skb->len;
> +	return err;
>  }
>  
>  static struct sk_buff *ctrl_build_family_msg(const struct genl_family *family,

-- 
Stefano


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-15 11:48   ` Stefano Brivio
@ 2024-03-19 15:55     ` Jakub Kicinski
  2024-03-19 17:17       ` Eric Dumazet
  2024-03-19 23:36       ` David Gibson
  0 siblings, 2 replies; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-19 15:55 UTC (permalink / raw)
  To: Stefano Brivio
  Cc: davem, netdev, edumazet, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson

On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > Make sure ctrl_fill_info() returns sensible error codes and
> > propagate them out to netlink core. Let netlink core decide
> > when to return skb->len and when to treat the exit as an
> > error. Netlink core does better job at it, if we always
> > return skb->len the core doesn't know when we're done
> > dumping and NLMSG_DONE ends up in a separate read().  
> 
> While this change is obviously correct, it breaks... well, broken
> applications that _wrongly_ rely on the fact that NLMSG_DONE is
> delivered in a separate datagram.
> 
> This was the (embarrassing) case for passt(1), which I just fixed:
>   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> 
> but the "separate" NLMSG_DONE is such an established behaviour,
> I think, that this might raise a more general concern.
> 
> From my perspective, I'm just happy that this change revealed the
> issue, but I wanted to report this anyway in case somebody has
> similar possible breakages in mind.

Hi Stefano! I was worried this may happen :( I think we should revert
offending commits, but I'd like to take it on case by case basis. 
I'd imagine majority of netlink is only exercised by iproute2 and
libmnl-based tools. Does passt hang specifically on genetlink family
dump? Your commit also mentions RTM_GETROUTE. This is not the only
commit which removed DONE:

$ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline 

9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
6647b338fc5c netlink: fix netlink_diag_dump() return value

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-19 15:55     ` Jakub Kicinski
@ 2024-03-19 17:17       ` Eric Dumazet
  2024-03-19 17:40         ` Jakub Kicinski
  2024-03-19 23:36       ` David Gibson
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2024-03-19 17:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stefano Brivio, davem, netdev, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson

On Tue, Mar 19, 2024 at 4:55 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > > Make sure ctrl_fill_info() returns sensible error codes and
> > > propagate them out to netlink core. Let netlink core decide
> > > when to return skb->len and when to treat the exit as an
> > > error. Netlink core does better job at it, if we always
> > > return skb->len the core doesn't know when we're done
> > > dumping and NLMSG_DONE ends up in a separate read().
> >
> > While this change is obviously correct, it breaks... well, broken
> > applications that _wrongly_ rely on the fact that NLMSG_DONE is
> > delivered in a separate datagram.
> >
> > This was the (embarrassing) case for passt(1), which I just fixed:
> >   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> >
> > but the "separate" NLMSG_DONE is such an established behaviour,
> > I think, that this might raise a more general concern.
> >
> > From my perspective, I'm just happy that this change revealed the
> > issue, but I wanted to report this anyway in case somebody has
> > similar possible breakages in mind.
>
> Hi Stefano! I was worried this may happen :( I think we should revert
> offending commits, but I'd like to take it on case by case basis.
> I'd imagine majority of netlink is only exercised by iproute2 and
> libmnl-based tools. Does passt hang specifically on genetlink family
> dump? Your commit also mentions RTM_GETROUTE. This is not the only
> commit which removed DONE:
>
> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>
> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> 6647b338fc5c netlink: fix netlink_diag_dump() return value

Lets not bring back more RTNL locking please for the handlers that
still require it.

The core can generate an NLMSG_DONE by itself, if we decide this needs
to be done.

I find this discussion a bit strange, because NLMSG_DONE being
piggybacked has been a long established behavior.

Jason patch was from 2017

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-19 17:17       ` Eric Dumazet
@ 2024-03-19 17:40         ` Jakub Kicinski
  2024-03-21 12:56           ` Gal Pressman
  2024-04-03 22:52           ` Ilya Maximets
  0 siblings, 2 replies; 25+ messages in thread
From: Jakub Kicinski @ 2024-03-19 17:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stefano Brivio, davem, netdev, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson

On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
> > Hi Stefano! I was worried this may happen :( I think we should revert
> > offending commits, but I'd like to take it on case by case basis.
> > I'd imagine majority of netlink is only exercised by iproute2 and
> > libmnl-based tools. Does passt hang specifically on genetlink family
> > dump? Your commit also mentions RTM_GETROUTE. This is not the only
> > commit which removed DONE:
> >
> > $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
> >
> > 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> > 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> > 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> > 6647b338fc5c netlink: fix netlink_diag_dump() return value  
> 
> Lets not bring back more RTNL locking please for the handlers that
> still require it.

Definitely. My git log copy/paste is pretty inaccurate, these two are
better examples:

5d9b7cb383bb nexthop: Simplify dump error handling
02e24903e5a4 netlink: let core handle error cases in dump operations

I was trying to point out that we merged a handful of DONE "coalescing"
patches, and if we need to revert - let's only do that for the exact
commands needed. The comment was raised on my genetlink patch while
the discussion in the link points to RTM_GETROUTE.

> The core can generate an NLMSG_DONE by itself, if we decide this needs
> to be done.

Exactly.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-19 15:55     ` Jakub Kicinski
  2024-03-19 17:17       ` Eric Dumazet
@ 2024-03-19 23:36       ` David Gibson
  1 sibling, 0 replies; 25+ messages in thread
From: David Gibson @ 2024-03-19 23:36 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stefano Brivio, davem, netdev, edumazet, pabeni, jiri, idosch,
	johannes, fw, pablo, Martin Pitt, Paul Holzinger

[-- Attachment #1: Type: text/plain, Size: 2527 bytes --]

On Tue, Mar 19, 2024 at 08:55:45AM -0700, Jakub Kicinski wrote:
> On Fri, 15 Mar 2024 12:48:08 +0100 Stefano Brivio wrote:
> > > Make sure ctrl_fill_info() returns sensible error codes and
> > > propagate them out to netlink core. Let netlink core decide
> > > when to return skb->len and when to treat the exit as an
> > > error. Netlink core does better job at it, if we always
> > > return skb->len the core doesn't know when we're done
> > > dumping and NLMSG_DONE ends up in a separate read().  
> > 
> > While this change is obviously correct, it breaks... well, broken
> > applications that _wrongly_ rely on the fact that NLMSG_DONE is
> > delivered in a separate datagram.
> > 
> > This was the (embarrassing) case for passt(1), which I just fixed:
> >   https://archives.passt.top/passt-dev/20240315112432.382212-1-sbrivio@redhat.com/
> > 
> > but the "separate" NLMSG_DONE is such an established behaviour,
> > I think, that this might raise a more general concern.
> > 
> > From my perspective, I'm just happy that this change revealed the
> > issue, but I wanted to report this anyway in case somebody has
> > similar possible breakages in mind.
> 
> Hi Stefano! I was worried this may happen :( I think we should revert
> offending commits, but I'd like to take it on case by case basis. 
> I'd imagine majority of netlink is only exercised by iproute2 and
> libmnl-based tools. Does passt hang specifically on genetlink family
> dump? Your commit also mentions RTM_GETROUTE. This is not the only
> commit which removed DONE:

I don't think there's anything specirfic to RTM_GETROUTE here from the
kernel side.  We've looked at the problem in passt more closely now,
and it turns out we handled a merged NLMSG_DONE correctly in most
cases.  For various reasons internal to passt, our handling of
RTM_GETROUTE on one path is more complex, and we had a subtle error
there which broke the handling of a merged NLMSG_DONE.

> 
> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline 
> 
> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
> 6647b338fc5c netlink: fix netlink_diag_dump() return value
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-19 17:40         ` Jakub Kicinski
@ 2024-03-21 12:56           ` Gal Pressman
  2024-03-21 13:51             ` Ido Schimmel
  2024-04-03 22:52           ` Ilya Maximets
  1 sibling, 1 reply; 25+ messages in thread
From: Gal Pressman @ 2024-03-21 12:56 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: Stefano Brivio, davem, netdev, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson

On 19/03/2024 19:40, Jakub Kicinski wrote:
> On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
>>> Hi Stefano! I was worried this may happen :( I think we should revert
>>> offending commits, but I'd like to take it on case by case basis.
>>> I'd imagine majority of netlink is only exercised by iproute2 and
>>> libmnl-based tools. Does passt hang specifically on genetlink family
>>> dump? Your commit also mentions RTM_GETROUTE. This is not the only
>>> commit which removed DONE:
>>>
>>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>>>
>>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
>>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
>>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
>>> 6647b338fc5c netlink: fix netlink_diag_dump() return value  
>>
>> Lets not bring back more RTNL locking please for the handlers that
>> still require it.
> 
> Definitely. My git log copy/paste is pretty inaccurate, these two are
> better examples:
> 
> 5d9b7cb383bb nexthop: Simplify dump error handling
> 02e24903e5a4 netlink: let core handle error cases in dump operations
> 
> I was trying to point out that we merged a handful of DONE "coalescing"
> patches, and if we need to revert - let's only do that for the exact
> commands needed. The comment was raised on my genetlink patch while
> the discussion in the link points to RTM_GETROUTE.
> 
>> The core can generate an NLMSG_DONE by itself, if we decide this needs
>> to be done.
> 
> Exactly.
> 

We've encountered a new issue recently which I believe is related to
this discussion.

Following Eric's patch:
9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")

Setting the interface mtu to < 1280 results in 'ip addr show eth2'
returning an error, because the ipv6 dump fails. This is a degradation
from the user's perspective.

# ip addr show eth2
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
default qlen 1000
    link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
    altname enp6s0f0np0
# ip link set dev eth2 mtu 1000
# ip addr show eth2
RTNETLINK answers: No such device
Dump terminated

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-21 12:56           ` Gal Pressman
@ 2024-03-21 13:51             ` Ido Schimmel
  2024-03-21 15:03               ` Gal Pressman
  2024-03-21 17:26               ` Eric Dumazet
  0 siblings, 2 replies; 25+ messages in thread
From: Ido Schimmel @ 2024-03-21 13:51 UTC (permalink / raw)
  To: Gal Pressman
  Cc: Jakub Kicinski, Eric Dumazet, Stefano Brivio, davem, netdev,
	pabeni, jiri, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
> We've encountered a new issue recently which I believe is related to
> this discussion.
> 
> Following Eric's patch:
> 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
> 
> Setting the interface mtu to < 1280 results in 'ip addr show eth2'
> returning an error, because the ipv6 dump fails. This is a degradation
> from the user's perspective.
> 
> # ip addr show eth2
> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> default qlen 1000
>     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
>     altname enp6s0f0np0
> # ip link set dev eth2 mtu 1000
> # ip addr show eth2
> RTNETLINK answers: No such device
> Dump terminated

I don't think it's the same issue. Original issue was about user space
not knowing how to handle NLMSG_DONE being sent together with dump
responses. The issue you reported seems to be related to an
unintentional change in the return code when IPv6 is disabled on an
interface. Can you please test the following patch?

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 247bd4d8ee45..92db9b474f2b 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 
                err = 0;
                if (fillargs.ifindex) {
-                       err = -ENODEV;
                        dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
-                       if (!dev)
+                       if (!dev) {
+                               err = -ENODEV;
                                goto done;
+                       }
                        idev = __in6_dev_get(dev);
                        if (idev)
                                err = in6_dump_addrs(idev, skb, cb,

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-21 13:51             ` Ido Schimmel
@ 2024-03-21 15:03               ` Gal Pressman
  2024-03-21 17:26               ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Gal Pressman @ 2024-03-21 15:03 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Jakub Kicinski, Eric Dumazet, Stefano Brivio, davem, netdev,
	pabeni, jiri, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On 21/03/2024 15:51, Ido Schimmel wrote:
> On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
>> We've encountered a new issue recently which I believe is related to
>> this discussion.
>>
>> Following Eric's patch:
>> 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
>>
>> Setting the interface mtu to < 1280 results in 'ip addr show eth2'
>> returning an error, because the ipv6 dump fails. This is a degradation
>> from the user's perspective.
>>
>> # ip addr show eth2
>> 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
>> default qlen 1000
>>     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
>>     altname enp6s0f0np0
>> # ip link set dev eth2 mtu 1000
>> # ip addr show eth2
>> RTNETLINK answers: No such device
>> Dump terminated
> 
> I don't think it's the same issue. Original issue was about user space
> not knowing how to handle NLMSG_DONE being sent together with dump
> responses. The issue you reported seems to be related to an
> unintentional change in the return code when IPv6 is disabled on an
> interface. Can you please test the following patch?
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 247bd4d8ee45..92db9b474f2b 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>  
>                 err = 0;
>                 if (fillargs.ifindex) {
> -                       err = -ENODEV;
>                         dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
> -                       if (!dev)
> +                       if (!dev) {
> +                               err = -ENODEV;
>                                 goto done;
> +                       }
>                         idev = __in6_dev_get(dev);
>                         if (idev)
>                                 err = in6_dump_addrs(idev, skb, cb,

This seems to fix it, thanks!
Will you submit a patch?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-21 13:51             ` Ido Schimmel
  2024-03-21 15:03               ` Gal Pressman
@ 2024-03-21 17:26               ` Eric Dumazet
  2024-03-21 17:41                 ` Ido Schimmel
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2024-03-21 17:26 UTC (permalink / raw)
  To: Ido Schimmel
  Cc: Gal Pressman, Jakub Kicinski, Stefano Brivio, davem, netdev,
	pabeni, jiri, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On Thu, Mar 21, 2024 at 2:51 PM Ido Schimmel <idosch@idosch.org> wrote:
>
> On Thu, Mar 21, 2024 at 02:56:41PM +0200, Gal Pressman wrote:
> > We've encountered a new issue recently which I believe is related to
> > this discussion.
> >
> > Following Eric's patch:
> > 9cc4cc329d30 ("ipv6: use xa_array iterator to implement inet6_dump_addr()")
> >
> > Setting the interface mtu to < 1280 results in 'ip addr show eth2'
> > returning an error, because the ipv6 dump fails. This is a degradation
> > from the user's perspective.
> >
> > # ip addr show eth2
> > 4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group
> > default qlen 1000
> >     link/ether 24:42:53:21:52:44 brd ff:ff:ff:ff:ff:ff
> >     altname enp6s0f0np0
> > # ip link set dev eth2 mtu 1000
> > # ip addr show eth2
> > RTNETLINK answers: No such device
> > Dump terminated
>
> I don't think it's the same issue. Original issue was about user space
> not knowing how to handle NLMSG_DONE being sent together with dump
> responses. The issue you reported seems to be related to an
> unintentional change in the return code when IPv6 is disabled on an
> interface. Can you please test the following patch?
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 247bd4d8ee45..92db9b474f2b 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5416,10 +5416,11 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
>
>                 err = 0;
>                 if (fillargs.ifindex) {
> -                       err = -ENODEV;
>                         dev = dev_get_by_index_rcu(tgt_net, fillargs.ifindex);
> -                       if (!dev)
> +                       if (!dev) {
> +                               err = -ENODEV;
>                                 goto done;
> +                       }
>                         idev = __in6_dev_get(dev);
>                         if (idev)
>                                 err = in6_dump_addrs(idev, skb, cb,

Good catch, thanks !

Not sure why ip command is so picky about the error code here.

There is no IPv6 on this device after all.

The following seems to react quite differently :

# ip addr  show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host proto kernel_lo
       valid_lft forever preferred_lft forever
# ip link set dev lo mtu 1000
# ip addr  show lo
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever



Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-21 17:26               ` Eric Dumazet
@ 2024-03-21 17:41                 ` Ido Schimmel
  0 siblings, 0 replies; 25+ messages in thread
From: Ido Schimmel @ 2024-03-21 17:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Gal Pressman, Jakub Kicinski, Stefano Brivio, davem, netdev,
	pabeni, jiri, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On Thu, Mar 21, 2024 at 06:26:31PM +0100, Eric Dumazet wrote:
> The following seems to react quite differently :
> 
> # ip addr  show lo
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
> group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host proto kernel_lo
>        valid_lft forever preferred_lft forever
> # ip link set dev lo mtu 1000
> # ip addr  show lo
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 1000 qdisc noqueue state UNKNOWN
> group default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever

Yes, for loopback the NETDEV_CHANGEMTU event is treated as NETDEV_DOWN
rather than NETDEV_UNREGISTER:

if (dev->mtu < IPV6_MIN_MTU) {
	addrconf_ifdown(dev, dev != net->loopback_dev);
	break;
}

vim net/ipv6/addrconf.c +3656

> Reviewed-by: Eric Dumazet <edumazet@google.com>

Sorry Eric, already sent the patch without your tag:

https://lore.kernel.org/netdev/20240321173042.2151756-1-idosch@nvidia.com/

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-03-19 17:40         ` Jakub Kicinski
  2024-03-21 12:56           ` Gal Pressman
@ 2024-04-03 22:52           ` Ilya Maximets
  2024-04-11 15:16             ` Jakub Kicinski
  1 sibling, 1 reply; 25+ messages in thread
From: Ilya Maximets @ 2024-04-03 22:52 UTC (permalink / raw)
  To: Jakub Kicinski, Eric Dumazet
  Cc: Stefano Brivio, davem, netdev, pabeni, jiri, idosch, johannes, fw,
	pablo, Martin Pitt, Paul Holzinger, David Gibson, i.maximets

On 3/19/24 18:40, Jakub Kicinski wrote:
> On Tue, 19 Mar 2024 18:17:47 +0100 Eric Dumazet wrote:
>>> Hi Stefano! I was worried this may happen :( I think we should revert
>>> offending commits, but I'd like to take it on case by case basis.
>>> I'd imagine majority of netlink is only exercised by iproute2 and
>>> libmnl-based tools. Does passt hang specifically on genetlink family
>>> dump? Your commit also mentions RTM_GETROUTE. This is not the only
>>> commit which removed DONE:
>>>
>>> $ git log --since='1 month ago' --grep=NLMSG_DONE --no-merges  --oneline
>>>
>>> 9cc4cc329d30 ipv6: use xa_array iterator to implement inet6_dump_addr()
>>> 87d381973e49 genetlink: fit NLMSG_DONE into same read() as families
>>> 4ce5dc9316de inet: switch inet_dump_fib() to RCU protection
>>> 6647b338fc5c netlink: fix netlink_diag_dump() return value  
>>
>> Lets not bring back more RTNL locking please for the handlers that
>> still require it.
> 
> Definitely. My git log copy/paste is pretty inaccurate, these two are
> better examples:
> 
> 5d9b7cb383bb nexthop: Simplify dump error handling
> 02e24903e5a4 netlink: let core handle error cases in dump operations
> 
> I was trying to point out that we merged a handful of DONE "coalescing"
> patches, and if we need to revert - let's only do that for the exact
> commands needed. The comment was raised on my genetlink patch while
> the discussion in the link points to RTM_GETROUTE.
> 
>> The core can generate an NLMSG_DONE by itself, if we decide this needs
>> to be done.
> 
> Exactly.

FWIW, it seems that Libreswan is suffering from the same issue on
RTM_GETROUTE dump.

On 6.9.0-rc1 I see:

/usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
                     --start --asynchronous tun-in-1

recvfrom(7, 
[
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  ...
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])

recvfrom(7, <-- Stuck here forever


On 6.8.0 the output is following:

recvfrom(7, 
[
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
  ...
  [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])

recvfrom(7,
  [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI,}, 0],
  40728, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
close(7)

So, it seems like it explicitly waits for NLMSG_DONE to be in a separate
message.

I reported the issue to Libreswan:
  https://github.com/libreswan/libreswan/issues/1675
but just wanted to let you know as well.

Found this since it breaks IPsec system tests in Open vSwitch.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-03 22:52           ` Ilya Maximets
@ 2024-04-11 15:16             ` Jakub Kicinski
  2024-04-11 15:39               ` Ilya Maximets
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-04-11 15:16 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Eric Dumazet, Stefano Brivio, davem, netdev, pabeni, jiri, idosch,
	johannes, fw, pablo, Martin Pitt, Paul Holzinger, David Gibson

On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote:
> /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
>                      --start --asynchronous tun-in-1
> 
> recvfrom(7, 
> [
>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>   ...
>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>   [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
> ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
> 
> recvfrom(7, <-- Stuck here forever

I think we should probably fix this..
Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() -
it does seem to coalesce DONE..

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-11 15:16             ` Jakub Kicinski
@ 2024-04-11 15:39               ` Ilya Maximets
  2024-04-11 15:52                 ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Ilya Maximets @ 2024-04-11 15:39 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: i.maximets, Eric Dumazet, Stefano Brivio, davem, netdev, pabeni,
	jiri, idosch, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On 4/11/24 17:16, Jakub Kicinski wrote:
> On Thu, 4 Apr 2024 00:52:15 +0200 Ilya Maximets wrote:
>> /usr/sbin/ipsec auto --config ipsec.conf --ctlsocket pluto.ctl \
>>                      --start --asynchronous tun-in-1
>>
>> recvfrom(7, 
>> [
>>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>>   ...
>>   [{nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, ...],
>>   [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, ...]
>> ], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12])
>>
>> recvfrom(7, <-- Stuck here forever
> 
> I think we should probably fix this..
> Would you mind sharing the sendmsg() call? Eyeballing rtnl_dump_all() -
> it does seem to coalesce DONE..

Sure, the whole sequence looks like this in strace:

socket(AF_NETLINK, SOCK_DGRAM|SOCK_CLOEXEC, NETLINK_ROUTE) = 7

sendto(7, [
  {
    nlmsg_len=36, nlmsg_type=0x1a /* NLMSG_??? */,
    nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=138040
  },
  "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
], 36, 0, NULL, 0) = 36

recvfrom(7, [
 [
   {nlmsg_len=52, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=0, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN,
    rtm_protocol=RTPROT_BOOT, rtm_scope=RT_SCOPE_UNIVERSE, rtm_type=RTN_UNICAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN],
     [{nla_len=8, nla_type=RTA_GATEWAY}, inet_addr("10.1.1.2")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=24, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_MAIN,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_UNICAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_MAIN],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.0")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_HOST, rtm_type=RTN_LOCAL, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=60, nlmsg_type=RTM_NEWROUTE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040},
   {rtm_family=AF_INET, rtm_dst_len=32, rtm_src_len=0, rtm_tos=0, rtm_table=RT_TABLE_LOCAL,
    rtm_protocol=RTPROT_KERNEL, rtm_scope=RT_SCOPE_LINK, rtm_type=RTN_BROADCAST, rtm_flags=0},
   [
     [{nla_len=8, nla_type=RTA_TABLE}, RT_TABLE_LOCAL],
     [{nla_len=8, nla_type=RTA_DST}, inet_addr("10.1.1.255")],
     [{nla_len=8, nla_type=RTA_PREFSRC}, inet_addr("10.1.1.1")],
     [{nla_len=8, nla_type=RTA_OIF}, if_nametoindex("p0")]
   ]
 ],
 [
   {nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=0, nlmsg_pid=138040}, 0
 ]
], 40960, 0, {sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, [12]) = 252

recvfrom(7,  <unfinished ...>


nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-11 15:39               ` Ilya Maximets
@ 2024-04-11 15:52                 ` Jakub Kicinski
  2024-04-11 16:38                   ` Ilya Maximets
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-04-11 15:52 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Eric Dumazet, Stefano Brivio, davem, netdev, pabeni, jiri, idosch,
	johannes, fw, pablo, Martin Pitt, Paul Holzinger, David Gibson

On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote:
> nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
> nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP

Thanks!

Also:

"\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
 ^^^^

So it's dumping AF_INET. I'm guessing its also going to dump v6,
separately? To fix v4 I think something like this would do (untested):

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 48741352a88a..749baa74eee7 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
 			e++;
 		}
 	}
+
+	/* Don't let NLM_DONE coalesce into a message, even if it could.
+	 * Some user space expects NLM_DONE in a separate recv().
+	 */
+	err = skb->len;
 out:
 
 	cb->args[1] = e;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-11 15:52                 ` Jakub Kicinski
@ 2024-04-11 16:38                   ` Ilya Maximets
  2024-04-11 18:03                     ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Ilya Maximets @ 2024-04-11 16:38 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: i.maximets, Eric Dumazet, Stefano Brivio, davem, netdev, pabeni,
	jiri, idosch, johannes, fw, pablo, Martin Pitt, Paul Holzinger,
	David Gibson

On 4/11/24 17:52, Jakub Kicinski wrote:
> On Thu, 11 Apr 2024 17:39:20 +0200 Ilya Maximets wrote:
>> nlmsg_type=0x1a /* NLMSG_??? */  --> RTM_GETROUTE
>> nlmsg_flags=NLM_F_REQUEST|0x300  --> NLM_F_REQUEST|NLM_F_DUMP
> 
> Thanks!
> 
> Also:
> 
> "\x02\x20\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x08\x00\x01\x00\x0a\x01\x01\x02"
>  ^^^^
> 
> So it's dumping AF_INET. I'm guessing its also going to dump v6,
> separately?

AFAICT, it dumps the family according to the tunnel configuration.
If ipsec is set up for IPv4, then it dumps IPv4, if v6 - v6.

> To fix v4 I think something like this would do (untested):
> 
> diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
> index 48741352a88a..749baa74eee7 100644
> --- a/net/ipv4/fib_frontend.c
> +++ b/net/ipv4/fib_frontend.c
> @@ -1050,6 +1050,11 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
>  			e++;
>  		}
>  	}
> +
> +	/* Don't let NLM_DONE coalesce into a message, even if it could.
> +	 * Some user space expects NLM_DONE in a separate recv().
> +	 */
> +	err = skb->len;
>  out:
>  
>  	cb->args[1] = e;

I tried that and IPv4 tests with Libreswan are passing with this change
on latest net-next/main.

IPv6 tests are stuck in the same way as IPv4 did before.  The sendto
as captured by strace is following:

sendto(7, [
  {
    nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */,
    nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084
  },
  "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48

NLMSG_DONE is part of the first recvfrom and the process is stuck
waiting for something from the second one.

Best regards, Ilya Maximets.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-11 16:38                   ` Ilya Maximets
@ 2024-04-11 18:03                     ` Jakub Kicinski
  2024-04-11 18:04                       ` Jakub Kicinski
  0 siblings, 1 reply; 25+ messages in thread
From: Jakub Kicinski @ 2024-04-11 18:03 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Eric Dumazet, Stefano Brivio, davem, netdev, pabeni, jiri, idosch,
	johannes, fw, pablo, Martin Pitt, Paul Holzinger, David Gibson

On Thu, 11 Apr 2024 18:38:19 +0200 Ilya Maximets wrote:
> I tried that and IPv4 tests with Libreswan are passing with this change
> on latest net-next/main.
> 
> IPv6 tests are stuck in the same way as IPv4 did before.  The sendto
> as captured by strace is following:
> 
> sendto(7, [
>   {
>     nlmsg_len=48, nlmsg_type=0x1a /* NLMSG_??? */,
>     nlmsg_flags=NLM_F_REQUEST|0x300, nlmsg_seq=0, nlmsg_pid=17084
>   },
>   "\x0a\x80\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x14\x00\x01\x00\xfd\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x02"], 48, 0, NULL, 0) = 48
> 
> NLMSG_DONE is part of the first recvfrom and the process is stuck
> waiting for something from the second one.

Perfect, thank you! I just sent the v4 fix,
just to be clear you were testing on -next right?
Because AFAICT the v6 change hasn't made it to Linus.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families
  2024-04-11 18:03                     ` Jakub Kicinski
@ 2024-04-11 18:04                       ` Jakub Kicinski
  0 siblings, 0 replies; 25+ messages in thread
From: Jakub Kicinski @ 2024-04-11 18:04 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Eric Dumazet, Stefano Brivio, davem, netdev, pabeni, jiri, idosch,
	johannes, fw, pablo, Martin Pitt, Paul Holzinger, David Gibson

On Thu, 11 Apr 2024 11:03:13 -0700 Jakub Kicinski wrote:
> > NLMSG_DONE is part of the first recvfrom and the process is stuck
> > waiting for something from the second one.  
> 
> Perfect, thank you! I just sent the v4 fix,
> just to be clear you were testing on -next right?
> Because AFAICT the v6 change hasn't made it to Linus.

Oh, you said net-next/main earlier in your message, all makes sense.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2024-04-11 18:04 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-03  5:24 [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core Jakub Kicinski
2024-03-03  5:24 ` [PATCH net-next v2 1/3] " Jakub Kicinski
2024-03-03 15:01   ` Ido Schimmel
2024-03-03  5:24 ` [PATCH net-next v2 2/3] netdev: let netlink core handle -EMSGSIZE errors Jakub Kicinski
2024-03-03 15:08   ` Ido Schimmel
2024-03-03  5:24 ` [PATCH net-next v2 3/3] genetlink: fit NLMSG_DONE into same read() as families Jakub Kicinski
2024-03-03 15:10   ` Ido Schimmel
2024-03-15 11:48   ` Stefano Brivio
2024-03-19 15:55     ` Jakub Kicinski
2024-03-19 17:17       ` Eric Dumazet
2024-03-19 17:40         ` Jakub Kicinski
2024-03-21 12:56           ` Gal Pressman
2024-03-21 13:51             ` Ido Schimmel
2024-03-21 15:03               ` Gal Pressman
2024-03-21 17:26               ` Eric Dumazet
2024-03-21 17:41                 ` Ido Schimmel
2024-04-03 22:52           ` Ilya Maximets
2024-04-11 15:16             ` Jakub Kicinski
2024-04-11 15:39               ` Ilya Maximets
2024-04-11 15:52                 ` Jakub Kicinski
2024-04-11 16:38                   ` Ilya Maximets
2024-04-11 18:03                     ` Jakub Kicinski
2024-04-11 18:04                       ` Jakub Kicinski
2024-03-19 23:36       ` David Gibson
2024-03-06  8:10 ` [PATCH net-next v2 0/3] netlink: handle EMSGSIZE errors in the core patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).