[PATCH net v2] net: consume xmit errors of GSO frames

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net v2] net: consume xmit errors of GSO frames
@ 2026-02-23 23:51 Jakub Kicinski
  2026-02-24  9:00 ` Eric Dumazet
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Jakub Kicinski @ 2026-02-23 23:51 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	sdf, hawk, alexander.h.duyck

udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.

These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.

Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).

In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.

The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:

  -------------------------------------------------
  |   GSO super frame 1   |   GSO super frame 2   |
  |-----------------------------------------------|
  | seg | seg | seg | seg | seg | seg | seg | seg |
  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
  -------------------------------------------------
     x    ok    ok    <ok>|  ok    ok    ok   <x>
                          \\
			   snd_nxt

"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.

So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).

Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.

We have multiple ways to fix this.

 1) make veth not return an error when it lost a packet.
    While this is what I think we did in the past, the issue keeps
    reappearing and it's annoying to debug. The game of whack
    a mole is not great.

 2) fix the damn return codes
    We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
    documentation, so maybe we should make the return code from
    ndo_start_xmit() a boolean. I like that the most, but perhaps
    some ancient, not-really-networking protocol would suffer.

 3) make TCP ignore the errors
    It is not entirely clear to me what benefit TCP gets from
    interpreting the result of ip_queue_xmit()? Specifically once
    the connection is established and we're pushing data - packet
    loss is just packet loss?

 4) this fix
    Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
    We already always return OK in the TCQ_F_CAN_BYPASS case.
    In the Qdisc-less case let's be a bit more conservative and only
    mask the GSO errors. This path is taken by non-IP-"networks"
    like CAN, MCTP etc, so we could regress some ancient thing.
    This is the simplest, but also maybe the hackiest fix?

Similar fix has been proposed by Eric in the past but never committed
because original reporter was working with an OOT driver and wasn't
providing feedback (see Link).

Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
v2:
 - rewrite to also cover the BUSY case
v1: https://lore.kernel.org/20260220170032.2964535-1-kuba@kernel.org

CC: sdf@fomichev.me
CC: hawk@kernel.org
CC: alexander.h.duyck@intel.com
---
 net/core/dev.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 096b3ff13f6b..52e5dd6c2e20 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4822,6 +4822,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		 * to -1 or to their cpu id, but not to our id.
 		 */
 		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+			bool is_list = false;
+
 			if (dev_xmit_recursion())
 				goto recursion_alert;

@@ -4832,17 +4834,28 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			HARD_TX_LOCK(dev, txq, cpu);

 			if (!netif_xmit_stopped(txq)) {
+				is_list = !!skb->next;
+
 				dev_xmit_recursion_inc();
 				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
 				dev_xmit_recursion_dec();
-				if (dev_xmit_complete(rc)) {
-					HARD_TX_UNLOCK(dev, txq);
-					goto out;
-				}
+
+				/* GSO segments a single SKB into
+				 * a list of frames. TCP expects error
+				 * to mean none of the data was sent.
+				 */
+				if (is_list)
+					rc = NETDEV_TX_OK;
 			}
 			HARD_TX_UNLOCK(dev, txq);
+			if (!skb) /* xmit completed */
+				goto out;
+
 			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
 					     dev->name);
+			/* NETDEV_TX_BUSY or queue was stopped */
+			if (!is_list)
+				rc = -ENETDOWN;
 		} else {
 			/* Recursion is detected! It is possible,
 			 * unfortunately
@@ -4850,10 +4863,10 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 recursion_alert:
 			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
 					     dev->name);
+			rc = -ENETDOWN;
 		}
 	}

-	rc = -ENETDOWN;
 	rcu_read_unlock_bh();

 	dev_core_stats_tx_dropped_inc(dev);
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net v2] net: consume xmit errors of GSO frames
  2026-02-23 23:51 [PATCH net v2] net: consume xmit errors of GSO frames Jakub Kicinski
@ 2026-02-24  9:00 ` Eric Dumazet
  2026-02-26 10:40 ` patchwork-bot+netdevbpf
  2026-03-25 12:01 ` Ben Hutchings
  2 siblings, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2026-02-24  9:00 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, pabeni, andrew+netdev, horms, sdf, hawk,
	alexander.h.duyck

On Tue, Feb 24, 2026 at 12:51 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
> currently in NIPA. They fail in the same exact way, TCP GRO
> test stalls occasionally and the test gets killed after 10min.
>
> These tests use veth to simulate GRO. They attach a trivial
> ("return XDP_PASS;") XDP program to the veth to force TSO off
> and NAPI on.
>
> Digging into the failure mode we can see that the connection
> is completely stuck after a burst of drops. The sender's snd_nxt
> is at sequence number N [1], but the receiver claims to have
> received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
> is that senders rtx queue is not empty (let's say the block in
> the rtx queue is at sequence number N - 4 * MSS [3]).
>
> In this state, sender sends a retransmission from the rtx queue
> with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
> Receiver sees it and responds with an ACK all the way up to
> N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
> because it has no recollection of ever sending data that far out [1].
> And we are stuck.
>
> The root cause is the mess of the xmit return codes. veth returns
> an error when it can't xmit a frame. We end up with a loss event
> like this:
>
>   -------------------------------------------------
>   |   GSO super frame 1   |   GSO super frame 2   |
>   |-----------------------------------------------|
>   | seg | seg | seg | seg | seg | seg | seg | seg |
>   |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
>   -------------------------------------------------
>      x    ok    ok    <ok>|  ok    ok    ok   <x>
>                           \\
>                            snd_nxt
>
> "x" means packet lost by veth, and "ok" means it went thru.
> Since veth has TSO disabled in this test it sees individual segments.
> Segment 1 is on the retransmit queue and will be resent.
>
> So why did the sender not advance snd_nxt even tho it clearly did
> send up to seg 8? tcp_write_xmit() interprets the return code
> from the core to mean that data has not been sent at all. Since
> TCP deals with GSO super frames, not individual segment the crux
> of the problem is that loss of a single segment can be interpreted
> as loss of all. TCP only sees the last return code for the last
> segment of the GSO frame (in <> brackets in the diagram above).
>
> Of course for the problem to occur we need a setup or a device
> without a Qdisc. Otherwise Qdisc layer disconnects the protocol
> layer from the device errors completely.
>
> We have multiple ways to fix this.
>
>  1) make veth not return an error when it lost a packet.
>     While this is what I think we did in the past, the issue keeps
>     reappearing and it's annoying to debug. The game of whack
>     a mole is not great.
>
>  2) fix the damn return codes
>     We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
>     documentation, so maybe we should make the return code from
>     ndo_start_xmit() a boolean. I like that the most, but perhaps
>     some ancient, not-really-networking protocol would suffer.
>
>  3) make TCP ignore the errors
>     It is not entirely clear to me what benefit TCP gets from
>     interpreting the result of ip_queue_xmit()? Specifically once
>     the connection is established and we're pushing data - packet
>     loss is just packet loss?

TCP updates some state immediately, instead of relying on RTO.

local congestion (especially with pfifo_fast) has been a thing, I
agree that with FQ of FQ_CODEL we do not care much.

>
>  4) this fix
>     Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
>     We already always return OK in the TCQ_F_CAN_BYPASS case.
>     In the Qdisc-less case let's be a bit more conservative and only
>     mask the GSO errors. This path is taken by non-IP-"networks"
>     like CAN, MCTP etc, so we could regress some ancient thing.
>     This is the simplest, but also maybe the hackiest fix?
>
> Similar fix has been proposed by Eric in the past but never committed
> because original reporter was working with an OOT driver and wasn't
> providing feedback (see Link).
>
> Link: https://lore.kernel.org/CANn89iJcLepEin7EtBETrZ36bjoD9LrR=k4cfwWh046GB+4f9A@mail.gmail.com
> Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> v2:
>  - rewrite to also cover the BUSY case
> v1: https://lore.kernel.org/20260220170032.2964535-1-kuba@kernel.org
>
> CC: sdf@fomichev.me
> CC: hawk@kernel.org
> CC: alexander.h.duyck@intel.com
> ---
>  net/core/dev.c | 23 ++++++++++++++++++-----
>  1 file changed, 18 insertions(+), 5 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 096b3ff13f6b..52e5dd6c2e20 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4822,6 +4822,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>                  * to -1 or to their cpu id, but not to our id.
>                  */
>                 if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> +                       bool is_list = false;
> +
>                         if (dev_xmit_recursion())
>                                 goto recursion_alert;
>
> @@ -4832,17 +4834,28 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>                         HARD_TX_LOCK(dev, txq, cpu);
>
>                         if (!netif_xmit_stopped(txq)) {
> +                               is_list = !!skb->next;
> +
>                                 dev_xmit_recursion_inc();
>                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
>                                 dev_xmit_recursion_dec();
> -                               if (dev_xmit_complete(rc)) {
> -                                       HARD_TX_UNLOCK(dev, txq);
> -                                       goto out;
> -                               }
> +
> +                               /* GSO segments a single SKB into
> +                                * a list of frames. TCP expects error
> +                                * to mean none of the data was sent.
> +                                */
> +                               if (is_list)

Note this will return  NETDEV_TX_OK even if no segment was sent.

 This is probably fine.

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net v2] net: consume xmit errors of GSO frames
  2026-02-23 23:51 [PATCH net v2] net: consume xmit errors of GSO frames Jakub Kicinski
  2026-02-24  9:00 ` Eric Dumazet
@ 2026-02-26 10:40 ` patchwork-bot+netdevbpf
  2026-03-25 12:01 ` Ben Hutchings
  2 siblings, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-02-26 10:40 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, sdf, hawk,
	alexander.h.duyck

Hello:

This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:

On Mon, 23 Feb 2026 15:51:00 -0800 you wrote:
> udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
> currently in NIPA. They fail in the same exact way, TCP GRO
> test stalls occasionally and the test gets killed after 10min.
> 
> These tests use veth to simulate GRO. They attach a trivial
> ("return XDP_PASS;") XDP program to the veth to force TSO off
> and NAPI on.
> 
> [...]

Here is the summary with links:
  - [net,v2] net: consume xmit errors of GSO frames
    https://git.kernel.org/netdev/net/c/7aa767d0d3d0

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH net v2] net: consume xmit errors of GSO frames
  2026-02-23 23:51 [PATCH net v2] net: consume xmit errors of GSO frames Jakub Kicinski
  2026-02-24  9:00 ` Eric Dumazet
  2026-02-26 10:40 ` patchwork-bot+netdevbpf
@ 2026-03-25 12:01 ` Ben Hutchings
  2 siblings, 0 replies; 4+ messages in thread
From: Ben Hutchings @ 2026-03-25 12:01 UTC (permalink / raw)
  To: Jakub Kicinski, davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, sdf, hawk,
	alexander.h.duyck

[-- Attachment #1: Type: text/plain, Size: 2108 bytes --]

On Mon, 2026-02-23 at 15:51 -0800, Jakub Kicinski wrote:
[...]
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4822,6 +4822,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  		 * to -1 or to their cpu id, but not to our id.
>  		 */
>  		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> +			bool is_list = false;
> +
>  			if (dev_xmit_recursion())
>  				goto recursion_alert;
>  
> @@ -4832,17 +4834,28 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  			HARD_TX_LOCK(dev, txq, cpu);
>  
>  			if (!netif_xmit_stopped(txq)) {
> +				is_list = !!skb->next;
> +
>  				dev_xmit_recursion_inc();
>  				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
>  				dev_xmit_recursion_dec();
> -				if (dev_xmit_complete(rc)) {
> -					HARD_TX_UNLOCK(dev, txq);
> -					goto out;
> -				}
> +
> +				/* GSO segments a single SKB into
> +				 * a list of frames. TCP expects error
> +				 * to mean none of the data was sent.
> +				 */
> +				if (is_list)
> +					rc = NETDEV_TX_OK;
>  			}
>  			HARD_TX_UNLOCK(dev, txq);
> +			if (!skb) /* xmit completed */
> +				goto out;
> +
>  			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
>  					     dev->name);
> +			/* NETDEV_TX_BUSY or queue was stopped */
> +			if (!is_list)
> +				rc = -ENETDOWN;
>  		} else {
>  			/* Recursion is detected! It is possible,
>  			 * unfortunately
> @@ -4850,10 +4863,10 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>  recursion_alert:
>  			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
>  					     dev->name);
> +			rc = -ENETDOWN;
>  		}
>  	}
>  
> -	rc = -ENETDOWN;

After removing this assignment, rc is not set properly in case the
device is down or TX is stopped.

Perhaps this assignment should be restored before the check for
(dev->flags & IFF_UP)?

Ben.

>  	rcu_read_unlock_bh();
>  
>  	dev_core_stats_tx_dropped_inc(dev);

-- 
Ben Hutchings
Theory and practice are closer in theory than in practice - John Levine

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-03-25 12:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-23 23:51 [PATCH net v2] net: consume xmit errors of GSO frames Jakub Kicinski
2026-02-24  9:00 ` Eric Dumazet
2026-02-26 10:40 ` patchwork-bot+netdevbpf
2026-03-25 12:01 ` Ben Hutchings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox