[PATCH net] net: consume xmit errors of GSO frames

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net] net: consume xmit errors of GSO frames
@ 2026-02-20 17:00 Jakub Kicinski
  2026-02-20 17:17 ` Eric Dumazet
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2026-02-20 17:00 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, Jakub Kicinski,
	sdf, hawk, alexander.h.duyck

udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
currently in NIPA. They fail in the same exact way, TCP GRO
test stalls occasionally and the test gets killed after 10min.

These tests use veth to simulate GRO. They attach a trivial
("return XDP_PASS;") XDP program to the veth to force TSO off
and NAPI on.

Digging into the failure mode we can see that the connection
is completely stuck after a burst of drops. The sender's snd_nxt
is at sequence number N [1], but the receiver claims to have
received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
is that senders rtx queue is not empty (let's say the block in
the rtx queue is at sequence number N - 4 * MSS [3]).

In this state, sender sends a retransmission from the rtx queue
with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
Receiver sees it and responds with an ACK all the way up to
N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
because it has no recollection of ever sending data that far out [1].
And we are stuck.

The root cause is the mess of the xmit return codes. veth returns
an error when it can't xmit a frame. We end up with a loss event
like this:

  -------------------------------------------------
  |   GSO super frame 1   |   GSO super frame 2   |
  |-----------------------------------------------|
  | seg | seg | seg | seg | seg | seg | seg | seg |
  |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
  -------------------------------------------------
     x    ok    ok    <ok>|  ok    ok    ok   <x>
                          \\
			   snd_nxt

"x" means packet lost by veth, and "ok" means it went thru.
Since veth has TSO disabled in this test it sees individual segments.
Segment 1 is on the retransmit queue and will be resent.

So why did the sender not advance snd_nxt even tho it clearly did
send up to seg 8? tcp_write_xmit() interprets the return code
from the core to mean that data has not been sent at all. Since
TCP deals with GSO super frames, not individual segment the crux
of the problem is that loss of a single segment can be interpreted
as loss of all. TCP only sees the last return code for the last
segment of the GSO frame (in <> brackets in the diagram above).

Of course for the problem to occur we need a setup or a device
without a Qdisc. Otherwise Qdisc layer disconnects the protocol
layer from the device errors completely.

We have multiple ways to fix this.

 1) make veth not return an error when it lost a packet.
    While this is what I think we did in the past, the issue keeps
    reappearing and it's annoying to debug. The game of whack
    a mole is not great.

 2) fix the damn return codes
    We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
    documentation, so maybe we should make the return code from
    ndo_start_xmit() a boolean. I like that the most, but perhaps
    some ancient, not-really-networking protocol would suffer.

 3) make TCP ignore the errors
    It is not entirely clear to me what benefit TCP gets from
    interpreting the result of ip_queue_xmit()? Specifically once
    the connection is established and we're pushing data - packet
    loss is just packet loss?

 4) this fix
    Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
    We already always return OK in the TCQ_F_CAN_BYPASS case.
    In the Qdisc-less case let's be a bit more conservative and only
    mask the GSO errors. This path is taken by non-IP-"networks"
    like CAN, MCTP etc, so we could regress some ancient thing.
    This is the simplest, but also maybe the hackiest fix?

Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
CC: sdf@fomichev.me
CC: hawk@kernel.org
CC: alexander.h.duyck@intel.com
---
 net/core/dev.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 096b3ff13f6b..fd700a00c4d9 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4832,11 +4832,19 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			HARD_TX_LOCK(dev, txq, cpu);

 			if (!netif_xmit_stopped(txq)) {
+				bool is_list = skb->next != NULL;
+
 				dev_xmit_recursion_inc();
 				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
 				dev_xmit_recursion_dec();
 				if (dev_xmit_complete(rc)) {
 					HARD_TX_UNLOCK(dev, txq);
+					/* GSO segments a single SKB into
+					 * a list of frames. TCP expects error
+					 * to mean none of the data was sent.
+					 */
+					if (is_list && rc > 0)
+						rc = NETDEV_TX_OK;
 					goto out;
 				}
 			}
-- 
2.53.0

^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net] net: consume xmit errors of GSO frames
  2026-02-20 17:00 [PATCH net] net: consume xmit errors of GSO frames Jakub Kicinski
@ 2026-02-20 17:17 ` Eric Dumazet
  2026-02-20 18:32   ` Jakub Kicinski
  0 siblings, 1 reply; 3+ messages in thread
From: Eric Dumazet @ 2026-02-20 17:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, pabeni, andrew+netdev, horms, sdf, hawk,
	alexander.h.duyck

On Fri, Feb 20, 2026 at 6:00 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> udpgro_frglist.sh and udpgro_bench.sh are the flakiest tests
> currently in NIPA. They fail in the same exact way, TCP GRO
> test stalls occasionally and the test gets killed after 10min.
>
> These tests use veth to simulate GRO. They attach a trivial
> ("return XDP_PASS;") XDP program to the veth to force TSO off
> and NAPI on.
>
> Digging into the failure mode we can see that the connection
> is completely stuck after a burst of drops. The sender's snd_nxt
> is at sequence number N [1], but the receiver claims to have
> received (rcv_nxt) up to N + 3 * MSS [2]. Last piece of the puzzle
> is that senders rtx queue is not empty (let's say the block in
> the rtx queue is at sequence number N - 4 * MSS [3]).
>
> In this state, sender sends a retransmission from the rtx queue
> with a single segment, and sequence numbers N-4*MSS:N-3*MSS [3].
> Receiver sees it and responds with an ACK all the way up to
> N + 3 * MSS [2]. But sender will reject this ack as TCP_ACK_UNSENT_DATA
> because it has no recollection of ever sending data that far out [1].
> And we are stuck.
>
> The root cause is the mess of the xmit return codes. veth returns
> an error when it can't xmit a frame. We end up with a loss event
> like this:
>
>   -------------------------------------------------
>   |   GSO super frame 1   |   GSO super frame 2   |
>   |-----------------------------------------------|
>   | seg | seg | seg | seg | seg | seg | seg | seg |
>   |  1  |  2  |  3  |  4  |  5  |  6  |  7  |  8  |
>   -------------------------------------------------
>      x    ok    ok    <ok>|  ok    ok    ok   <x>
>                           \\
>                            snd_nxt
>
> "x" means packet lost by veth, and "ok" means it went thru.
> Since veth has TSO disabled in this test it sees individual segments.
> Segment 1 is on the retransmit queue and will be resent.
>
> So why did the sender not advance snd_nxt even tho it clearly did
> send up to seg 8? tcp_write_xmit() interprets the return code
> from the core to mean that data has not been sent at all. Since
> TCP deals with GSO super frames, not individual segment the crux
> of the problem is that loss of a single segment can be interpreted
> as loss of all. TCP only sees the last return code for the last
> segment of the GSO frame (in <> brackets in the diagram above).
>
> Of course for the problem to occur we need a setup or a device
> without a Qdisc. Otherwise Qdisc layer disconnects the protocol
> layer from the device errors completely.
>
> We have multiple ways to fix this.
>
>  1) make veth not return an error when it lost a packet.
>     While this is what I think we did in the past, the issue keeps
>     reappearing and it's annoying to debug. The game of whack
>     a mole is not great.
>
>  2) fix the damn return codes
>     We only talk about NETDEV_TX_OK and NETDEV_TX_BUSY in the
>     documentation, so maybe we should make the return code from
>     ndo_start_xmit() a boolean. I like that the most, but perhaps
>     some ancient, not-really-networking protocol would suffer.
>
>  3) make TCP ignore the errors
>     It is not entirely clear to me what benefit TCP gets from
>     interpreting the result of ip_queue_xmit()? Specifically once
>     the connection is established and we're pushing data - packet
>     loss is just packet loss?
>
>  4) this fix
>     Ignore the rc in the Qdisc-less+GSO case, since it's unreliable.
>     We already always return OK in the TCQ_F_CAN_BYPASS case.
>     In the Qdisc-less case let's be a bit more conservative and only
>     mask the GSO errors. This path is taken by non-IP-"networks"
>     like CAN, MCTP etc, so we could regress some ancient thing.
>     This is the simplest, but also maybe the hackiest fix?
>
> Fixes: 1f59533f9ca5 ("qdisc: validate frames going through the direct_xmit path")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
> CC: sdf@fomichev.me
> CC: hawk@kernel.org
> CC: alexander.h.duyck@intel.com
> ---
>  net/core/dev.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 096b3ff13f6b..fd700a00c4d9 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4832,11 +4832,19 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
>                         HARD_TX_LOCK(dev, txq, cpu);
>
>                         if (!netif_xmit_stopped(txq)) {
> +                               bool is_list = skb->next != NULL;
> +
>                                 dev_xmit_recursion_inc();
>                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
>                                 dev_xmit_recursion_dec();
>                                 if (dev_xmit_complete(rc)) {
>                                         HARD_TX_UNLOCK(dev, txq);
> +                                       /* GSO segments a single SKB into
> +                                        * a list of frames. TCP expects error
> +                                        * to mean none of the data was sent.
> +                                        */
> +                                       if (is_list && rc > 0)
> +                                               rc = NETDEV_TX_OK;
>                                         goto out;
>                                 }
>                         }
> --
> 2.53.0
>

This rings a bell.

I sent back in October something in the same vein.

https://www.spinics.net/lists/netdev/msg1131452.html

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net] net: consume xmit errors of GSO frames
  2026-02-20 17:17 ` Eric Dumazet
@ 2026-02-20 18:32   ` Jakub Kicinski
  0 siblings, 0 replies; 3+ messages in thread
From: Jakub Kicinski @ 2026-02-20 18:32 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, netdev, pabeni, andrew+netdev, horms, sdf, hawk,
	alexander.h.duyck

On Fri, 20 Feb 2026 18:17:14 +0100 Eric Dumazet wrote:
> This rings a bell.
> 
> I sent back in October something in the same vein.
> 
> https://www.spinics.net/lists/netdev/msg1131452.html

Yes, I had a recollection of you explaining this problem in the past.
Otherwise I would have never thought of this path! :) 

If we want to cover the BUSY case maybe the patch below? Too messy?

diff --git a/net/core/dev.c b/net/core/dev.c
index 096b3ff13f6b..65b7c54e1bef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4822,6 +4822,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 		 * to -1 or to their cpu id, but not to our id.
 		 */
 		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+			bool is_list = false, completed = false;
+
 			if (dev_xmit_recursion())
 				goto recursion_alert;
 
@@ -4832,17 +4834,27 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 			HARD_TX_LOCK(dev, txq, cpu);
 
 			if (!netif_xmit_stopped(txq)) {
+				/* GSO segments a single SKB into
+				 * a list of frames. TCP expects error
+				 * to mean none of the data was sent.
+				 */
+				is_list = !!skb->next;
+
 				dev_xmit_recursion_inc();
 				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
 				dev_xmit_recursion_dec();
-				if (dev_xmit_complete(rc)) {
-					HARD_TX_UNLOCK(dev, txq);
-					goto out;
-				}
+				completed = dev_xmit_complete(rc);
+				if (is_list)
+					rc = NETDEV_TX_OK;
 			}
 			HARD_TX_UNLOCK(dev, txq);
+			if (completed)
+				goto out;
+
 			net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
 					     dev->name);
+			if (!is_list)
+				rc = -ENETDOWN;
 		} else {
 			/* Recursion is detected! It is possible,
 			 * unfortunately
@@ -4850,10 +4862,10 @@ int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
 recursion_alert:
 			net_crit_ratelimited("Dead loop on virtual device %s, fix it urgently!\n",
 					     dev->name);
+			rc = -ENETDOWN;
 		}
 	}
 
-	rc = -ENETDOWN;
 	rcu_read_unlock_bh();
 
 	dev_core_stats_tx_dropped_inc(dev);

^ permalink raw reply related	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2026-02-20 18:32 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-20 17:00 [PATCH net] net: consume xmit errors of GSO frames Jakub Kicinski
2026-02-20 17:17 ` Eric Dumazet
2026-02-20 18:32   ` Jakub Kicinski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox