Re: TCP sender stuck despite receiving ACKs from the peer

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: TCP sender stuck despite receiving ACKs from the peer
       [not found] <CA+suKw5OhWLJe_7uth4q=qxVpsD4qpwGRENORwA=beNLpiDuwg@mail.gmail.com>
@ 2025-10-04  1:24 ` Neal Cardwell
  2025-10-07 21:32   ` Christoph Schwarz
  2025-10-23 22:52   ` Christoph Schwarz
  0 siblings, 2 replies; 8+ messages in thread
From: Neal Cardwell @ 2025-10-04  1:24 UTC (permalink / raw)
  To: Christoph Schwarz; +Cc: edumazet, netdev

On Fri, Oct 3, 2025 at 8:29 PM Christoph Schwarz <cschwarz@arista.com> wrote:
>
> Hi,
>
> tldr; we believe there might be an issue with the TCP stack of the Linux kernel 5.10 that causes TCP connections to get stuck when they encounter a certain pattern of retransmissions and delayed and/or lost ACKs. We gathered extensive evidence supporting this theory, but we need help confirming it and further narrowing down the problem. Please read on if you find this interesting.
>
> Background: We have an application where multiple clients concurrently download large files (~900 MB) from an HTTP server. Both server and clients run on a Linux with kernel 5.10.165
>
> We observed that occasionally one or more of those downloads get stuck, i.e. download a portion of the file and then stop making any progress. In this state, ss shows a large (2 MB-ish) Send-Q on the server side, while Recv-Q on the client is zero, i.e. there is data to send, but it is just not making it across.
>
> We ran tcpdump on the server on one of the stuck connections and noticed that the server is retransmitting the same packet over and over again. The client ACK's each retransmission immediately, but the server doesn't seem to care. This goes on until either an application timeout hits, or (with application timeouts disabled) the kernel eventually closes the connection after ~15 minutes, which we believe is due to having exhausted the maximum number of retransmissions (tcp_retries2).
>
> We can reproduce this problem with selective ACK enabled or disabled, ruling out any direct connection to it.
>
> Example:
>
> 11:20:04.676418 02:1c:a7:00:00:01 > 02:1c:a7:00:00:04, ethertype IPv4 (0x0800), length 1514: 127.2.0.1.3102 > 127.2.0.4.46598: Flags [.], seq 1380896424:1380897872, ack 2678783744, win 500, options [nop,nop,TS val 2175898514 ecr 3444405317], length 1448
> 11:20:04.676525 02:1c:a7:00:00:04 > 02:1c:a7:00:00:01, ethertype IPv4 (0x0800), length 78: 127.2.0.4.46598 > 127.2.0.1.3102: Flags [.], ack 1381019504, win 24567, options [nop,nop,TS val 3444986302 ecr 2175317524,nop,nop,sack 1 {1380896424:1380897872}], length 0
> ...
> (this pattern continues, with incremental backoff, until either application level timeout hits, or maximum number of retransmissions is exceeded)
>
> The packet that the sender keeps sending is apparently a retransmission, with the client ACK'ing a sequence number further ahead.
>
> The next thing we tried is if we can bring such a connection out of the problem state by manually constructing and injecting ACKs, and indeed this is possible. As long as we keep ACKing the right edge of the retransmitted packet(s), the server will send more packets that are further ahead in the stream. If we ACK larger seqnos, such as the one the the client TCP stack is using, the server doesn't react. But if we continue to ACKs the right edges of retransmitted packets, then eventually the connection recovers and the download resumes and finishes successfully.
>
> At this point it is evident that the server is ignoring ACKs above a certain seqno. We just don't know what this seqno is.
>
> With some more hacks, we extracted snd_nxt from a socket in the problem state:
>    sz = sizeof(tqi->write_seq);
>    if (getsockopt(fd, SOL_TCP, TCP_QUEUE_SEQ, &tqi->write_seq, &sz))
>       return false;
>
>    // SIOCOUTQNSD: tp->write_seq - tp->snd_nxt
>    int write_seq__snd_nxt;
>    if (ioctl(fd, SIOCOUTQNSD, &write_seq__snd_nxt) == -1)
>       return false;
>    tqi->snd_nxt = tqi->write_seq - write_seq__snd_nxt;
>
> Then we cross-referenced the so acquired snd_nxt with the seqno that the client is ACK'ing and surprise, the seqno is LARGER than snd_nxt.
>
> We now have a suspicion why the sender is ignoring the ACKs. The following is very old code in tcp_ack that ignores all ACKs for data the the server hasn't sent yet:
> /* If the ack includes data we haven't sent yet, discard
> * this segment (RFC793 Section 3.9).
> */
> if (after(ack, tp->snd_nxt))
> return -1;
>
> To verify this theory, we added additional trace instructions to tcp_rcv_established and tcp_ack, then reproduced the issue once more while taking a packet capture on the server. This experiment confirmed the theory.
>
>   <...>-10864   [002] .... 56338.066092: tcp_rcv_established: tcp_rcv_established(2874212, 3102->33240) ack_seq=1678664094 after snd_nxt=1678609070
>   <...>-10864   [002] .... 56338.066093: tcp_ack: tcp_ack(2874212, 3102->33240, 16640): ack=1678664094, ack_seq=308986386, prior_snd_una=1678606174, snd_nxt=1678609070, high_seq=1678606174
>   <...>-10864   [002] .... 56338.066093: tcp_ack: tcp_ack(2874212), exit2=-1
>
> The traces show that in this instance, the client is ACK'ing 1678664094 which is greater than snd_nxt 1678609070. tcp_ack then returns at the place indicated above without processing the ACK.
>
> From the packet capture of this instance, we reconstructed the timeline of events happening before the connections entered the problem state. This was with SACK disabled.
>
> 1. the HTTP download starts, and all seems fine, with the server sending TCP segments of 1448 bytes in each packet and the client ACKing them.
> 2. at some point, the server decides to retransmit certain packets. When it does, it retransmits 45 consecutive packets, starting at a certain sequence number. The first thing to note is that this is not the oldest unacknowledged sequence number. There are in fact 88 older, unacknowledged packets before the first retransmitted one. This retransmission happens 0.000078 seconds after the initial transmission (according to timestamps in the packet capture)
> 3. the server retransmits the same 45 packets for a second time, 0.000061 seconds after the first retransmission.
> 4. ACKs arrive that cover receipt of all data up to, but not including, those 45 packets. For the purpose of the following events, let those packets be numbered 1 through 45
> 5. the server retransmits packet 1 for the third time
> 6. multiple ACKs arrive covering packets 2 through 41
> 7. the server retransmits packet 2
> 8. two ACKs arrive for packet 41
> 9. the server retransmits packet 1
> 10. an ACK arrives for packet 41
> 11. steps 9. and 10. repeat with incremental backoff. The connection is stuck at this point
>
> From the kernel traces, we can tell the sender's state as follows:
> snd_nxt = packet 3
> high_seq and prior_snd_una = packet 1
>
> At this point, the sender believes it sent only packets 1 and 2, but the peer received more packets, up to packet 41. Packets 42 through 45 seem to have been lost.
>
> This is where we need help:
> 1. why did the retransmission of the 45 packets start so shortly after the initial transmission?
> 2. why were there two retransmissions?
> 3. why did retransmission not start at the oldest unacknowledged packet, given that SACK was disabled?
> 4. is this possible given the sequence of events, that snd_nxt and high_seq were reset in step 5. or 6. and what would be the reason for it?
> 5. does this look like a bug in the TCP stack?
> 6. any advice how we can further narrow this down?
>
> thank you,
> Chris

Thanks for the report!

A few thoughts:

(1) For the trace you described in detail, would it be possible to
place the binary .pcap file on a server somewhere and share the URL
for the file? This will be vastly easier to diagnose if we can see the
whole trace, and use visualization tools, etc. The best traces are
those that capture the SYN and SYN/ACK, so we can see the option
negotiation. (If the trace is large, keep in mind that usually
analysis only requires the headers; tcpdump with "-s 120" is usually
sufficient.)

(2) After that, would it be possible to try this test with a newer
kernel? You mentioned this is with kernel version 5.10.165, but that's
more than 2.5 years old at this point, and it's possible the bug has
been fixed since then.  Could you please try this test with the newest
kernel that is available in your distribution? (If you are forced to
use 5.10.x on your distribution, note that even with 5.10.x there is
v5.10.245, which was released yesterday.)

(3) If this bug is still reproducible with a recent kernel, would it
be possible to gather .pcap traces from both client and server,
including SYN and SYN/ACK? Sometimes it can be helpful to see the
perspective of both ends, especially if there are middleboxes
manipulating the packets in some way.

Thanks!

Best regards,
neal

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-04  1:24 ` TCP sender stuck despite receiving ACKs from the peer Neal Cardwell
@ 2025-10-07 21:32   ` Christoph Schwarz
  2025-10-23 22:52   ` Christoph Schwarz
  1 sibling, 0 replies; 8+ messages in thread
From: Christoph Schwarz @ 2025-10-07 21:32 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: edumazet, netdev

On 10/3/25 18:24, Neal Cardwell wrote:
> On Fri, Oct 3, 2025 at 8:29 PM Christoph Schwarz <cschwarz@arista.com> wrote:
>>
>> Hi,
>>
>> tldr; we believe there might be an issue with the TCP stack of the Linux kernel 5.10 that causes TCP connections to get stuck when they encounter a certain pattern of retransmissions and delayed and/or lost ACKs. We gathered extensive evidence supporting this theory, but we need help confirming it and further narrowing down the problem. Please read on if you find this interesting.
>>
>> Background: We have an application where multiple clients concurrently download large files (~900 MB) from an HTTP server. Both server and clients run on a Linux with kernel 5.10.165
>>
>> We observed that occasionally one or more of those downloads get stuck, i.e. download a portion of the file and then stop making any progress. In this state, ss shows a large (2 MB-ish) Send-Q on the server side, while Recv-Q on the client is zero, i.e. there is data to send, but it is just not making it across.
>>
>> We ran tcpdump on the server on one of the stuck connections and noticed that the server is retransmitting the same packet over and over again. The client ACK's each retransmission immediately, but the server doesn't seem to care. This goes on until either an application timeout hits, or (with application timeouts disabled) the kernel eventually closes the connection after ~15 minutes, which we believe is due to having exhausted the maximum number of retransmissions (tcp_retries2).
>>
>> We can reproduce this problem with selective ACK enabled or disabled, ruling out any direct connection to it.
>>
>> Example:
>>
>> 11:20:04.676418 02:1c:a7:00:00:01 > 02:1c:a7:00:00:04, ethertype IPv4 (0x0800), length 1514: 127.2.0.1.3102 > 127.2.0.4.46598: Flags [.], seq 1380896424:1380897872, ack 2678783744, win 500, options [nop,nop,TS val 2175898514 ecr 3444405317], length 1448
>> 11:20:04.676525 02:1c:a7:00:00:04 > 02:1c:a7:00:00:01, ethertype IPv4 (0x0800), length 78: 127.2.0.4.46598 > 127.2.0.1.3102: Flags [.], ack 1381019504, win 24567, options [nop,nop,TS val 3444986302 ecr 2175317524,nop,nop,sack 1 {1380896424:1380897872}], length 0
>> ...
>> (this pattern continues, with incremental backoff, until either application level timeout hits, or maximum number of retransmissions is exceeded)
>>
>> The packet that the sender keeps sending is apparently a retransmission, with the client ACK'ing a sequence number further ahead.
>>
>> The next thing we tried is if we can bring such a connection out of the problem state by manually constructing and injecting ACKs, and indeed this is possible. As long as we keep ACKing the right edge of the retransmitted packet(s), the server will send more packets that are further ahead in the stream. If we ACK larger seqnos, such as the one the the client TCP stack is using, the server doesn't react. But if we continue to ACKs the right edges of retransmitted packets, then eventually the connection recovers and the download resumes and finishes successfully.
>>
>> At this point it is evident that the server is ignoring ACKs above a certain seqno. We just don't know what this seqno is.
>>
>> With some more hacks, we extracted snd_nxt from a socket in the problem state:
>>     sz = sizeof(tqi->write_seq);
>>     if (getsockopt(fd, SOL_TCP, TCP_QUEUE_SEQ, &tqi->write_seq, &sz))
>>        return false;
>>
>>     // SIOCOUTQNSD: tp->write_seq - tp->snd_nxt
>>     int write_seq__snd_nxt;
>>     if (ioctl(fd, SIOCOUTQNSD, &write_seq__snd_nxt) == -1)
>>        return false;
>>     tqi->snd_nxt = tqi->write_seq - write_seq__snd_nxt;
>>
>> Then we cross-referenced the so acquired snd_nxt with the seqno that the client is ACK'ing and surprise, the seqno is LARGER than snd_nxt.
>>
>> We now have a suspicion why the sender is ignoring the ACKs. The following is very old code in tcp_ack that ignores all ACKs for data the the server hasn't sent yet:
>> /* If the ack includes data we haven't sent yet, discard
>> * this segment (RFC793 Section 3.9).
>> */
>> if (after(ack, tp->snd_nxt))
>> return -1;
>>
>> To verify this theory, we added additional trace instructions to tcp_rcv_established and tcp_ack, then reproduced the issue once more while taking a packet capture on the server. This experiment confirmed the theory.
>>
>>    <...>-10864   [002] .... 56338.066092: tcp_rcv_established: tcp_rcv_established(2874212, 3102->33240) ack_seq=1678664094 after snd_nxt=1678609070
>>    <...>-10864   [002] .... 56338.066093: tcp_ack: tcp_ack(2874212, 3102->33240, 16640): ack=1678664094, ack_seq=308986386, prior_snd_una=1678606174, snd_nxt=1678609070, high_seq=1678606174
>>    <...>-10864   [002] .... 56338.066093: tcp_ack: tcp_ack(2874212), exit2=-1
>>
>> The traces show that in this instance, the client is ACK'ing 1678664094 which is greater than snd_nxt 1678609070. tcp_ack then returns at the place indicated above without processing the ACK.
>>
>>  From the packet capture of this instance, we reconstructed the timeline of events happening before the connections entered the problem state. This was with SACK disabled.
>>
>> 1. the HTTP download starts, and all seems fine, with the server sending TCP segments of 1448 bytes in each packet and the client ACKing them.
>> 2. at some point, the server decides to retransmit certain packets. When it does, it retransmits 45 consecutive packets, starting at a certain sequence number. The first thing to note is that this is not the oldest unacknowledged sequence number. There are in fact 88 older, unacknowledged packets before the first retransmitted one. This retransmission happens 0.000078 seconds after the initial transmission (according to timestamps in the packet capture)
>> 3. the server retransmits the same 45 packets for a second time, 0.000061 seconds after the first retransmission.
>> 4. ACKs arrive that cover receipt of all data up to, but not including, those 45 packets. For the purpose of the following events, let those packets be numbered 1 through 45
>> 5. the server retransmits packet 1 for the third time
>> 6. multiple ACKs arrive covering packets 2 through 41
>> 7. the server retransmits packet 2
>> 8. two ACKs arrive for packet 41
>> 9. the server retransmits packet 1
>> 10. an ACK arrives for packet 41
>> 11. steps 9. and 10. repeat with incremental backoff. The connection is stuck at this point
>>
>>  From the kernel traces, we can tell the sender's state as follows:
>> snd_nxt = packet 3
>> high_seq and prior_snd_una = packet 1
>>
>> At this point, the sender believes it sent only packets 1 and 2, but the peer received more packets, up to packet 41. Packets 42 through 45 seem to have been lost.
>>
>> This is where we need help:
>> 1. why did the retransmission of the 45 packets start so shortly after the initial transmission?
>> 2. why were there two retransmissions?
>> 3. why did retransmission not start at the oldest unacknowledged packet, given that SACK was disabled?
>> 4. is this possible given the sequence of events, that snd_nxt and high_seq were reset in step 5. or 6. and what would be the reason for it?
>> 5. does this look like a bug in the TCP stack?
>> 6. any advice how we can further narrow this down?
>>
>> thank you,
>> Chris
> 
> Thanks for the report!
> 
> A few thoughts:
> 
> (1) For the trace you described in detail, would it be possible to
> place the binary .pcap file on a server somewhere and share the URL
> for the file? This will be vastly easier to diagnose if we can see the
> whole trace, and use visualization tools, etc. The best traces are
> those that capture the SYN and SYN/ACK, so we can see the option
> negotiation. (If the trace is large, keep in mind that usually
> analysis only requires the headers; tcpdump with "-s 120" is usually
> sufficient.)
> 
> (2) After that, would it be possible to try this test with a newer
> kernel? You mentioned this is with kernel version 5.10.165, but that's
> more than 2.5 years old at this point, and it's possible the bug has
> been fixed since then.  Could you please try this test with the newest
> kernel that is available in your distribution? (If you are forced to
> use 5.10.x on your distribution, note that even with 5.10.x there is
> v5.10.245, which was released yesterday.)
> 
> (3) If this bug is still reproducible with a recent kernel, would it
> be possible to gather .pcap traces from both client and server,
> including SYN and SYN/ACK? Sometimes it can be helpful to see the
> perspective of both ends, especially if there are middleboxes
> manipulating the packets in some way.
> 
> Thanks!
> 
> Best regards,
> neal

Hi Neal,

Thank you for your feedback. The pcap file for (1) is available at:
https://drive.google.com/drive/folders/147C8Fzt9hsStolASh1tcdJArEYq-wUkY?usp=sharing

Handshake: tcp-stuck-w10-no-sack-server-handshake-5.10.165.pcap
Full TCP session: tcp-stuck-w10-no-sack-server-5.10.165.pcap

I stripped TCP payloads for privacy reasons (snaplen 66), but it still 
shows nicely in Wireshark.

We will be getting to (2) and (3) hopefully by the end of this week. We 
won't have 5.10.245 available but we can try with 6.12.40

thanks,
Chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-04  1:24 ` TCP sender stuck despite receiving ACKs from the peer Neal Cardwell
  2025-10-07 21:32   ` Christoph Schwarz
@ 2025-10-23 22:52   ` Christoph Schwarz
  2025-10-24  5:29     ` Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: Christoph Schwarz @ 2025-10-23 22:52 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: edumazet, netdev

On 10/3/25 18:24, Neal Cardwell wrote:
[...]
> Thanks for the report!
> 
> A few thoughts:
> 
[...]
> 
> (2) After that, would it be possible to try this test with a newer
> kernel? You mentioned this is with kernel version 5.10.165, but that's
> more than 2.5 years old at this point, and it's possible the bug has
> been fixed since then.  Could you please try this test with the newest
> kernel that is available in your distribution? (If you are forced to
> use 5.10.x on your distribution, note that even with 5.10.x there is
> v5.10.245, which was released yesterday.)
> 
> (3) If this bug is still reproducible with a recent kernel, would it
> be possible to gather .pcap traces from both client and server,
> including SYN and SYN/ACK? Sometimes it can be helpful to see the
> perspective of both ends, especially if there are middleboxes
> manipulating the packets in some way.
> 
> Thanks!
> 
> Best regards,
> neal

Hi,

I want to give an update as we made some progress.

We tried with the 6.12.40 kernel, but it was much harder to reproduce 
and we were not able to do a successful packet capture and reproduction 
at the same time. So we went back to 5.10.165, added more tracing and 
eventually figured out how the TCP connection got into the bad state.

This is a backtrace from the TCP stack calling down to the device driver:
  => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
  => dev_hard_start_xmit
  => sch_direct_xmit
  => __qdisc_run
  => __dev_queue_xmit
  => vlan_dev_hard_start_xmit
  => dev_hard_start_xmit
  => __dev_queue_xmit
  => ip_finish_output2
  => __ip_queue_xmit
  => __tcp_transmit_skb
  => tcp_write_xmit

tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448, 
they get broken down into 45 packets of 1448 bytes each. These 45 
packets eventually reach dev_hard_start_xmit, which is a simple loop 
forwarding packets one by one. When the problem occurs, we see that 
dev_hard_start_xmit transmits the initial N packets successfully, but 
the remaining 45-N ones fail with error code 1. The loop runs to 
completion and does not break.

The error code 1 from dev_hard_start_xmit gets returned through the call 
stack up to tcp_write_xmit, which treats this as error and breaks its 
own loop without advancing snd_nxt:

		if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
			break; // <<< breaks here

repair:
		/* Advance the send_head.  This one is sent out.
		 * This call will increment packets_out.
		 */
		tcp_event_new_data_sent(sk, skb);

 From packet captures we can prove that the 45 packets show up on the 
kernel device on the sender. In addition, the first N of those 45 
packets show up on the kernel device on the peer. The connection is now 
in the problem state where the peer is N packets ahead of the sender and 
the sender thinks that it never those packets, leading to the problem as 
described in my initial mail.

Furthermore, we noticed that the N-45 missing packets show up as drops 
on the sender's kernel device:

vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
         inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
         [...]
         TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0

This device is a vlan device stacked on another device like this:

49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc 
noqueue state UP mode DEFAULT group default qlen 1000
     link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state 
UNKNOWN mode DEFAULT group default qlen 1000
     link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff

Eventually packets need to go through the device driver, which has only 
a limited number of TX buffers. The driver implements flow control: when 
it is about to exhaust its buffers, it stops TX by calling 
netif_stop_queue. Once more buffers become available again, it resumes 
TX by calling netif_wake_queue. From packet counters we can tell that 
this is happening frequently.

At this point we suspected "qdisc noqueue" to be a factor, and indeed, 
after adding a queue to vlan0 the problem no longer happened, although 
there are still TX drops on the vlan0 device.

Missing queue or not, we think there is a disconnect between the device 
driver API and the TCP stack. The device driver API only allows 
transmitting packets one by one (ndo_start_xmit). The TCP stack operates 
on larger segments that is breaks down into smaller pieces 
(tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short 
write" condition which the network stack doesn't seem to handle well in 
all cases.

Appreciate you comments,
Chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-23 22:52   ` Christoph Schwarz
@ 2025-10-24  5:29     ` Eric Dumazet
  2025-10-24  5:57       ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-10-24  5:29 UTC (permalink / raw)
  To: Christoph Schwarz; +Cc: Neal Cardwell, netdev

On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz <cschwarz@arista.com> wrote:
>
> On 10/3/25 18:24, Neal Cardwell wrote:
> [...]
> > Thanks for the report!
> >
> > A few thoughts:
> >
> [...]
> >
> > (2) After that, would it be possible to try this test with a newer
> > kernel? You mentioned this is with kernel version 5.10.165, but that's
> > more than 2.5 years old at this point, and it's possible the bug has
> > been fixed since then.  Could you please try this test with the newest
> > kernel that is available in your distribution? (If you are forced to
> > use 5.10.x on your distribution, note that even with 5.10.x there is
> > v5.10.245, which was released yesterday.)
> >
> > (3) If this bug is still reproducible with a recent kernel, would it
> > be possible to gather .pcap traces from both client and server,
> > including SYN and SYN/ACK? Sometimes it can be helpful to see the
> > perspective of both ends, especially if there are middleboxes
> > manipulating the packets in some way.
> >
> > Thanks!
> >
> > Best regards,
> > neal
>
> Hi,
>
> I want to give an update as we made some progress.
>
> We tried with the 6.12.40 kernel, but it was much harder to reproduce
> and we were not able to do a successful packet capture and reproduction
> at the same time. So we went back to 5.10.165, added more tracing and
> eventually figured out how the TCP connection got into the bad state.
>
> This is a backtrace from the TCP stack calling down to the device driver:
>   => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
>   => dev_hard_start_xmit
>   => sch_direct_xmit
>   => __qdisc_run
>   => __dev_queue_xmit
>   => vlan_dev_hard_start_xmit
>   => dev_hard_start_xmit
>   => __dev_queue_xmit
>   => ip_finish_output2
>   => __ip_queue_xmit
>   => __tcp_transmit_skb
>   => tcp_write_xmit
>
> tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
> they get broken down into 45 packets of 1448 bytes each.

So the driver does not support TSO ? Quite odd in 2025...

One thing you want is to make sure your vlan device (the one without a
Qdisc on it)
advertizes tso support.

ethtool -k vlan0


> These 45
> packets eventually reach dev_hard_start_xmit, which is a simple loop
> forwarding packets one by one. When the problem occurs, we see that
> dev_hard_start_xmit transmits the initial N packets successfully, but
> the remaining 45-N ones fail with error code 1. The loop runs to
> completion and does not break.
>
> The error code 1 from dev_hard_start_xmit gets returned through the call
> stack up to tcp_write_xmit, which treats this as error and breaks its
> own loop without advancing snd_nxt:
>
>                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
>                         break; // <<< breaks here
>
> repair:
>                 /* Advance the send_head.  This one is sent out.
>                  * This call will increment packets_out.
>                  */
>                 tcp_event_new_data_sent(sk, skb);
>
>  From packet captures we can prove that the 45 packets show up on the
> kernel device on the sender. In addition, the first N of those 45
> packets show up on the kernel device on the peer. The connection is now
> in the problem state where the peer is N packets ahead of the sender and
> the sender thinks that it never those packets, leading to the problem as
> described in my initial mail.
>
> Furthermore, we noticed that the N-45 missing packets show up as drops
> on the sender's kernel device:
>
> vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
>          inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
>          [...]
>          TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0
>
> This device is a vlan device stacked on another device like this:
>
> 49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> noqueue state UP mode DEFAULT group default qlen 1000
>      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
> 3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
> UNKNOWN mode DEFAULT group default qlen 1000
>      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
>
> Eventually packets need to go through the device driver, which has only
> a limited number of TX buffers. The driver implements flow control: when
> it is about to exhaust its buffers, it stops TX by calling
> netif_stop_queue. Once more buffers become available again, it resumes
> TX by calling netif_wake_queue. From packet counters we can tell that
> this is happening frequently.
>
> At this point we suspected "qdisc noqueue" to be a factor, and indeed,
> after adding a queue to vlan0 the problem no longer happened, although
> there are still TX drops on the vlan0 device.
>
> Missing queue or not, we think there is a disconnect between the device
> driver API and the TCP stack. The device driver API only allows
> transmitting packets one by one (ndo_start_xmit). The TCP stack operates
> on larger segments that is breaks down into smaller pieces
> (tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
> write" condition which the network stack doesn't seem to handle well in
> all cases.
>
> Appreciate you comments,

Very nice analysis, very much appreciated.

I think the issue here is that __tcp_transmit_skb() trusts the return
of icsk->icsk_af_ops->queue_xmit()

An error means : the packet was _not_ sent at all.

Here, it seems that the GSO layer returns an error, even if some
segments were sent.
This needs to be confirmed and fixed, but in the meantime, make sure
vlan0 has TSO support.
It will also be more efficient to segment (if you ethernet device has
no TSO capability) at the last moment,
because all the segments will be sent in  the described scenario
thanks to qdisc requeues.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-24  5:29     ` Eric Dumazet
@ 2025-10-24  5:57       ` Eric Dumazet
  2025-10-31  9:06         ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-10-24  5:57 UTC (permalink / raw)
  To: Christoph Schwarz; +Cc: Neal Cardwell, netdev

On Thu, Oct 23, 2025 at 10:29 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz <cschwarz@arista.com> wrote:
> >
> > On 10/3/25 18:24, Neal Cardwell wrote:
> > [...]
> > > Thanks for the report!
> > >
> > > A few thoughts:
> > >
> > [...]
> > >
> > > (2) After that, would it be possible to try this test with a newer
> > > kernel? You mentioned this is with kernel version 5.10.165, but that's
> > > more than 2.5 years old at this point, and it's possible the bug has
> > > been fixed since then.  Could you please try this test with the newest
> > > kernel that is available in your distribution? (If you are forced to
> > > use 5.10.x on your distribution, note that even with 5.10.x there is
> > > v5.10.245, which was released yesterday.)
> > >
> > > (3) If this bug is still reproducible with a recent kernel, would it
> > > be possible to gather .pcap traces from both client and server,
> > > including SYN and SYN/ACK? Sometimes it can be helpful to see the
> > > perspective of both ends, especially if there are middleboxes
> > > manipulating the packets in some way.
> > >
> > > Thanks!
> > >
> > > Best regards,
> > > neal
> >
> > Hi,
> >
> > I want to give an update as we made some progress.
> >
> > We tried with the 6.12.40 kernel, but it was much harder to reproduce
> > and we were not able to do a successful packet capture and reproduction
> > at the same time. So we went back to 5.10.165, added more tracing and
> > eventually figured out how the TCP connection got into the bad state.
> >
> > This is a backtrace from the TCP stack calling down to the device driver:
> >   => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
> >   => dev_hard_start_xmit
> >   => sch_direct_xmit
> >   => __qdisc_run
> >   => __dev_queue_xmit
> >   => vlan_dev_hard_start_xmit
> >   => dev_hard_start_xmit
> >   => __dev_queue_xmit
> >   => ip_finish_output2
> >   => __ip_queue_xmit
> >   => __tcp_transmit_skb
> >   => tcp_write_xmit
> >
> > tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
> > they get broken down into 45 packets of 1448 bytes each.
>
> So the driver does not support TSO ? Quite odd in 2025...
>
> One thing you want is to make sure your vlan device (the one without a
> Qdisc on it)
> advertizes tso support.
>
> ethtool -k vlan0
>
>
> > These 45
> > packets eventually reach dev_hard_start_xmit, which is a simple loop
> > forwarding packets one by one. When the problem occurs, we see that
> > dev_hard_start_xmit transmits the initial N packets successfully, but
> > the remaining 45-N ones fail with error code 1. The loop runs to
> > completion and does not break.
> >
> > The error code 1 from dev_hard_start_xmit gets returned through the call
> > stack up to tcp_write_xmit, which treats this as error and breaks its
> > own loop without advancing snd_nxt:
> >
> >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> >                         break; // <<< breaks here
> >
> > repair:
> >                 /* Advance the send_head.  This one is sent out.
> >                  * This call will increment packets_out.
> >                  */
> >                 tcp_event_new_data_sent(sk, skb);
> >
> >  From packet captures we can prove that the 45 packets show up on the
> > kernel device on the sender. In addition, the first N of those 45
> > packets show up on the kernel device on the peer. The connection is now
> > in the problem state where the peer is N packets ahead of the sender and
> > the sender thinks that it never those packets, leading to the problem as
> > described in my initial mail.
> >
> > Furthermore, we noticed that the N-45 missing packets show up as drops
> > on the sender's kernel device:
> >
> > vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> >          inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
> >          [...]
> >          TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0
> >
> > This device is a vlan device stacked on another device like this:
> >
> > 49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> > noqueue state UP mode DEFAULT group default qlen 1000
> >      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
> > 3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
> > UNKNOWN mode DEFAULT group default qlen 1000
> >      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> >
> > Eventually packets need to go through the device driver, which has only
> > a limited number of TX buffers. The driver implements flow control: when
> > it is about to exhaust its buffers, it stops TX by calling
> > netif_stop_queue. Once more buffers become available again, it resumes
> > TX by calling netif_wake_queue. From packet counters we can tell that
> > this is happening frequently.
> >
> > At this point we suspected "qdisc noqueue" to be a factor, and indeed,
> > after adding a queue to vlan0 the problem no longer happened, although
> > there are still TX drops on the vlan0 device.
> >
> > Missing queue or not, we think there is a disconnect between the device
> > driver API and the TCP stack. The device driver API only allows
> > transmitting packets one by one (ndo_start_xmit). The TCP stack operates
> > on larger segments that is breaks down into smaller pieces
> > (tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
> > write" condition which the network stack doesn't seem to handle well in
> > all cases.
> >
> > Appreciate you comments,
>
> Very nice analysis, very much appreciated.
>
> I think the issue here is that __tcp_transmit_skb() trusts the return
> of icsk->icsk_af_ops->queue_xmit()
>
> An error means : the packet was _not_ sent at all.
>
> Here, it seems that the GSO layer returns an error, even if some
> segments were sent.
> This needs to be confirmed and fixed, but in the meantime, make sure
> vlan0 has TSO support.
> It will also be more efficient to segment (if you ethernet device has
> no TSO capability) at the last moment,
> because all the segments will be sent in  the described scenario
> thanks to qdisc requeues.

Could you try the following patch ?

Thanks again !

diff --git a/net/core/dev.c b/net/core/dev.c
index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
                 * to -1 or to their cpu id, but not to our id.
                 */
                if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
+                       struct sk_buff *orig;
+
                        if (dev_xmit_recursion())
                                goto recursion_alert;

@@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)

                        HARD_TX_LOCK(dev, txq, cpu);

+                       orig = skb;
                        if (!netif_xmit_stopped(txq)) {
                                dev_xmit_recursion_inc();
                                skb = dev_hard_start_xmit(skb, dev, txq, &rc);
@@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,
struct net_device *sb_dev)
                        HARD_TX_UNLOCK(dev, txq);
                        net_crit_ratelimited("Virtual device %s asks
to queue packet!\n",
                                             dev->name);
+                       if (skb != orig) {
+                               /* If at least one packet was sent, we
must return NETDEV_TX_OK */
+                               rc = NETDEV_TX_OK;
+                               goto unlock;
+                       }
                } else {
                        /* Recursion is detected! It is possible,
                         * unfortunately
@@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
net_device *sb_dev)
        }

        rc = -ENETDOWN;
+unlock:
        rcu_read_unlock_bh();

        dev_core_stats_tx_dropped_inc(dev);

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-24  5:57       ` Eric Dumazet
@ 2025-10-31  9:06         ` Eric Dumazet
  2025-10-31 17:43           ` Christoph Schwarz
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-10-31  9:06 UTC (permalink / raw)
  To: Christoph Schwarz; +Cc: Neal Cardwell, netdev

On Thu, Oct 23, 2025 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Oct 23, 2025 at 10:29 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Oct 23, 2025 at 3:52 PM Christoph Schwarz <cschwarz@arista.com> wrote:
> > >
> > > On 10/3/25 18:24, Neal Cardwell wrote:
> > > [...]
> > > > Thanks for the report!
> > > >
> > > > A few thoughts:
> > > >
> > > [...]
> > > >
> > > > (2) After that, would it be possible to try this test with a newer
> > > > kernel? You mentioned this is with kernel version 5.10.165, but that's
> > > > more than 2.5 years old at this point, and it's possible the bug has
> > > > been fixed since then.  Could you please try this test with the newest
> > > > kernel that is available in your distribution? (If you are forced to
> > > > use 5.10.x on your distribution, note that even with 5.10.x there is
> > > > v5.10.245, which was released yesterday.)
> > > >
> > > > (3) If this bug is still reproducible with a recent kernel, would it
> > > > be possible to gather .pcap traces from both client and server,
> > > > including SYN and SYN/ACK? Sometimes it can be helpful to see the
> > > > perspective of both ends, especially if there are middleboxes
> > > > manipulating the packets in some way.
> > > >
> > > > Thanks!
> > > >
> > > > Best regards,
> > > > neal
> > >
> > > Hi,
> > >
> > > I want to give an update as we made some progress.
> > >
> > > We tried with the 6.12.40 kernel, but it was much harder to reproduce
> > > and we were not able to do a successful packet capture and reproduction
> > > at the same time. So we went back to 5.10.165, added more tracing and
> > > eventually figured out how the TCP connection got into the bad state.
> > >
> > > This is a backtrace from the TCP stack calling down to the device driver:
> > >   => fdev_tx    // ndo_start_xmit hook of a proprietary device driver
> > >   => dev_hard_start_xmit
> > >   => sch_direct_xmit
> > >   => __qdisc_run
> > >   => __dev_queue_xmit
> > >   => vlan_dev_hard_start_xmit
> > >   => dev_hard_start_xmit
> > >   => __dev_queue_xmit
> > >   => ip_finish_output2
> > >   => __ip_queue_xmit
> > >   => __tcp_transmit_skb
> > >   => tcp_write_xmit
> > >
> > > tcp_write_xmit sends segments of 65160 bytes. Due to an MSS of 1448,
> > > they get broken down into 45 packets of 1448 bytes each.
> >
> > So the driver does not support TSO ? Quite odd in 2025...
> >
> > One thing you want is to make sure your vlan device (the one without a
> > Qdisc on it)
> > advertizes tso support.
> >
> > ethtool -k vlan0
> >
> >
> > > These 45
> > > packets eventually reach dev_hard_start_xmit, which is a simple loop
> > > forwarding packets one by one. When the problem occurs, we see that
> > > dev_hard_start_xmit transmits the initial N packets successfully, but
> > > the remaining 45-N ones fail with error code 1. The loop runs to
> > > completion and does not break.
> > >
> > > The error code 1 from dev_hard_start_xmit gets returned through the call
> > > stack up to tcp_write_xmit, which treats this as error and breaks its
> > > own loop without advancing snd_nxt:
> > >
> > >                 if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> > >                         break; // <<< breaks here
> > >
> > > repair:
> > >                 /* Advance the send_head.  This one is sent out.
> > >                  * This call will increment packets_out.
> > >                  */
> > >                 tcp_event_new_data_sent(sk, skb);
> > >
> > >  From packet captures we can prove that the 45 packets show up on the
> > > kernel device on the sender. In addition, the first N of those 45
> > > packets show up on the kernel device on the peer. The connection is now
> > > in the problem state where the peer is N packets ahead of the sender and
> > > the sender thinks that it never those packets, leading to the problem as
> > > described in my initial mail.
> > >
> > > Furthermore, we noticed that the N-45 missing packets show up as drops
> > > on the sender's kernel device:
> > >
> > > vlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
> > >          inet 127.2.0.1  netmask 255.255.255.0  broadcast 0.0.0.0
> > >          [...]
> > >          TX errors 0  dropped 36 overruns 0  carrier 0  collisions 0
> > >
> > > This device is a vlan device stacked on another device like this:
> > >
> > > 49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
> > > noqueue state UP mode DEFAULT group default qlen 1000
> > >      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
> > > 3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
> > > UNKNOWN mode DEFAULT group default qlen 1000
> > >      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff
> > >
> > > Eventually packets need to go through the device driver, which has only
> > > a limited number of TX buffers. The driver implements flow control: when
> > > it is about to exhaust its buffers, it stops TX by calling
> > > netif_stop_queue. Once more buffers become available again, it resumes
> > > TX by calling netif_wake_queue. From packet counters we can tell that
> > > this is happening frequently.
> > >
> > > At this point we suspected "qdisc noqueue" to be a factor, and indeed,
> > > after adding a queue to vlan0 the problem no longer happened, although
> > > there are still TX drops on the vlan0 device.
> > >
> > > Missing queue or not, we think there is a disconnect between the device
> > > driver API and the TCP stack. The device driver API only allows
> > > transmitting packets one by one (ndo_start_xmit). The TCP stack operates
> > > on larger segments that is breaks down into smaller pieces
> > > (tcp_write_xmit / __tcp_transmit_skb). This can lead to a classic "short
> > > write" condition which the network stack doesn't seem to handle well in
> > > all cases.
> > >
> > > Appreciate you comments,
> >
> > Very nice analysis, very much appreciated.
> >
> > I think the issue here is that __tcp_transmit_skb() trusts the return
> > of icsk->icsk_af_ops->queue_xmit()
> >
> > An error means : the packet was _not_ sent at all.
> >
> > Here, it seems that the GSO layer returns an error, even if some
> > segments were sent.
> > This needs to be confirmed and fixed, but in the meantime, make sure
> > vlan0 has TSO support.
> > It will also be more efficient to segment (if you ethernet device has
> > no TSO capability) at the last moment,
> > because all the segments will be sent in  the described scenario
> > thanks to qdisc requeues.
>
> Could you try the following patch ?
>
> Thanks again !
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>                  * to -1 or to their cpu id, but not to our id.
>                  */
>                 if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
> +                       struct sk_buff *orig;
> +
>                         if (dev_xmit_recursion())
>                                 goto recursion_alert;
>
> @@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>
>                         HARD_TX_LOCK(dev, txq, cpu);
>
> +                       orig = skb;
>                         if (!netif_xmit_stopped(txq)) {
>                                 dev_xmit_recursion_inc();
>                                 skb = dev_hard_start_xmit(skb, dev, txq, &rc);
> @@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,
> struct net_device *sb_dev)
>                         HARD_TX_UNLOCK(dev, txq);
>                         net_crit_ratelimited("Virtual device %s asks
> to queue packet!\n",
>                                              dev->name);
> +                       if (skb != orig) {
> +                               /* If at least one packet was sent, we
> must return NETDEV_TX_OK */
> +                               rc = NETDEV_TX_OK;
> +                               goto unlock;
> +                       }
>                 } else {
>                         /* Recursion is detected! It is possible,
>                          * unfortunately
> @@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
> net_device *sb_dev)
>         }
>
>         rc = -ENETDOWN;
> +unlock:
>         rcu_read_unlock_bh();
>
>         dev_core_stats_tx_dropped_inc(dev);

Hi Christoph

Any progress on your side ?

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-31  9:06         ` Eric Dumazet
@ 2025-10-31 17:43           ` Christoph Schwarz
  2025-10-31 18:01             ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Christoph Schwarz @ 2025-10-31 17:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Neal Cardwell, netdev



On 10/31/25 02:06, Eric Dumazet wrote:
> On Thu, Oct 23, 2025 at 10:57 PM Eric Dumazet <edumazet@google.com> wrote:
>>
[...]
>> Could you try the following patch ?
>>
>> Thanks again !
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 378c2d010faf251ffd874ebf0cc3dd6968eee447..8efda845611129920a9ae21d5e9dd05ffab36103
>> 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4796,6 +4796,8 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
>> net_device *sb_dev)
>>                   * to -1 or to their cpu id, but not to our id.
>>                   */
>>                  if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
>> +                       struct sk_buff *orig;
>> +
>>                          if (dev_xmit_recursion())
>>                                  goto recursion_alert;
>>
>> @@ -4805,6 +4807,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
>> net_device *sb_dev)
>>
>>                          HARD_TX_LOCK(dev, txq, cpu);
>>
>> +                       orig = skb;
>>                          if (!netif_xmit_stopped(txq)) {
>>                                  dev_xmit_recursion_inc();
>>                                  skb = dev_hard_start_xmit(skb, dev, txq, &rc);
>> @@ -4817,6 +4820,11 @@ int __dev_queue_xmit(struct sk_buff *skb,
>> struct net_device *sb_dev)
>>                          HARD_TX_UNLOCK(dev, txq);
>>                          net_crit_ratelimited("Virtual device %s asks
>> to queue packet!\n",
>>                                               dev->name);
>> +                       if (skb != orig) {
>> +                               /* If at least one packet was sent, we
>> must return NETDEV_TX_OK */
>> +                               rc = NETDEV_TX_OK;
>> +                               goto unlock;
>> +                       }
>>                  } else {
>>                          /* Recursion is detected! It is possible,
>>                           * unfortunately
>> @@ -4828,6 +4836,7 @@ int __dev_queue_xmit(struct sk_buff *skb, struct
>> net_device *sb_dev)
>>          }
>>
>>          rc = -ENETDOWN;
>> +unlock:
>>          rcu_read_unlock_bh();
>>
>>          dev_core_stats_tx_dropped_inc(dev);
> 
> Hi Christoph
> 
> Any progress on your side ?
> 
> Thanks.

Hi Eric,

Thanks for your help. This is much appreciated.

We tried your patch but unfortunately it did not help. We have some 
ideas why that is. Here is what we figured out:

It is very likely that device stacking as described in my previous mail 
is a factor.

49: vlan0@parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP mode DEFAULT group default qlen 1000
      link/ether 02:1c:a7:00:00:01 brd ff:ff:ff:ff:ff:ff
3: parent: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 10000 qdisc prio state
UNKNOWN mode DEFAULT group default qlen 1000
      link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff

The "parent" device is served by a proprietary device driver for a 
switch ASIC, and implements TX flow control, with the TX queue being 
stopped frequently. It does not have TSO capabilities. We could look 
into adding that, but as of now it is not an option.

The "vlan0" device stacked on top is Linux kernel code 
(net/8021q/vlan_dev.c) and has the IP address to which the HTTP server 
binds. However, its TX queue never stops.

So now it can get into this situation where the TX queue on the 
underlying device is stopped, but on the stacked vlan0 device it is not. 
In this situation, we see return codes of NET_XMIT_DROP (1).

Which means it never reaches the code that you patched in, because 
thanks to rc=1, dev_xmit_complete is always true so it goes to out. And 
because the TX queue on vlan0 is never stopped, it always enters the 
"!netif_xmit_stopped(txq)" block and never skips over it, again 
preventing the new code from ever being executed.

if (!netif_xmit_stopped(txq)) {
	dev_xmit_recursion_inc();
	skb = dev_hard_start_xmit(skb, dev, txq, &rc);
	dev_xmit_recursion_dec();
	if (dev_xmit_complete(rc)) {
		HARD_TX_UNLOCK(dev, txq);
		goto out;
	}
}
HARD_TX_UNLOCK(dev, txq);
net_crit_ratelimited("Virtual device %s asks to queue packet!\n",
		     dev->name);
if (skb != orig) {
	/* If at least one packet was sent, we must return NETDEV_TX_OK */
	rc = NETDEV_TX_OK;
	goto unlock;
}

I think for your patch to work we would need to see a NETDEV_TX_BUSY 
(0x10) rc from dev_hard_start_xmit, but that does not seem to happen, 
maybe due to the device stacking?

best regards,
Chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: TCP sender stuck despite receiving ACKs from the peer
  2025-10-31 17:43           ` Christoph Schwarz
@ 2025-10-31 18:01             ` Stephen Hemminger
  0 siblings, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2025-10-31 18:01 UTC (permalink / raw)
  To: Christoph Schwarz; +Cc: Eric Dumazet, Neal Cardwell, netdev

On Fri, 31 Oct 2025 10:43:36 -0700
Christoph Schwarz <cschwarz@arista.com> wrote:

> The "parent" device is served by a proprietary device driver for a 
> switch ASIC, and implements TX flow control, with the TX queue being 
> stopped frequently. It does not have TSO capabilities. We could look 
> into adding that, but as of now it is not an option.

You really should not expect any on mailing list support for anything
that uses proprietary device driver. I.e "not my problem go away"

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-10-31 18:01 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CA+suKw5OhWLJe_7uth4q=qxVpsD4qpwGRENORwA=beNLpiDuwg@mail.gmail.com>
2025-10-04  1:24 ` TCP sender stuck despite receiving ACKs from the peer Neal Cardwell
2025-10-07 21:32   ` Christoph Schwarz
2025-10-23 22:52   ` Christoph Schwarz
2025-10-24  5:29     ` Eric Dumazet
2025-10-24  5:57       ` Eric Dumazet
2025-10-31  9:06         ` Eric Dumazet
2025-10-31 17:43           ` Christoph Schwarz
2025-10-31 18:01             ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).