* tcp: socket stuck with zero receive window after SACK
@ 2025-05-19 13:31 Simon Campion
2025-05-19 14:42 ` Neal Cardwell
0 siblings, 1 reply; 11+ messages in thread
From: Simon Campion @ 2025-05-19 13:31 UTC (permalink / raw)
To: netdev
Hi all,
We have a TCP socket that's stuck in the following state:
* it SACKed ~40KB of data, but misses 602 bytes at the beginning
* it has a zero receive window
* the Recv-Q as reported by ss is 0
Due to the zero window, the kernel drops the missing 602 bytes when
the peer sends them. So, the socket is stuck indefinitely waiting for
data it drops when it receives it. Since the Recv-Q as reported by ss
is 0, we suspect the receive window is not 0 because the owner of the
socket isn't reading data. Rather, we wonder whether the kernel SACKed
too much data than it should have, given the receive buffer size, not
leaving enough space to store the missing bytes when they arrive.
Could this happen?
We don't have a reproducer for this issue. The socket is still in this
state, so we're happy to provide more debugging information while we
have it. This is the first time we've seen this problem.
Here are more details:
# uname -r
6.6.83-flatcar
The stuck socket is used by the in-kernel cephfs driver to fetch data
from an OSD on a storage node. The storage node runs the same kernel
version.
In tcpdump, we see SACK {603:42189} and win 0:
# tcpdump -ni any 'host 10.70.3.48 and tcp port 6945'
...
10:02:56.739075 eth1a Out IP 10.70.112.146.35432 > 10.70.3.48.6945:
Flags [P.], seq 4260252836:4260252845, ack 838667293, win 0, options
[nop,nop,TS val 1683964360 ecr 1218657118,nop,nop,sack 1 {603:42189}],
length 9
10:02:56.739157 eth1b In IP 10.70.3.48.6945 > 10.70.112.146.35432:
Flags [.], ack 9, win 518, options [nop,nop,TS val 1218662238 ecr
1683964360], length 0
10:03:01.859080 eth1a Out IP 10.70.112.146.35432 > 10.70.3.48.6945:
Flags [P.], seq 9:18, ack 1, win 0, options [nop,nop,TS val 1683969480
ecr 1218662238,nop,nop,sack 1 {603:42189}], length 9
10:03:01.859185 eth1b In IP 10.70.3.48.6945 > 10.70.112.146.35432:
Flags [.], ack 18, win 518, options [nop,nop,TS val 1218667358 ecr
1683969480], length 0
Every two minutes, the storage node will try to transmit the missing 602 bytes:
10:03:10.896289 eth1b In IP 10.70.3.48.6945 > 10.70.112.146.35432:
Flags [.], seq 1:603, ack 27, win 518, options [nop,nop,TS val
1218676396 ecr 1683974600], length 602
10:03:10.896319 eth1a Out IP 10.70.112.146.35432 > 10.70.3.48.6945:
Flags [.], ack 1, win 0, options [nop,nop,TS val 1683978517 ecr
1218676396,nop,nop,sack 1 {603:42189}], length 0
pwru shows that the packet with the missing 602 bytes is dropped
because of SKB_DROP_REASON_TCP_ZEROWINDOW:
# ./pwru 'src host 10.70.3.48 and tcp port 6945 and greater 100'
...
SKB CPU PROCESS NETNS MARK/x IFACE
PROTO MTU LEN TUPLE FUNC
0xff1100014dd71f00 10 <empty>:0 4026531840 0
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) inet_gro_receive
0xff1100014dd71f00 10 <empty>:0 4026531840 0
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp4_gro_receive
0xff1100014dd71f00 10 <empty>:0 4026531840 0
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp_gro_receive
0xff1100014dd71f00 10 <empty>:0 4026531840 0
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) skb_defer_rx_timestamp
0xff1100014dd71f00 10 <empty>:0 4026531840 0
eth1b:5 0x0800 9086 668
10.70.3.48:6945->10.70.112.146:35432(tcp) skb_ensure_writable
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) ip_rcv_core
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) nf_hook_slow
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) nf_checksum
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) nf_ip_checksum
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 9086 654
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp_v4_early_demux
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 654
10.70.3.48:6945->10.70.112.146:35432(tcp) ip_local_deliver
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 654
10.70.3.48:6945->10.70.112.146:35432(tcp) nf_hook_slow
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 654
10.70.3.48:6945->10.70.112.146:35432(tcp) ip_local_deliver_finish
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) ip_protocol_deliver_rcu
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) raw_local_deliver
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp_v4_rcv
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp_filter
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) sk_filter_trim_cap
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) security_sock_rcv_skb
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) selinux_socket_sock_rcv_skb
0xff1100014dd71f00 10 <empty>:0 4026531840 300
eth1b:5 0x0800 65536 634
10.70.3.48:6945->10.70.112.146:35432(tcp) tcp_v4_fill_cb
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 65536 634 10.70.3.48:6945->10.70.112.146:35432(tcp)
tcp_v4_do_rcv
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 65536 634 10.70.3.48:6945->10.70.112.146:35432(tcp)
tcp_rcv_established
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 65536 634 10.70.3.48:6945->10.70.112.146:35432(tcp)
tcp_validate_incoming
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 65536 634 10.70.3.48:6945->10.70.112.146:35432(tcp)
tcp_urg
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 65536 634 10.70.3.48:6945->10.70.112.146:35432(tcp)
tcp_data_queue
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 0 602 10.70.3.48:6945->10.70.112.146:35432(tcp)
kfree_skb_reason(SKB_DROP_REASON_TCP_ZEROWINDOW)
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 0 602 10.70.3.48:6945->10.70.112.146:35432(tcp)
skb_release_head_state
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 0 602 10.70.3.48:6945->10.70.112.146:35432(tcp)
skb_release_data
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 0 602 10.70.3.48:6945->10.70.112.146:35432(tcp)
skb_free_head
0xff1100014dd71f00 10 <empty>:0 0 300 0
0x0800 0 602 10.70.3.48:6945->10.70.112.146:35432(tcp)
kfree_skbmem
If we interpret the ss output below correctly, the Recv-Q is 0 and
receive buffer is almost full (127kb/131kb):
# ss --tcp -timo | grep -A 1 10.70.3.48:6945
ESTAB 0 0 10.70.112.146:35432
10.70.3.48:6945
skmem:(r127488,rb131072,t0,tb46080,f3584,w0,o0,bl0,d2821) ts
sack cubic wscale:7,7 rto:201 rtt:0.11/0.02 ato:40 mss:1434 pmtu:9086
rcvmss:1434 advmss:9034 cwnd:2 ssthresh:2 bytes_sent:610174
bytes_retrans:36 bytes_acked:610139 bytes_received:102223
segs_out:70278 segs_in:70365 data_segs_out:67444 data_segs_in:2927
send 209Mbps lastsnd:4071 lastrcv:345233042 lastack:4071 pacing_rate
249Mbps delivery_rate 417Mbps delivered:67441 app_limited busy:2957ms
retrans:0/4 rcv_rtt:0.075 rcv_space:45056 rcv_ssthresh:45056
minrtt:0.058 rcv_ooopack:31 snd_wnd:66304 rehash:1
Here's the `ss` output from the other end of the TCP connection, the
storage node:
# ss --tcp -timo | grep -A 1 10.70.112.146:35432
ESTAB 0 186408 10.70.3.48:6945 10.70.112.146:35432
timer:(on,35sec,0)
skmem:(r0,rb2420208,t0,tb332800,f3032,w332840,o0,bl0,d0) ts
sack cubic wscale:7,7 rto:120000 rtt:0.066/0.015 ato:40 mss:1434
pmtu:1486 rcvmss:536 advmss:1434 cwnd:1 ssthresh:2 bytes_sent:1852143
bytes_retrans:1707732 bytes_acked:102223 bytes_received:613144
segs_out:70712 segs_in:70622 data_segs_out:2941 data_segs_in:67774
send 174Mbps lastsnd:84760 lastrcv:1638 lastack:1638 pacing_rate
6.64Gbps delivery_rate 6.98Gbps delivered:107 busy:346940689ms
rwnd_limited:209ms(0.0%) unacked:32 retrans:1/2834 lost:1 sacked:31
rcv_rtt:362445 rcv_space:64677 rcv_ssthresh:66234 notsent:144220
minrtt:0.044 rcv_wnd:66304
The receive buffer has the default size:
# cat /proc/sys/net/ipv4/tcp_rmem
4096 131072 6291456
Let us know if any other information would be helpful to diagnose this
issue further.
Thanks for your help!
Simon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: tcp: socket stuck with zero receive window after SACK
2025-05-19 13:31 tcp: socket stuck with zero receive window after SACK Simon Campion
@ 2025-05-19 14:42 ` Neal Cardwell
2025-05-19 15:03 ` [EXT] " Simon Campion
0 siblings, 1 reply; 11+ messages in thread
From: Neal Cardwell @ 2025-05-19 14:42 UTC (permalink / raw)
To: Simon Campion; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang
On Mon, May 19, 2025 at 9:31 AM Simon Campion <simon.campion@deepl.com> wrote:
>
> Hi all,
>
> We have a TCP socket that's stuck in the following state:
>
> * it SACKed ~40KB of data, but misses 602 bytes at the beginning
> * it has a zero receive window
> * the Recv-Q as reported by ss is 0
>
> Due to the zero window, the kernel drops the missing 602 bytes when
> the peer sends them. So, the socket is stuck indefinitely waiting for
> data it drops when it receives it. Since the Recv-Q as reported by ss
> is 0, we suspect the receive window is not 0 because the owner of the
> socket isn't reading data. Rather, we wonder whether the kernel SACKed
> too much data than it should have, given the receive buffer size, not
> leaving enough space to store the missing bytes when they arrive.
> Could this happen?
>
> We don't have a reproducer for this issue. The socket is still in this
> state, so we're happy to provide more debugging information while we
> have it. This is the first time we've seen this problem.
>
> Here are more details:
>
> # uname -r
> 6.6.83-flatcar
Thanks for the detailed report!
Can you please attach the output of the following command, run on the
same machine (and in the same network namespace) as the socket with
the receive buffer that is almost full:
nstat -az > /tmp/nstat.txt
This should help us get a better idea about which "prune" methods are
being tried, and which of them are failing to free up enough memory.
Thanks!
neal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-19 14:42 ` Neal Cardwell
@ 2025-05-19 15:03 ` Simon Campion
2025-05-21 3:04 ` Neal Cardwell
0 siblings, 1 reply; 11+ messages in thread
From: Simon Campion @ 2025-05-19 15:03 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang
[-- Attachment #1: Type: text/plain, Size: 28026 bytes --]
Gladly! I attached the output of nstat -az. I ran it twice, right
before a 602 byte retransmit was received and dropped, and right
after, in case looking at the diff is helpful.
Here's the first run:
#kernel
IpInReceives 18788097214 0.0
IpInHdrErrors 47522 0.0
IpInAddrErrors 0 0.0
IpForwDatagrams 11778747 0.0
IpInUnknownProtos 0 0.0
IpInDiscards 0 0.0
IpInDelivers 18776248435 0.0
IpOutRequests 14428799907 0.0
IpOutDiscards 40798 0.0
IpOutNoRoutes 0 0.0
IpReasmTimeout 0 0.0
IpReasmReqds 32952 0.0
IpReasmOKs 16476 0.0
IpReasmFails 0 0.0
IpFragOKs 0 0.0
IpFragFails 0 0.0
IpFragCreates 0 0.0
IpOutTransmits 14440537900 0.0
IcmpInMsgs 7052154 0.0
IcmpInErrors 107 0.0
IcmpInCsumErrors 0 0.0
IcmpInDestUnreachs 100237 0.0
IcmpInTimeExcds 298 0.0
IcmpInParmProbs 0 0.0
IcmpInSrcQuenchs 0 0.0
IcmpInRedirects 67 0.0
IcmpInEchos 4035849 0.0
IcmpInEchoReps 2915703 0.0
IcmpInTimestamps 0 0.0
IcmpInTimestampReps 0 0.0
IcmpInAddrMasks 0 0.0
IcmpInAddrMaskReps 0 0.0
IcmpOutMsgs 7061967 0.0
IcmpOutErrors 0 0.0
IcmpOutRateLimitGlobal 0 0.0
IcmpOutRateLimitHost 47087 0.0
IcmpOutDestUnreachs 10421 0.0
IcmpOutTimeExcds 449 0.0
IcmpOutParmProbs 0 0.0
IcmpOutSrcQuenchs 0 0.0
IcmpOutRedirects 46216 0.0
IcmpOutEchos 2969032 0.0
IcmpOutEchoReps 4035849 0.0
IcmpOutTimestamps 0 0.0
IcmpOutTimestampReps 0 0.0
IcmpOutAddrMasks 0 0.0
IcmpOutAddrMaskReps 0 0.0
IcmpMsgInType0 2915703 0.0
IcmpMsgInType3 100237 0.0
IcmpMsgInType5 67 0.0
IcmpMsgInType8 4035849 0.0
IcmpMsgInType11 298 0.0
IcmpMsgOutType0 4035849 0.0
IcmpMsgOutType3 10421 0.0
IcmpMsgOutType5 46216 0.0
IcmpMsgOutType8 2969032 0.0
IcmpMsgOutType11 449 0.0
TcpActiveOpens 7902932 0.0
TcpPassiveOpens 1447146 0.0
TcpAttemptFails 41605 0.0
TcpEstabResets 1327046 0.0
TcpInSegs 4953075475 0.0
TcpOutSegs 2988066928 0.0
TcpRetransSegs 656187 0.0
TcpInErrs 0 0.0
TcpOutRsts 3401866 0.0
TcpInCsumErrors 0 0.0
UdpInDatagrams 13816173054 0.0
UdpNoPorts 10435 0.0
UdpInErrors 2 0.0
UdpOutDatagrams 493212273 0.0
UdpRcvbufErrors 0 0.0
UdpSndbufErrors 10 0.0
UdpInCsumErrors 0 0.0
UdpIgnoredMulti 0 0.0
UdpMemErrors 0 0.0
UdpLiteInDatagrams 0 0.0
UdpLiteNoPorts 0 0.0
UdpLiteInErrors 0 0.0
UdpLiteOutDatagrams 0 0.0
UdpLiteRcvbufErrors 0 0.0
UdpLiteSndbufErrors 0 0.0
UdpLiteInCsumErrors 0 0.0
UdpLiteIgnoredMulti 0 0.0
UdpLiteMemErrors 0 0.0
Ip6InReceives 268280 0.0
Ip6InHdrErrors 0 0.0
Ip6InTooBigErrors 0 0.0
Ip6InNoRoutes 39 0.0
Ip6InAddrErrors 0 0.0
Ip6InUnknownProtos 0 0.0
Ip6InTruncatedPkts 0 0.0
Ip6InDiscards 0 0.0
Ip6InDelivers 85862 0.0
Ip6OutForwDatagrams 0 0.0
Ip6OutRequests 500197 0.0
Ip6OutDiscards 1 0.0
Ip6OutNoRoutes 591 0.0
Ip6ReasmTimeout 0 0.0
Ip6ReasmReqds 0 0.0
Ip6ReasmOKs 0 0.0
Ip6ReasmFails 0 0.0
Ip6FragOKs 0 0.0
Ip6FragFails 0 0.0
Ip6FragCreates 0 0.0
Ip6InMcastPkts 194727 0.0
Ip6OutMcastPkts 426562 0.0
Ip6InOctets 20173005 0.0
Ip6OutOctets 45359068 0.0
Ip6InMcastOctets 14345494 0.0
Ip6OutMcastOctets 39440790 0.0
Ip6InBcastOctets 0 0.0
Ip6OutBcastOctets 0 0.0
Ip6InNoECTPkts 268280 0.0
Ip6InECT1Pkts 0 0.0
Ip6InECT0Pkts 0 0.0
Ip6InCEPkts 0 0.0
Ip6OutTransmits 500197 0.0
Icmp6InMsgs 23176 0.0
Icmp6InErrors 0 0.0
Icmp6OutMsgs 437389 0.0
Icmp6OutErrors 0 0.0
Icmp6InCsumErrors 0 0.0
Icmp6OutRateLimitHost 0 0.0
Icmp6InDestUnreachs 0 0.0
Icmp6InPktTooBigs 0 0.0
Icmp6InTimeExcds 0 0.0
Icmp6InParmProblems 0 0.0
Icmp6InEchos 0 0.0
Icmp6InEchoReplies 0 0.0
Icmp6InGroupMembQueries 0 0.0
Icmp6InGroupMembResponses 0 0.0
Icmp6InGroupMembReductions 0 0.0
Icmp6InRouterSolicits 12339 0.0
Icmp6InRouterAdvertisements 0 0.0
Icmp6InNeighborSolicits 2117 0.0
Icmp6InNeighborAdvertisements 8720 0.0
Icmp6InRedirects 0 0.0
Icmp6InMLDv2Reports 0 0.0
Icmp6OutDestUnreachs 0 0.0
Icmp6OutPktTooBigs 0 0.0
Icmp6OutTimeExcds 0 0.0
Icmp6OutParmProblems 0 0.0
Icmp6OutEchos 0 0.0
Icmp6OutEchoReplies 0 0.0
Icmp6OutGroupMembQueries 0 0.0
Icmp6OutGroupMembResponses 0 0.0
Icmp6OutGroupMembReductions 0 0.0
Icmp6OutRouterSolicits 2 0.0
Icmp6OutRouterAdvertisements 0 0.0
Icmp6OutNeighborSolicits 69476 0.0
Icmp6OutNeighborAdvertisements 2116 0.0
Icmp6OutRedirects 0 0.0
Icmp6OutMLDv2Reports 365795 0.0
Icmp6InType133 12339 0.0
Icmp6InType135 2117 0.0
Icmp6InType136 8720 0.0
Icmp6OutType133 2 0.0
Icmp6OutType135 69476 0.0
Icmp6OutType136 2116 0.0
Icmp6OutType143 365795 0.0
Udp6InDatagrams 6 0.0
Udp6NoPorts 0 0.0
Udp6InErrors 0 0.0
Udp6OutDatagrams 6 0.0
Udp6RcvbufErrors 0 0.0
Udp6SndbufErrors 0 0.0
Udp6InCsumErrors 0 0.0
Udp6IgnoredMulti 0 0.0
Udp6MemErrors 0 0.0
UdpLite6InDatagrams 0 0.0
UdpLite6NoPorts 0 0.0
UdpLite6InErrors 0 0.0
UdpLite6OutDatagrams 0 0.0
UdpLite6RcvbufErrors 0 0.0
UdpLite6SndbufErrors 0 0.0
UdpLite6InCsumErrors 0 0.0
UdpLite6MemErrors 0 0.0
TcpExtSyncookiesSent 0 0.0
TcpExtSyncookiesRecv 0 0.0
TcpExtSyncookiesFailed 0 0.0
TcpExtEmbryonicRsts 3 0.0
TcpExtPruneCalled 3891 0.0
TcpExtRcvPruned 0 0.0
TcpExtOfoPruned 0 0.0
TcpExtOutOfWindowIcmps 10 0.0
TcpExtLockDroppedIcmps 178 0.0
TcpExtArpFilter 0 0.0
TcpExtTW 3583160 0.0
TcpExtTWRecycled 4217 0.0
TcpExtTWKilled 0 0.0
TcpExtPAWSActive 0 0.0
TcpExtPAWSEstab 60 0.0
TcpExtDelayedACKs 5133416 0.0
TcpExtDelayedACKLocked 3008 0.0
TcpExtDelayedACKLost 111165 0.0
TcpExtListenOverflows 0 0.0
TcpExtListenDrops 0 0.0
TcpExtTCPHPHits 2023413 0.0
TcpExtTCPPureAcks 53087713 0.0
TcpExtTCPHPAcks 115340211 0.0
TcpExtTCPRenoRecovery 0 0.0
TcpExtTCPSackRecovery 13396 0.0
TcpExtTCPSACKReneging 0 0.0
TcpExtTCPSACKReorder 202 0.0
TcpExtTCPRenoReorder 0 0.0
TcpExtTCPTSReorder 1 0.0
TcpExtTCPFullUndo 0 0.0
TcpExtTCPPartialUndo 1 0.0
TcpExtTCPDSACKUndo 252 0.0
TcpExtTCPLossUndo 85 0.0
TcpExtTCPLostRetransmit 195718 0.0
TcpExtTCPRenoFailures 0 0.0
TcpExtTCPSackFailures 3 0.0
TcpExtTCPLossFailures 0 0.0
TcpExtTCPFastRetrans 381976 0.0
TcpExtTCPSlowStartRetrans 76 0.0
TcpExtTCPTimeouts 195387 0.0
TcpExtTCPLossProbes 110863 0.0
TcpExtTCPLossProbeRecovery 773 0.0
TcpExtTCPRenoRecoveryFail 0 0.0
TcpExtTCPSackRecoveryFail 38 0.0
TcpExtTCPRcvCollapsed 2960 0.0
TcpExtTCPBacklogCoalesce 252840759 0.0
TcpExtTCPDSACKOldSent 111216 0.0
TcpExtTCPDSACKOfoSent 2 0.0
TcpExtTCPDSACKRecv 77523 0.0
TcpExtTCPDSACKOfoRecv 3 0.0
TcpExtTCPAbortOnData 509577 0.0
TcpExtTCPAbortOnClose 1246626 0.0
TcpExtTCPAbortOnMemory 0 0.0
TcpExtTCPAbortOnTimeout 32 0.0
TcpExtTCPAbortOnLinger 0 0.0
TcpExtTCPAbortFailed 0 0.0
TcpExtTCPMemoryPressures 0 0.0
TcpExtTCPMemoryPressuresChrono 0 0.0
TcpExtTCPSACKDiscard 10 0.0
TcpExtTCPDSACKIgnoredOld 2 0.0
TcpExtTCPDSACKIgnoredNoUndo 73310 0.0
TcpExtTCPSpuriousRTOs 23 0.0
TcpExtTCPMD5NotFound 0 0.0
TcpExtTCPMD5Unexpected 0 0.0
TcpExtTCPMD5Failure 0 0.0
TcpExtTCPSackShifted 45975 0.0
TcpExtTCPSackMerged 248207 0.0
TcpExtTCPSackShiftFallback 328257 0.0
TcpExtTCPBacklogDrop 0 0.0
TcpExtPFMemallocDrop 0 0.0
TcpExtTCPMinTTLDrop 0 0.0
TcpExtTCPDeferAcceptDrop 0 0.0
TcpExtIPReversePathFilter 0 0.0
TcpExtTCPTimeWaitOverflow 0 0.0
TcpExtTCPReqQFullDoCookies 0 0.0
TcpExtTCPReqQFullDrop 0 0.0
TcpExtTCPRetransFail 55 0.0
TcpExtTCPRcvCoalesce 1575181380 0.0
TcpExtTCPOFOQueue 2661363 0.0
TcpExtTCPOFODrop 0 0.0
TcpExtTCPOFOMerge 2 0.0
TcpExtTCPChallengeACK 138 0.0
TcpExtTCPSYNChallenge 87 0.0
TcpExtTCPFastOpenActive 0 0.0
TcpExtTCPFastOpenActiveFail 0 0.0
TcpExtTCPFastOpenPassive 0 0.0
TcpExtTCPFastOpenPassiveFail 0 0.0
TcpExtTCPFastOpenListenOverflow 0 0.0
TcpExtTCPFastOpenCookieReqd 0 0.0
TcpExtTCPFastOpenBlackhole 0 0.0
TcpExtTCPSpuriousRtxHostQueues 118 0.0
TcpExtBusyPollRxPackets 0 0.0
TcpExtTCPAutoCorking 20400747 0.0
TcpExtTCPFromZeroWindowAdv 3612949 0.0
TcpExtTCPToZeroWindowAdv 3612953 0.0
TcpExtTCPWantZeroWindowAdv 0 0.0
TcpExtTCPSynRetrans 194466 0.0
TcpExtTCPOrigDataSent 1261792021 0.0
TcpExtTCPHystartTrainDetect 40140 0.0
TcpExtTCPHystartTrainCwnd 4828784 0.0
TcpExtTCPHystartDelayDetect 17 0.0
TcpExtTCPHystartDelayCwnd 8489 0.0
TcpExtTCPACKSkippedSynRecv 0 0.0
TcpExtTCPACKSkippedPAWS 3 0.0
TcpExtTCPACKSkippedSeq 580 0.0
TcpExtTCPACKSkippedFinWait2 0 0.0
TcpExtTCPACKSkippedTimeWait 0 0.0
TcpExtTCPACKSkippedChallenge 0 0.0
TcpExtTCPWinProbe 0 0.0
TcpExtTCPKeepAlive 2024725 0.0
TcpExtTCPMTUPFail 0 0.0
TcpExtTCPMTUPSuccess 0 0.0
TcpExtTCPDelivered 1269268682 0.0
TcpExtTCPDeliveredCE 0 0.0
TcpExtTCPAckCompressed 1028306 0.0
TcpExtTCPZeroWindowDrop 485489 0.0
TcpExtTCPRcvQDrop 0 0.0
TcpExtTCPWqueueTooBig 0 0.0
TcpExtTCPFastOpenPassiveAltKey 0 0.0
TcpExtTcpTimeoutRehash 195346 0.0
TcpExtTcpDuplicateDataRehash 53 0.0
TcpExtTCPDSACKRecvSegs 77539 0.0
TcpExtTCPDSACKIgnoredDubious 12 0.0
TcpExtTCPMigrateReqSuccess 0 0.0
TcpExtTCPMigrateReqFailure 0 0.0
TcpExtTCPPLBRehash 0 0.0
IpExtInNoRoutes 4 0.0
IpExtInTruncatedPkts 0 0.0
IpExtInMcastPkts 6 0.0
IpExtOutMcastPkts 16 0.0
IpExtInBcastPkts 0 0.0
IpExtOutBcastPkts 0 0.0
IpExtInOctets 115665856793051 0.0
IpExtOutOctets 20991618500174 0.0
IpExtInMcastOctets 306 0.0
IpExtOutMcastOctets 706 0.0
IpExtInBcastOctets 0 0.0
IpExtOutBcastOctets 0 0.0
IpExtInCsumErrors 0 0.0
IpExtInNoECTPkts 41778706183 0.0
IpExtInECT1Pkts 0 0.0
IpExtInECT0Pkts 299 0.0
IpExtInCEPkts 0 0.0
IpExtReasmOverlaps 0 0.0
And here's the diff:
2c2
< IpInReceives 18788097214 0.0
---
> IpInReceives 18788830514 0.0
5c5
< IpForwDatagrams 11778747 0.0
---
> IpForwDatagrams 11778974 0.0
8,9c8,9
< IpInDelivers 18776248435 0.0
< IpOutRequests 14428799907 0.0
---
> IpInDelivers 18776981508 0.0
> IpOutRequests 14429439831 0.0
19,20c19,20
< IpOutTransmits 14440537900 0.0
< IcmpInMsgs 7052154 0.0
---
> IpOutTransmits 14441178051 0.0
> IcmpInMsgs 7052682 0.0
23c23
< IcmpInDestUnreachs 100237 0.0
---
> IcmpInDestUnreachs 100240 0.0
28,29c28,29
< IcmpInEchos 4035849 0.0
< IcmpInEchoReps 2915703 0.0
---
> IcmpInEchos 4035969 0.0
> IcmpInEchoReps 2916108 0.0
34c34
< IcmpOutMsgs 7061967 0.0
---
> IcmpOutMsgs 7062503 0.0
43,44c43,44
< IcmpOutEchos 2969032 0.0
< IcmpOutEchoReps 4035849 0.0
---
> IcmpOutEchos 2969448 0.0
> IcmpOutEchoReps 4035969 0.0
49,50c49,50
< IcmpMsgInType0 2915703 0.0
< IcmpMsgInType3 100237 0.0
---
> IcmpMsgInType0 2916108 0.0
> IcmpMsgInType3 100240 0.0
52c52
< IcmpMsgInType8 4035849 0.0
---
> IcmpMsgInType8 4035969 0.0
54c54
< IcmpMsgOutType0 4035849 0.0
---
> IcmpMsgOutType0 4035969 0.0
57c57
< IcmpMsgOutType8 2969032 0.0
---
> IcmpMsgOutType8 2969448 0.0
59,65c59,65
< TcpActiveOpens 7902932 0.0
< TcpPassiveOpens 1447146 0.0
< TcpAttemptFails 41605 0.0
< TcpEstabResets 1327046 0.0
< TcpInSegs 4953075475 0.0
< TcpOutSegs 2988066928 0.0
< TcpRetransSegs 656187 0.0
---
> TcpActiveOpens 7903289 0.0
> TcpPassiveOpens 1447188 0.0
> TcpAttemptFails 41607 0.0
> TcpEstabResets 1327071 0.0
> TcpInSegs 4953086019 0.0
> TcpOutSegs 2988189227 0.0
> TcpRetransSegs 656198 0.0
67c67
< TcpOutRsts 3401866 0.0
---
> TcpOutRsts 3401906 0.0
69c69
< UdpInDatagrams 13816173054 0.0
---
> UdpInDatagrams 13816895055 0.0
72c72
< UdpOutDatagrams 493212273 0.0
---
> UdpOutDatagrams 493225063 0.0
87c87
< Ip6InReceives 268280 0.0
---
> Ip6InReceives 268283 0.0
95c95
< Ip6InDelivers 85862 0.0
---
> Ip6InDelivers 85865 0.0
97c97
< Ip6OutRequests 500197 0.0
---
> Ip6OutRequests 500200 0.0
109,110c109,110
< Ip6InOctets 20173005 0.0
< Ip6OutOctets 45359068 0.0
---
> Ip6InOctets 20173232 0.0
> Ip6OutOctets 45359303 0.0
115c115
< Ip6InNoECTPkts 268280 0.0
---
> Ip6InNoECTPkts 268283 0.0
119,120c119,120
< Ip6OutTransmits 500197 0.0
< Icmp6InMsgs 23176 0.0
---
> Ip6OutTransmits 500200 0.0
> Icmp6InMsgs 23177 0.0
122c122
< Icmp6OutMsgs 437389 0.0
---
> Icmp6OutMsgs 437390 0.0
138c138
< Icmp6InNeighborAdvertisements 8720 0.0
---
> Icmp6InNeighborAdvertisements 8721 0.0
152c152
< Icmp6OutNeighborSolicits 69476 0.0
---
> Icmp6OutNeighborSolicits 69477 0.0
158c158
< Icmp6InType136 8720 0.0
---
> Icmp6InType136 8721 0.0
160c160
< Icmp6OutType135 69476 0.0
---
> Icmp6OutType135 69477 0.0
190c190
< TcpExtTW 3583160 0.0
---
> TcpExtTW 3583407 0.0
195c195
< TcpExtDelayedACKs 5133416 0.0
---
> TcpExtDelayedACKs 5133551 0.0
200,202c200,202
< TcpExtTCPHPHits 2023413 0.0
< TcpExtTCPPureAcks 53087713 0.0
< TcpExtTCPHPAcks 115340211 0.0
---
> TcpExtTCPHPHits 2023452 0.0
> TcpExtTCPPureAcks 53091101 0.0
> TcpExtTCPHPAcks 115343941 0.0
213c213
< TcpExtTCPLostRetransmit 195718 0.0
---
> TcpExtTCPLostRetransmit 195727 0.0
219,220c219,220
< TcpExtTCPTimeouts 195387 0.0
< TcpExtTCPLossProbes 110863 0.0
---
> TcpExtTCPTimeouts 195398 0.0
> TcpExtTCPLossProbes 110865 0.0
225c225
< TcpExtTCPBacklogCoalesce 252840759 0.0
---
> TcpExtTCPBacklogCoalesce 252840913 0.0
230,231c230,231
< TcpExtTCPAbortOnData 509577 0.0
< TcpExtTCPAbortOnClose 1246626 0.0
---
> TcpExtTCPAbortOnData 509579 0.0
> TcpExtTCPAbortOnClose 1246649 0.0
257c257
< TcpExtTCPRcvCoalesce 1575181380 0.0
---
> TcpExtTCPRcvCoalesce 1575181523 0.0
272c272
< TcpExtTCPAutoCorking 20400747 0.0
---
> TcpExtTCPAutoCorking 20401664 0.0
276,279c276,279
< TcpExtTCPSynRetrans 194466 0.0
< TcpExtTCPOrigDataSent 1261792021 0.0
< TcpExtTCPHystartTrainDetect 40140 0.0
< TcpExtTCPHystartTrainCwnd 4828784 0.0
---
> TcpExtTCPSynRetrans 194477 0.0
> TcpExtTCPOrigDataSent 1261911536 0.0
> TcpExtTCPHystartTrainDetect 40143 0.0
> TcpExtTCPHystartTrainCwnd 4829428 0.0
289c289
< TcpExtTCPKeepAlive 2024725 0.0
---
> TcpExtTCPKeepAlive 2024763 0.0
292c292
< TcpExtTCPDelivered 1269268682 0.0
---
> TcpExtTCPDelivered 1269388548 0.0
295c295
< TcpExtTCPZeroWindowDrop 485489 0.0
---
> TcpExtTCPZeroWindowDrop 485490 0.0
299c299
< TcpExtTcpTimeoutRehash 195346 0.0
---
> TcpExtTcpTimeoutRehash 195357 0.0
312,313c312,313
< IpExtInOctets 115665856793051 0.0
< IpExtOutOctets 20991618500174 0.0
---
> IpExtInOctets 115667256768208 0.0
> IpExtOutOctets 20993368512190 0.0
319c319
< IpExtInNoECTPkts 41778706183 0.0
---
> IpExtInNoECTPkts 41780216583 0.0
On Mon, 19 May 2025 at 16:42, Neal Cardwell <ncardwell@google.com> wrote:
>
> On Mon, May 19, 2025 at 9:31 AM Simon Campion <simon.campion@deepl.com> wrote:
> >
> > Hi all,
> >
> > We have a TCP socket that's stuck in the following state:
> >
> > * it SACKed ~40KB of data, but misses 602 bytes at the beginning
> > * it has a zero receive window
> > * the Recv-Q as reported by ss is 0
> >
> > Due to the zero window, the kernel drops the missing 602 bytes when
> > the peer sends them. So, the socket is stuck indefinitely waiting for
> > data it drops when it receives it. Since the Recv-Q as reported by ss
> > is 0, we suspect the receive window is not 0 because the owner of the
> > socket isn't reading data. Rather, we wonder whether the kernel SACKed
> > too much data than it should have, given the receive buffer size, not
> > leaving enough space to store the missing bytes when they arrive.
> > Could this happen?
> >
> > We don't have a reproducer for this issue. The socket is still in this
> > state, so we're happy to provide more debugging information while we
> > have it. This is the first time we've seen this problem.
> >
> > Here are more details:
> >
> > # uname -r
> > 6.6.83-flatcar
>
> Thanks for the detailed report!
>
> Can you please attach the output of the following command, run on the
> same machine (and in the same network namespace) as the socket with
> the receive buffer that is almost full:
>
> nstat -az > /tmp/nstat.txt
>
> This should help us get a better idea about which "prune" methods are
> being tried, and which of them are failing to free up enough memory.
>
> Thanks!
> neal
[-- Attachment #2: nstat_2_52.txt --]
[-- Type: text/plain, Size: 17718 bytes --]
#kernel
IpInReceives 18788830514 0.0
IpInHdrErrors 47522 0.0
IpInAddrErrors 0 0.0
IpForwDatagrams 11778974 0.0
IpInUnknownProtos 0 0.0
IpInDiscards 0 0.0
IpInDelivers 18776981508 0.0
IpOutRequests 14429439831 0.0
IpOutDiscards 40798 0.0
IpOutNoRoutes 0 0.0
IpReasmTimeout 0 0.0
IpReasmReqds 32952 0.0
IpReasmOKs 16476 0.0
IpReasmFails 0 0.0
IpFragOKs 0 0.0
IpFragFails 0 0.0
IpFragCreates 0 0.0
IpOutTransmits 14441178051 0.0
IcmpInMsgs 7052682 0.0
IcmpInErrors 107 0.0
IcmpInCsumErrors 0 0.0
IcmpInDestUnreachs 100240 0.0
IcmpInTimeExcds 298 0.0
IcmpInParmProbs 0 0.0
IcmpInSrcQuenchs 0 0.0
IcmpInRedirects 67 0.0
IcmpInEchos 4035969 0.0
IcmpInEchoReps 2916108 0.0
IcmpInTimestamps 0 0.0
IcmpInTimestampReps 0 0.0
IcmpInAddrMasks 0 0.0
IcmpInAddrMaskReps 0 0.0
IcmpOutMsgs 7062503 0.0
IcmpOutErrors 0 0.0
IcmpOutRateLimitGlobal 0 0.0
IcmpOutRateLimitHost 47087 0.0
IcmpOutDestUnreachs 10421 0.0
IcmpOutTimeExcds 449 0.0
IcmpOutParmProbs 0 0.0
IcmpOutSrcQuenchs 0 0.0
IcmpOutRedirects 46216 0.0
IcmpOutEchos 2969448 0.0
IcmpOutEchoReps 4035969 0.0
IcmpOutTimestamps 0 0.0
IcmpOutTimestampReps 0 0.0
IcmpOutAddrMasks 0 0.0
IcmpOutAddrMaskReps 0 0.0
IcmpMsgInType0 2916108 0.0
IcmpMsgInType3 100240 0.0
IcmpMsgInType5 67 0.0
IcmpMsgInType8 4035969 0.0
IcmpMsgInType11 298 0.0
IcmpMsgOutType0 4035969 0.0
IcmpMsgOutType3 10421 0.0
IcmpMsgOutType5 46216 0.0
IcmpMsgOutType8 2969448 0.0
IcmpMsgOutType11 449 0.0
TcpActiveOpens 7903289 0.0
TcpPassiveOpens 1447188 0.0
TcpAttemptFails 41607 0.0
TcpEstabResets 1327071 0.0
TcpInSegs 4953086019 0.0
TcpOutSegs 2988189227 0.0
TcpRetransSegs 656198 0.0
TcpInErrs 0 0.0
TcpOutRsts 3401906 0.0
TcpInCsumErrors 0 0.0
UdpInDatagrams 13816895055 0.0
UdpNoPorts 10435 0.0
UdpInErrors 2 0.0
UdpOutDatagrams 493225063 0.0
UdpRcvbufErrors 0 0.0
UdpSndbufErrors 10 0.0
UdpInCsumErrors 0 0.0
UdpIgnoredMulti 0 0.0
UdpMemErrors 0 0.0
UdpLiteInDatagrams 0 0.0
UdpLiteNoPorts 0 0.0
UdpLiteInErrors 0 0.0
UdpLiteOutDatagrams 0 0.0
UdpLiteRcvbufErrors 0 0.0
UdpLiteSndbufErrors 0 0.0
UdpLiteInCsumErrors 0 0.0
UdpLiteIgnoredMulti 0 0.0
UdpLiteMemErrors 0 0.0
Ip6InReceives 268283 0.0
Ip6InHdrErrors 0 0.0
Ip6InTooBigErrors 0 0.0
Ip6InNoRoutes 39 0.0
Ip6InAddrErrors 0 0.0
Ip6InUnknownProtos 0 0.0
Ip6InTruncatedPkts 0 0.0
Ip6InDiscards 0 0.0
Ip6InDelivers 85865 0.0
Ip6OutForwDatagrams 0 0.0
Ip6OutRequests 500200 0.0
Ip6OutDiscards 1 0.0
Ip6OutNoRoutes 591 0.0
Ip6ReasmTimeout 0 0.0
Ip6ReasmReqds 0 0.0
Ip6ReasmOKs 0 0.0
Ip6ReasmFails 0 0.0
Ip6FragOKs 0 0.0
Ip6FragFails 0 0.0
Ip6FragCreates 0 0.0
Ip6InMcastPkts 194727 0.0
Ip6OutMcastPkts 426562 0.0
Ip6InOctets 20173232 0.0
Ip6OutOctets 45359303 0.0
Ip6InMcastOctets 14345494 0.0
Ip6OutMcastOctets 39440790 0.0
Ip6InBcastOctets 0 0.0
Ip6OutBcastOctets 0 0.0
Ip6InNoECTPkts 268283 0.0
Ip6InECT1Pkts 0 0.0
Ip6InECT0Pkts 0 0.0
Ip6InCEPkts 0 0.0
Ip6OutTransmits 500200 0.0
Icmp6InMsgs 23177 0.0
Icmp6InErrors 0 0.0
Icmp6OutMsgs 437390 0.0
Icmp6OutErrors 0 0.0
Icmp6InCsumErrors 0 0.0
Icmp6OutRateLimitHost 0 0.0
Icmp6InDestUnreachs 0 0.0
Icmp6InPktTooBigs 0 0.0
Icmp6InTimeExcds 0 0.0
Icmp6InParmProblems 0 0.0
Icmp6InEchos 0 0.0
Icmp6InEchoReplies 0 0.0
Icmp6InGroupMembQueries 0 0.0
Icmp6InGroupMembResponses 0 0.0
Icmp6InGroupMembReductions 0 0.0
Icmp6InRouterSolicits 12339 0.0
Icmp6InRouterAdvertisements 0 0.0
Icmp6InNeighborSolicits 2117 0.0
Icmp6InNeighborAdvertisements 8721 0.0
Icmp6InRedirects 0 0.0
Icmp6InMLDv2Reports 0 0.0
Icmp6OutDestUnreachs 0 0.0
Icmp6OutPktTooBigs 0 0.0
Icmp6OutTimeExcds 0 0.0
Icmp6OutParmProblems 0 0.0
Icmp6OutEchos 0 0.0
Icmp6OutEchoReplies 0 0.0
Icmp6OutGroupMembQueries 0 0.0
Icmp6OutGroupMembResponses 0 0.0
Icmp6OutGroupMembReductions 0 0.0
Icmp6OutRouterSolicits 2 0.0
Icmp6OutRouterAdvertisements 0 0.0
Icmp6OutNeighborSolicits 69477 0.0
Icmp6OutNeighborAdvertisements 2116 0.0
Icmp6OutRedirects 0 0.0
Icmp6OutMLDv2Reports 365795 0.0
Icmp6InType133 12339 0.0
Icmp6InType135 2117 0.0
Icmp6InType136 8721 0.0
Icmp6OutType133 2 0.0
Icmp6OutType135 69477 0.0
Icmp6OutType136 2116 0.0
Icmp6OutType143 365795 0.0
Udp6InDatagrams 6 0.0
Udp6NoPorts 0 0.0
Udp6InErrors 0 0.0
Udp6OutDatagrams 6 0.0
Udp6RcvbufErrors 0 0.0
Udp6SndbufErrors 0 0.0
Udp6InCsumErrors 0 0.0
Udp6IgnoredMulti 0 0.0
Udp6MemErrors 0 0.0
UdpLite6InDatagrams 0 0.0
UdpLite6NoPorts 0 0.0
UdpLite6InErrors 0 0.0
UdpLite6OutDatagrams 0 0.0
UdpLite6RcvbufErrors 0 0.0
UdpLite6SndbufErrors 0 0.0
UdpLite6InCsumErrors 0 0.0
UdpLite6MemErrors 0 0.0
TcpExtSyncookiesSent 0 0.0
TcpExtSyncookiesRecv 0 0.0
TcpExtSyncookiesFailed 0 0.0
TcpExtEmbryonicRsts 3 0.0
TcpExtPruneCalled 3891 0.0
TcpExtRcvPruned 0 0.0
TcpExtOfoPruned 0 0.0
TcpExtOutOfWindowIcmps 10 0.0
TcpExtLockDroppedIcmps 178 0.0
TcpExtArpFilter 0 0.0
TcpExtTW 3583407 0.0
TcpExtTWRecycled 4217 0.0
TcpExtTWKilled 0 0.0
TcpExtPAWSActive 0 0.0
TcpExtPAWSEstab 60 0.0
TcpExtDelayedACKs 5133551 0.0
TcpExtDelayedACKLocked 3008 0.0
TcpExtDelayedACKLost 111165 0.0
TcpExtListenOverflows 0 0.0
TcpExtListenDrops 0 0.0
TcpExtTCPHPHits 2023452 0.0
TcpExtTCPPureAcks 53091101 0.0
TcpExtTCPHPAcks 115343941 0.0
TcpExtTCPRenoRecovery 0 0.0
TcpExtTCPSackRecovery 13396 0.0
TcpExtTCPSACKReneging 0 0.0
TcpExtTCPSACKReorder 202 0.0
TcpExtTCPRenoReorder 0 0.0
TcpExtTCPTSReorder 1 0.0
TcpExtTCPFullUndo 0 0.0
TcpExtTCPPartialUndo 1 0.0
TcpExtTCPDSACKUndo 252 0.0
TcpExtTCPLossUndo 85 0.0
TcpExtTCPLostRetransmit 195727 0.0
TcpExtTCPRenoFailures 0 0.0
TcpExtTCPSackFailures 3 0.0
TcpExtTCPLossFailures 0 0.0
TcpExtTCPFastRetrans 381976 0.0
TcpExtTCPSlowStartRetrans 76 0.0
TcpExtTCPTimeouts 195398 0.0
TcpExtTCPLossProbes 110865 0.0
TcpExtTCPLossProbeRecovery 773 0.0
TcpExtTCPRenoRecoveryFail 0 0.0
TcpExtTCPSackRecoveryFail 38 0.0
TcpExtTCPRcvCollapsed 2960 0.0
TcpExtTCPBacklogCoalesce 252840913 0.0
TcpExtTCPDSACKOldSent 111216 0.0
TcpExtTCPDSACKOfoSent 2 0.0
TcpExtTCPDSACKRecv 77523 0.0
TcpExtTCPDSACKOfoRecv 3 0.0
TcpExtTCPAbortOnData 509579 0.0
TcpExtTCPAbortOnClose 1246649 0.0
TcpExtTCPAbortOnMemory 0 0.0
TcpExtTCPAbortOnTimeout 32 0.0
TcpExtTCPAbortOnLinger 0 0.0
TcpExtTCPAbortFailed 0 0.0
TcpExtTCPMemoryPressures 0 0.0
TcpExtTCPMemoryPressuresChrono 0 0.0
TcpExtTCPSACKDiscard 10 0.0
TcpExtTCPDSACKIgnoredOld 2 0.0
TcpExtTCPDSACKIgnoredNoUndo 73310 0.0
TcpExtTCPSpuriousRTOs 23 0.0
TcpExtTCPMD5NotFound 0 0.0
TcpExtTCPMD5Unexpected 0 0.0
TcpExtTCPMD5Failure 0 0.0
TcpExtTCPSackShifted 45975 0.0
TcpExtTCPSackMerged 248207 0.0
TcpExtTCPSackShiftFallback 328257 0.0
TcpExtTCPBacklogDrop 0 0.0
TcpExtPFMemallocDrop 0 0.0
TcpExtTCPMinTTLDrop 0 0.0
TcpExtTCPDeferAcceptDrop 0 0.0
TcpExtIPReversePathFilter 0 0.0
TcpExtTCPTimeWaitOverflow 0 0.0
TcpExtTCPReqQFullDoCookies 0 0.0
TcpExtTCPReqQFullDrop 0 0.0
TcpExtTCPRetransFail 55 0.0
TcpExtTCPRcvCoalesce 1575181523 0.0
TcpExtTCPOFOQueue 2661363 0.0
TcpExtTCPOFODrop 0 0.0
TcpExtTCPOFOMerge 2 0.0
TcpExtTCPChallengeACK 138 0.0
TcpExtTCPSYNChallenge 87 0.0
TcpExtTCPFastOpenActive 0 0.0
TcpExtTCPFastOpenActiveFail 0 0.0
TcpExtTCPFastOpenPassive 0 0.0
TcpExtTCPFastOpenPassiveFail 0 0.0
TcpExtTCPFastOpenListenOverflow 0 0.0
TcpExtTCPFastOpenCookieReqd 0 0.0
TcpExtTCPFastOpenBlackhole 0 0.0
TcpExtTCPSpuriousRtxHostQueues 118 0.0
TcpExtBusyPollRxPackets 0 0.0
TcpExtTCPAutoCorking 20401664 0.0
TcpExtTCPFromZeroWindowAdv 3612949 0.0
TcpExtTCPToZeroWindowAdv 3612953 0.0
TcpExtTCPWantZeroWindowAdv 0 0.0
TcpExtTCPSynRetrans 194477 0.0
TcpExtTCPOrigDataSent 1261911536 0.0
TcpExtTCPHystartTrainDetect 40143 0.0
TcpExtTCPHystartTrainCwnd 4829428 0.0
TcpExtTCPHystartDelayDetect 17 0.0
TcpExtTCPHystartDelayCwnd 8489 0.0
TcpExtTCPACKSkippedSynRecv 0 0.0
TcpExtTCPACKSkippedPAWS 3 0.0
TcpExtTCPACKSkippedSeq 580 0.0
TcpExtTCPACKSkippedFinWait2 0 0.0
TcpExtTCPACKSkippedTimeWait 0 0.0
TcpExtTCPACKSkippedChallenge 0 0.0
TcpExtTCPWinProbe 0 0.0
TcpExtTCPKeepAlive 2024763 0.0
TcpExtTCPMTUPFail 0 0.0
TcpExtTCPMTUPSuccess 0 0.0
TcpExtTCPDelivered 1269388548 0.0
TcpExtTCPDeliveredCE 0 0.0
TcpExtTCPAckCompressed 1028306 0.0
TcpExtTCPZeroWindowDrop 485490 0.0
TcpExtTCPRcvQDrop 0 0.0
TcpExtTCPWqueueTooBig 0 0.0
TcpExtTCPFastOpenPassiveAltKey 0 0.0
TcpExtTcpTimeoutRehash 195357 0.0
TcpExtTcpDuplicateDataRehash 53 0.0
TcpExtTCPDSACKRecvSegs 77539 0.0
TcpExtTCPDSACKIgnoredDubious 12 0.0
TcpExtTCPMigrateReqSuccess 0 0.0
TcpExtTCPMigrateReqFailure 0 0.0
TcpExtTCPPLBRehash 0 0.0
IpExtInNoRoutes 4 0.0
IpExtInTruncatedPkts 0 0.0
IpExtInMcastPkts 6 0.0
IpExtOutMcastPkts 16 0.0
IpExtInBcastPkts 0 0.0
IpExtOutBcastPkts 0 0.0
IpExtInOctets 115667256768208 0.0
IpExtOutOctets 20993368512190 0.0
IpExtInMcastOctets 306 0.0
IpExtOutMcastOctets 706 0.0
IpExtInBcastOctets 0 0.0
IpExtOutBcastOctets 0 0.0
IpExtInCsumErrors 0 0.0
IpExtInNoECTPkts 41780216583 0.0
IpExtInECT1Pkts 0 0.0
IpExtInECT0Pkts 299 0.0
IpExtInCEPkts 0 0.0
IpExtReasmOverlaps 0 0.0
[-- Attachment #3: nstat_2_51.txt --]
[-- Type: text/plain, Size: 17718 bytes --]
#kernel
IpInReceives 18788097214 0.0
IpInHdrErrors 47522 0.0
IpInAddrErrors 0 0.0
IpForwDatagrams 11778747 0.0
IpInUnknownProtos 0 0.0
IpInDiscards 0 0.0
IpInDelivers 18776248435 0.0
IpOutRequests 14428799907 0.0
IpOutDiscards 40798 0.0
IpOutNoRoutes 0 0.0
IpReasmTimeout 0 0.0
IpReasmReqds 32952 0.0
IpReasmOKs 16476 0.0
IpReasmFails 0 0.0
IpFragOKs 0 0.0
IpFragFails 0 0.0
IpFragCreates 0 0.0
IpOutTransmits 14440537900 0.0
IcmpInMsgs 7052154 0.0
IcmpInErrors 107 0.0
IcmpInCsumErrors 0 0.0
IcmpInDestUnreachs 100237 0.0
IcmpInTimeExcds 298 0.0
IcmpInParmProbs 0 0.0
IcmpInSrcQuenchs 0 0.0
IcmpInRedirects 67 0.0
IcmpInEchos 4035849 0.0
IcmpInEchoReps 2915703 0.0
IcmpInTimestamps 0 0.0
IcmpInTimestampReps 0 0.0
IcmpInAddrMasks 0 0.0
IcmpInAddrMaskReps 0 0.0
IcmpOutMsgs 7061967 0.0
IcmpOutErrors 0 0.0
IcmpOutRateLimitGlobal 0 0.0
IcmpOutRateLimitHost 47087 0.0
IcmpOutDestUnreachs 10421 0.0
IcmpOutTimeExcds 449 0.0
IcmpOutParmProbs 0 0.0
IcmpOutSrcQuenchs 0 0.0
IcmpOutRedirects 46216 0.0
IcmpOutEchos 2969032 0.0
IcmpOutEchoReps 4035849 0.0
IcmpOutTimestamps 0 0.0
IcmpOutTimestampReps 0 0.0
IcmpOutAddrMasks 0 0.0
IcmpOutAddrMaskReps 0 0.0
IcmpMsgInType0 2915703 0.0
IcmpMsgInType3 100237 0.0
IcmpMsgInType5 67 0.0
IcmpMsgInType8 4035849 0.0
IcmpMsgInType11 298 0.0
IcmpMsgOutType0 4035849 0.0
IcmpMsgOutType3 10421 0.0
IcmpMsgOutType5 46216 0.0
IcmpMsgOutType8 2969032 0.0
IcmpMsgOutType11 449 0.0
TcpActiveOpens 7902932 0.0
TcpPassiveOpens 1447146 0.0
TcpAttemptFails 41605 0.0
TcpEstabResets 1327046 0.0
TcpInSegs 4953075475 0.0
TcpOutSegs 2988066928 0.0
TcpRetransSegs 656187 0.0
TcpInErrs 0 0.0
TcpOutRsts 3401866 0.0
TcpInCsumErrors 0 0.0
UdpInDatagrams 13816173054 0.0
UdpNoPorts 10435 0.0
UdpInErrors 2 0.0
UdpOutDatagrams 493212273 0.0
UdpRcvbufErrors 0 0.0
UdpSndbufErrors 10 0.0
UdpInCsumErrors 0 0.0
UdpIgnoredMulti 0 0.0
UdpMemErrors 0 0.0
UdpLiteInDatagrams 0 0.0
UdpLiteNoPorts 0 0.0
UdpLiteInErrors 0 0.0
UdpLiteOutDatagrams 0 0.0
UdpLiteRcvbufErrors 0 0.0
UdpLiteSndbufErrors 0 0.0
UdpLiteInCsumErrors 0 0.0
UdpLiteIgnoredMulti 0 0.0
UdpLiteMemErrors 0 0.0
Ip6InReceives 268280 0.0
Ip6InHdrErrors 0 0.0
Ip6InTooBigErrors 0 0.0
Ip6InNoRoutes 39 0.0
Ip6InAddrErrors 0 0.0
Ip6InUnknownProtos 0 0.0
Ip6InTruncatedPkts 0 0.0
Ip6InDiscards 0 0.0
Ip6InDelivers 85862 0.0
Ip6OutForwDatagrams 0 0.0
Ip6OutRequests 500197 0.0
Ip6OutDiscards 1 0.0
Ip6OutNoRoutes 591 0.0
Ip6ReasmTimeout 0 0.0
Ip6ReasmReqds 0 0.0
Ip6ReasmOKs 0 0.0
Ip6ReasmFails 0 0.0
Ip6FragOKs 0 0.0
Ip6FragFails 0 0.0
Ip6FragCreates 0 0.0
Ip6InMcastPkts 194727 0.0
Ip6OutMcastPkts 426562 0.0
Ip6InOctets 20173005 0.0
Ip6OutOctets 45359068 0.0
Ip6InMcastOctets 14345494 0.0
Ip6OutMcastOctets 39440790 0.0
Ip6InBcastOctets 0 0.0
Ip6OutBcastOctets 0 0.0
Ip6InNoECTPkts 268280 0.0
Ip6InECT1Pkts 0 0.0
Ip6InECT0Pkts 0 0.0
Ip6InCEPkts 0 0.0
Ip6OutTransmits 500197 0.0
Icmp6InMsgs 23176 0.0
Icmp6InErrors 0 0.0
Icmp6OutMsgs 437389 0.0
Icmp6OutErrors 0 0.0
Icmp6InCsumErrors 0 0.0
Icmp6OutRateLimitHost 0 0.0
Icmp6InDestUnreachs 0 0.0
Icmp6InPktTooBigs 0 0.0
Icmp6InTimeExcds 0 0.0
Icmp6InParmProblems 0 0.0
Icmp6InEchos 0 0.0
Icmp6InEchoReplies 0 0.0
Icmp6InGroupMembQueries 0 0.0
Icmp6InGroupMembResponses 0 0.0
Icmp6InGroupMembReductions 0 0.0
Icmp6InRouterSolicits 12339 0.0
Icmp6InRouterAdvertisements 0 0.0
Icmp6InNeighborSolicits 2117 0.0
Icmp6InNeighborAdvertisements 8720 0.0
Icmp6InRedirects 0 0.0
Icmp6InMLDv2Reports 0 0.0
Icmp6OutDestUnreachs 0 0.0
Icmp6OutPktTooBigs 0 0.0
Icmp6OutTimeExcds 0 0.0
Icmp6OutParmProblems 0 0.0
Icmp6OutEchos 0 0.0
Icmp6OutEchoReplies 0 0.0
Icmp6OutGroupMembQueries 0 0.0
Icmp6OutGroupMembResponses 0 0.0
Icmp6OutGroupMembReductions 0 0.0
Icmp6OutRouterSolicits 2 0.0
Icmp6OutRouterAdvertisements 0 0.0
Icmp6OutNeighborSolicits 69476 0.0
Icmp6OutNeighborAdvertisements 2116 0.0
Icmp6OutRedirects 0 0.0
Icmp6OutMLDv2Reports 365795 0.0
Icmp6InType133 12339 0.0
Icmp6InType135 2117 0.0
Icmp6InType136 8720 0.0
Icmp6OutType133 2 0.0
Icmp6OutType135 69476 0.0
Icmp6OutType136 2116 0.0
Icmp6OutType143 365795 0.0
Udp6InDatagrams 6 0.0
Udp6NoPorts 0 0.0
Udp6InErrors 0 0.0
Udp6OutDatagrams 6 0.0
Udp6RcvbufErrors 0 0.0
Udp6SndbufErrors 0 0.0
Udp6InCsumErrors 0 0.0
Udp6IgnoredMulti 0 0.0
Udp6MemErrors 0 0.0
UdpLite6InDatagrams 0 0.0
UdpLite6NoPorts 0 0.0
UdpLite6InErrors 0 0.0
UdpLite6OutDatagrams 0 0.0
UdpLite6RcvbufErrors 0 0.0
UdpLite6SndbufErrors 0 0.0
UdpLite6InCsumErrors 0 0.0
UdpLite6MemErrors 0 0.0
TcpExtSyncookiesSent 0 0.0
TcpExtSyncookiesRecv 0 0.0
TcpExtSyncookiesFailed 0 0.0
TcpExtEmbryonicRsts 3 0.0
TcpExtPruneCalled 3891 0.0
TcpExtRcvPruned 0 0.0
TcpExtOfoPruned 0 0.0
TcpExtOutOfWindowIcmps 10 0.0
TcpExtLockDroppedIcmps 178 0.0
TcpExtArpFilter 0 0.0
TcpExtTW 3583160 0.0
TcpExtTWRecycled 4217 0.0
TcpExtTWKilled 0 0.0
TcpExtPAWSActive 0 0.0
TcpExtPAWSEstab 60 0.0
TcpExtDelayedACKs 5133416 0.0
TcpExtDelayedACKLocked 3008 0.0
TcpExtDelayedACKLost 111165 0.0
TcpExtListenOverflows 0 0.0
TcpExtListenDrops 0 0.0
TcpExtTCPHPHits 2023413 0.0
TcpExtTCPPureAcks 53087713 0.0
TcpExtTCPHPAcks 115340211 0.0
TcpExtTCPRenoRecovery 0 0.0
TcpExtTCPSackRecovery 13396 0.0
TcpExtTCPSACKReneging 0 0.0
TcpExtTCPSACKReorder 202 0.0
TcpExtTCPRenoReorder 0 0.0
TcpExtTCPTSReorder 1 0.0
TcpExtTCPFullUndo 0 0.0
TcpExtTCPPartialUndo 1 0.0
TcpExtTCPDSACKUndo 252 0.0
TcpExtTCPLossUndo 85 0.0
TcpExtTCPLostRetransmit 195718 0.0
TcpExtTCPRenoFailures 0 0.0
TcpExtTCPSackFailures 3 0.0
TcpExtTCPLossFailures 0 0.0
TcpExtTCPFastRetrans 381976 0.0
TcpExtTCPSlowStartRetrans 76 0.0
TcpExtTCPTimeouts 195387 0.0
TcpExtTCPLossProbes 110863 0.0
TcpExtTCPLossProbeRecovery 773 0.0
TcpExtTCPRenoRecoveryFail 0 0.0
TcpExtTCPSackRecoveryFail 38 0.0
TcpExtTCPRcvCollapsed 2960 0.0
TcpExtTCPBacklogCoalesce 252840759 0.0
TcpExtTCPDSACKOldSent 111216 0.0
TcpExtTCPDSACKOfoSent 2 0.0
TcpExtTCPDSACKRecv 77523 0.0
TcpExtTCPDSACKOfoRecv 3 0.0
TcpExtTCPAbortOnData 509577 0.0
TcpExtTCPAbortOnClose 1246626 0.0
TcpExtTCPAbortOnMemory 0 0.0
TcpExtTCPAbortOnTimeout 32 0.0
TcpExtTCPAbortOnLinger 0 0.0
TcpExtTCPAbortFailed 0 0.0
TcpExtTCPMemoryPressures 0 0.0
TcpExtTCPMemoryPressuresChrono 0 0.0
TcpExtTCPSACKDiscard 10 0.0
TcpExtTCPDSACKIgnoredOld 2 0.0
TcpExtTCPDSACKIgnoredNoUndo 73310 0.0
TcpExtTCPSpuriousRTOs 23 0.0
TcpExtTCPMD5NotFound 0 0.0
TcpExtTCPMD5Unexpected 0 0.0
TcpExtTCPMD5Failure 0 0.0
TcpExtTCPSackShifted 45975 0.0
TcpExtTCPSackMerged 248207 0.0
TcpExtTCPSackShiftFallback 328257 0.0
TcpExtTCPBacklogDrop 0 0.0
TcpExtPFMemallocDrop 0 0.0
TcpExtTCPMinTTLDrop 0 0.0
TcpExtTCPDeferAcceptDrop 0 0.0
TcpExtIPReversePathFilter 0 0.0
TcpExtTCPTimeWaitOverflow 0 0.0
TcpExtTCPReqQFullDoCookies 0 0.0
TcpExtTCPReqQFullDrop 0 0.0
TcpExtTCPRetransFail 55 0.0
TcpExtTCPRcvCoalesce 1575181380 0.0
TcpExtTCPOFOQueue 2661363 0.0
TcpExtTCPOFODrop 0 0.0
TcpExtTCPOFOMerge 2 0.0
TcpExtTCPChallengeACK 138 0.0
TcpExtTCPSYNChallenge 87 0.0
TcpExtTCPFastOpenActive 0 0.0
TcpExtTCPFastOpenActiveFail 0 0.0
TcpExtTCPFastOpenPassive 0 0.0
TcpExtTCPFastOpenPassiveFail 0 0.0
TcpExtTCPFastOpenListenOverflow 0 0.0
TcpExtTCPFastOpenCookieReqd 0 0.0
TcpExtTCPFastOpenBlackhole 0 0.0
TcpExtTCPSpuriousRtxHostQueues 118 0.0
TcpExtBusyPollRxPackets 0 0.0
TcpExtTCPAutoCorking 20400747 0.0
TcpExtTCPFromZeroWindowAdv 3612949 0.0
TcpExtTCPToZeroWindowAdv 3612953 0.0
TcpExtTCPWantZeroWindowAdv 0 0.0
TcpExtTCPSynRetrans 194466 0.0
TcpExtTCPOrigDataSent 1261792021 0.0
TcpExtTCPHystartTrainDetect 40140 0.0
TcpExtTCPHystartTrainCwnd 4828784 0.0
TcpExtTCPHystartDelayDetect 17 0.0
TcpExtTCPHystartDelayCwnd 8489 0.0
TcpExtTCPACKSkippedSynRecv 0 0.0
TcpExtTCPACKSkippedPAWS 3 0.0
TcpExtTCPACKSkippedSeq 580 0.0
TcpExtTCPACKSkippedFinWait2 0 0.0
TcpExtTCPACKSkippedTimeWait 0 0.0
TcpExtTCPACKSkippedChallenge 0 0.0
TcpExtTCPWinProbe 0 0.0
TcpExtTCPKeepAlive 2024725 0.0
TcpExtTCPMTUPFail 0 0.0
TcpExtTCPMTUPSuccess 0 0.0
TcpExtTCPDelivered 1269268682 0.0
TcpExtTCPDeliveredCE 0 0.0
TcpExtTCPAckCompressed 1028306 0.0
TcpExtTCPZeroWindowDrop 485489 0.0
TcpExtTCPRcvQDrop 0 0.0
TcpExtTCPWqueueTooBig 0 0.0
TcpExtTCPFastOpenPassiveAltKey 0 0.0
TcpExtTcpTimeoutRehash 195346 0.0
TcpExtTcpDuplicateDataRehash 53 0.0
TcpExtTCPDSACKRecvSegs 77539 0.0
TcpExtTCPDSACKIgnoredDubious 12 0.0
TcpExtTCPMigrateReqSuccess 0 0.0
TcpExtTCPMigrateReqFailure 0 0.0
TcpExtTCPPLBRehash 0 0.0
IpExtInNoRoutes 4 0.0
IpExtInTruncatedPkts 0 0.0
IpExtInMcastPkts 6 0.0
IpExtOutMcastPkts 16 0.0
IpExtInBcastPkts 0 0.0
IpExtOutBcastPkts 0 0.0
IpExtInOctets 115665856793051 0.0
IpExtOutOctets 20991618500174 0.0
IpExtInMcastOctets 306 0.0
IpExtOutMcastOctets 706 0.0
IpExtInBcastOctets 0 0.0
IpExtOutBcastOctets 0 0.0
IpExtInCsumErrors 0 0.0
IpExtInNoECTPkts 41778706183 0.0
IpExtInECT1Pkts 0 0.0
IpExtInECT0Pkts 299 0.0
IpExtInCEPkts 0 0.0
IpExtReasmOverlaps 0 0.0
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-19 15:03 ` [EXT] " Simon Campion
@ 2025-05-21 3:04 ` Neal Cardwell
2025-05-21 15:08 ` Simon Campion
2025-05-21 15:29 ` Simon Campion
0 siblings, 2 replies; 11+ messages in thread
From: Neal Cardwell @ 2025-05-21 3:04 UTC (permalink / raw)
To: Simon Campion; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
cc += Jon Maloy <jmaloy@redhat.com>
On Mon, May 19, 2025 at 11:03 AM Simon Campion <simon.campion@deepl.com> wrote:
>
> Gladly! I attached the output of nstat -az. I ran it twice, right
> before a 602 byte retransmit was received and dropped, and right
> after, in case looking at the diff is helpful.
Thanks, Simon, for the data!
Skimming the data and the code for your kernel (6.6.83), I have a theory:
In your nstat data, we see TcpExtTCPZeroWindowDrop is incremented by 1
when the 602 byte retransmit was received and dropped:
> < TcpExtTCPZeroWindowDrop 485489 0.0
> ---
> > TcpExtTCPZeroWindowDrop 485490 0.0
That SNMP stat (corresponding to the SKB_DROP_REASON_TCP_ZEROWINDOW
drop reason Simon mentioned earlier) is incremented by
tcp_data_queue() when an in-order packet arrives and
tcp_receive_window(tp) == 0, and the packet is dropped.
But, critically, tcp_data_queue() in that code path does not call
tcp_try_rmem_schedule() to try to free up memory.
Why is tcp_receive_window(tp) == 0 in this case? A conjecture:
(a) I bet the machine was probably under memory pressure earlier,
triggering ICSK_ACK_NOMEM
(b) We can see your kernel 6.6.83 has a backport of the recent bug fix
patch that sets tp->rcv_wnd = 0 upon ICSK_ACK_NOMEM events:
commit b01e7ceb35dcb7ffad413da657b78c3340a09039
Author: Jon Maloy <jmaloy@redhat.com>
Date: Mon Jan 27 18:13:04 2025 -0500
tcp: correct handling of extreme memory squeeze
[ Upstream commit 8c670bdfa58e48abad1d5b6ca1ee843ca91f7303 ]
...
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfddc94508f0b..3771ed22c2f56 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -263,11 +263,14 @@ static u16 tcp_select_window(struct sock *sk)
u32 cur_win, new_win;
/* Make the window 0 if we failed to queue the data because we
- * are out of memory. The window is temporary, so we don't store
- * it on the socket.
+ * are out of memory.
*/
- if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+ if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+ tp->pred_flags = 0;
+ tp->rcv_wnd = 0;
+ tp->rcv_wup = tp->rcv_nxt;
return 0;
+ }
---
Putting this all together, a conjecture about what happened:
+ the machine was under memory pressure, so triggered ICSK_ACK_NOMEM
+ this caused the new "tcp: correct handling of extreme memory
squeeze" patch to set tp->rcv_wnd = 0
+ this caused tcp_data_queue() to see the in-order packet arrive and
tcp_receive_window(tp) == 0, and the packet is dropped.with
TcpExtTCPZeroWindowDrop
+ tcp_data_queue() in that code path does not call
tcp_try_rmem_schedule() to try to free up memory
+ so even if more memory was available at this point,
tcp_try_rmem_schedule() is not called, because of the new "tcp:
correct handling of extreme memory squeeze" patch
I suppose one possible fix would be to change tcp_data_queue() in that
(tcp_receive_window(tp) == 0) case, to make sure it calls
tcp_try_rmem_schedule() to try to free up memory.
Eric and Jon, WDYT?
It's a bit past my bedtime here in NYC so I may not be thinking straight.... :-)
thanks,
neal
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-21 3:04 ` Neal Cardwell
@ 2025-05-21 15:08 ` Simon Campion
2025-05-21 15:56 ` Neal Cardwell
2025-05-21 15:29 ` Simon Campion
1 sibling, 1 reply; 11+ messages in thread
From: Simon Campion @ 2025-05-21 15:08 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
Great to hear we have a potential lead to investigate!
We've now seen this problem occur several times on multiple different
nodes. We tried two workarounds, without success:
* As far as we see, the patch Neal mentioned was included in the
6.6.76 release. We rolled back some nodes to an earlier Flatcar image
with kernel 6.6.74. But we saw the issue occur on 6.6.74 as well.
* We disabled SACK on the nodes with broken connections (not on the
nodes they connect to). The problem occurs in the absence of SACK as
well:
05:59:05.706056 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[P.], seq 306:315, ack 1, win 0, options [nop,nop,TS val 2554169028
ecr 1041911222], length 9
05:59:05.706142 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], ack 315, win 501, options [nop,nop,TS val 1041916342 ecr
2554169028], length 0
05:59:07.846543 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], seq 1:609, ack 315, win 501, options [nop,nop,TS val 1041918483
ecr 2554169028], length 608
05:59:07.846569 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[.], ack 1, win 0, options [nop,nop,TS val 2554171168 ecr 1041918483],
length 0
05:59:10.826079 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[P.], seq 315:324, ack 1, win 0, options [nop,nop,TS val 2554174148
ecr 1041918483], length 9
05:59:10.826205 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], ack 324, win 501, options [nop,nop,TS val 1041921462 ecr
2554174148], length 0
Another important piece of information (which I should've included in
my first message!): we set net.ipv4.tcp_shrink_window=1. We disabled
it to check whether this will avoid the issue.
Thanks for all your help!
Simon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-21 3:04 ` Neal Cardwell
2025-05-21 15:08 ` Simon Campion
@ 2025-05-21 15:29 ` Simon Campion
1 sibling, 0 replies; 11+ messages in thread
From: Simon Campion @ 2025-05-21 15:29 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
(Sorry---resending with different subject to hopefully get this into
the correct thread.)
Great to hear we have a potential lead to investigate!
We've now seen this problem occur several times on multiple different
nodes. We tried two workarounds, without success:
* As far as we see, the patch Neal mentioned was included in the
6.6.76 release. We rolled back some nodes to an earlier Flatcar image
with kernel 6.6.74, but we saw the issue occur on 6.6.74 as well.
* We disabled SACK on the Ceph client nodes. But the problem occurs in
the absence of SACK as well:
05:59:05.706056 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[P.], seq 306:315, ack 1, win 0, options [nop,nop,TS val 2554169028
ecr 1041911222], length 9
05:59:05.706142 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], ack 315, win 501, options [nop,nop,TS val 1041916342 ecr
2554169028], length 0
05:59:07.846543 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], seq 1:609, ack 315, win 501, options [nop,nop,TS val 1041918483
ecr 2554169028], length 608
05:59:07.846569 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[.], ack 1, win 0, options [nop,nop,TS val 2554171168 ecr 1041918483],
length 0
05:59:10.826079 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
[P.], seq 315:324, ack 1, win 0, options [nop,nop,TS val 2554174148
ecr 1041918483], length 9
05:59:10.826205 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
[.], ack 324, win 501, options [nop,nop,TS val 1041921462 ecr
2554174148], length 0
Another important piece of information (which I should've included in
my first message!): we set net.ipv4.tcp_shrink_window=1. To test
whether this setting triggers the issue, we disabled it. We will
report back whether this appears to fix it or not.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-21 15:08 ` Simon Campion
@ 2025-05-21 15:56 ` Neal Cardwell
2025-05-22 10:34 ` Simon Campion
0 siblings, 1 reply; 11+ messages in thread
From: Neal Cardwell @ 2025-05-21 15:56 UTC (permalink / raw)
To: Simon Campion; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
On Wed, May 21, 2025 at 11:08 AM Simon Campion <simon.campion@deepl.com> wrote:
>
> Great to hear we have a potential lead to investigate!
>
> We've now seen this problem occur several times on multiple different
> nodes. We tried two workarounds, without success:
> * As far as we see, the patch Neal mentioned was included in the
> 6.6.76 release. We rolled back some nodes to an earlier Flatcar image
> with kernel 6.6.74. But we saw the issue occur on 6.6.74 as well.
For clarity, it sounds like you observed the issue occur on 6.6.74
with net.ipv4.tcp_shrink_window=1. Is that correct?
> * We disabled SACK on the nodes with broken connections (not on the
> nodes they connect to). The problem occurs in the absence of SACK as
> well:
> 05:59:05.706056 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
> [P.], seq 306:315, ack 1, win 0, options [nop,nop,TS val 2554169028
> ecr 1041911222], length 9
> 05:59:05.706142 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
> [.], ack 315, win 501, options [nop,nop,TS val 1041916342 ecr
> 2554169028], length 0
> 05:59:07.846543 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
> [.], seq 1:609, ack 315, win 501, options [nop,nop,TS val 1041918483
> ecr 2554169028], length 608
> 05:59:07.846569 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
> [.], ack 1, win 0, options [nop,nop,TS val 2554171168 ecr 1041918483],
> length 0
> 05:59:10.826079 eth1b Out IP 10.70.3.80.57136 > 10.70.3.46.6920: Flags
> [P.], seq 315:324, ack 1, win 0, options [nop,nop,TS val 2554174148
> ecr 1041918483], length 9
> 05:59:10.826205 eth1b In IP 10.70.3.46.6920 > 10.70.3.80.57136: Flags
> [.], ack 324, win 501, options [nop,nop,TS val 1041921462 ecr
> 2554174148], length 0
>
> Another important piece of information (which I should've included in
> my first message!): we set net.ipv4.tcp_shrink_window=1. We disabled
> it to check whether this will avoid the issue.
Oh! That's a big deal. For my education, why do you set
net.ipv4.tcp_shrink_window=1?
If at all feasible, I *strongly* recommend running with
net.ipv4.tcp_shrink_window=0, since this is the default setting, and I
imagine net.ipv4.tcp_shrink_window=1 could run into all sorts of weird
problems like this. :-)
It would be super useful if you can share results for whether you see
this problem with:
(a) 6.6.74 (before "tcp: correct handling of extreme memory squeeze")
with net.ipv4.tcp_shrink_window=0
(b) 6.6.83 (after "tcp: correct handling of extreme memory squeeze")
with net.ipv4.tcp_shrink_window=0
Thanks,
neal
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-21 15:56 ` Neal Cardwell
@ 2025-05-22 10:34 ` Simon Campion
2025-05-22 13:02 ` Neal Cardwell
0 siblings, 1 reply; 11+ messages in thread
From: Simon Campion @ 2025-05-22 10:34 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
On Wed, 21 May 2025 at 17:56, Neal Cardwell <ncardwell@google.com> wrote:
> For my education, why do you set net.ipv4.tcp_shrink_window=1?
We enabled it mainly as an attempt to decrease the frequency of a
different issue in which jumbo frames were dropped indefinitely on a
host, presumably after memory pressure, discussed in [1]. The jumbo
frame issue is most likely triggered by system-wide memory pressure
rather than hitting net.ipv4.tcp_mem. So,
net.ipv4.tcp_shrink_window=1, which, as far as we understand, makes
hitting net.ipv4.tcp_mem less likely, probably didn't help with
decreasing the frequency of the jumbo frame issue. But the issue had
sufficiently serious impact and we were sufficiently unsure about the
root cause that we deemed net.ipv4.tcp_shrink_window=1 worth a try.
(Also, the rationale behind net.ipv4.tcp_shrink_window=1 laid out in
[2] and [3] sounded reasonable.)
But yes, it's feasible for us to revert to the default
net.ipv4.tcp_shrink_window=0, in particular because there's another
workaround for the jumbo frame issue: reduce the MTU. We've set
net.ipv4.tcp_shrink_window=0 yesterday and haven't seen the issue
since. So:
6.6.74 + net.ipv4.tcp_shrink_window=1: issue occurs
6.6.83 + net.ipv4.tcp_shrink_window=1: issue occurs
6.6.74 + net.ipv4.tcp_shrink_window=0: no issue so far
6.6.83 + net.ipv4.tcp_shrink_window=0: no issue so far
Since the issue occurred sporadically, it's too soon to be fully
confident that it's gone with net.ipv4.tcp_shrink_window=0. We'll
write again in a week or so to confirm.
If net.ipv4.tcp_shrink_window=1 seems to have caused this issue, we'd
still be curious to understand why it leads to TCP connections being
stuck indefinitely even though the recv-q (as reported by ss) is 0.
Assuming the recv-q was indeed correctly reported as 0, the issue
might be that receive buffers can fill up in a way so that the only
way for data to leave the receive buffer is receipt of further data.
In particular, the application can't read data out of the receive
buffer and empty it that way. Maybe filling up buffers with data
received out-of-order (whether we SACK it or not) satisfies this
condition. This would explain why we saw this issue only in the
presence of SACK flags before we disabled SACK. With
net.ipv4.tcp_shrink_window=1, a full receive buffer leads to a zero
window being advertised (see [2]) and if the buffer filled up in a way
so that no data can leave until further data is received, we are stuck
forever because the kernel drops incoming data due to the zero window.
In contrast, with ipv4.tcp_shrink_window=0, we will keep advertising a
non-zero window, so incoming data isn't dropped and we can have data
leave the receive buffer. I'm speculating here; once we confirm that
the issue seems to have been triggered by
net.ipv4.tcp_shrink_window=1, I'd be keen to hear other thoughts as to
why the setting may have this effect in certain environments.
[1] https://marc.info/?l=linux-netdev&m=174600337131981&w=2
[2] https://github.com/torvalds/linux/commit/b650d953cd391595e536153ce30b4aab385643ac
[3] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-22 10:34 ` Simon Campion
@ 2025-05-22 13:02 ` Neal Cardwell
2025-06-03 7:26 ` Simon Campion
0 siblings, 1 reply; 11+ messages in thread
From: Neal Cardwell @ 2025-05-22 13:02 UTC (permalink / raw)
To: Simon Campion; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
On Thu, May 22, 2025 at 6:34 AM Simon Campion <simon.campion@deepl.com> wrote:
>
> On Wed, 21 May 2025 at 17:56, Neal Cardwell <ncardwell@google.com> wrote:
> > For my education, why do you set net.ipv4.tcp_shrink_window=1?
>
> We enabled it mainly as an attempt to decrease the frequency of a
> different issue in which jumbo frames were dropped indefinitely on a
> host, presumably after memory pressure, discussed in [1]. The jumbo
> frame issue is most likely triggered by system-wide memory pressure
> rather than hitting net.ipv4.tcp_mem. So,
> net.ipv4.tcp_shrink_window=1, which, as far as we understand, makes
> hitting net.ipv4.tcp_mem less likely, probably didn't help with
> decreasing the frequency of the jumbo frame issue. But the issue had
> sufficiently serious impact and we were sufficiently unsure about the
> root cause that we deemed net.ipv4.tcp_shrink_window=1 worth a try.
> (Also, the rationale behind net.ipv4.tcp_shrink_window=1 laid out in
> [2] and [3] sounded reasonable.)
>
> But yes, it's feasible for us to revert to the default
> net.ipv4.tcp_shrink_window=0, in particular because there's another
> workaround for the jumbo frame issue: reduce the MTU. We've set
> net.ipv4.tcp_shrink_window=0 yesterday and haven't seen the issue
> since. So:
>
> 6.6.74 + net.ipv4.tcp_shrink_window=1: issue occurs
> 6.6.83 + net.ipv4.tcp_shrink_window=1: issue occurs
> 6.6.74 + net.ipv4.tcp_shrink_window=0: no issue so far
> 6.6.83 + net.ipv4.tcp_shrink_window=0: no issue so far
>
> Since the issue occurred sporadically, it's too soon to be fully
> confident that it's gone with net.ipv4.tcp_shrink_window=0. We'll
> write again in a week or so to confirm.
Thanks for the data points and testing!
I agree it will take a while to gather more confidence that the issue
is gone for your workload with net.ipv4.tcp_shrink_window=0.
Based on your data, my current sense is that for your workload the
buggy behavior was triggered by net.ipv4.tcp_shrink_window=1.
However, AFAICT with the current code there could be similar problems
with the default net.ipv4.tcp_shrink_window=0 setting if the socket
suffers a memory pressure event while there is a tiny amount of free
receive buffer.
> If net.ipv4.tcp_shrink_window=1 seems to have caused this issue, we'd
> still be curious to understand why it leads to TCP connections being
> stuck indefinitely even though the recv-q (as reported by ss) is 0.
> Assuming the recv-q was indeed correctly reported as 0, the issue
> might be that receive buffers can fill up in a way so that the only
> way for data to leave the receive buffer is receipt of further data.
> In particular, the application can't read data out of the receive
> buffer and empty it that way. Maybe filling up buffers with data
> received out-of-order (whether we SACK it or not) satisfies this
> condition. This would explain why we saw this issue only in the
> presence of SACK flags before we disabled SACK. With
> net.ipv4.tcp_shrink_window=1, a full receive buffer leads to a zero
> window being advertised (see [2]) and if the buffer filled up in a way
> so that no data can leave until further data is received, we are stuck
> forever because the kernel drops incoming data due to the zero window.
> In contrast, with ipv4.tcp_shrink_window=0, we will keep advertising a
> non-zero window, so incoming data isn't dropped and we can have data
> leave the receive buffer.
Yes, this matches my theory of the case as well.
Except I would add that with ipv4.tcp_shrink_window=0, AFAICT with
recent kernels a receiver can get into a situation where a memory
pressure event while there is a tiny amount of free receive buffer can
cause the receiver to set tp->rcv_wnd to 0, and thus get into a
similar situation where the receiver (due to the zero window) will
keep advertising a zero window and dropping incoming data without
pruning SACKed skbs to make room in the receive buffer.
(It sounds like in your case the net.ipv4.tcp_shrink_window=1 is
triggering the problem rather than the memory pressure issue.)
> ... I'm speculating here; once we confirm that
> the issue seems to have been triggered by
> net.ipv4.tcp_shrink_window=1, I'd be keen to hear other thoughts as to
> why the setting may have this effect in certain environments.
I suspect the environmental factors that make your workload
susceptible to these issues are related to
+ the amount of space used by the NIC driver on the receiver to hold
incoming packets may be large relative to the rcvmss
+ the variation in the incoming packet sizes (the hole was 602 bytes
when the rcvmss is a larger 1434 bytes) may be causing challenges
+ the packet loss is definitely causing challenges for the algorithm,
since the SACKed out-of-order data can eat up most of the space needed
to buffer the packet to fill the hole and allow the app to read the
data out of the receive buffer to free up more space
Thanks,
neal
> [1] https://marc.info/?l=linux-netdev&m=174600337131981&w=2
> [2] https://github.com/torvalds/linux/commit/b650d953cd391595e536153ce30b4aab385643ac
> [3] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-05-22 13:02 ` Neal Cardwell
@ 2025-06-03 7:26 ` Simon Campion
2025-06-09 17:47 ` Neal Cardwell
0 siblings, 1 reply; 11+ messages in thread
From: Simon Campion @ 2025-06-03 7:26 UTC (permalink / raw)
To: Neal Cardwell; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
> I agree it will take a while to gather more confidence that the issue
> is gone for your workload with net.ipv4.tcp_shrink_window=0.
To confirm, it's been over a week since we set
net.ipv4.tcp_shrink_window=0, and so far we haven't seen an issue with
TCP connections being stuck with a zero window and an empty recv-q.
So, it looks like the problem is either entirely gone or occurs much
less frequently with net.ipv4.tcp_shrink_window=0.
We also attempted to reproduce the issue with a program that sends
data over a TCP connection but leaves out the first N bytes.
Unfortunately, we haven't been able to reproduce the issue so far,
even with net.ipv4.tcp_shrink_window=1 and with a system under memory
pressure.
Simon
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK
2025-06-03 7:26 ` Simon Campion
@ 2025-06-09 17:47 ` Neal Cardwell
0 siblings, 0 replies; 11+ messages in thread
From: Neal Cardwell @ 2025-06-09 17:47 UTC (permalink / raw)
To: Simon Campion; +Cc: netdev, Eric Dumazet, Yuchung Cheng, Kevin Yang, Jon Maloy
On Tue, Jun 3, 2025 at 3:27 AM Simon Campion <simon.campion@deepl.com> wrote:
>
> > I agree it will take a while to gather more confidence that the issue
> > is gone for your workload with net.ipv4.tcp_shrink_window=0.
>
> To confirm, it's been over a week since we set
> net.ipv4.tcp_shrink_window=0, and so far we haven't seen an issue with
> TCP connections being stuck with a zero window and an empty recv-q.
> So, it looks like the problem is either entirely gone or occurs much
> less frequently with net.ipv4.tcp_shrink_window=0.
>
> We also attempted to reproduce the issue with a program that sends
> data over a TCP connection but leaves out the first N bytes.
> Unfortunately, we haven't been able to reproduce the issue so far,
> even with net.ipv4.tcp_shrink_window=1 and with a system under memory
> pressure.
Thanks for the detailed update! Those are very useful data points.
best regards,
neal
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-06-09 17:47 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 13:31 tcp: socket stuck with zero receive window after SACK Simon Campion
2025-05-19 14:42 ` Neal Cardwell
2025-05-19 15:03 ` [EXT] " Simon Campion
2025-05-21 3:04 ` Neal Cardwell
2025-05-21 15:08 ` Simon Campion
2025-05-21 15:56 ` Neal Cardwell
2025-05-22 10:34 ` Simon Campion
2025-05-22 13:02 ` Neal Cardwell
2025-06-03 7:26 ` Simon Campion
2025-06-09 17:47 ` Neal Cardwell
2025-05-21 15:29 ` Simon Campion
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).