Issue with delayed segments despite TCP

netfilter.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Issue with delayed segments despite TCP_NODELAY
@ 2025-05-26  0:44 Dennis Baurichter
  2025-05-26 13:50 ` Neal Cardwell
  0 siblings, 1 reply; 3+ messages in thread
From: Dennis Baurichter @ 2025-05-26  0:44 UTC (permalink / raw)
  To: netdev, netfilter

Hi,

I have a question on why the kernel stops sending further TCP segments 
after the handshake and first 2 (or 3) payload segments have been sent. 
This seems to happen if the round trip time is "too high" (e.g., over 
9ms or 15ms, depending on system). Remaining segments are (apparently) 
only sent after an ACK has been received, even though TCP_NODELAY is set 
on the socket.

This is happening on a range of different kernels, from Arch Linux' 
6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's 
5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I 
can test on an actual mainline kernel, too, if that helps.
I will describe our (probably somewhat uncommon) setup below. If you 
need any further information, I'll be happy to provide it.

My colleague and I have the following setup:
- Userland application connects to a server via TCP/IPv4 (complete TCP 
handshake is performed).
- A nftables rule is added to intercept packets of this connection and 
put them into a netfilter queue.
- Userland application writes data into this TCP socket.
   - The data is written in up to 4 chunks, which are intended to end up 
in individual TCP segments.
   - The socket has TCP_NODELAY set.
   - sysctl net.ipv4.tcp_autocorking=0
- The above nftables rule is removed.
- Userland application (a different part of it) retrieves all packets 
from the netfilter queue.
   - Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
   - Reading from the netfilter queue is attempted until 5 timeouts of 
20ms each occured. Even much higher timeout values don't change the 
results, so it's not a race condition.
- Userland application performs some modifications on the intercepted 
segments and eventually issues verdict NF_ACCEPT.

We checked (via strace) that all payload chunks are successfully written 
to the socket, (via nlmon kernel module) that there are no errors in the 
netlink communication, and (via nft monitor) that indeed no further 
segments traverse the netfilter pipeline before the first two payload 
segments are actually sent on the wire.
We dug through the entire list of TCP and IPv4 sysctls (testing several 
of them), tried loading and using different congestion algorithm 
modules, toggling TCP_NODELAY off and on between each write to the 
socket (to trigger an explicit flush), and other things, but to no avail.

Modifying our code, we can see that after NF_ACCEPT'ing the first 
segments, we can retrieve the remaining segments from netfilter queue.
In Wireshark we see that this seems to be triggered by the incoming ACK 
segment from the server.

Notably, we can intercept all segments at once when testing this on 
localhost or in a LAN network. However, on long-distance / 
higher-latency connections, we can only intercept 2 (sometimes 3) segments.

Testing on a LAN connection from an old laptop to a fast PC, we delayed 
packets on the latter one with variants of:
tc qdisc add dev eth0 root netem delay 15ms
We got the following mappings of delay / rtt to number of segments 
intercepted:
below 15ms -> all (up to 4) segments intercepted
15-16ms -> 2-3 segments
16-17ms -> 2 (sometimes 3) segments
over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
Testing in the other direction, from fast PC to old laptop (which now 
has the qdisc delay), we get similar results, just with lower round trip 
times (15ms becomes more like 8-9ms).

We would very much appreciate it if someone could help us on the 
following questions:
- Why are the remaining segments not send out immediately, despite 
TCP_NODELAY?
- Is there a way to change this?
- If not, do you have better workarounds than injecting a fake ACK 
pretending to come "from the server" via a raw socket?
   Actually, we haven't tried this yet, but probably will soon.

Regards,
Dennis

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Issue with delayed segments despite TCP_NODELAY
  2025-05-26  0:44 Issue with delayed segments despite TCP_NODELAY Dennis Baurichter
@ 2025-05-26 13:50 ` Neal Cardwell
  2025-05-27 23:40   ` Dennis Baurichter
  0 siblings, 1 reply; 3+ messages in thread
From: Neal Cardwell @ 2025-05-26 13:50 UTC (permalink / raw)
  To: Dennis Baurichter; +Cc: netdev, netfilter, Eric Dumazet

On Sun, May 25, 2025 at 9:01 PM Dennis Baurichter
<dennisba@mail.uni-paderborn.de> wrote:
>
> Hi,
>
> I have a question on why the kernel stops sending further TCP segments
> after the handshake and first 2 (or 3) payload segments have been sent.
> This seems to happen if the round trip time is "too high" (e.g., over
> 9ms or 15ms, depending on system). Remaining segments are (apparently)
> only sent after an ACK has been received, even though TCP_NODELAY is set
> on the socket.
>
> This is happening on a range of different kernels, from Arch Linux'
> 6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's
> 5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I
> can test on an actual mainline kernel, too, if that helps.
> I will describe our (probably somewhat uncommon) setup below. If you
> need any further information, I'll be happy to provide it.
>
> My colleague and I have the following setup:
> - Userland application connects to a server via TCP/IPv4 (complete TCP
> handshake is performed).
> - A nftables rule is added to intercept packets of this connection and
> put them into a netfilter queue.
> - Userland application writes data into this TCP socket.
>    - The data is written in up to 4 chunks, which are intended to end up
> in individual TCP segments.
>    - The socket has TCP_NODELAY set.
>    - sysctl net.ipv4.tcp_autocorking=0
> - The above nftables rule is removed.
> - Userland application (a different part of it) retrieves all packets
> from the netfilter queue.
>    - Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
>    - Reading from the netfilter queue is attempted until 5 timeouts of
> 20ms each occured. Even much higher timeout values don't change the
> results, so it's not a race condition.
> - Userland application performs some modifications on the intercepted
> segments and eventually issues verdict NF_ACCEPT.
>
> We checked (via strace) that all payload chunks are successfully written
> to the socket, (via nlmon kernel module) that there are no errors in the
> netlink communication, and (via nft monitor) that indeed no further
> segments traverse the netfilter pipeline before the first two payload
> segments are actually sent on the wire.
> We dug through the entire list of TCP and IPv4 sysctls (testing several
> of them), tried loading and using different congestion algorithm
> modules, toggling TCP_NODELAY off and on between each write to the
> socket (to trigger an explicit flush), and other things, but to no avail.
>
> Modifying our code, we can see that after NF_ACCEPT'ing the first
> segments, we can retrieve the remaining segments from netfilter queue.
> In Wireshark we see that this seems to be triggered by the incoming ACK
> segment from the server.
>
> Notably, we can intercept all segments at once when testing this on
> localhost or in a LAN network. However, on long-distance /
> higher-latency connections, we can only intercept 2 (sometimes 3) segments.
>
> Testing on a LAN connection from an old laptop to a fast PC, we delayed
> packets on the latter one with variants of:
> tc qdisc add dev eth0 root netem delay 15ms
> We got the following mappings of delay / rtt to number of segments
> intercepted:
> below 15ms -> all (up to 4) segments intercepted
> 15-16ms -> 2-3 segments
> 16-17ms -> 2 (sometimes 3) segments
> over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
> Testing in the other direction, from fast PC to old laptop (which now
> has the qdisc delay), we get similar results, just with lower round trip
> times (15ms becomes more like 8-9ms).
>
> We would very much appreciate it if someone could help us on the
> following questions:
> - Why are the remaining segments not send out immediately, despite
> TCP_NODELAY?
> - Is there a way to change this?
> - If not, do you have better workarounds than injecting a fake ACK
> pretending to come "from the server" via a raw socket?
>    Actually, we haven't tried this yet, but probably will soon.

Sounds like you are probably seeing the effects of TCP Small Queues
(TSQ) limiting the number of skbs queued in various layers of the
sending machine. See tcp_small_queue_check() for details.

Probably with shorter RTTs the incoming ACKs clear skbs from the rtx
queue, and thus the tcp_small_queue_check() call to
tcp_rtx_queue_empty_or_single_skb(sk) returns true and
tcp_small_queue_check() returns false, enabling transmissions.

What is it that you are trying to accomplish with this nftables approach?

neal

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Issue with delayed segments despite TCP_NODELAY
  2025-05-26 13:50 ` Neal Cardwell
@ 2025-05-27 23:40   ` Dennis Baurichter
  0 siblings, 0 replies; 3+ messages in thread
From: Dennis Baurichter @ 2025-05-27 23:40 UTC (permalink / raw)
  To: Neal Cardwell; +Cc: netdev, netfilter, Eric Dumazet

Hi neal,

Am 26.05.25 um 15:50 schrieb Neal Cardwell:
>> We would very much appreciate it if someone could help us on the
>> following questions:
>> - Why are the remaining segments not send out immediately, despite
>> TCP_NODELAY?
>> - Is there a way to change this?
>> - If not, do you have better workarounds than injecting a fake ACK
>> pretending to come "from the server" via a raw socket?
>>     Actually, we haven't tried this yet, but probably will soon.
> 
> Sounds like you are probably seeing the effects of TCP Small Queues
> (TSQ) limiting the number of skbs queued in various layers of the
> sending machine. See tcp_small_queue_check() for details.

thank you so much! I compiled v6.15 with a tcp_small_queue_check() that 
I patched to always return false and things just worked (again)! Now I 
wrote a small module using kretprobe and regs_set_return_value() to 
allow us to apply this change a bit more selectively (and without 
recompiling the entire kernel). That's probably not optimal for anything 
that should be widely deployed, but since we are currently just 
experimenting and don't even know what might be actually used later on, 
it seems good enough for now.

> Probably with shorter RTTs the incoming ACKs clear skbs from the rtx
> queue, and thus the tcp_small_queue_check() call to
> tcp_rtx_queue_empty_or_single_skb(sk) returns true and
> tcp_small_queue_check() returns false, enabling transmissions.

Honestly, I still don't quite understand why this works the way it does. 
We intercept all outgoing (initial) payload segments before we NF_ACCEPT 
any of them (i.e., collect all first, then release), so after the 
handshake itself there shouldn't be any skb clearing triggered by new 
ACKs from our server... Oh well. In any case, it does work, and I'm 
happy with that.

Thanks again,
Dennis

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-05-27 23:57 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-26  0:44 Issue with delayed segments despite TCP_NODELAY Dennis Baurichter
2025-05-26 13:50 ` Neal Cardwell
2025-05-27 23:40   ` Dennis Baurichter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).