From: xietangxin <xietangxin@yeah.net>
To: Willy Tarreau <w@1wt.eu>, Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Eric Dumazet <edumazet@google.com>,
Pablo Neira Ayuso <pablo@netfilter.org>,
"David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
Neal Cardwell <ncardwell@google.com>,
Kuniyuki Iwashima <kuniyu@google.com>,
netdev@vger.kernel.org, eric.dumazet@gmail.com,
Zhouyan Deng <dengzhouyan_nwpu@163.com>,
Florian Westphal <fw@strlen.de>
Subject: Re: [PATCH net] tcp: secure_seq: add back ports to TS offset
Date: Thu, 11 Jun 2026 10:29:59 +0800 [thread overview]
Message-ID: <92935c00-e0be-4591-ac44-5978c7804d57@yeah.net> (raw)
In-Reply-To: <aibaZyz4MT8Ixt0N@1wt.eu>
On 6/8/2026 11:06 PM, Willy Tarreau wrote:
> On Mon, Jun 08, 2026 at 09:14:26PM +0800, Jiayuan Chen wrote:
>>
>> On 6/8/26 8:51 PM, xietangxin wrote:
>>>
>>> On 6/8/2026 5:42 PM, Willy Tarreau wrote:
>>>> On Mon, Jun 08, 2026 at 01:51:49AM -0700, Eric Dumazet wrote:
>>>>> On Sat, Jun 6, 2026 at 4:06 AM xietangxin <xietangxin@yeah.net> wrote:
>>>>>>
>>>>>> Hi Eric and netdev,
>>>>>>
>>>>>> I noticed a significant TCP performance regression (QPS drop) when using
>>>>>> iptables MASQUERADE with the `--random-fully` option, and I have bisected
>>>>>> it down to commit 165573e41f2f66ef98940cf65f838b2cb575d9d1
>>>>>> (tcp: secure_seq: add back ports to TS offset).
>>>>>>
>>>>>> Here is the benchmark environment and test results.
>>>>>> Environment:
>>>>>> - Client & Server: 2 VMs
>>>>>> - Server: Nginx listening on port 80 (HTTP), and ip 10.0.0.1
>>>>>> - Benchmark tool: wrk (short-lived connections with "Connection: close")
>>>>>>
>>>>>> Test Commands
>>>>>> 1. With random-fully:
>>>>>> # iptables -t nat -A POSTROUTING -d 10.0.0.1 -p tcp --dport 80 -j MASQUERADE --random-fully
>>>>>> # wrk -t8 -c200 -H "Connection: close" -d10s --latency http://10.0.0.1:80
>>>>>> 2. Without random-fully:
>>>>>> # iptables -t nat -A POSTROUTING -d 10.0.0.1 -p tcp --dport 80 -j MASQUERADE
>>>>>> # wrk -t8 -c200 -H "Connection: close" -d10s --latency http://10.0.0.1:80
>>>>>>
>>>>>> Test Results (QPS):
>>>>>> 1. Parent Commit (7f083faf59d14c04e01ec05a7507f036c965acf8):
>>>>>> - with random-fully: 18145.74, 15006.39, 15716.67
>>>>>> - without random-fully: 18556.36, 16339.22, 21506.02
>>>>>>
>>>>>> 2. Bad Commit (165573e41f2f66ef98940cf65f838b2cb575d9d1):
>>>>>> - with random-fully: 11074.76, 10383.20, 10164.81 <-- (~35% drop)
>>>>>> - without random-fully: 17310.75, 20279.85, 18399.48
>>>>>>
>>>>>> Is this performance degradation an expected side-effect of the security fix,
>>>>>> or is there any sysctl param we should tune when `--random-fully` is
>>>>>> required for high-concurrency short connections?
>>>>> Hi Tangxin
>>>>>
>>>>> I do not know why that patch would affect MASQUERADE performance.
>>>>>
>>>>> Pablo, Florian, do you have an idea?
>>>> I suspect it's because MASQUERADE can shuffle the ports around and
>>>> break the end-to-end mapping. With host-based ISN the increments
>>>> remain positive regardless of the ports, while with port-based
>>>> increments if you shuffle ports around, two consecutive uses of
>>>> the same port can end up showing a decreasing ISN, and some
>>>> outgoing SYN will get an ACK instead of a SYN-ACK, then send an
>>>> RST, and a SYN again, causing a degradation.
>>>>
>>>> I'm not saying this is necessarily what happens here but based on the
>>>> commit message description I suspect that this is what's happening
>>>> here. There's always a tradeoff between ISN secrecy and reliability
>>>> unfortunately.
>>>>
>>>> Willy
>>> Hi,
>>>
>>> Willy, your hypothesis is 100% correct!
>>> I captured the packets during the benchmark on the bad commit,
>>> and the trace perfectly shows the "SYN -> ACK -> RST".
>>>
>>> Here is the key snippet of the packet trace (Client: 10.0.0.2, Server: 10.0.0.1):
>>>
>>> // 1. First connection closes, Server sends last ACK(410615916), entering TIME_WAIT.
>>> 12105 08:54:39.128861 10.0.0.1 -> 10.0.0.2 TCP 80 -> 47824 [ACK] Seq=3315216203 Ack=410615916 TSval=273827652 TSecr=370383870
>>>
>>> // 2. ~200ms later, next short-conn reuses port 47824 via MASQUERADE --random-fully
>>> 47637 08:54:39.332281 10.0.0.2 -> 10.0.0.1 TCP 47824 -> 80 [SYN] Seq=559739866 TSval=4137539723 TSecr=0
>>>
>>> // 3. Server is sends a ACK with the old connection's expected ACK(410615916).
>>> 48591 08:54:39.337692 10.0.0.1 -> 10.0.0.2 TCP 80 -> 47824 [ACK] Seq=3315216203 Ack=410615916 TSval=273827858 TSecr=370383870
>>>
>>> // 4. Client receives the unexpected old ACK, responds with RST, and has to retry the connection.
>>> 48600 08:54:39.337799 10.0.0.2 -> 10.0.0.1 TCP 47824 -> 80 [RST] Seq=410615916 Win=0
>>>
>>>
>>> Are there any architectural recommendations we should consider here,
>>> or is this considered an acceptable trade-off for security?
>>
>>
>> It's classic PAWS problem when packets go through NAT/Gateway.
>>
>> Can you test the performance with following different two configs (client) ?
>>
>> sysctl -w net.ipv4.tcp_timestamps=2
>>
>> sysctl -w net.ipv4.tcp_timestamps=0
>
> The thing is, nothing forces a server to use PAWS to distinguish SYNs,
> and some OSes only apply the spec to the letter (at least Solaris in
> my experience). The spec basically says that PAWS protects against
> duplicate non-SYN segments, and duplicate SYNs while there is a
> connection. So while Linux (and other OSes) nicely covers the case
> of the transition from TIME_WAIT to SYN_RECV, others do not always do
> it.
>
> Regardless, while I hate it when we play borderline games with TCP for
> the sake of supposed security issues that can only be demonstrated in
> a lab, we must admit that masquerading remains an issue when the same
> port range is shared between multiple hosts, even with PAWS since there
> is no reason for multiple hosts to have the same clock.
>
> Willy
Hi Willy, Jiayuan and all,
Thanks for the excellent analysis. I have verified the suggestions
and captured detailed traces.
- using `tcp_timestamps=2` on the client completely restores performance
because it sets `ts_offset = 0`.
- we found that recalculating the `ts_offset` in Netfilter
after NAT port allocation also completely fixes the regression.
Here are the detailed test results, packet analysis, and a potential fix:
1. Test Results with different tcp_timestamps (on commit 165573e41f2f)
- With net.ipv4.tcp_timestamps=2 (ts_offset = 0):
* with random-fully: 21198.8, 20754.97, 21598.64 (Performance recovered)
* without random-fully: 21561.6, 22531.71, 23119.88
- With net.ipv4.tcp_timestamps=0 (Timestamps disabled):
* with random-fully: 12018.82, 10098.1, 10714.86 (Still suffers)
* without random-fully: 18409.88, 20951.87, 24635.95
2. Packet Analysis: PAWS Failure due to Non-monotonic TSval
I captured the traffic during the regression. The majority of the "SYN -> ACK -> RST"
cases occur because the new SYN's TSval is smaller than the TIME_WAIT socket's TSval
(confirmed via the TSecr in the ACK).
Client -> Server 24870 → 80 [SYN] Seq=3540240581 TSval=2294041168 TSecr=0
Server -> Client 80 → 24870 [ACK] Seq=2293248269 Ack=855605690 TSecr=2846236456
Client -> Server 24870 → 80 [RST] Seq=855605690
When Conn1 (local sport 10000) and Conn2 (local sport 20000) are both mapped to the
same external sport 30000 via `--random-fully`, their TS offsets are calculated
as `ts_offset1(10000)` and `ts_offset2(20000)`.
If Conn2 reuses the TIME_WAIT slot of Conn1 on the server, there is a chance that
`ts_offset2 < ts_offset1`, which breaks the TSval monotonicity for the same 4-tuple.
3. A Potential Fix: Recalculating ts_offset in Netfilter
I wrote a quick demo patch to update the `ts_offset` inside `tcp_manip_pkt` using
the newly mapped port. The results show that performance is completely restored:
- With Netfilter Demo Patch:
* with random-fully: 21966.86, 21113.68, 20763.9 (Performance recovered!)
* without random-fully: 19306.86, 20018.69, 20288.24
Since I am not an expert in netfilter internals, I am wondering if recalculating
the `ts_offset` during the NAT mangling phase is an acceptable approach?
--
Best regards,
Tangxin Xie
next prev parent reply other threads:[~2026-06-11 2:32 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-02 20:55 [PATCH net] tcp: secure_seq: add back ports to TS offset Eric Dumazet
2026-03-02 21:47 ` Kuniyuki Iwashima
2026-03-03 1:41 ` Florian Westphal
2026-03-03 7:39 ` Jörg Sommer
2026-03-05 2:00 ` patchwork-bot+netdevbpf
2026-06-06 11:04 ` xietangxin
2026-06-08 8:51 ` Eric Dumazet
2026-06-08 9:42 ` Willy Tarreau
2026-06-08 12:51 ` xietangxin
2026-06-08 13:14 ` Jiayuan Chen
2026-06-08 15:06 ` Willy Tarreau
2026-06-11 2:29 ` xietangxin [this message]
2026-06-08 11:30 ` Pablo Neira Ayuso
2026-06-08 12:11 ` Florian Westphal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=92935c00-e0be-4591-ac44-5978c7804d57@yeah.net \
--to=xietangxin@yeah.net \
--cc=davem@davemloft.net \
--cc=dengzhouyan_nwpu@163.com \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=fw@strlen.de \
--cc=horms@kernel.org \
--cc=jiayuan.chen@linux.dev \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=pablo@netfilter.org \
--cc=w@1wt.eu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox