Netdev List
 help / color / mirror / Atom feed
From: xietangxin <xietangxin@yeah.net>
To: Willy Tarreau <w@1wt.eu>, Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Eric Dumazet <edumazet@google.com>,
	Pablo Neira Ayuso <pablo@netfilter.org>,
	"David S . Miller" <davem@davemloft.net>,
	Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
	Simon Horman <horms@kernel.org>,
	Neal Cardwell <ncardwell@google.com>,
	Kuniyuki Iwashima <kuniyu@google.com>,
	netdev@vger.kernel.org, eric.dumazet@gmail.com,
	Zhouyan Deng <dengzhouyan_nwpu@163.com>,
	Florian Westphal <fw@strlen.de>
Subject: Re: [PATCH net] tcp: secure_seq: add back ports to TS offset
Date: Thu, 11 Jun 2026 10:29:59 +0800	[thread overview]
Message-ID: <92935c00-e0be-4591-ac44-5978c7804d57@yeah.net> (raw)
In-Reply-To: <aibaZyz4MT8Ixt0N@1wt.eu>



On 6/8/2026 11:06 PM, Willy Tarreau wrote:
> On Mon, Jun 08, 2026 at 09:14:26PM +0800, Jiayuan Chen wrote:
>>
>> On 6/8/26 8:51 PM, xietangxin wrote:
>>>
>>> On 6/8/2026 5:42 PM, Willy Tarreau wrote:
>>>> On Mon, Jun 08, 2026 at 01:51:49AM -0700, Eric Dumazet wrote:
>>>>> On Sat, Jun 6, 2026 at 4:06 AM xietangxin <xietangxin@yeah.net> wrote:
>>>>>>
>>>>>> Hi Eric and netdev,
>>>>>>
>>>>>> I noticed a significant TCP performance regression (QPS drop) when using
>>>>>> iptables MASQUERADE with the `--random-fully` option, and I have bisected
>>>>>> it down to commit 165573e41f2f66ef98940cf65f838b2cb575d9d1
>>>>>> (tcp: secure_seq: add back ports to TS offset).
>>>>>>
>>>>>> Here is the benchmark environment and test results.
>>>>>> Environment:
>>>>>> - Client & Server: 2 VMs
>>>>>> - Server: Nginx listening on port 80 (HTTP), and ip 10.0.0.1
>>>>>> - Benchmark tool: wrk (short-lived connections with "Connection: close")
>>>>>>
>>>>>> Test Commands
>>>>>> 1. With random-fully:
>>>>>>     # iptables -t nat -A POSTROUTING -d 10.0.0.1 -p tcp --dport 80 -j MASQUERADE --random-fully
>>>>>>     # wrk -t8 -c200 -H "Connection: close" -d10s --latency http://10.0.0.1:80
>>>>>> 2. Without random-fully:
>>>>>>     # iptables -t nat -A POSTROUTING -d 10.0.0.1 -p tcp --dport 80 -j MASQUERADE
>>>>>>     # wrk -t8 -c200 -H "Connection: close" -d10s --latency http://10.0.0.1:80
>>>>>>
>>>>>> Test Results (QPS):
>>>>>> 1. Parent Commit (7f083faf59d14c04e01ec05a7507f036c965acf8):
>>>>>>     - with random-fully:    18145.74, 15006.39, 15716.67
>>>>>>     - without random-fully: 18556.36, 16339.22, 21506.02
>>>>>>
>>>>>> 2. Bad Commit (165573e41f2f66ef98940cf65f838b2cb575d9d1):
>>>>>>     - with random-fully:    11074.76, 10383.20, 10164.81  <-- (~35% drop)
>>>>>>     - without random-fully: 17310.75, 20279.85, 18399.48
>>>>>>
>>>>>> Is this performance degradation an expected side-effect of the security fix,
>>>>>> or is there any sysctl param we should tune when `--random-fully` is
>>>>>> required for high-concurrency short connections?
>>>>> Hi Tangxin
>>>>>
>>>>> I do not know why that patch would affect MASQUERADE performance.
>>>>>
>>>>> Pablo, Florian, do you have an idea?
>>>> I suspect it's because MASQUERADE can shuffle the ports around and
>>>> break the end-to-end mapping. With host-based ISN the increments
>>>> remain positive regardless of the ports, while with port-based
>>>> increments if you shuffle ports around, two consecutive uses of
>>>> the same port can end up showing a decreasing ISN, and some
>>>> outgoing SYN will get an ACK instead of a SYN-ACK, then send an
>>>> RST, and a SYN again, causing a degradation.
>>>>
>>>> I'm not saying this is necessarily what happens here but based on the
>>>> commit message description I suspect that this is what's happening
>>>> here. There's always a tradeoff between ISN secrecy and reliability
>>>> unfortunately.
>>>>
>>>> Willy
>>> Hi,
>>>
>>> Willy, your hypothesis is 100% correct!
>>> I captured the packets during the benchmark on the bad commit,
>>> and the trace perfectly shows the "SYN -> ACK -> RST".
>>>
>>> Here is the key snippet of the packet trace (Client: 10.0.0.2, Server: 10.0.0.1):
>>>
>>> // 1. First connection closes, Server sends last ACK(410615916), entering TIME_WAIT.
>>> 12105 08:54:39.128861 10.0.0.1 -> 10.0.0.2 TCP 80 -> 47824 [ACK] Seq=3315216203 Ack=410615916 TSval=273827652 TSecr=370383870
>>>
>>> // 2. ~200ms later, next short-conn reuses port 47824 via MASQUERADE --random-fully
>>> 47637 08:54:39.332281 10.0.0.2 -> 10.0.0.1 TCP 47824 -> 80 [SYN] Seq=559739866 TSval=4137539723 TSecr=0
>>>
>>> // 3. Server is sends a ACK with the old connection's expected ACK(410615916).
>>> 48591 08:54:39.337692 10.0.0.1 -> 10.0.0.2 TCP 80 -> 47824 [ACK] Seq=3315216203 Ack=410615916 TSval=273827858 TSecr=370383870
>>>
>>> // 4. Client receives the unexpected old ACK, responds with RST, and has to retry the connection.
>>> 48600 08:54:39.337799 10.0.0.2 -> 10.0.0.1 TCP 47824 -> 80 [RST] Seq=410615916 Win=0
>>>
>>>
>>> Are there any architectural recommendations we should consider here,
>>> or is this considered an acceptable trade-off for security?
>>
>>
>> It's classic PAWS problem when packets go through NAT/Gateway.
>>
>> Can you test the performance with following different two configs (client) ?
>>
>>     sysctl -w net.ipv4.tcp_timestamps=2
>>
>>     sysctl -w net.ipv4.tcp_timestamps=0
> 
> The thing is, nothing forces a server to use PAWS to distinguish SYNs,
> and some OSes only apply the spec to the letter (at least Solaris in
> my experience). The spec basically says that PAWS protects against
> duplicate non-SYN segments, and duplicate SYNs while there is a
> connection. So while Linux (and other OSes) nicely covers the case
> of the transition from TIME_WAIT to SYN_RECV, others do not always do
> it.
> 
> Regardless, while I hate it when we play borderline games with TCP for
> the sake of supposed security issues that can only be demonstrated in
> a lab, we must admit that masquerading remains an issue when the same
> port range is shared between multiple hosts, even with PAWS since there
> is no reason for multiple hosts to have the same clock.
> 
> Willy
Hi Willy, Jiayuan and all,

Thanks for the excellent analysis. I have verified the suggestions
and captured detailed traces.

- using `tcp_timestamps=2` on the client completely restores performance
  because it sets `ts_offset = 0`.
- we found that recalculating the `ts_offset` in Netfilter
after NAT port allocation also completely fixes the regression.

Here are the detailed test results, packet analysis, and a potential fix:

1. Test Results with different tcp_timestamps (on commit 165573e41f2f)
- With net.ipv4.tcp_timestamps=2 (ts_offset = 0):
  * with random-fully:    21198.8,  20754.97, 21598.64 (Performance recovered)
  * without random-fully: 21561.6,  22531.71, 23119.88

- With net.ipv4.tcp_timestamps=0 (Timestamps disabled):
  * with random-fully:    12018.82, 10098.1,  10714.86 (Still suffers)
  * without random-fully: 18409.88, 20951.87, 24635.95

2. Packet Analysis: PAWS Failure due to Non-monotonic TSval
I captured the traffic during the regression. The majority of the "SYN -> ACK -> RST"
cases occur because the new SYN's TSval is smaller than the TIME_WAIT socket's TSval
(confirmed via the TSecr in the ACK).

Client -> Server 24870 → 80 [SYN] Seq=3540240581 TSval=2294041168 TSecr=0
Server -> Client 80 → 24870 [ACK] Seq=2293248269 Ack=855605690 TSecr=2846236456
Client -> Server 24870 → 80 [RST] Seq=855605690

When Conn1 (local sport 10000) and Conn2 (local sport 20000) are both mapped to the
same external sport 30000 via `--random-fully`, their TS offsets are calculated
as `ts_offset1(10000)` and `ts_offset2(20000)`.

If Conn2 reuses the TIME_WAIT slot of Conn1 on the server, there is a chance that
`ts_offset2 < ts_offset1`, which breaks the TSval monotonicity for the same 4-tuple.


3. A Potential Fix: Recalculating ts_offset in Netfilter
I wrote a quick demo patch to update the `ts_offset` inside `tcp_manip_pkt` using
the newly mapped port. The results show that performance is completely restored:

- With Netfilter Demo Patch:
  * with random-fully:    21966.86, 21113.68, 20763.9 (Performance recovered!)
  * without random-fully: 19306.86, 20018.69, 20288.24

Since I am not an expert in netfilter internals, I am wondering if recalculating
the `ts_offset` during the NAT mangling phase is an acceptable approach?
-- 
Best regards,
Tangxin Xie


  reply	other threads:[~2026-06-11  2:32 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-02 20:55 [PATCH net] tcp: secure_seq: add back ports to TS offset Eric Dumazet
2026-03-02 21:47 ` Kuniyuki Iwashima
2026-03-03  1:41 ` Florian Westphal
2026-03-03  7:39 ` Jörg Sommer
2026-03-05  2:00 ` patchwork-bot+netdevbpf
2026-06-06 11:04 ` xietangxin
2026-06-08  8:51   ` Eric Dumazet
2026-06-08  9:42     ` Willy Tarreau
2026-06-08 12:51       ` xietangxin
2026-06-08 13:14         ` Jiayuan Chen
2026-06-08 15:06           ` Willy Tarreau
2026-06-11  2:29             ` xietangxin [this message]
2026-06-08 11:30     ` Pablo Neira Ayuso
2026-06-08 12:11     ` Florian Westphal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=92935c00-e0be-4591-ac44-5978c7804d57@yeah.net \
    --to=xietangxin@yeah.net \
    --cc=davem@davemloft.net \
    --cc=dengzhouyan_nwpu@163.com \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=fw@strlen.de \
    --cc=horms@kernel.org \
    --cc=jiayuan.chen@linux.dev \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=ncardwell@google.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=pablo@netfilter.org \
    --cc=w@1wt.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox