From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Ido Schimmel <idosch@idosch.org>, Dmitry <demetriousz@proton.me>,
willemdebruijn.kernel@gmail.com
Cc: "David S. Miller" <davem@davemloft.net>,
David Ahern <dsahern@kernel.org>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>,
Paolo Abeni <pabeni@redhat.com>,
Simon Horman <horms@kernel.org>,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash
Date: Tue, 07 Oct 2025 18:25:47 -0400 [thread overview]
Message-ID: <willemdebruijn.kernel.3a8157265761f@gmail.com> (raw)
In-Reply-To: <aOVIAWAxpWto8ETd@shredder>
Ido Schimmel wrote:
> On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> > If the 5-tuple is not changed, then both the hash and the outgoing interface
> > (OIF) should remain consistent, which is not the case.
With git fetch over SSH, the process apparenty explicitly changes DSCP
(by calling setsockopt IPV6_TCLASS?). Which triggers a dst reset,
which that may trigger a different path. That is WAI, right?
Policy routing can explicitly specify different egress devices for
different DSCP settings.
Is this the entire issue? The original message states
> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
>
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.
That sounds as if the OIF can change in between the rt6_multipath_hash
and ipv6_dev_get_saddr calls for a regular socket, without such
explicit DSCP changes. Does this happen?
> > Only with the fix does it
> > respect the configured SRC and produce a consistent, correct 5-tuple with the
> > proper hash.
> >
> > Therefore, in my opinion, this should be fixed.
>
> Note that even if the hash is consistent throughout the lifetime of the
> socket, it is still possible for packets to be routed out of different
> interfaces. This can happen, for example, if one of the nexthop devices
> loses its carrier. This will change the hash thresholds in the ECMP
> group and can cause packets to egress a different interface even if the
> current one is not the one that went down. Obviously packets can also
> change paths due to changes in other routers between you and the
> destination. A network design that results in connections being severed
> every time a flow is routed differently seems fragile to me.
>
> If you still want to address the issue, then I believe that the correct
> way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
> I'm not sure why they differ, but the IPv4 version will first do a route
> lookup to determine the source address, then allocate a source port and
> only when all the parameters are known it will do a final route lookup
> and cache the result in the socket. IPv6 on the other hand, does a
> single route lookup with an unknown source address and an unknown source
> port.
>
> This is explained in the comment above ip_route_connect_init() and
> Willem also explained it here:
>
> https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/
>
> Willem, do you happen to know why tcp_v6_connect() only performs a
> single route lookup?
I did not fully get to the historical reasons for the differences.
From v1 of that patch:
"Side-quest: I wonder if the second route lookup in ip_route_connect
is vestigial since the introduction of the third route lookup with
ip_route_newports. IPv6 has neither second nor third lookup, which
hints that perhaps both can be removed. "
https://lore.kernel.org/netdev/20250420180537.2973960-2-willemdebruijn.kernel@gmail.com/
> Link to the original patch:
>
> https://lore.kernel.org/netdev/20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-v1-1-d43b6ef00035@proton.me/
>
> Thanks
next prev parent reply other threads:[~2025-10-07 22:25 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-10-05 20:49 [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash Dmitry Z via B4 Relay
2025-10-06 13:30 ` Ido Schimmel
2025-10-06 18:31 ` Dmitry
2025-10-07 17:04 ` Ido Schimmel
2025-10-07 22:25 ` Willem de Bruijn [this message]
2025-10-08 6:57 ` Ido Schimmel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=willemdebruijn.kernel.3a8157265761f@gmail.com \
--to=willemdebruijn.kernel@gmail.com \
--cc=davem@davemloft.net \
--cc=demetriousz@proton.me \
--cc=dsahern@kernel.org \
--cc=edumazet@google.com \
--cc=horms@kernel.org \
--cc=idosch@idosch.org \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).