Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
To: Ido Schimmel <idosch@idosch.org>,  Dmitry <demetriousz@proton.me>,
	 willemdebruijn.kernel@gmail.com
Cc: "David S. Miller" <davem@davemloft.net>,
	 David Ahern <dsahern@kernel.org>,
	 Eric Dumazet <edumazet@google.com>,
	 Jakub Kicinski <kuba@kernel.org>,
	 Paolo Abeni <pabeni@redhat.com>,
	 Simon Horman <horms@kernel.org>,
	 netdev@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash
Date: Tue, 07 Oct 2025 18:25:47 -0400	[thread overview]
Message-ID: <willemdebruijn.kernel.3a8157265761f@gmail.com> (raw)
In-Reply-To: <aOVIAWAxpWto8ETd@shredder>

Ido Schimmel wrote:
> On Mon, Oct 06, 2025 at 06:31:10PM +0000, Dmitry wrote:
> > If the 5-tuple is not changed, then both the hash and the outgoing interface
> > (OIF) should remain consistent, which is not the case.

With git fetch over SSH, the process apparenty explicitly changes DSCP
(by calling setsockopt IPV6_TCLASS?). Which triggers a dst reset,
which that may trigger a different path. That is WAI, right?

Policy routing can explicitly specify different egress devices for
different DSCP settings.

Is this the entire issue? The original message states

> In an IPv6 ECMP scenario, if a multi-homed host initiates a connection,
> `saddr` may remain empty during the initial call to `rt6_multipath_hash()`.
> It gets filled later, once the outgoing interface (OIF) is determined and
> `ipv6_dev_get_saddr()` (RFC 6724) selects the proper source address.
>
> In some cases, this can cause the flow to switch paths: the first packets
> go via one link, while the rest of the flow is routed over another.

That sounds as if the OIF can change in between the rt6_multipath_hash
and ipv6_dev_get_saddr calls for a regular socket, without such
explicit DSCP changes. Does this happen?


> > Only with the fix does it
> > respect the configured SRC and produce a consistent, correct 5-tuple with the
> > proper hash.
> > 
> > Therefore, in my opinion, this should be fixed.
> 
> Note that even if the hash is consistent throughout the lifetime of the
> socket, it is still possible for packets to be routed out of different
> interfaces. This can happen, for example, if one of the nexthop devices
> loses its carrier. This will change the hash thresholds in the ECMP
> group and can cause packets to egress a different interface even if the
> current one is not the one that went down. Obviously packets can also
> change paths due to changes in other routers between you and the
> destination. A network design that results in connections being severed
> every time a flow is routed differently seems fragile to me.
> 
> If you still want to address the issue, then I believe that the correct
> way to do it would be to align tcp_v6_connect() with tcp_v4_connect().
> I'm not sure why they differ, but the IPv4 version will first do a route
> lookup to determine the source address, then allocate a source port and
> only when all the parameters are known it will do a final route lookup
> and cache the result in the socket. IPv6 on the other hand, does a
> single route lookup with an unknown source address and an unknown source
> port.
> 
> This is explained in the comment above ip_route_connect_init() and
> Willem also explained it here:
> 
> https://lore.kernel.org/all/20250424143549.669426-2-willemdebruijn.kernel@gmail.com/
> 
> Willem, do you happen to know why tcp_v6_connect() only performs a
> single route lookup?

I did not fully get to the historical reasons for the differences.
From v1 of that patch:

"Side-quest: I wonder if the second route lookup in ip_route_connect
is vestigial since the introduction of the third route lookup with
ip_route_newports. IPv6 has neither second nor third lookup, which
hints that perhaps both can be removed. "

https://lore.kernel.org/netdev/20250420180537.2973960-2-willemdebruijn.kernel@gmail.com/
 
> Link to the original patch:
> 
> https://lore.kernel.org/netdev/20251005-ipv6-set-saddr-to-prefsrc-before-hash-to-stabilize-ecmp-v1-1-d43b6ef00035@proton.me/
> 
> Thanks

next prev parent reply	other threads:[~2025-10-07 22:25 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-05 20:49 [PATCH net-next] net: ipv6: respect route prfsrc and fill empty saddr before ECMP hash Dmitry Z via B4 Relay
2025-10-06 13:30 ` Ido Schimmel
2025-10-06 18:31   ` Dmitry
2025-10-07 17:04     ` Ido Schimmel
2025-10-07 22:25       ` Willem de Bruijn [this message]
2025-10-08  6:57         ` Ido Schimmel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=willemdebruijn.kernel.3a8157265761f@gmail.com \
    --to=willemdebruijn.kernel@gmail.com \
    --cc=davem@davemloft.net \
    --cc=demetriousz@proton.me \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=horms@kernel.org \
    --cc=idosch@idosch.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).