From: Leon Romanovsky <leon@kernel.org>
To: Jakub Sitnicki <jakub@cloudflare.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
Eric Dumazet <edumazet@google.com>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>,
Kuniyuki Iwashima <kuniyu@amazon.com>,
Neal Cardwell <ncardwell@google.com>,
selinux@vger.kernel.org, Paul Moore <paul@paul-moore.com>,
Stephen Smalley <stephen.smalley.work@gmail.com>,
Eric Paris <eparis@parisplace.org>,
kernel-team@cloudflare.com,
Marek Majkowski <marek@cloudflare.com>
Subject: Re: [PATCH net-next v4 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
Date: Mon, 23 Jan 2023 19:47:50 +0200 [thread overview]
Message-ID: <Y87IRq1ITGcWIh3F@unreal> (raw)
In-Reply-To: <20221221-sockopt-port-range-v4-1-d7d2f2561238@cloudflare.com>
On Mon, Jan 23, 2023 at 03:44:39PM +0100, Jakub Sitnicki wrote:
> Users who want to share a single public IP address for outgoing connections
> between several hosts traditionally reach for SNAT. However, SNAT requires
> state keeping on the node(s) performing the NAT.
>
> A stateless alternative exists, where a single IP address used for egress
> can be shared between several hosts by partitioning the available ephemeral
> port range. In such a setup:
>
> 1. Each host gets assigned a disjoint range of ephemeral ports.
> 2. Applications open connections from the host-assigned port range.
> 3. Return traffic gets routed to the host based on both, the destination IP
> and the destination port.
>
> An application which wants to open an outgoing connection (connect) from a
> given port range today can choose between two solutions:
>
> 1. Manually pick the source port by bind()'ing to it before connect()'ing
> the socket.
>
> This approach has a couple of downsides:
>
> a) Search for a free port has to be implemented in the user-space. If
> the chosen 4-tuple happens to be busy, the application needs to retry
> from a different local port number.
>
> Detecting if 4-tuple is busy can be either easy (TCP) or hard
> (UDP). In TCP case, the application simply has to check if connect()
> returned an error (EADDRNOTAVAIL). That is assuming that the local
> port sharing was enabled (REUSEADDR) by all the sockets.
>
> # Assume desired local port range is 60_000-60_511
> s = socket(AF_INET, SOCK_STREAM)
> s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
> s.bind(("192.0.2.1", 60_000))
> s.connect(("1.1.1.1", 53))
> # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
> # Application must retry with another local port
>
> In case of UDP, the network stack allows binding more than one socket
> to the same 4-tuple, when local port sharing is enabled
> (REUSEADDR). Hence detecting the conflict is much harder and involves
> querying sock_diag and toggling the REUSEADDR flag [1].
>
> b) For TCP, bind()-ing to a port within the ephemeral port range means
> that no connecting sockets, that is those which leave it to the
> network stack to find a free local port at connect() time, can use
> the this port.
>
> IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
> will be skipped during the free port search at connect() time.
>
> 2. Isolate the app in a dedicated netns and use the use the per-netns
> ip_local_port_range sysctl to adjust the ephemeral port range bounds.
>
> The per-netns setting affects all sockets, so this approach can be used
> only if:
>
> - there is just one egress IP address, or
> - the desired egress port range is the same for all egress IP addresses
> used by the application.
>
> For TCP, this approach avoids the downsides of (1). Free port search and
> 4-tuple conflict detection is done by the network stack:
>
> system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
>
> s = socket(AF_INET, SOCK_STREAM)
> s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
> s.bind(("192.0.2.1", 0))
> s.connect(("1.1.1.1", 53))
> # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
>
> For UDP this approach has limited applicability. Setting the
> IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
> port being shared with other connected UDP sockets.
>
> Hence relying on the network stack to find a free source port, limits the
> number of outgoing UDP flows from a single IP address down to the number
> of available ephemeral ports.
>
> To put it another way, partitioning the ephemeral port range between hosts
> using the existing Linux networking API is cumbersome.
>
> To address this use case, add a new socket option at the SOL_IP level,
> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
> ephemeral port range for each socket individually.
>
> The option can be used only to narrow down the per-netns local port
> range. If the per-socket range lies outside of the per-netns range, the
> latter takes precedence.
>
> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
> of u16 values in host byte order packed into a u32. This avoids pointer
> passing.
>
> PORT_LO = 40_000
> PORT_HI = 40_511
>
> s = socket(AF_INET, SOCK_STREAM)
> v = struct.pack("I", PORT_HI << 16 | PORT_LO)
> s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
> s.bind(("127.0.0.1", 0))
> s.getsockname()
> # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
> # if there is a free port. EADDRINUSE otherwise.
>
> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
>
> v3 -> v4:
> * Clarify that u16 values are in host byte order (Neal)
>
> v2 -> v3:
> * Make SCTP bind()/bind_add() respect IP_LOCAL_PORT_RANGE option (Eric)
>
> v1 -> v2:
> * Fix the corner case when the per-socket range doesn't overlap with the
> per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)
Please put changelog after "---" trailer, so it will be stripped while
applying patch.
Thanks
next prev parent reply other threads:[~2023-01-23 17:53 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-01-23 14:44 [PATCH net-next v4 0/2] Add IP_LOCAL_PORT_RANGE socket option Jakub Sitnicki
2023-01-23 14:44 ` [PATCH net-next v4 1/2] inet: " Jakub Sitnicki
2023-01-23 17:47 ` Leon Romanovsky [this message]
2023-01-23 20:48 ` Jakub Sitnicki
2023-01-24 3:35 ` Jakub Kicinski
2023-01-23 17:55 ` Kuniyuki Iwashima
2023-01-23 20:46 ` Jakub Sitnicki
2023-01-23 14:44 ` [PATCH net-next v4 2/2] selftests/net: Cover the " Jakub Sitnicki
2023-01-23 17:58 ` Kuniyuki Iwashima
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y87IRq1ITGcWIh3F@unreal \
--to=leon@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eparis@parisplace.org \
--cc=jakub@cloudflare.com \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=kuniyu@amazon.com \
--cc=marek@cloudflare.com \
--cc=ncardwell@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=paul@paul-moore.com \
--cc=selinux@vger.kernel.org \
--cc=stephen.smalley.work@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).