netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Sitnicki <jakub@cloudflare.com>
To: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: davem@davemloft.net, edumazet@google.com,
	kernel-team@cloudflare.com, kuba@kernel.org,
	marek@cloudflare.com, netdev@vger.kernel.org, pabeni@redhat.com
Subject: Re: [PATCH net-next v2 1/2] inet: Add IP_LOCAL_PORT_RANGE socket option
Date: Wed, 11 Jan 2023 13:44:03 +0100	[thread overview]
Message-ID: <87tu0xckab.fsf@cloudflare.com> (raw)
In-Reply-To: <20230111005923.47037-1-kuniyu@amazon.com>

On Wed, Jan 11, 2023 at 09:59 AM +09, Kuniyuki Iwashima wrote:
> From:   Jakub Sitnicki <jakub@cloudflare.com>
> Date:   Tue, 10 Jan 2023 14:37:29 +0100
>> Users who want to share a single public IP address for outgoing connections
>> between several hosts traditionally reach for SNAT. However, SNAT requires
>> state keeping on the node(s) performing the NAT.
>> 
>> A stateless alternative exists, where a single IP address used for egress
>> can be shared between several hosts by partitioning the available ephemeral
>> port range. In such a setup:
>> 
>> 1. Each host gets assigned a disjoint range of ephemeral ports.
>> 2. Applications open connections from the host-assigned port range.
>> 3. Return traffic gets routed to the host based on both, the destination IP
>>    and the destination port.
>> 
>> An application which wants to open an outgoing connection (connect) from a
>> given port range today can choose between two solutions:
>> 
>> 1. Manually pick the source port by bind()'ing to it before connect()'ing
>>    the socket.
>> 
>>    This approach has a couple of downsides:
>> 
>>    a) Search for a free port has to be implemented in the user-space. If
>>       the chosen 4-tuple happens to be busy, the application needs to retry
>>       from a different local port number.
>> 
>>       Detecting if 4-tuple is busy can be either easy (TCP) or hard
>>       (UDP). In TCP case, the application simply has to check if connect()
>>       returned an error (EADDRNOTAVAIL). That is assuming that the local
>>       port sharing was enabled (REUSEADDR) by all the sockets.
>> 
>>         # Assume desired local port range is 60_000-60_511
>>         s = socket(AF_INET, SOCK_STREAM)
>>         s.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
>>         s.bind(("192.0.2.1", 60_000))
>>         s.connect(("1.1.1.1", 53))
>>         # Fails only if 192.0.2.1:60000 -> 1.1.1.1:53 is busy
>>         # Application must retry with another local port
>> 
>>       In case of UDP, the network stack allows binding more than one socket
>>       to the same 4-tuple, when local port sharing is enabled
>>       (REUSEADDR). Hence detecting the conflict is much harder and involves
>>       querying sock_diag and toggling the REUSEADDR flag [1].
>> 
>>    b) For TCP, bind()-ing to a port within the ephemeral port range means
>>       that no connecting sockets, that is those which leave it to the
>>       network stack to find a free local port at connect() time, can use
>>       the this port.
>> 
>>       IOW, the bind hash bucket tb->fastreuse will be 0 or 1, and the port
>>       will be skipped during the free port search at connect() time.
>> 
>> 2. Isolate the app in a dedicated netns and use the use the per-netns
>>    ip_local_port_range sysctl to adjust the ephemeral port range bounds.
>> 
>>    The per-netns setting affects all sockets, so this approach can be used
>>    only if:
>> 
>>    - there is just one egress IP address, or
>>    - the desired egress port range is the same for all egress IP addresses
>>      used by the application.
>> 
>>    For TCP, this approach avoids the downsides of (1). Free port search and
>>    4-tuple conflict detection is done by the network stack:
>> 
>>      system("sysctl -w net.ipv4.ip_local_port_range='60000 60511'")
>> 
>>      s = socket(AF_INET, SOCK_STREAM)
>>      s.setsockopt(SOL_IP, IP_BIND_ADDRESS_NO_PORT, 1)
>>      s.bind(("192.0.2.1", 0))
>>      s.connect(("1.1.1.1", 53))
>>      # Fails if all 4-tuples 192.0.2.1:60000-60511 -> 1.1.1.1:53 are busy
>> 
>>   For UDP this approach has limited applicability. Setting the
>>   IP_BIND_ADDRESS_NO_PORT socket option does not result in local source
>>   port being shared with other connected UDP sockets.
>> 
>>   Hence relying on the network stack to find a free source port, limits the
>>   number of outgoing UDP flows from a single IP address down to the number
>>   of available ephemeral ports.
>> 
>> To put it another way, partitioning the ephemeral port range between hosts
>> using the existing Linux networking API is cumbersome.
>> 
>> To address this use case, add a new socket option at the SOL_IP level,
>> named IP_LOCAL_PORT_RANGE. The new option can be used to clamp down the
>> ephemeral port range for each socket individually.
>> 
>> The option can be used only to narrow down the per-netns local port
>> range. If the per-socket range lies outside of the per-netns range, the
>> latter takes precedence.
>> 
>> UAPI-wise, the low and high range bounds are passed to the kernel as a pair
>> of u16 values packed into a u32. This avoids pointer passing.
>> 
>>   PORT_LO = 40_000
>>   PORT_HI = 40_511
>> 
>>   s = socket(AF_INET, SOCK_STREAM)
>>   v = struct.pack("I", PORT_HI << 16 | PORT_LO)
>>   s.setsockopt(SOL_IP, IP_LOCAL_PORT_RANGE, v)
>>   s.bind(("127.0.0.1", 0))
>>   s.getsockname()
>>   # Local address between ("127.0.0.1", 40_000) and ("127.0.0.1", 40_511),
>>   # if there is a free port. EADDRINUSE otherwise.
>> 
>> [1] https://github.com/cloudflare/cloudflare-blog/blob/232b432c1d57/2022-02-connectx/connectx.py#L116
>> 
>> v1 -> v2:
>>  * Fix the corner case when the per-socket range doesn't overlap with the
>>    per-netns range. Fallback correctly to the per-netns range. (Kuniyuki)
>> 
>> Reviewed-by: Marek Majkowski <marek@cloudflare.com>
>> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
>> ---

[...]

>>  include/net/ip.h                |  3 ++-
>>  include/uapi/linux/in.h         |  1 +
>>  net/ipv4/inet_connection_sock.c | 25 +++++++++++++++++++++++--
>>  net/ipv4/inet_hashtables.c      |  2 +-
>>  net/ipv4/ip_sockglue.c          | 18 ++++++++++++++++++
>>  net/ipv4/udp.c                  |  2 +-
>>  7 files changed, 50 insertions(+), 5 deletions(-)
>> 
>> diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
>> index bf5654ce711e..51857117ac09 100644
>> --- a/include/net/inet_sock.h
>> +++ b/include/net/inet_sock.h
>> @@ -249,6 +249,10 @@ struct inet_sock {
>>  	__be32			mc_addr;
>>  	struct ip_mc_socklist __rcu	*mc_list;
>>  	struct inet_cork_full	cork;
>> +	struct {
>> +		__u16 lo;
>> +		__u16 hi;
>> +	}			local_port_range;
>>  };
>>  
>>  #define IPCORK_OPT	1	/* ip-options has been held in ipcork.opt */
>> diff --git a/include/net/ip.h b/include/net/ip.h
>> index 144bdfbb25af..c3fffaa92d6e 100644
>> --- a/include/net/ip.h
>> +++ b/include/net/ip.h
>> @@ -340,7 +340,8 @@ static inline u64 snmp_fold_field64(void __percpu *mib, int offt, size_t syncp_o
>>  	} \
>>  }
>>  
>> -void inet_get_local_port_range(struct net *net, int *low, int *high);
>> +void inet_get_local_port_range(const struct net *net, int *low, int *high);
>> +void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high);
>>  
>>  #ifdef CONFIG_SYSCTL
>>  static inline bool inet_is_local_reserved_port(struct net *net, unsigned short port)
>> diff --git a/include/uapi/linux/in.h b/include/uapi/linux/in.h
>> index 07a4cb149305..4b7f2df66b99 100644
>> --- a/include/uapi/linux/in.h
>> +++ b/include/uapi/linux/in.h
>> @@ -162,6 +162,7 @@ struct in_addr {
>>  #define MCAST_MSFILTER			48
>>  #define IP_MULTICAST_ALL		49
>>  #define IP_UNICAST_IF			50
>> +#define IP_LOCAL_PORT_RANGE		51
>>  
>>  #define MCAST_EXCLUDE	0
>>  #define MCAST_INCLUDE	1
>> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
>> index d1f837579398..1049a9b8d152 100644
>> --- a/net/ipv4/inet_connection_sock.c
>> +++ b/net/ipv4/inet_connection_sock.c
>> @@ -117,7 +117,7 @@ bool inet_rcv_saddr_any(const struct sock *sk)
>>  	return !sk->sk_rcv_saddr;
>>  }
>>  
>> -void inet_get_local_port_range(struct net *net, int *low, int *high)
>> +void inet_get_local_port_range(const struct net *net, int *low, int *high)
>>  {
>>  	unsigned int seq;
>>  
>> @@ -130,6 +130,27 @@ void inet_get_local_port_range(struct net *net, int *low, int *high)
>>  }
>>  EXPORT_SYMBOL(inet_get_local_port_range);
>>  
>> +void inet_sk_get_local_port_range(const struct sock *sk, int *low, int *high)
>> +{
>> +	const struct inet_sock *inet = inet_sk(sk);
>> +	const struct net *net = sock_net(sk);
>> +	int lo, hi, sk_lo, sk_hi;
>> +
>> +	inet_get_local_port_range(net, &lo, &hi);
>> +
>> +	sk_lo = inet->local_port_range.lo;
>> +	sk_hi = inet->local_port_range.hi;
>> +
>> +	if (unlikely(sk_lo && sk_lo <= hi))
>> +		lo = max(lo, sk_lo);
>> +	if (unlikely(sk_hi && sk_hi >= lo))
>> +		hi = min(hi, sk_hi);
>
> nit: The min of sysctl lo/hi is 1, so
>
>         if (unlikely(lo <= sk_lo && sk_lo <= hi))
>                 lo = sk_lo;
>         if (unlikely(lo <= sk_hi && sk_hi <= hi))
>                 hi = sk_hi;
>
> this seems cleaner.

That is much cleaner. Will apply to v3.

  reply	other threads:[~2023-01-11 12:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-10 13:37 [PATCH net-next v2 0/2] Add IP_LOCAL_PORT_RANGE socket option Jakub Sitnicki
2023-01-10 13:37 ` [PATCH net-next v2 1/2] inet: " Jakub Sitnicki
2023-01-10 14:28   ` Eric Dumazet
2023-01-10 21:36     ` Jakub Sitnicki
2023-01-11  0:59   ` Kuniyuki Iwashima
2023-01-11 12:44     ` Jakub Sitnicki [this message]
2023-01-10 13:37 ` [PATCH net-next v2 2/2] selftests/net: Cover the " Jakub Sitnicki
2023-01-11  1:25   ` Kuniyuki Iwashima
2023-01-11 12:45     ` Jakub Sitnicki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87tu0xckab.fsf@cloudflare.com \
    --to=jakub@cloudflare.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@amazon.com \
    --cc=marek@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).