All of lore.kernel.org
 help / color / mirror / Atom feed
From: Kuniyuki Iwashima <kuniyu@amazon.com>
To: <edumazet@google.com>
Cc: <davem@davemloft.net>, <eric.dumazet@gmail.com>,
	<jakub@cloudflare.com>, <kuba@kernel.org>,
	<netdev@vger.kernel.org>, <pabeni@redhat.com>,
	"Kuniyuki Iwashima" <kuniyu@amazon.com>
Subject: [PATCH net-next 2/2] tcp/dccp: change source port selection at connect() time
Date: Fri, 15 Dec 2023 10:58:07 +0900	[thread overview]
Message-ID: <20231215015807.39107-1-kuniyu@amazon.com> (raw)
In-Reply-To: <20231214192939.1962891-3-edumazet@google.com>

From: Eric Dumazet <edumazet@google.com>
Date: Thu, 14 Dec 2023 19:29:39 +0000
> In commit 1580ab63fc9a ("tcp/dccp: better use of ephemeral ports in connect()")
> we added an heuristic to select even ports for connect() and odd ports for bind().
> 
> This was nice because no applications changes were needed.
> 
> But it added more costs when all even ports are in use,
> when there are few listeners and many active connections.
> 
> Since then, IP_LOCAL_PORT_RANGE has been added to permit an application
> to partition ephemeral port range at will.
> 
> This patch extends the idea so that if IP_LOCAL_PORT_RANGE is set on
> a socket before accept(), port selection no longer favors even ports.
> 
> This means that connect() can find a suitable source port faster,
> and applications can use a different split between connect() and bind()
> users.
> 
> This should give more entropy to Toeplitz hash used in RSS: Using even
> ports was wasting one bit from the 16bit sport.
> 
> A similar change can be done in inet_csk_find_open_port() if needed.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>


> Cc: Jakub Sitnicki <jakub@cloudflare.com>
> ---
>  net/ipv4/inet_hashtables.c | 27 ++++++++++++++++-----------
>  1 file changed, 16 insertions(+), 11 deletions(-)
> 
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index a532f749e47781cc951f2003f621cec4387a2384..9ff201bc4e6d2da04735e8c160d446602e0adde1 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -1012,7 +1012,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  	bool tb_created = false;
>  	u32 remaining, offset;
>  	int ret, i, low, high;
> -	int l3mdev;
> +	bool local_ports;
> +	int step, l3mdev;
>  	u32 index;
>  
>  	if (port) {
> @@ -1024,10 +1025,12 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  
>  	l3mdev = inet_sk_bound_l3mdev(sk);
>  
> -	inet_sk_get_local_port_range(sk, &low, &high);
> +	local_ports = inet_sk_get_local_port_range(sk, &low, &high);
> +	step = local_ports ? 1 : 2;
> +
>  	high++; /* [32768, 60999] -> [32768, 61000[ */
>  	remaining = high - low;
> -	if (likely(remaining > 1))
> +	if (!local_ports && remaining > 1)
>  		remaining &= ~1U;
>  
>  	get_random_sleepable_once(table_perturb,
> @@ -1040,10 +1043,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  	/* In first pass we try ports of @low parity.
>  	 * inet_csk_get_port() does the opposite choice.
>  	 */
> -	offset &= ~1U;
> +	if (!local_ports)
> +		offset &= ~1U;
>  other_parity_scan:
>  	port = low + offset;
> -	for (i = 0; i < remaining; i += 2, port += 2) {
> +	for (i = 0; i < remaining; i += step, port += step) {
>  		if (unlikely(port >= high))
>  			port -= remaining;
>  		if (inet_is_local_reserved_port(net, port))
> @@ -1083,10 +1087,11 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  		cond_resched();
>  	}
>  
> -	offset++;
> -	if ((offset & 1) && remaining > 1)
> -		goto other_parity_scan;
> -
> +	if (!local_ports) {
> +		offset++;
> +		if ((offset & 1) && remaining > 1)
> +			goto other_parity_scan;
> +	}
>  	return -EADDRNOTAVAIL;
>  
>  ok:
> @@ -1109,8 +1114,8 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
>  	 * on low contention the randomness is maximal and on high contention
>  	 * it may be inexistent.
>  	 */
> -	i = max_t(int, i, get_random_u32_below(8) * 2);
> -	WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + 2);
> +	i = max_t(int, i, get_random_u32_below(8) * step);
> +	WRITE_ONCE(table_perturb[index], READ_ONCE(table_perturb[index]) + i + step);
>  
>  	/* Head lock still held and bh's disabled */
>  	inet_bind_hash(sk, tb, tb2, port);
> -- 
> 2.43.0.472.g3155946c3a-goog

  reply	other threads:[~2023-12-15  1:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-12-14 19:29 [PATCH net-next 0/2] tcp/dccp: refine source port selection Eric Dumazet
2023-12-14 19:29 ` [PATCH net-next 1/2] inet: returns a bool from inet_sk_get_local_port_range() Eric Dumazet
2023-12-15  1:50   ` Kuniyuki Iwashima
2023-12-14 19:29 ` [PATCH net-next 2/2] tcp/dccp: change source port selection at connect() time Eric Dumazet
2023-12-15  1:58   ` Kuniyuki Iwashima [this message]
2023-12-15  2:26   ` Jason Xing
2023-12-16  2:10 ` [PATCH net-next 0/2] tcp/dccp: refine source port selection patchwork-bot+netdevbpf
2024-01-03 14:17 ` Jakub Sitnicki
2024-01-03 16:48   ` Eric Dumazet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231215015807.39107-1-kuniyu@amazon.com \
    --to=kuniyu@amazon.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=eric.dumazet@gmail.com \
    --cc=jakub@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.