From: Kuniyuki Iwashima <kuniyu@amazon.com>
To: <edumazet@google.com>
Cc: <chuck.lever@oracle.com>, <davem@davemloft.net>,
<jlayton@kernel.org>, <keescook@chromium.org>, <kuba@kernel.org>,
<kuni1840@gmail.com>, <kuniyu@amazon.com>,
<linux-fsdevel@vger.kernel.org>, <mcgrof@kernel.org>,
<netdev@vger.kernel.org>, <pabeni@redhat.com>,
<yzaikin@google.com>
Subject: Re: [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table.
Date: Fri, 26 Aug 2022 09:51:44 -0700 [thread overview]
Message-ID: <20220826165144.95976-1-kuniyu@amazon.com> (raw)
In-Reply-To: <CANn89i+pfVeH0Gs4tFPcZstnfxjz-Vp2D86H5AQsdsR_+p_3qQ@mail.gmail.com>
From: Eric Dumazet <edumazet@google.com>
Date: Fri, 26 Aug 2022 08:17:25 -0700
> On Thu, Aug 25, 2022 at 5:05 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > The more sockets we have in the hash table, the more time we spend
> > looking up the socket. While running a number of small workloads on
> > the same host, they penalise each other and cause performance degradation.
> >
> > Also, the root cause might be a single workload that consumes much more
> > resources than the others. It often happens on a cloud service where
> > different workloads share the same computing resource.
> >
> > On EC2 c5.24xlarge instance (196 GiB memory and 524288 (1Mi / 2) ehash
> > entries), after running iperf3 in different netns, creating 24Mi sockets
> > without data transfer in the root netns causes about 10% performance
> > regression for the iperf3's connection.
> >
> > thash_entries sockets length Gbps
> > 524288 1 1 50.7
> > 24Mi 48 45.1
> >
> > It is basically related to the length of the list of each hash bucket.
> > For testing purposes to see how performance drops along the length,
> > I set 131072 (1Mi / 8) to thash_entries, and here's the result.
> >
> > thash_entries sockets length Gbps
> > 131072 1 1 50.7
> > 1Mi 8 49.9
> > 2Mi 16 48.9
> > 4Mi 32 47.3
> > 8Mi 64 44.6
> > 16Mi 128 40.6
> > 24Mi 192 36.3
> > 32Mi 256 32.5
> > 40Mi 320 27.0
> > 48Mi 384 25.0
> >
> > To resolve the socket lookup degradation, we introduce an optional
> > per-netns hash table for TCP and UDP. With a smaller hash table, we
> > can look up sockets faster and isolate noisy neighbours. Also, we can
> > reduce lock contention.
> >
> > We can control and check the hash size via sysctl knobs. It requires
> > some tuning based on workloads, so the per-netns hash table is disabled
> > by default.
> >
> > # dmesg | cut -d ' ' -f 5- | grep "established hash"
> > TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage)
> >
> > # sysctl net.ipv4.tcp_ehash_entries
> > net.ipv4.tcp_ehash_entries = 524288 # can be changed by thash_entries
> >
> > # sysctl net.ipv4.tcp_child_ehash_entries
> > net.ipv4.tcp_child_ehash_entries = 0 # disabled by default
> >
> > # ip netns add test1
> > # ip netns exec test1 sysctl net.ipv4.tcp_ehash_entries
> > net.ipv4.tcp_ehash_entries = -524288 # share the global ehash
> >
> > # sysctl -w net.ipv4.tcp_child_ehash_entries=100
> > net.ipv4.tcp_child_ehash_entries = 100
> >
> > # sysctl net.ipv4.tcp_child_ehash_entries
> > net.ipv4.tcp_child_ehash_entries = 128 # rounded up to 2^n
> >
> > # ip netns add test2
> > # ip netns exec test2 sysctl net.ipv4.tcp_ehash_entries
> > net.ipv4.tcp_ehash_entries = 128 # own per-netns ehash
> >
> > [ UDP has the same interface as udp_hash_entries and
> > udp_child_hash_entries. ]
> >
> > When creating per-netns concurrently with different sizes, we can
> > guarantee the size by doing one of these ways.
> >
> > 1) Share the global hash table and create per-netns one
> >
> > First, unshare() with tcp_child_ehash_entries==0. It creates dedicated
> > netns sysctl knobs where we can safely change tcp_child_ehash_entries
> > and clone()/unshare() to create a per-netns hash table.
> >
> > 2) Lock the sysctl knob
> >
>
> This is orthogonal.
>
> Your series should have been split in three really.
>
> I do not want to discuss the merit of re-instating LOCK_MAND :/
I see.
I'll drop the flock() part at once and respin TCP part only in v2.
prev parent reply other threads:[~2022-08-26 16:52 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-26 0:04 [PATCH v1 net-next 00/13] tcp/udp: Introduce optional per-netns hash table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 01/13] fs/lock: Revive LOCK_MAND Kuniyuki Iwashima
2022-08-26 10:02 ` Jeff Layton
2022-08-26 16:48 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 02/13] sysctl: Support LOCK_MAND for read/write Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 03/13] selftest: sysctl: Add test for flock(LOCK_MAND) Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 04/13] net: Introduce init2() for pernet_operations Kuniyuki Iwashima
2022-08-26 15:20 ` Eric Dumazet
2022-08-26 17:03 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 05/13] tcp: Clean up some functions Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 06/13] tcp: Set NULL to sk->sk_prot->h.hashinfo Kuniyuki Iwashima
2022-08-26 15:40 ` Eric Dumazet
2022-08-26 17:26 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 07/13] tcp: Access &tcp_hashinfo via net Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 08/13] tcp: Introduce optional per-netns ehash Kuniyuki Iwashima
2022-08-26 15:24 ` Eric Dumazet
2022-08-26 17:19 ` Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 09/13] udp: Clean up some functions Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 10/13] udp: Set NULL to sk->sk_prot->h.udp_table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 11/13] udp: Set NULL to udp_seq_afinfo.udp_table Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 12/13] udp: Access &udp_table via net Kuniyuki Iwashima
2022-08-26 0:04 ` [PATCH v1 net-next 13/13] udp: Introduce optional per-netns hash table Kuniyuki Iwashima
2022-08-26 15:17 ` [PATCH v1 net-next 00/13] tcp/udp: " Eric Dumazet
2022-08-26 16:51 ` Kuniyuki Iwashima [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220826165144.95976-1-kuniyu@amazon.com \
--to=kuniyu@amazon.com \
--cc=chuck.lever@oracle.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=jlayton@kernel.org \
--cc=keescook@chromium.org \
--cc=kuba@kernel.org \
--cc=kuni1840@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=mcgrof@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=yzaikin@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).