* [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
@ 2022-06-26 8:23 Kuniyuki Iwashima
2022-06-26 16:43 ` Eric W. Biederman
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-26 8:23 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Eric W. Biederman, Pavel Emelyanov, Herbert Xu, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev
While setting up init_net's sysctl table, we need not duplicate the global
table and can use it directly.
Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
* Fix NULL comparison style by checkpatch.pl
v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/
---
---
net/unix/sysctl_net_unix.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c
index 01d44e2598e2..4bd856d05135 100644
--- a/net/unix/sysctl_net_unix.c
+++ b/net/unix/sysctl_net_unix.c
@@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net)
{
struct ctl_table *table;
- table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
- if (table == NULL)
- goto err_alloc;
+ if (net_eq(net, &init_net)) {
+ table = unix_table;
+ } else {
+ table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
+ if (!table)
+ goto err_alloc;
+
+ table[0].data = &net->unx.sysctl_max_dgram_qlen;
+ }
- table[0].data = &net->unx.sysctl_max_dgram_qlen;
net->unx.ctl = register_net_sysctl(net, "net/unix", table);
if (net->unx.ctl == NULL)
goto err_reg;
@@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net)
return 0;
err_reg:
- kfree(table);
+ if (net_eq(net, &init_net))
+ kfree(table);
err_alloc:
return -ENOMEM;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-26 8:23 [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table Kuniyuki Iwashima
@ 2022-06-26 16:43 ` Eric W. Biederman
2022-06-27 17:58 ` Jakub Kicinski
0 siblings, 1 reply; 12+ messages in thread
From: Eric W. Biederman @ 2022-06-26 16:43 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Pavel Emelyanov, Herbert Xu, Kuniyuki Iwashima, netdev
Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> While setting up init_net's sysctl table, we need not duplicate the global
> table and can use it directly.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
I am not quite certain the savings of a single entry table justivies
the complexity. But the looks correct.
Eric
>
> Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---
> v2:
> * Fix NULL comparison style by checkpatch.pl
>
> v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/
> ---
> ---
> net/unix/sysctl_net_unix.c | 16 +++++++++++-----
> 1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c
> index 01d44e2598e2..4bd856d05135 100644
> --- a/net/unix/sysctl_net_unix.c
> +++ b/net/unix/sysctl_net_unix.c
> @@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net)
> {
> struct ctl_table *table;
>
> - table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
> - if (table == NULL)
> - goto err_alloc;
> + if (net_eq(net, &init_net)) {
> + table = unix_table;
> + } else {
> + table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
> + if (!table)
> + goto err_alloc;
> +
> + table[0].data = &net->unx.sysctl_max_dgram_qlen;
> + }
>
> - table[0].data = &net->unx.sysctl_max_dgram_qlen;
> net->unx.ctl = register_net_sysctl(net, "net/unix", table);
> if (net->unx.ctl == NULL)
> goto err_reg;
> @@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net)
> return 0;
>
> err_reg:
> - kfree(table);
> + if (net_eq(net, &init_net))
> + kfree(table);
> err_alloc:
> return -ENOMEM;
> }
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-26 16:43 ` Eric W. Biederman
@ 2022-06-27 17:58 ` Jakub Kicinski
2022-06-27 18:30 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Jakub Kicinski @ 2022-06-27 17:58 UTC (permalink / raw)
To: Eric W. Biederman, Kuniyuki Iwashima
Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Pavel Emelyanov,
Herbert Xu, Kuniyuki Iwashima, netdev
On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> Kuniyuki Iwashima <kuniyu@amazon.com> writes:
>
> > While setting up init_net's sysctl table, we need not duplicate the global
> > table and can use it directly.
>
> Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
>
> I am not quite certain the savings of a single entry table justivies
> the complexity. But the looks correct.
Yeah, the commit message is a little sparse. The "why" is not addressed.
Could you add more details to explain the motivation?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 17:58 ` Jakub Kicinski
@ 2022-06-27 18:30 ` Kuniyuki Iwashima
2022-06-27 18:40 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-27 18:30 UTC (permalink / raw)
To: kuba
Cc: davem, ebiederm, edumazet, herbert, kuni1840, kuniyu, netdev,
pabeni, xemul
From: Jakub Kicinski <kuba@kernel.org>
Date: Mon, 27 Jun 2022 10:58:59 -0700
> On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> >
> > > While setting up init_net's sysctl table, we need not duplicate the global
> > > table and can use it directly.
> >
> > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> >
> > I am not quite certain the savings of a single entry table justivies
> > the complexity. But the looks correct.
>
> Yeah, the commit message is a little sparse. The "why" is not addressed.
> Could you add more details to explain the motivation?
I was working on a series which converts UDP/TCP hash tables into per-netns
ones like AF_UNIX to speed up looking up sockets. It will consume much
memory on a host with thousands of netns, but it can be waste if we do not
have its protocol family's sockets.
So, I'm now working on a follow-up series for AF_UNIX per-netns hash table
so that we can change the size for a child netns by a sysctl knob:
# sysctl -w net.unix.child_hash_entries=128
# ip net add test # created with the hash table size 128
# ip net exec test sh
# sysctl net.unix.hash_entries # read-only
128
(The size for init_net can be changed via a new boot parameter
xhash_entries like uhash_entries/thash_entries.)
While implementing that, I found that kmemdup() is called for init_net but
TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does
not have a huge sysctl table, so it cannot be a problem though, this patch
is for consuming less memory and kind of consistency. The reason I submit
this seperately is that it might be better to have a Fixes tag.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 18:30 ` Kuniyuki Iwashima
@ 2022-06-27 18:40 ` Eric Dumazet
2022-06-27 18:58 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-06-27 18:40 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Jakub Kicinski, David Miller, Eric W. Biederman, Herbert Xu,
Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov
On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Jakub Kicinski <kuba@kernel.org>
> Date: Mon, 27 Jun 2022 10:58:59 -0700
> > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > >
> > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > table and can use it directly.
> > >
> > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > >
> > > I am not quite certain the savings of a single entry table justivies
> > > the complexity. But the looks correct.
> >
> > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > Could you add more details to explain the motivation?
>
> I was working on a series which converts UDP/TCP hash tables into per-netns
> ones like AF_UNIX to speed up looking up sockets. It will consume much
> memory on a host with thousands of netns, but it can be waste if we do not
> have its protocol family's sockets.
For the record, I doubt we will accept such a patch (per net-ns
TCP/UDP hash tables)
>
> So, I'm now working on a follow-up series for AF_UNIX per-netns hash table
> so that we can change the size for a child netns by a sysctl knob:
>
> # sysctl -w net.unix.child_hash_entries=128
> # ip net add test # created with the hash table size 128
> # ip net exec test sh
> # sysctl net.unix.hash_entries # read-only
> 128
>
> (The size for init_net can be changed via a new boot parameter
> xhash_entries like uhash_entries/thash_entries.)
>
> While implementing that, I found that kmemdup() is called for init_net but
> TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does
> not have a huge sysctl table, so it cannot be a problem though, this patch
> is for consuming less memory and kind of consistency. The reason I submit
> this seperately is that it might be better to have a Fixes tag.
I think that af_unix module can be unloaded.
Your patch will break the module unload operation.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 18:40 ` Eric Dumazet
@ 2022-06-27 18:58 ` Kuniyuki Iwashima
2022-06-27 19:06 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-27 18:58 UTC (permalink / raw)
To: edumazet
Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni,
xemul
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Jun 2022 20:40:24 +0200
> On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Jakub Kicinski <kuba@kernel.org>
> > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > >
> > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > table and can use it directly.
> > > >
> > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > >
> > > > I am not quite certain the savings of a single entry table justivies
> > > > the complexity. But the looks correct.
> > >
> > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > Could you add more details to explain the motivation?
> >
> > I was working on a series which converts UDP/TCP hash tables into per-netns
> > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > memory on a host with thousands of netns, but it can be waste if we do not
> > have its protocol family's sockets.
>
> For the record, I doubt we will accept such a patch (per net-ns
> TCP/UDP hash tables)
Is it because it's risky?
IIRC, you said we need per netns table for TCP in the future.
> > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table
> > so that we can change the size for a child netns by a sysctl knob:
> >
> > # sysctl -w net.unix.child_hash_entries=128
> > # ip net add test # created with the hash table size 128
> > # ip net exec test sh
> > # sysctl net.unix.hash_entries # read-only
> > 128
> >
> > (The size for init_net can be changed via a new boot parameter
> > xhash_entries like uhash_entries/thash_entries.)
> >
> > While implementing that, I found that kmemdup() is called for init_net but
> > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does
> > not have a huge sysctl table, so it cannot be a problem though, this patch
> > is for consuming less memory and kind of consistency. The reason I submit
> > this seperately is that it might be better to have a Fixes tag.
>
> I think that af_unix module can be unloaded.
>
> Your patch will break the module unload operation.
Thank you!
I had to take of kfree() in unix_sysctl_unregister().
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 18:58 ` Kuniyuki Iwashima
@ 2022-06-27 19:06 ` Eric Dumazet
2022-06-27 19:15 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-06-27 19:06 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski,
Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov
On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 27 Jun 2022 20:40:24 +0200
> > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Jakub Kicinski <kuba@kernel.org>
> > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > >
> > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > table and can use it directly.
> > > > >
> > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > >
> > > > > I am not quite certain the savings of a single entry table justivies
> > > > > the complexity. But the looks correct.
> > > >
> > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > Could you add more details to explain the motivation?
> > >
> > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > memory on a host with thousands of netns, but it can be waste if we do not
> > > have its protocol family's sockets.
> >
> > For the record, I doubt we will accept such a patch (per net-ns
> > TCP/UDP hash tables)
>
> Is it because it's risky?
Because it will be very expensive. TCP hash tables are quite big.
[ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536
(order: 8, 1048576 bytes, vmalloc)
[ 4.917260] TCP established hash table entries: 524288 (order: 10,
4194304 bytes, vmalloc hugepage)
[ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576
bytes, vmalloc)
[ 4.917881] TCP: Hash tables configured (established 524288 bind 65536)
> IIRC, you said we need per netns table for TCP in the future.
Which ones exactly ? I guess you misunderstood.
>
>
> > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table
> > > so that we can change the size for a child netns by a sysctl knob:
> > >
> > > # sysctl -w net.unix.child_hash_entries=128
> > > # ip net add test # created with the hash table size 128
> > > # ip net exec test sh
> > > # sysctl net.unix.hash_entries # read-only
> > > 128
> > >
> > > (The size for init_net can be changed via a new boot parameter
> > > xhash_entries like uhash_entries/thash_entries.)
> > >
> > > While implementing that, I found that kmemdup() is called for init_net but
> > > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does
> > > not have a huge sysctl table, so it cannot be a problem though, this patch
> > > is for consuming less memory and kind of consistency. The reason I submit
> > > this seperately is that it might be better to have a Fixes tag.
> >
> > I think that af_unix module can be unloaded.
> >
> > Your patch will break the module unload operation.
>
> Thank you!
> I had to take of kfree() in unix_sysctl_unregister().
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 19:06 ` Eric Dumazet
@ 2022-06-27 19:15 ` Kuniyuki Iwashima
2022-06-27 19:36 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-27 19:15 UTC (permalink / raw)
To: edumazet
Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni,
xemul
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Jun 2022 21:06:14 +0200
> On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon, 27 Jun 2022 20:40:24 +0200
> > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > > >
> > > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > > table and can use it directly.
> > > > > >
> > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > >
> > > > > > I am not quite certain the savings of a single entry table justivies
> > > > > > the complexity. But the looks correct.
> > > > >
> > > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > > Could you add more details to explain the motivation?
> > > >
> > > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > > memory on a host with thousands of netns, but it can be waste if we do not
> > > > have its protocol family's sockets.
> > >
> > > For the record, I doubt we will accept such a patch (per net-ns
> > > TCP/UDP hash tables)
> >
> > Is it because it's risky?
>
> Because it will be very expensive. TCP hash tables are quite big.
Yes, so I'm wondering if changing the size by sysctl makes sense. If we
have per-netns hash tables, each table should have smaller amount of
sockets and smaller size should be enough, I think.
>
> [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536
> (order: 8, 1048576 bytes, vmalloc)
> [ 4.917260] TCP established hash table entries: 524288 (order: 10,
> 4194304 bytes, vmalloc hugepage)
> [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576
> bytes, vmalloc)
> [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536)
>
>
>
> > IIRC, you said we need per netns table for TCP in the future.
>
> Which ones exactly ? I guess you misunderstood.
I think this.
https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 19:15 ` Kuniyuki Iwashima
@ 2022-06-27 19:36 ` Eric Dumazet
2022-06-27 19:59 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-06-27 19:36 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski,
Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov
On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 27 Jun 2022 21:06:14 +0200
> > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Eric Dumazet <edumazet@google.com>
> > > Date: Mon, 27 Jun 2022 20:40:24 +0200
> > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > > > >
> > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > > > table and can use it directly.
> > > > > > >
> > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > > >
> > > > > > > I am not quite certain the savings of a single entry table justivies
> > > > > > > the complexity. But the looks correct.
> > > > > >
> > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > > > Could you add more details to explain the motivation?
> > > > >
> > > > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > > > memory on a host with thousands of netns, but it can be waste if we do not
> > > > > have its protocol family's sockets.
> > > >
> > > > For the record, I doubt we will accept such a patch (per net-ns
> > > > TCP/UDP hash tables)
> > >
> > > Is it because it's risky?
> >
> > Because it will be very expensive. TCP hash tables are quite big.
>
> Yes, so I'm wondering if changing the size by sysctl makes sense. If we
> have per-netns hash tables, each table should have smaller amount of
> sockets and smaller size should be enough, I think.
How can a sysctl be safely used if two different threads call "unshare
-n" at the same time ?
>
> >
> > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536
> > (order: 8, 1048576 bytes, vmalloc)
> > [ 4.917260] TCP established hash table entries: 524288 (order: 10,
> > 4194304 bytes, vmalloc hugepage)
> > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576
> > bytes, vmalloc)
> > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536)
> >
> >
> >
> > > IIRC, you said we need per netns table for TCP in the future.
> >
> > Which ones exactly ? I guess you misunderstood.
>
> I think this.
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13
"might" is very different than "will"
I would rather use the list of time_wait, instead of adding huge
memory costs for hosts with hundreds of netns.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 19:36 ` Eric Dumazet
@ 2022-06-27 19:59 ` Kuniyuki Iwashima
2022-06-27 20:04 ` Eric Dumazet
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-27 19:59 UTC (permalink / raw)
To: edumazet
Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni,
xemul
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Jun 2022 21:36:18 +0200
> On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon, 27 Jun 2022 21:06:14 +0200
> > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > Date: Mon, 27 Jun 2022 20:40:24 +0200
> > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > > > > >
> > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > > > > table and can use it directly.
> > > > > > > >
> > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > > > >
> > > > > > > > I am not quite certain the savings of a single entry table justivies
> > > > > > > > the complexity. But the looks correct.
> > > > > > >
> > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > > > > Could you add more details to explain the motivation?
> > > > > >
> > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > > > > memory on a host with thousands of netns, but it can be waste if we do not
> > > > > > have its protocol family's sockets.
> > > > >
> > > > > For the record, I doubt we will accept such a patch (per net-ns
> > > > > TCP/UDP hash tables)
> > > >
> > > > Is it because it's risky?
> > >
> > > Because it will be very expensive. TCP hash tables are quite big.
> >
> > Yes, so I'm wondering if changing the size by sysctl makes sense. If we
> > have per-netns hash tables, each table should have smaller amount of
> > sockets and smaller size should be enough, I think.
>
> How can a sysctl be safely used if two different threads call "unshare
> -n" at the same time ?
I think two unshare are safe. Each of them reads its parent netns's sysctl
knob. Even when the parent is the same, they can read the same value.
But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. While
creating a child netns, another one can change the value and there can be
a data-race. So we have to use custome handler and pass write/read handler
as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless().
If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add
more general one, proc_douintvec_lockless().
> > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536
> > > (order: 8, 1048576 bytes, vmalloc)
> > > [ 4.917260] TCP established hash table entries: 524288 (order: 10,
> > > 4194304 bytes, vmalloc hugepage)
> > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576
> > > bytes, vmalloc)
> > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536)
> > >
> > >
> > >
> > > > IIRC, you said we need per netns table for TCP in the future.
> > >
> > > Which ones exactly ? I guess you misunderstood.
> >
> > I think this.
> > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13
>
> "might" is very different than "will"
>
> I would rather use the list of time_wait, instead of adding huge
> memory costs for hosts with hundreds of netns.
Sorry, my bad.
I would give it a try only for TIME_WAIT.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 19:59 ` Kuniyuki Iwashima
@ 2022-06-27 20:04 ` Eric Dumazet
2022-06-27 20:18 ` Kuniyuki Iwashima
0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2022-06-27 20:04 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski,
Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov
On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: Eric Dumazet <edumazet@google.com>
> Date: Mon, 27 Jun 2022 21:36:18 +0200
> > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > >
> > > From: Eric Dumazet <edumazet@google.com>
> > > Date: Mon, 27 Jun 2022 21:06:14 +0200
> > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > >
> > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200
> > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > >
> > > > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > > > > > >
> > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > > > > > table and can use it directly.
> > > > > > > > >
> > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > > > > >
> > > > > > > > > I am not quite certain the savings of a single entry table justivies
> > > > > > > > > the complexity. But the looks correct.
> > > > > > > >
> > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > > > > > Could you add more details to explain the motivation?
> > > > > > >
> > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > > > > > memory on a host with thousands of netns, but it can be waste if we do not
> > > > > > > have its protocol family's sockets.
> > > > > >
> > > > > > For the record, I doubt we will accept such a patch (per net-ns
> > > > > > TCP/UDP hash tables)
> > > > >
> > > > > Is it because it's risky?
> > > >
> > > > Because it will be very expensive. TCP hash tables are quite big.
> > >
> > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we
> > > have per-netns hash tables, each table should have smaller amount of
> > > sockets and smaller size should be enough, I think.
> >
> > How can a sysctl be safely used if two different threads call "unshare
> > -n" at the same time ?
>
> I think two unshare are safe. Each of them reads its parent netns's sysctl
> knob. Even when the parent is the same, they can read the same value.
How can one thread create a netns with a TCP ehash table with 1024 buckets,
and a second one create a netns with a TCP ehash table with 1 million
buckets at the same time,
if they share the same sysctl ???
>
> But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl.
Like all sysctls really.
While
> creating a child netns, another one can change the value and there can be
> a data-race. So we have to use custome handler and pass write/read handler
> as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless().
>
> If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add
> more general one, proc_douintvec_lockless().
Seriously, all sysctls can be set while being read. That is not something new.
>
>
> > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536
> > > > (order: 8, 1048576 bytes, vmalloc)
> > > > [ 4.917260] TCP established hash table entries: 524288 (order: 10,
> > > > 4194304 bytes, vmalloc hugepage)
> > > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576
> > > > bytes, vmalloc)
> > > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536)
> > > >
> > > >
> > > >
> > > > > IIRC, you said we need per netns table for TCP in the future.
> > > >
> > > > Which ones exactly ? I guess you misunderstood.
> > >
> > > I think this.
> > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13
> >
> > "might" is very different than "will"
> >
> > I would rather use the list of time_wait, instead of adding huge
> > memory costs for hosts with hundreds of netns.
>
> Sorry, my bad.
> I would give it a try only for TIME_WAIT.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
2022-06-27 20:04 ` Eric Dumazet
@ 2022-06-27 20:18 ` Kuniyuki Iwashima
0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-27 20:18 UTC (permalink / raw)
To: edumazet
Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni,
xemul
From: Eric Dumazet <edumazet@google.com>
Date: Mon, 27 Jun 2022 22:04:07 +0200
> On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> >
> > From: Eric Dumazet <edumazet@google.com>
> > Date: Mon, 27 Jun 2022 21:36:18 +0200
> > > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > >
> > > > From: Eric Dumazet <edumazet@google.com>
> > > > Date: Mon, 27 Jun 2022 21:06:14 +0200
> > > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > >
> > > > > > From: Eric Dumazet <edumazet@google.com>
> > > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200
> > > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> > > > > > > >
> > > > > > > > From: Jakub Kicinski <kuba@kernel.org>
> > > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700
> > > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote:
> > > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes:
> > > > > > > > > >
> > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global
> > > > > > > > > > > table and can use it directly.
> > > > > > > > > >
> > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > > > > > > > >
> > > > > > > > > > I am not quite certain the savings of a single entry table justivies
> > > > > > > > > > the complexity. But the looks correct.
> > > > > > > > >
> > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed.
> > > > > > > > > Could you add more details to explain the motivation?
> > > > > > > >
> > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns
> > > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much
> > > > > > > > memory on a host with thousands of netns, but it can be waste if we do not
> > > > > > > > have its protocol family's sockets.
> > > > > > >
> > > > > > > For the record, I doubt we will accept such a patch (per net-ns
> > > > > > > TCP/UDP hash tables)
> > > > > >
> > > > > > Is it because it's risky?
> > > > >
> > > > > Because it will be very expensive. TCP hash tables are quite big.
> > > >
> > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we
> > > > have per-netns hash tables, each table should have smaller amount of
> > > > sockets and smaller size should be enough, I think.
> > >
> > > How can a sysctl be safely used if two different threads call "unshare
> > > -n" at the same time ?
> >
> > I think two unshare are safe. Each of them reads its parent netns's sysctl
> > knob. Even when the parent is the same, they can read the same value.
>
> How can one thread create a netns with a TCP ehash table with 1024 buckets,
> and a second one create a netns with a TCP ehash table with 1 million
> buckets at the same time,
> if they share the same sysctl ???
Oh, I undertood.
In the example, I added net.unix.hash_entries so we can confirm if the size
is intended one, but yes, checking it and recreating netns is crazy...
# sysctl -w net.unix.child_hash_entries=128
# ip net add test # created with the hash table size 128
# ip net exec test sh
# sysctl net.unix.hash_entries # read-only
128
Do you have good idea?
>
> >
> > But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl.
>
> Like all sysctls really.
>
> While
> > creating a child netns, another one can change the value and there can be
> > a data-race. So we have to use custome handler and pass write/read handler
> > as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless().
> >
> > If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add
> > more general one, proc_douintvec_lockless().
>
> Seriously, all sysctls can be set while being read. That is not something new.
Ok, I added that on TODO.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-06-27 20:18 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-06-26 8:23 [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table Kuniyuki Iwashima
2022-06-26 16:43 ` Eric W. Biederman
2022-06-27 17:58 ` Jakub Kicinski
2022-06-27 18:30 ` Kuniyuki Iwashima
2022-06-27 18:40 ` Eric Dumazet
2022-06-27 18:58 ` Kuniyuki Iwashima
2022-06-27 19:06 ` Eric Dumazet
2022-06-27 19:15 ` Kuniyuki Iwashima
2022-06-27 19:36 ` Eric Dumazet
2022-06-27 19:59 ` Kuniyuki Iwashima
2022-06-27 20:04 ` Eric Dumazet
2022-06-27 20:18 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).