* [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table.
@ 2022-06-26 8:23 Kuniyuki Iwashima
2022-06-26 16:43 ` Eric W. Biederman
0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2022-06-26 8:23 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni
Cc: Eric W. Biederman, Pavel Emelyanov, Herbert Xu, Kuniyuki Iwashima,
Kuniyuki Iwashima, netdev
While setting up init_net's sysctl table, we need not duplicate the global
table and can use it directly.
Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
v2:
* Fix NULL comparison style by checkpatch.pl
v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/
---
---
net/unix/sysctl_net_unix.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)
diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c
index 01d44e2598e2..4bd856d05135 100644
--- a/net/unix/sysctl_net_unix.c
+++ b/net/unix/sysctl_net_unix.c
@@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net)
{
struct ctl_table *table;
- table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
- if (table == NULL)
- goto err_alloc;
+ if (net_eq(net, &init_net)) {
+ table = unix_table;
+ } else {
+ table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL);
+ if (!table)
+ goto err_alloc;
+
+ table[0].data = &net->unx.sysctl_max_dgram_qlen;
+ }
- table[0].data = &net->unx.sysctl_max_dgram_qlen;
net->unx.ctl = register_net_sysctl(net, "net/unix", table);
if (net->unx.ctl == NULL)
goto err_reg;
@@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net)
return 0;
err_reg:
- kfree(table);
+ if (net_eq(net, &init_net))
+ kfree(table);
err_alloc:
return -ENOMEM;
}
--
2.30.2
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-26 8:23 [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table Kuniyuki Iwashima @ 2022-06-26 16:43 ` Eric W. Biederman 2022-06-27 17:58 ` Jakub Kicinski 0 siblings, 1 reply; 12+ messages in thread From: Eric W. Biederman @ 2022-06-26 16:43 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Pavel Emelyanov, Herbert Xu, Kuniyuki Iwashima, netdev Kuniyuki Iwashima <kuniyu@amazon.com> writes: > While setting up init_net's sysctl table, we need not duplicate the global > table and can use it directly. Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> I am not quite certain the savings of a single entry table justivies the complexity. But the looks correct. Eric > > Fixes: 1597fbc0faf8 ("[UNIX]: Make the unix sysctl tables per-namespace") > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> > --- > v2: > * Fix NULL comparison style by checkpatch.pl > > v1: https://lore.kernel.org/all/20220626074454.28944-1-kuniyu@amazon.com/ > --- > --- > net/unix/sysctl_net_unix.c | 16 +++++++++++----- > 1 file changed, 11 insertions(+), 5 deletions(-) > > diff --git a/net/unix/sysctl_net_unix.c b/net/unix/sysctl_net_unix.c > index 01d44e2598e2..4bd856d05135 100644 > --- a/net/unix/sysctl_net_unix.c > +++ b/net/unix/sysctl_net_unix.c > @@ -26,11 +26,16 @@ int __net_init unix_sysctl_register(struct net *net) > { > struct ctl_table *table; > > - table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); > - if (table == NULL) > - goto err_alloc; > + if (net_eq(net, &init_net)) { > + table = unix_table; > + } else { > + table = kmemdup(unix_table, sizeof(unix_table), GFP_KERNEL); > + if (!table) > + goto err_alloc; > + > + table[0].data = &net->unx.sysctl_max_dgram_qlen; > + } > > - table[0].data = &net->unx.sysctl_max_dgram_qlen; > net->unx.ctl = register_net_sysctl(net, "net/unix", table); > if (net->unx.ctl == NULL) > goto err_reg; > @@ -38,7 +43,8 @@ int __net_init unix_sysctl_register(struct net *net) > return 0; > > err_reg: > - kfree(table); > + if (net_eq(net, &init_net)) > + kfree(table); > err_alloc: > return -ENOMEM; > } ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-26 16:43 ` Eric W. Biederman @ 2022-06-27 17:58 ` Jakub Kicinski 2022-06-27 18:30 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Jakub Kicinski @ 2022-06-27 17:58 UTC (permalink / raw) To: Eric W. Biederman, Kuniyuki Iwashima Cc: David S. Miller, Eric Dumazet, Paolo Abeni, Pavel Emelyanov, Herbert Xu, Kuniyuki Iwashima, netdev On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > While setting up init_net's sysctl table, we need not duplicate the global > > table and can use it directly. > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > I am not quite certain the savings of a single entry table justivies > the complexity. But the looks correct. Yeah, the commit message is a little sparse. The "why" is not addressed. Could you add more details to explain the motivation? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 17:58 ` Jakub Kicinski @ 2022-06-27 18:30 ` Kuniyuki Iwashima 2022-06-27 18:40 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2022-06-27 18:30 UTC (permalink / raw) To: kuba Cc: davem, ebiederm, edumazet, herbert, kuni1840, kuniyu, netdev, pabeni, xemul From: Jakub Kicinski <kuba@kernel.org> Date: Mon, 27 Jun 2022 10:58:59 -0700 > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > table and can use it directly. > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > I am not quite certain the savings of a single entry table justivies > > the complexity. But the looks correct. > > Yeah, the commit message is a little sparse. The "why" is not addressed. > Could you add more details to explain the motivation? I was working on a series which converts UDP/TCP hash tables into per-netns ones like AF_UNIX to speed up looking up sockets. It will consume much memory on a host with thousands of netns, but it can be waste if we do not have its protocol family's sockets. So, I'm now working on a follow-up series for AF_UNIX per-netns hash table so that we can change the size for a child netns by a sysctl knob: # sysctl -w net.unix.child_hash_entries=128 # ip net add test # created with the hash table size 128 # ip net exec test sh # sysctl net.unix.hash_entries # read-only 128 (The size for init_net can be changed via a new boot parameter xhash_entries like uhash_entries/thash_entries.) While implementing that, I found that kmemdup() is called for init_net but TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does not have a huge sysctl table, so it cannot be a problem though, this patch is for consuming less memory and kind of consistency. The reason I submit this seperately is that it might be better to have a Fixes tag. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 18:30 ` Kuniyuki Iwashima @ 2022-06-27 18:40 ` Eric Dumazet 2022-06-27 18:58 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-06-27 18:40 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: Jakub Kicinski, David Miller, Eric W. Biederman, Herbert Xu, Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Jakub Kicinski <kuba@kernel.org> > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > table and can use it directly. > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > I am not quite certain the savings of a single entry table justivies > > > the complexity. But the looks correct. > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > Could you add more details to explain the motivation? > > I was working on a series which converts UDP/TCP hash tables into per-netns > ones like AF_UNIX to speed up looking up sockets. It will consume much > memory on a host with thousands of netns, but it can be waste if we do not > have its protocol family's sockets. For the record, I doubt we will accept such a patch (per net-ns TCP/UDP hash tables) > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > so that we can change the size for a child netns by a sysctl knob: > > # sysctl -w net.unix.child_hash_entries=128 > # ip net add test # created with the hash table size 128 > # ip net exec test sh > # sysctl net.unix.hash_entries # read-only > 128 > > (The size for init_net can be changed via a new boot parameter > xhash_entries like uhash_entries/thash_entries.) > > While implementing that, I found that kmemdup() is called for init_net but > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > not have a huge sysctl table, so it cannot be a problem though, this patch > is for consuming less memory and kind of consistency. The reason I submit > this seperately is that it might be better to have a Fixes tag. I think that af_unix module can be unloaded. Your patch will break the module unload operation. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 18:40 ` Eric Dumazet @ 2022-06-27 18:58 ` Kuniyuki Iwashima 2022-06-27 19:06 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2022-06-27 18:58 UTC (permalink / raw) To: edumazet Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni, xemul From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 20:40:24 +0200 > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Jakub Kicinski <kuba@kernel.org> > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > table and can use it directly. > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > the complexity. But the looks correct. > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > Could you add more details to explain the motivation? > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > memory on a host with thousands of netns, but it can be waste if we do not > > have its protocol family's sockets. > > For the record, I doubt we will accept such a patch (per net-ns > TCP/UDP hash tables) Is it because it's risky? IIRC, you said we need per netns table for TCP in the future. > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > > so that we can change the size for a child netns by a sysctl knob: > > > > # sysctl -w net.unix.child_hash_entries=128 > > # ip net add test # created with the hash table size 128 > > # ip net exec test sh > > # sysctl net.unix.hash_entries # read-only > > 128 > > > > (The size for init_net can be changed via a new boot parameter > > xhash_entries like uhash_entries/thash_entries.) > > > > While implementing that, I found that kmemdup() is called for init_net but > > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > > not have a huge sysctl table, so it cannot be a problem though, this patch > > is for consuming less memory and kind of consistency. The reason I submit > > this seperately is that it might be better to have a Fixes tag. > > I think that af_unix module can be unloaded. > > Your patch will break the module unload operation. Thank you! I had to take of kfree() in unix_sysctl_unregister(). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 18:58 ` Kuniyuki Iwashima @ 2022-06-27 19:06 ` Eric Dumazet 2022-06-27 19:15 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-06-27 19:06 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski, Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > table and can use it directly. > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > the complexity. But the looks correct. > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > Could you add more details to explain the motivation? > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > memory on a host with thousands of netns, but it can be waste if we do not > > > have its protocol family's sockets. > > > > For the record, I doubt we will accept such a patch (per net-ns > > TCP/UDP hash tables) > > Is it because it's risky? Because it will be very expensive. TCP hash tables are quite big. [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc) [ 4.917260] TCP established hash table entries: 524288 (order: 10, 4194304 bytes, vmalloc hugepage) [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 bytes, vmalloc) [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > IIRC, you said we need per netns table for TCP in the future. Which ones exactly ? I guess you misunderstood. > > > > > So, I'm now working on a follow-up series for AF_UNIX per-netns hash table > > > so that we can change the size for a child netns by a sysctl knob: > > > > > > # sysctl -w net.unix.child_hash_entries=128 > > > # ip net add test # created with the hash table size 128 > > > # ip net exec test sh > > > # sysctl net.unix.hash_entries # read-only > > > 128 > > > > > > (The size for init_net can be changed via a new boot parameter > > > xhash_entries like uhash_entries/thash_entries.) > > > > > > While implementing that, I found that kmemdup() is called for init_net but > > > TCP/UDP does not (See: ipv4_sysctl_init_net()). Unlike IPv4, AF_UNIX does > > > not have a huge sysctl table, so it cannot be a problem though, this patch > > > is for consuming less memory and kind of consistency. The reason I submit > > > this seperately is that it might be better to have a Fixes tag. > > > > I think that af_unix module can be unloaded. > > > > Your patch will break the module unload operation. > > Thank you! > I had to take of kfree() in unix_sysctl_unregister(). ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 19:06 ` Eric Dumazet @ 2022-06-27 19:15 ` Kuniyuki Iwashima 2022-06-27 19:36 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2022-06-27 19:15 UTC (permalink / raw) To: edumazet Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni, xemul From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 21:06:14 +0200 > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > table and can use it directly. > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > the complexity. But the looks correct. > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > Could you add more details to explain the motivation? > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > have its protocol family's sockets. > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > TCP/UDP hash tables) > > > > Is it because it's risky? > > Because it will be very expensive. TCP hash tables are quite big. Yes, so I'm wondering if changing the size by sysctl makes sense. If we have per-netns hash tables, each table should have smaller amount of sockets and smaller size should be enough, I think. > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > (order: 8, 1048576 bytes, vmalloc) > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > 4194304 bytes, vmalloc hugepage) > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > bytes, vmalloc) > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > IIRC, you said we need per netns table for TCP in the future. > > Which ones exactly ? I guess you misunderstood. I think this. https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 19:15 ` Kuniyuki Iwashima @ 2022-06-27 19:36 ` Eric Dumazet 2022-06-27 19:59 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-06-27 19:36 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski, Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Eric Dumazet <edumazet@google.com> > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > have its protocol family's sockets. > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > TCP/UDP hash tables) > > > > > > Is it because it's risky? > > > > Because it will be very expensive. TCP hash tables are quite big. > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > have per-netns hash tables, each table should have smaller amount of > sockets and smaller size should be enough, I think. How can a sysctl be safely used if two different threads call "unshare -n" at the same time ? > > > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > (order: 8, 1048576 bytes, vmalloc) > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > 4194304 bytes, vmalloc hugepage) > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > bytes, vmalloc) > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > Which ones exactly ? I guess you misunderstood. > > I think this. > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 "might" is very different than "will" I would rather use the list of time_wait, instead of adding huge memory costs for hosts with hundreds of netns. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 19:36 ` Eric Dumazet @ 2022-06-27 19:59 ` Kuniyuki Iwashima 2022-06-27 20:04 ` Eric Dumazet 0 siblings, 1 reply; 12+ messages in thread From: Kuniyuki Iwashima @ 2022-06-27 19:59 UTC (permalink / raw) To: edumazet Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni, xemul From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 21:36:18 +0200 > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > have its protocol family's sockets. > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > TCP/UDP hash tables) > > > > > > > > Is it because it's risky? > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > have per-netns hash tables, each table should have smaller amount of > > sockets and smaller size should be enough, I think. > > How can a sysctl be safely used if two different threads call "unshare > -n" at the same time ? I think two unshare are safe. Each of them reads its parent netns's sysctl knob. Even when the parent is the same, they can read the same value. But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. While creating a child netns, another one can change the value and there can be a data-race. So we have to use custome handler and pass write/read handler as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add more general one, proc_douintvec_lockless(). > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > > (order: 8, 1048576 bytes, vmalloc) > > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > > 4194304 bytes, vmalloc hugepage) > > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > > bytes, vmalloc) > > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > > > Which ones exactly ? I guess you misunderstood. > > > > I think this. > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 > > "might" is very different than "will" > > I would rather use the list of time_wait, instead of adding huge > memory costs for hosts with hundreds of netns. Sorry, my bad. I would give it a try only for TIME_WAIT. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 19:59 ` Kuniyuki Iwashima @ 2022-06-27 20:04 ` Eric Dumazet 2022-06-27 20:18 ` Kuniyuki Iwashima 0 siblings, 1 reply; 12+ messages in thread From: Eric Dumazet @ 2022-06-27 20:04 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: David Miller, Eric W. Biederman, Herbert Xu, Jakub Kicinski, Kuniyuki Iwashima, netdev, Paolo Abeni, Pavel Emelyanov On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > From: Eric Dumazet <edumazet@google.com> > Date: Mon, 27 Jun 2022 21:36:18 +0200 > > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > From: Eric Dumazet <edumazet@google.com> > > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > > have its protocol family's sockets. > > > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > > TCP/UDP hash tables) > > > > > > > > > > Is it because it's risky? > > > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > > have per-netns hash tables, each table should have smaller amount of > > > sockets and smaller size should be enough, I think. > > > > How can a sysctl be safely used if two different threads call "unshare > > -n" at the same time ? > > I think two unshare are safe. Each of them reads its parent netns's sysctl > knob. Even when the parent is the same, they can read the same value. How can one thread create a netns with a TCP ehash table with 1024 buckets, and a second one create a netns with a TCP ehash table with 1 million buckets at the same time, if they share the same sysctl ??? > > But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. Like all sysctls really. While > creating a child netns, another one can change the value and there can be > a data-race. So we have to use custome handler and pass write/read handler > as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). > > If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add > more general one, proc_douintvec_lockless(). Seriously, all sysctls can be set while being read. That is not something new. > > > > > > [ 4.917080] tcp_listen_portaddr_hash hash table entries: 65536 > > > > (order: 8, 1048576 bytes, vmalloc) > > > > [ 4.917260] TCP established hash table entries: 524288 (order: 10, > > > > 4194304 bytes, vmalloc hugepage) > > > > [ 4.917760] TCP bind hash table entries: 65536 (order: 8, 1048576 > > > > bytes, vmalloc) > > > > [ 4.917881] TCP: Hash tables configured (established 524288 bind 65536) > > > > > > > > > > > > > > > > > IIRC, you said we need per netns table for TCP in the future. > > > > > > > > Which ones exactly ? I guess you misunderstood. > > > > > > I think this. > > > https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=04c494e68a13 > > > > "might" is very different than "will" > > > > I would rather use the list of time_wait, instead of adding huge > > memory costs for hosts with hundreds of netns. > > Sorry, my bad. > I would give it a try only for TIME_WAIT. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table. 2022-06-27 20:04 ` Eric Dumazet @ 2022-06-27 20:18 ` Kuniyuki Iwashima 0 siblings, 0 replies; 12+ messages in thread From: Kuniyuki Iwashima @ 2022-06-27 20:18 UTC (permalink / raw) To: edumazet Cc: davem, ebiederm, herbert, kuba, kuni1840, kuniyu, netdev, pabeni, xemul From: Eric Dumazet <edumazet@google.com> Date: Mon, 27 Jun 2022 22:04:07 +0200 > On Mon, Jun 27, 2022 at 10:00 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > From: Eric Dumazet <edumazet@google.com> > > Date: Mon, 27 Jun 2022 21:36:18 +0200 > > > On Mon, Jun 27, 2022 at 9:16 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > Date: Mon, 27 Jun 2022 21:06:14 +0200 > > > > > On Mon, Jun 27, 2022 at 8:59 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > From: Eric Dumazet <edumazet@google.com> > > > > > > Date: Mon, 27 Jun 2022 20:40:24 +0200 > > > > > > > On Mon, Jun 27, 2022 at 8:30 PM Kuniyuki Iwashima <kuniyu@amazon.com> wrote: > > > > > > > > > > > > > > > > From: Jakub Kicinski <kuba@kernel.org> > > > > > > > > Date: Mon, 27 Jun 2022 10:58:59 -0700 > > > > > > > > > On Sun, 26 Jun 2022 11:43:27 -0500 Eric W. Biederman wrote: > > > > > > > > > > Kuniyuki Iwashima <kuniyu@amazon.com> writes: > > > > > > > > > > > > > > > > > > > > > While setting up init_net's sysctl table, we need not duplicate the global > > > > > > > > > > > table and can use it directly. > > > > > > > > > > > > > > > > > > > > Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> > > > > > > > > > > > > > > > > > > > > I am not quite certain the savings of a single entry table justivies > > > > > > > > > > the complexity. But the looks correct. > > > > > > > > > > > > > > > > > > Yeah, the commit message is a little sparse. The "why" is not addressed. > > > > > > > > > Could you add more details to explain the motivation? > > > > > > > > > > > > > > > > I was working on a series which converts UDP/TCP hash tables into per-netns > > > > > > > > ones like AF_UNIX to speed up looking up sockets. It will consume much > > > > > > > > memory on a host with thousands of netns, but it can be waste if we do not > > > > > > > > have its protocol family's sockets. > > > > > > > > > > > > > > For the record, I doubt we will accept such a patch (per net-ns > > > > > > > TCP/UDP hash tables) > > > > > > > > > > > > Is it because it's risky? > > > > > > > > > > Because it will be very expensive. TCP hash tables are quite big. > > > > > > > > Yes, so I'm wondering if changing the size by sysctl makes sense. If we > > > > have per-netns hash tables, each table should have smaller amount of > > > > sockets and smaller size should be enough, I think. > > > > > > How can a sysctl be safely used if two different threads call "unshare > > > -n" at the same time ? > > > > I think two unshare are safe. Each of them reads its parent netns's sysctl > > knob. Even when the parent is the same, they can read the same value. > > How can one thread create a netns with a TCP ehash table with 1024 buckets, > and a second one create a netns with a TCP ehash table with 1 million > buckets at the same time, > if they share the same sysctl ??? Oh, I undertood. In the example, I added net.unix.hash_entries so we can confirm if the size is intended one, but yes, checking it and recreating netns is crazy... # sysctl -w net.unix.child_hash_entries=128 # ip net add test # created with the hash table size 128 # ip net exec test sh # sysctl net.unix.hash_entries # read-only 128 Do you have good idea? > > > > > But I think we need READ_ONCE()/WRITE_ONCE() in such a sysctl. > > Like all sysctls really. > > While > > creating a child netns, another one can change the value and there can be > > a data-race. So we have to use custome handler and pass write/read handler > > as conv of do_proc_douintvec(), like do_proc_douintvec_conv_lockless(). > > > > If there are some sysctls missing READ_ONCE/WRITE_ONCE(), I will add > > more general one, proc_douintvec_lockless(). > > Seriously, all sysctls can be set while being read. That is not something new. Ok, I added that on TODO. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2022-06-27 20:18 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-06-26 8:23 [PATCH v2 net] af_unix: Do not call kmemdup() for init_net's sysctl table Kuniyuki Iwashima 2022-06-26 16:43 ` Eric W. Biederman 2022-06-27 17:58 ` Jakub Kicinski 2022-06-27 18:30 ` Kuniyuki Iwashima 2022-06-27 18:40 ` Eric Dumazet 2022-06-27 18:58 ` Kuniyuki Iwashima 2022-06-27 19:06 ` Eric Dumazet 2022-06-27 19:15 ` Kuniyuki Iwashima 2022-06-27 19:36 ` Eric Dumazet 2022-06-27 19:59 ` Kuniyuki Iwashima 2022-06-27 20:04 ` Eric Dumazet 2022-06-27 20:18 ` Kuniyuki Iwashima
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).