linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/16] memcg accounting from OpenVZ
       [not found] <CALvZod66KF-8xKB1dyY2twizDE=svE8iXT_nqvsrfWg1a92f4A@mail.gmail.com>
@ 2021-07-19 10:44 ` Vasily Averin
  2021-07-26 18:59   ` [PATCH v6 00/16] memcg accounting from Vasily Averin
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
       [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
  1 sibling, 2 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-19 10:44 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov, Serge Hallyn,
	Thomas Gleixner, Zefan Li, netdev, linux-fsdevel, LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries
       [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
@ 2021-07-19 10:45   ` Vasily Averin
  2021-07-19 10:45   ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays
       [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:45   ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-19 10:45   ` Vasily Averin
  2021-07-19 10:45   ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
  2021-07-19 10:45   ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 09/16] memcg: enable accounting for file lock caches
       [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
  2021-07-19 10:45   ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-19 10:45   ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-19 10:45   ` Vasily Averin
  2021-07-19 10:45   ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v5 10/16] memcg: enable accounting for fasync_cache
       [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
                     ` (2 preceding siblings ...)
  2021-07-19 10:45   ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-19 10:45   ` Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-19 10:45 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 00/16] memcg accounting from
  2021-07-19 10:44 ` [PATCH v5 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-07-26 18:59   ` Vasily Averin
  2021-07-26 21:59     ` David Miller
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
  1 sibling, 1 reply; 22+ messages in thread
From: Vasily Averin @ 2021-07-26 18:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Andrew Morton,
	Borislav Petkov, Christian Brauner, David Ahern, David S. Miller,
	Dmitry Safonov, Eric Dumazet, Eric W. Biederman,
	Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
	Ingo Molnar, Jakub Kicinski, J. Bruce Fields, Jeff Layton,
	Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov, Serge Hallyn,
	Thomas Gleixner, Zefan Li, netdev, linux-fsdevel, LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (16):
  memcg: enable accounting for net_device and Tx/Rx queues
  memcg: enable accounting for IP address and routing-related objects
  memcg: enable accounting for inet_bin_bucket cache
  memcg: enable accounting for VLAN group array
  memcg: ipv6/sit: account and don't WARN on ip_tunnel_prl structs
    allocation
  memcg: enable accounting for scm_fp_list objects
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 mm/memcontrol.c            | 2 +-
 net/8021q/vlan.c           | 2 +-
 net/core/dev.c             | 6 +++---
 net/core/fib_rules.c       | 4 ++--
 net/core/scm.c             | 4 ++--
 net/dccp/proto.c           | 2 +-
 net/ipv4/devinet.c         | 2 +-
 net/ipv4/fib_trie.c        | 4 ++--
 net/ipv4/tcp.c             | 4 +++-
 net/ipv6/addrconf.c        | 2 +-
 net/ipv6/ip6_fib.c         | 4 ++--
 net/ipv6/route.c           | 2 +-
 net/ipv6/sit.c             | 5 +++--
 30 files changed, 57 insertions(+), 49 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
@ 2021-07-26 19:00     ` Vasily Averin
  2021-07-26 19:00     ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 19:00     ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-26 19:00     ` Vasily Averin
  2021-07-26 19:01     ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
  2021-07-26 19:01     ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-26 19:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 09/16] memcg: enable accounting for file lock caches
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
  2021-07-26 19:00     ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-26 19:00     ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-26 19:01     ` Vasily Averin
  2021-07-26 19:01     ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v6 10/16] memcg: enable accounting for fasync_cache
       [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
                       ` (2 preceding siblings ...)
  2021-07-26 19:01     ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-26 19:01     ` Vasily Averin
  3 siblings, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-26 19:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index dfc72f1..7941559 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from
  2021-07-26 18:59   ` [PATCH v6 00/16] memcg accounting from Vasily Averin
@ 2021-07-26 21:59     ` David Miller
  2021-07-27  4:44       ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2021-07-26 21:59 UTC (permalink / raw)
  To: vvs
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel


This series does not apply cleanly to net-next, please respin.

Thank you.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v6 00/16] memcg accounting from OpenVZ
  2021-07-26 21:59     ` David Miller
@ 2021-07-27  4:44       ` Vasily Averin
  2021-07-27  5:33         ` [PATCH v7 00/10] " Vasily Averin
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
  0 siblings, 2 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: akpm, tj, cgroups, mhocko, hannes, vdavydov.dev, guro, shakeelb,
	nglaive, viro, adobriyan, avagin, bp, christian.brauner, dsahern,
	0x7f454c46, edumazet, ebiederm, gregkh, yoshfuji, hpa, mingo,
	kuba, bfields, jlayton, axboe, jirislaby, ktkhai, oleg, serge,
	tglx, lizefan.x, netdev, linux-fsdevel, linux-kernel

On 7/27/21 12:59 AM, David Miller wrote:
> 
> This series does not apply cleanly to net-next, please respin.

Dear David,
I found that you have already approved net-related patches of this series and included them into net-next.
So I'll respin v7 without these patches.

Thank you,
	Vasily Averin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 00/10] memcg accounting from OpenVZ
  2021-07-27  4:44       ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
@ 2021-07-27  5:33         ` Vasily Averin
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Tejun Heo, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Shakeel Butt, Yutian Yang,
	Alexander Viro, Alexey Dobriyan, Andrei Vagin, Borislav Petkov,
	Christian Brauner, Dmitry Safonov, Eric W. Biederman,
	Greg Kroah-Hartman, H. Peter Anvin, Ingo Molnar, J. Bruce Fields,
	Jeff Layton, Jens Axboe, Jiri Slaby, Kirill Tkhai, Oleg Nesterov,
	Serge Hallyn, Thomas Gleixner, Zefan Li, linux-fsdevel, LKML

OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. 
Initially we used our own accounting subsystem, then partially committed
it to upstream, and a few years ago switched to cgroups v1.
Now we're rebasing again, revising our old patches and trying to push
them upstream.

We try to protect the host system from any misuse of kernel memory 
allocation triggered by untrusted users inside the containers.

Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
list, though I would be very grateful for any comments from maintainersi
of affected subsystems or other people added in cc:

Compared to the upstream, we additionally account the following kernel objects:
- network devices and its Tx/Rx queues
- ipv4/v6 addresses and routing-related objects
- inet_bind_bucket cache objects
- VLAN group arrays
- ipv6/sit: ip_tunnel_prl
- scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets 
- nsproxy and namespace objects itself
- IPC objects: semaphores, message queues and share memory segments
- mounts
- pollfd and select bits arrays
- signals and posix timers
- file lock
- fasync_struct used by the file lease code and driver's fasync queues 
- tty objects
- per-mm LDT

We have an incorrect/incomplete/obsoleted accounting for few other kernel
objects: sk_filter, af_packets, netlink and xt_counters for iptables.
They require rework and probably will be dropped at all.

Also we're going to add an accounting for nft, however it is not ready yet.

We have not tested performance on upstream, however, our performance team
compares our current RHEL7-based production kernel and reports that
they are at least not worse as the according original RHEL7 kernel.

v7:
- net-related patches was approved and included into net-next git
- rebase to v5.14-rc3
- added Acked-by tag from Kirill Tkhai on "memcg: enable accounting for
  new namesapces and struct nsproxy"

v6:
- improved description of "memcg: enable accounting for signals"
  according to Eric Biderman's wishes
- added Reviewed-by tag from Shakeel Butt on the same patch

v5:
- rebased to v5.14-rc1
- updated ack tags

v4:
- improved description for tty patch
- minor cleanup in LDT patch
- rebased to v5.12
- resent to lkml@

v3:
- added new patches for other kind of accounted objects
- combined patches for ip address/routing-related objects
- improved description
- re-ordered and rebased for linux 5.12-rc8

v2:
- squashed old patch 1 "accounting for allocations called with disabled BH"
   with old patch 2 "accounting for fib6_nodes cache" used such kind of memory allocation 
- improved patch description
- subsystem maintainers added to cc:

Vasily Averin (10):
  memcg: enable accounting for mnt_cache entries
  memcg: enable accounting for pollfd and select bits arrays
  memcg: enable accounting for file lock caches
  memcg: enable accounting for fasync_cache
  memcg: enable accounting for new namesapces and struct nsproxy
  memcg: enable accounting of ipc resources
  memcg: enable accounting for signals
  memcg: enable accounting for posix_timers_cache slab
  memcg: enable accounting for tty-related objects
  memcg: enable accounting for ldt_struct objects

 arch/x86/kernel/ldt.c      | 6 +++---
 drivers/tty/tty_io.c       | 4 ++--
 fs/fcntl.c                 | 3 ++-
 fs/locks.c                 | 6 ++++--
 fs/namespace.c             | 7 ++++---
 fs/select.c                | 4 ++--
 ipc/msg.c                  | 2 +-
 ipc/namespace.c            | 2 +-
 ipc/sem.c                  | 9 +++++----
 ipc/shm.c                  | 2 +-
 kernel/cgroup/namespace.c  | 2 +-
 kernel/nsproxy.c           | 2 +-
 kernel/pid_namespace.c     | 2 +-
 kernel/signal.c            | 2 +-
 kernel/time/namespace.c    | 4 ++--
 kernel/time/posix-timers.c | 4 ++--
 kernel/user_namespace.c    | 2 +-
 17 files changed, 34 insertions(+), 29 deletions(-)

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
@ 2021-07-27  5:33           ` Vasily Averin
  2021-07-27  6:44             ` Shakeel Butt
  2021-07-27  7:21             ` Christian Brauner
  2021-07-27  5:33           ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
Creating a new mount namespace clones most of the parent mounts,
and this can be repeated many times. Additionally, each mount allocates
up to PATH_MAX=4096 bytes for mnt->mnt_devname.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/namespace.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ab4174a..c6a74e5 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -203,7 +203,8 @@ static struct mount *alloc_vfsmnt(const char *name)
 			goto out_free_cache;
 
 		if (name) {
-			mnt->mnt_devname = kstrdup_const(name, GFP_KERNEL);
+			mnt->mnt_devname = kstrdup_const(name,
+							 GFP_KERNEL_ACCOUNT);
 			if (!mnt->mnt_devname)
 				goto out_free_id;
 		}
@@ -4222,7 +4223,7 @@ void __init mnt_init(void)
 	int err;
 
 	mnt_cache = kmem_cache_create("mnt_cache", sizeof(struct mount),
-			0, SLAB_HWCACHE_ALIGN | SLAB_PANIC, NULL);
+			0, SLAB_HWCACHE_ALIGN|SLAB_PANIC|SLAB_ACCOUNT, NULL);
 
 	mount_hashtable = alloc_large_system_hash("Mount-cache",
 				sizeof(struct hlist_head),
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33           ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-27  5:33           ` Vasily Averin
  2021-07-27 21:39             ` Shakeel Butt
  2021-07-27  5:33           ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
  2021-07-27  5:33           ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 1 reply; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	linux-kernel

User can call select/poll system calls with a large number of assigned
file descriptors and force kernel to allocate up to several pages of memory
till end of these sleeping system calls. We have here long-living
unaccounted per-task allocations.

It makes sense to account for these allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/select.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 945896d..e83e563 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
 			goto out_nofds;
 
 		alloc_size = 6 * size;
-		bits = kvmalloc(alloc_size, GFP_KERNEL);
+		bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);
 		if (!bits)
 			goto out_nofds;
 	}
@@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
 
 		len = min(todo, POLLFD_PER_PAGE);
 		walk = walk->next = kmalloc(struct_size(walk, entries, len),
-					    GFP_KERNEL);
+					    GFP_KERNEL_ACCOUNT);
 		if (!walk) {
 			err = -ENOMEM;
 			goto out_fds;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 03/10] memcg: enable accounting for file lock caches
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
  2021-07-27  5:33           ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-27  5:33           ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-27  5:33           ` Vasily Averin
  2021-07-27 21:41             ` Shakeel Butt
  2021-07-27  5:33           ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
  3 siblings, 1 reply; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

User can create file locks for each open file and force kernel
to allocate small but long-living objects per each open file.

It makes sense to account for these objects to limit the host's memory
consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/locks.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 74b2a1d..1bc7ede 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -3056,10 +3056,12 @@ static int __init filelock_init(void)
 	int i;
 
 	flctx_cache = kmem_cache_create("file_lock_ctx",
-			sizeof(struct file_lock_context), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock_context), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	filelock_cache = kmem_cache_create("file_lock_cache",
-			sizeof(struct file_lock), 0, SLAB_PANIC, NULL);
+			sizeof(struct file_lock), 0,
+			SLAB_PANIC | SLAB_ACCOUNT, NULL);
 
 	for_each_possible_cpu(i) {
 		struct file_lock_list_struct *fll = per_cpu_ptr(&file_lock_list, i);
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v7 04/10] memcg: enable accounting for fasync_cache
       [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
                             ` (2 preceding siblings ...)
  2021-07-27  5:33           ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-27  5:33           ` Vasily Averin
  2021-07-27 21:50             ` Shakeel Butt
  3 siblings, 1 reply; 22+ messages in thread
From: Vasily Averin @ 2021-07-27  5:33 UTC (permalink / raw)
  To: Andrew Morton
  Cc: cgroups, Michal Hocko, Shakeel Butt, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, linux-kernel

fasync_struct is used by almost all character device drivers to set up
the fasync queue, and for regular files by the file lease code.
This structure is quite small but long-living and it can be assigned
for any open file.

It makes sense to account for its allocations to restrict the host's
memory consumption from inside the memcg-limited container.

Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
---
 fs/fcntl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index f946bec..714e7c9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -1049,7 +1049,8 @@ static int __init fcntl_init(void)
 			__FMODE_EXEC | __FMODE_NONOTIFY));
 
 	fasync_cache = kmem_cache_create("fasync_cache",
-		sizeof(struct fasync_struct), 0, SLAB_PANIC, NULL);
+					 sizeof(struct fasync_struct), 0,
+					 SLAB_PANIC | SLAB_ACCOUNT, NULL);
 	return 0;
 }
 
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
  2021-07-27  5:33           ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
@ 2021-07-27  6:44             ` Shakeel Butt
  2021-07-27  7:21             ` Christian Brauner
  1 sibling, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-07-27  6:44 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> The kernel allocates ~400 bytes of 'strcut mount' for any new mount.

*struct mount*

> Creating a new mount namespace clones most of the parent mounts,
> and this can be repeated many times. Additionally, each mount allocates
> up to PATH_MAX=4096 bytes for mnt->mnt_devname.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries
  2021-07-27  5:33           ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
  2021-07-27  6:44             ` Shakeel Butt
@ 2021-07-27  7:21             ` Christian Brauner
  1 sibling, 0 replies; 22+ messages in thread
From: Christian Brauner @ 2021-07-27  7:21 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, cgroups, Michal Hocko, Shakeel Butt,
	Johannes Weiner, Vladimir Davydov, Roman Gushchin, Alexander Viro,
	linux-fsdevel, linux-kernel

On Tue, Jul 27, 2021 at 08:33:12AM +0300, Vasily Averin wrote:
> The kernel allocates ~400 bytes of 'strcut mount' for any new mount.
> Creating a new mount namespace clones most of the parent mounts,
> and this can be repeated many times. Additionally, each mount allocates
> up to PATH_MAX=4096 bytes for mnt->mnt_devname.
> 
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
> 
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---

Looks good. Thank you!
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>

I wonder how much this increases reported memory consumption when you
boot full system containers that run systemd and a bunch of systemd
services that each use a separate mount namespace.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays
  2021-07-27  5:33           ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
@ 2021-07-27 21:39             ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:39 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, linux-fsdevel,
	LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> User can call select/poll system calls with a large number of assigned
> file descriptors and force kernel to allocate up to several pages of memory
> till end of these sleeping system calls. We have here long-living
> unaccounted per-task allocations.
>
> It makes sense to account for these allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
> ---
>  fs/select.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/select.c b/fs/select.c
> index 945896d..e83e563 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -655,7 +655,7 @@ int core_sys_select(int n, fd_set __user *inp, fd_set __user *outp,
>                         goto out_nofds;
>
>                 alloc_size = 6 * size;
> -               bits = kvmalloc(alloc_size, GFP_KERNEL);
> +               bits = kvmalloc(alloc_size, GFP_KERNEL_ACCOUNT);

What about the similar allocation in compat_core_sys_select()? Also
what about the allocation in poll_get_entry()?

>                 if (!bits)
>                         goto out_nofds;
>         }
> @@ -1000,7 +1000,7 @@ static int do_sys_poll(struct pollfd __user *ufds, unsigned int nfds,
>
>                 len = min(todo, POLLFD_PER_PAGE);
>                 walk = walk->next = kmalloc(struct_size(walk, entries, len),
> -                                           GFP_KERNEL);
> +                                           GFP_KERNEL_ACCOUNT);
>                 if (!walk) {
>                         err = -ENOMEM;
>                         goto out_fds;
> --
> 1.8.3.1
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 03/10] memcg: enable accounting for file lock caches
  2021-07-27  5:33           ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
@ 2021-07-27 21:41             ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:41 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> User can create file locks for each open file and force kernel
> to allocate small but long-living objects per each open file.
>
> It makes sense to account for these objects to limit the host's memory
> consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v7 04/10] memcg: enable accounting for fasync_cache
  2021-07-27  5:33           ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
@ 2021-07-27 21:50             ` Shakeel Butt
  0 siblings, 0 replies; 22+ messages in thread
From: Shakeel Butt @ 2021-07-27 21:50 UTC (permalink / raw)
  To: Vasily Averin
  Cc: Andrew Morton, Cgroups, Michal Hocko, Johannes Weiner,
	Vladimir Davydov, Roman Gushchin, Alexander Viro, Jeff Layton,
	J. Bruce Fields, linux-fsdevel, LKML

On Mon, Jul 26, 2021 at 10:33 PM Vasily Averin <vvs@virtuozzo.com> wrote:
>
> fasync_struct is used by almost all character device drivers to set up
> the fasync queue, and for regular files by the file lease code.
> This structure is quite small but long-living and it can be assigned
> for any open file.
>
> It makes sense to account for its allocations to restrict the host's
> memory consumption from inside the memcg-limited container.
>
> Signed-off-by: Vasily Averin <vvs@virtuozzo.com>

Reviewed-by: Shakeel Butt <shakeelb@google.com>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-27 21:51 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CALvZod66KF-8xKB1dyY2twizDE=svE8iXT_nqvsrfWg1a92f4A@mail.gmail.com>
2021-07-19 10:44 ` [PATCH v5 00/16] memcg accounting from OpenVZ Vasily Averin
2021-07-26 18:59   ` [PATCH v6 00/16] memcg accounting from Vasily Averin
2021-07-26 21:59     ` David Miller
2021-07-27  4:44       ` [PATCH v6 00/16] memcg accounting from OpenVZ Vasily Averin
2021-07-27  5:33         ` [PATCH v7 00/10] " Vasily Averin
     [not found]         ` <cover.1627362057.git.vvs@virtuozzo.com>
2021-07-27  5:33           ` [PATCH v7 01/10] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-27  6:44             ` Shakeel Butt
2021-07-27  7:21             ` Christian Brauner
2021-07-27  5:33           ` [PATCH v7 02/10] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-27 21:39             ` Shakeel Butt
2021-07-27  5:33           ` [PATCH v7 03/10] memcg: enable accounting for file lock caches Vasily Averin
2021-07-27 21:41             ` Shakeel Butt
2021-07-27  5:33           ` [PATCH v7 04/10] memcg: enable accounting for fasync_cache Vasily Averin
2021-07-27 21:50             ` Shakeel Butt
     [not found]   ` <cover.1627321321.git.vvs@virtuozzo.com>
2021-07-26 19:00     ` [PATCH v6 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-26 19:00     ` [PATCH v6 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-26 19:01     ` [PATCH v6 09/16] memcg: enable accounting for file lock caches Vasily Averin
2021-07-26 19:01     ` [PATCH v6 10/16] memcg: enable accounting for fasync_cache Vasily Averin
     [not found] ` <cover.1626688654.git.vvs@virtuozzo.com>
2021-07-19 10:45   ` [PATCH v5 07/16] memcg: enable accounting for mnt_cache entries Vasily Averin
2021-07-19 10:45   ` [PATCH v5 08/16] memcg: enable accounting for pollfd and select bits arrays Vasily Averin
2021-07-19 10:45   ` [PATCH v5 09/16] memcg: enable accounting for file lock caches Vasily Averin
2021-07-19 10:45   ` [PATCH v5 10/16] memcg: enable accounting for fasync_cache Vasily Averin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).