* Re: [PATCH v4 1/4] mm: support deterministic memory charging of filesystems
@ 2021-11-22 20:43 kernel test robot
0 siblings, 0 replies; 4+ messages in thread
From: kernel test robot @ 2021-11-22 20:43 UTC (permalink / raw)
To: kbuild
[-- Attachment #1: Type: text/plain, Size: 7596 bytes --]
CC: kbuild-all(a)lists.01.org
In-Reply-To: <20211120045011.3074840-2-almasrymina@google.com>
References: <20211120045011.3074840-2-almasrymina@google.com>
TO: Mina Almasry <almasrymina@google.com>
TO: Alexander Viro <viro@zeniv.linux.org.uk>
TO: Andrew Morton <akpm@linux-foundation.org>
CC: Linux Memory Management List <linux-mm@kvack.org>
TO: Johannes Weiner <hannes@cmpxchg.org>
TO: Michal Hocko <mhocko@kernel.org>
TO: Vladimir Davydov <vdavydov.dev@gmail.com>
TO: Hugh Dickins <hughd@google.com>
CC: Mina Almasry <almasrymina@google.com>
CC: Jonathan Corbet <corbet@lwn.net>
CC: Shuah Khan <skhan@linuxfoundation.org>
Hi Mina,
Thank you for the patch! Perhaps something to improve:
[auto build test WARNING on hnaz-mm/master]
[also build test WARNING on linus/master v5.16-rc2 next-20211118]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]
url: https://github.com/0day-ci/linux/commits/Mina-Almasry/Deterministic-charging-of-shared-memory/20211120-125229
base: https://github.com/hnaz/linux-mm master
:::::: branch date: 3 days ago
:::::: commit date: 3 days ago
config: i386-randconfig-s002-20211122 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/0day-ci/linux/commit/5ecf5e613f50d859803aae9bc6f8295cb199701d
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review Mina-Almasry/Deterministic-charging-of-shared-memory/20211120-125229
git checkout 5ecf5e613f50d859803aae9bc6f8295cb199701d
# save the attached .config to linux build tree
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' ARCH=i386
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>
sparse warnings: (new ones prefixed by >>)
mm/memcontrol.c:2644:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
>> mm/memcontrol.c:2644:17: sparse: struct mem_cgroup [noderef] __rcu *
>> mm/memcontrol.c:2644:17: sparse: struct mem_cgroup *
mm/memcontrol.c:2699:17: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:2699:17: sparse: struct mem_cgroup [noderef] __rcu *
mm/memcontrol.c:2699:17: sparse: struct mem_cgroup *
mm/memcontrol.c:4192:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:4192:21: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu *
mm/memcontrol.c:4192:21: sparse: struct mem_cgroup_threshold_ary *
mm/memcontrol.c:4194:21: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:4194:21: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu *
mm/memcontrol.c:4194:21: sparse: struct mem_cgroup_threshold_ary *
mm/memcontrol.c:4350:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:4350:9: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu *
mm/memcontrol.c:4350:9: sparse: struct mem_cgroup_threshold_ary *
mm/memcontrol.c:4444:9: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:4444:9: sparse: struct mem_cgroup_threshold_ary [noderef] __rcu *
mm/memcontrol.c:4444:9: sparse: struct mem_cgroup_threshold_ary *
mm/memcontrol.c:6059:23: sparse: sparse: incompatible types in comparison expression (different address spaces):
mm/memcontrol.c:6059:23: sparse: struct task_struct [noderef] __rcu *
mm/memcontrol.c:6059:23: sparse: struct task_struct *
mm/memcontrol.c: note: in included file:
include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock' - wrong count at exit
include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irq' - wrong count at exit
include/linux/memcontrol.h:779:9: sparse: sparse: context imbalance in 'folio_lruvec_lock_irqsave' - wrong count at exit
mm/memcontrol.c:2019:6: sparse: sparse: context imbalance in 'folio_memcg_lock' - wrong count at exit
mm/memcontrol.c:2071:17: sparse: sparse: context imbalance in '__folio_memcg_unlock' - unexpected unlock
mm/memcontrol.c:5910:28: sparse: sparse: context imbalance in 'mem_cgroup_count_precharge_pte_range' - unexpected unlock
mm/memcontrol.c:6104:36: sparse: sparse: context imbalance in 'mem_cgroup_move_charge_pte_range' - unexpected unlock
vim +2644 mm/memcontrol.c
5ecf5e613f50d8 Mina Almasry 2021-11-19 2630
5ecf5e613f50d8 Mina Almasry 2021-11-19 2631 void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb)
5ecf5e613f50d8 Mina Almasry 2021-11-19 2632 {
5ecf5e613f50d8 Mina Almasry 2021-11-19 2633 struct mem_cgroup *memcg;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2634 int ret = 0;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2635 char *buf = __getname();
5ecf5e613f50d8 Mina Almasry 2021-11-19 2636 int len = PATH_MAX;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2637
5ecf5e613f50d8 Mina Almasry 2021-11-19 2638 if (!buf)
5ecf5e613f50d8 Mina Almasry 2021-11-19 2639 return;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2640
5ecf5e613f50d8 Mina Almasry 2021-11-19 2641 buf[0] = '\0';
5ecf5e613f50d8 Mina Almasry 2021-11-19 2642
5ecf5e613f50d8 Mina Almasry 2021-11-19 2643 rcu_read_lock();
5ecf5e613f50d8 Mina Almasry 2021-11-19 @2644 memcg = rcu_dereference(sb->s_memcg_to_charge);
5ecf5e613f50d8 Mina Almasry 2021-11-19 2645 if (memcg && !css_tryget_online(&memcg->css))
5ecf5e613f50d8 Mina Almasry 2021-11-19 2646 memcg = NULL;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2647 rcu_read_unlock();
5ecf5e613f50d8 Mina Almasry 2021-11-19 2648
5ecf5e613f50d8 Mina Almasry 2021-11-19 2649 if (!memcg)
5ecf5e613f50d8 Mina Almasry 2021-11-19 2650 return;
5ecf5e613f50d8 Mina Almasry 2021-11-19 2651
5ecf5e613f50d8 Mina Almasry 2021-11-19 2652 ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2);
5ecf5e613f50d8 Mina Almasry 2021-11-19 2653 if (ret >= len / 2)
5ecf5e613f50d8 Mina Almasry 2021-11-19 2654 strcpy(buf, "?");
5ecf5e613f50d8 Mina Almasry 2021-11-19 2655 else {
5ecf5e613f50d8 Mina Almasry 2021-11-19 2656 char *p = mangle_path(buf, buf + len / 2, " \t\n\\");
5ecf5e613f50d8 Mina Almasry 2021-11-19 2657
5ecf5e613f50d8 Mina Almasry 2021-11-19 2658 if (p)
5ecf5e613f50d8 Mina Almasry 2021-11-19 2659 *p = '\0';
5ecf5e613f50d8 Mina Almasry 2021-11-19 2660 else
5ecf5e613f50d8 Mina Almasry 2021-11-19 2661 strcpy(buf, "?");
5ecf5e613f50d8 Mina Almasry 2021-11-19 2662 }
5ecf5e613f50d8 Mina Almasry 2021-11-19 2663
5ecf5e613f50d8 Mina Almasry 2021-11-19 2664 css_put(&memcg->css);
5ecf5e613f50d8 Mina Almasry 2021-11-19 2665 if (buf[0] != '\0')
5ecf5e613f50d8 Mina Almasry 2021-11-19 2666 seq_printf(m, ",memcg=%s", buf);
5ecf5e613f50d8 Mina Almasry 2021-11-19 2667
5ecf5e613f50d8 Mina Almasry 2021-11-19 2668 __putname(buf);
5ecf5e613f50d8 Mina Almasry 2021-11-19 2669 }
5ecf5e613f50d8 Mina Almasry 2021-11-19 2670
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all(a)lists.01.org
[-- Attachment #2: config.gz --]
[-- Type: application/gzip, Size: 38300 bytes --]
^ permalink raw reply [flat|nested] 4+ messages in thread* [PATCH v4 0/4] Deterministic charging of shared memory
@ 2021-11-20 4:50 Mina Almasry
[not found] ` <20211120045011.3074840-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Mina Almasry @ 2021-11-20 4:50 UTC (permalink / raw)
Cc: Mina Almasry, Jonathan Corbet, Alexander Viro, Andrew Morton,
Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins,
Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner,
Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel,
linux-fsdevel, linux-mm
Problem:
Currently shared memory is charged to the memcg of the allocating
process. This makes memory usage of processes accessing shared memory
a bit unpredictable since whichever process accesses the memory first
will get charged. We have a number of use cases where our userspace
would like deterministic charging of shared memory:
1. System services allocating memory for client jobs:
We have services (namely a network access service[1]) that provide
functionality for clients running on the machine and allocate memory
to carry out these services. The memory usage of these services
depends on the number of jobs running on the machine and the nature of
the requests made to the service, which makes the memory usage of
these services hard to predict and thus hard to limit via memory.max.
These system services would like a way to allocate memory and instruct
the kernel to charge this memory to the client’s memcg.
2. Shared filesystem between subtasks of a large job
Our infrastructure has large meta jobs such as kubernetes which spawn
multiple subtasks which share a tmpfs mount. These jobs and its
subtasks use that tmpfs mount for various purposes such as data
sharing or persistent data between the subtask restarts. In kubernetes
terminology, the meta job is similar to pods and subtasks are
containers under pods. We want the shared memory to be
deterministically charged to the kubernetes's pod and independent to
the lifetime of containers under the pod.
3. Shared libraries and language runtimes shared between independent jobs.
We’d like to optimize memory usage on the machine by sharing libraries
and language runtimes of many of the processes running on our machines
in separate memcgs. This produces a side effect that one job may be
unlucky to be the first to access many of the libraries and may get
oom killed as all the cached files get charged to it.
Design:
My rough proposal to solve this problem is to simply add a
‘memcg=/path/to/memcg’ mount option for filesystems:
directing all the memory of the file system to be ‘remote charged’ to
cgroup provided by that memcg= option.
Caveats:
1. One complication to address is the behavior when the target memcg
hits its memory.max limit because of remote charging. In this case the
oom-killer will be invoked, but the oom-killer may not find anything
to kill in the target memcg being charged. Thera are a number of considerations
in this case:
1. It's not great to kill the allocating process since the allocating process
is not running in the memcg under oom, and killing it will not free memory
in the memcg under oom.
2. Pagefaults may hit the memcg limit, and we need to handle the pagefault
somehow. If not, the process will forever loop the pagefault in the upstream
kernel.
In this case, I propose simply failing the remote charge and returning an ENOSPC
to the caller. This will cause will cause the process executing the remote
charge to get an ENOSPC in non-pagefault paths, and get a SIGBUS on the pagefault
path. This will be documented behavior of remote charging, and this feature is
opt-in. Users can:
- Not opt-into the feature if they want.
- Opt-into the feature and accept the risk of received ENOSPC or SIGBUS and
abort if they desire.
- Gracefully handle any resulting ENOSPC or SIGBUS errors and continue their
operation without executing the remote charge if possible.
2. Only processes allowed the enter cgroup at mount time can mount a
tmpfs with memcg=<cgroup>. This is to prevent intential DoS of random cgroups
on the machine. However, once a filesysetem is mounted with memcg=<cgroup>, any
process with write access to this mount point will be able to charge memory to
<cgroup>. This is largely a non-issue because in configurations where there is
untrusted code running on the machine, mount point access needs to be
restricted to the intended users only regardless of whether the mount point
memory is deterministly charged or not.
[1] https://research.google/pubs/pub48630
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: linux-kernel@vger.kernel.org
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-mm@kvack.org
Mina Almasry (4):
mm: support deterministic memory charging of filesystems
mm/oom: handle remote ooms
mm, shmem: add filesystem memcg= option documentation
mm, shmem, selftests: add tmpfs memcg= mount option tests
Documentation/filesystems/tmpfs.rst | 28 ++++
fs/fs_context.c | 27 ++++
fs/proc_namespace.c | 4 +
fs/super.c | 9 ++
include/linux/fs.h | 5 +
include/linux/fs_context.h | 2 +
include/linux/memcontrol.h | 38 +++++
mm/filemap.c | 2 +-
mm/khugepaged.c | 3 +-
mm/memcontrol.c | 171 ++++++++++++++++++++++
mm/oom_kill.c | 9 ++
mm/shmem.c | 3 +-
tools/testing/selftests/vm/.gitignore | 1 +
tools/testing/selftests/vm/mmap_write.c | 103 +++++++++++++
tools/testing/selftests/vm/tmpfs-memcg.sh | 116 +++++++++++++++
15 files changed, 518 insertions(+), 3 deletions(-)
create mode 100644 tools/testing/selftests/vm/mmap_write.c
create mode 100755 tools/testing/selftests/vm/tmpfs-memcg.sh
--
2.34.0.rc2.393.gf8c9666880-goog
^ permalink raw reply [flat|nested] 4+ messages in thread[parent not found: <20211120045011.3074840-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>]
* [PATCH v4 1/4] mm: support deterministic memory charging of filesystems 2021-11-20 4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry @ 2021-11-20 4:50 ` Mina Almasry 0 siblings, 0 replies; 4+ messages in thread From: Mina Almasry @ 2021-11-20 4:50 UTC (permalink / raw) To: Alexander Viro, Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins Cc: Mina Almasry, Jonathan Corbet, Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel-u79uwXL29TY76Z2rM5mHXA, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA Users can specify a memcg= mount option option at mount time and all data page charges will be charged to the memcg supplied. This is useful to deterministicly charge the memory of the file system or memory shared via tmpfs for example. Implementation notes: - Add memcg= option parsing to fs common code. - We attach the memcg to charge for this filesystem data pages to the struct super_block. The memcg can be changed via a remount operation, and all future memcg charges in this filesystem will be charged to the new memcg. - We create a new interface mem_cgroup_charge_mapping(), which will check if the super_block in the mapping has a memcg to charge. It charges that, and falls back to the mm passed if there is no super_block memcg. - On filesystem data memory allocation paths, we call the new interface mem_cgroup_charge_mapping(). Caveats: - Processes are only allowed to direct filesystem charges to a cgroup that they themselves can enter and allocate memory in. This so that we do not introduce an attack vector where processes can DoS any cgroup in the system that they are not normally allowed to enter and allocate memory in. - In mem_cgroup_charge_mapping() we pay the cost of checking whether the super_block has a memcg to charge, regardless of whether the mount point was mounted with memcg=. This can be alleviated by putting the memcg to charge in the struct address_space, but, this increases the size of that struct and makes it difficult to support remounting the memcg= option, although remounting is of dubious value. - mem_cgroup_charge_mapping() simply returns any error received from the following charge_memcg() or mem_cgroup_charge() calls. There is a follow up patch in this series which closely examines and handles the behavior when hitting the limit of the remote memcg. Signed-off-by: Mina Almasry <almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> --- Changes in v4: - Added cover letter and moved list of Cc's there. - Made memcg= option generic to all file systems. - Reverted to calling mem_cgroup_charge_mapping() for generic file system allocation paths, since this feature is not implemented for all filesystems. - Refactored some memcontrol interfaces slightly to reduce the number of "#ifdef CONFIG_MEMCG" needed in other files. Changes in v3: - Fixed build failures/warnings Reported-by: kernel test robot <lkp-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> Changes in v2: - Fixed Roman's email. - Added a new wrapper around charge_memcg() instead of __mem_cgroup_charge() - Merged the permission check into this patch as Roman suggested. - Instead of checking for a s_memcg_to_charge off the superblock in the filemap code, I set_active_memcg() before calling into the fs generic code as Dave suggests. - I have kept the s_memcg_to_charge in the superblock to keep the struct address_space pointer small and preserve the remount use case.. --- fs/fs_context.c | 27 +++++++ fs/proc_namespace.c | 4 ++ fs/super.c | 9 +++ include/linux/fs.h | 5 ++ include/linux/fs_context.h | 2 + include/linux/memcontrol.h | 32 +++++++++ mm/filemap.c | 2 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 142 +++++++++++++++++++++++++++++++++++++ mm/shmem.c | 3 +- 10 files changed, 226 insertions(+), 3 deletions(-) diff --git a/fs/fs_context.c b/fs/fs_context.c index b7e43a780a625..fe2449d5f1fbf 100644 --- a/fs/fs_context.c +++ b/fs/fs_context.c @@ -23,6 +23,7 @@ #include <asm/sections.h> #include "mount.h" #include "internal.h" +#include <linux/memcontrol.h> enum legacy_fs_param { LEGACY_FS_UNSET_PARAMS, @@ -108,6 +109,28 @@ int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param) } EXPORT_SYMBOL(vfs_parse_fs_param_source); +static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param) +{ + struct mem_cgroup *memcg; + + if (strcmp(param->key, "memcg") != 0) + return -ENOPARAM; + + if (param->type != fs_value_is_string) + return invalf(fc, "Non-string source"); + + if (fc->memcg) + return invalf(fc, "Multiple memcgs specified"); + + memcg = mem_cgroup_get_from_path(param->string); + if (IS_ERR(memcg)) + return invalf(fc, "Bad value for memcg"); + + fc->memcg = memcg; + param->string = NULL; + return 0; +} + /** * vfs_parse_fs_param - Add a single parameter to a superblock config * @fc: The filesystem context to modify @@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) return ret; } + ret = parse_param_memcg(fc, param); + if (ret != -ENOPARAM) + return ret; + /* If the filesystem doesn't take any arguments, give it the * default handling of source. */ diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 392ef5162655b..32e1647dcef43 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -12,6 +12,7 @@ #include <linux/security.h> #include <linux/fs_struct.h> #include <linux/sched/task.h> +#include <linux/memcontrol.h> #include "proc/internal.h" /* only for get_proc_task() in ->open() */ @@ -125,6 +126,9 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) if (err) goto out; show_mnt_opts(m, mnt); + + mem_cgroup_put_name_in_seq(m, sb); + if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); diff --git a/fs/super.c b/fs/super.c index 3bfc0f8fbd5bc..06c972f80c529 100644 --- a/fs/super.c +++ b/fs/super.c @@ -24,6 +24,7 @@ #include <linux/export.h> #include <linux/slab.h> #include <linux/blkdev.h> +#include <linux/memcontrol.h> #include <linux/mount.h> #include <linux/security.h> #include <linux/writeback.h> /* for the emergency remount stuff */ @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s) up_write(&s->s_umount); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); put_user_ns(s->s_user_ns); kfree(s->s_subtype); @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s) WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); WARN_ON(!list_empty(&s->s_mounts)); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); fscrypt_sb_free(s); put_user_ns(s->s_user_ns); @@ -904,6 +907,9 @@ int reconfigure_super(struct fs_context *fc) } } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + if (fc->ops->reconfigure) { retval = fc->ops->reconfigure(fc); if (retval) { @@ -1528,6 +1534,9 @@ int vfs_get_tree(struct fs_context *fc) return error; } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + /* * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE * but s_maxbytes was an unsigned long long for many releases. Throw diff --git a/include/linux/fs.h b/include/linux/fs.h index 3afca821df32e..59407b3e7aee3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1567,6 +1567,11 @@ struct super_block { struct workqueue_struct *s_dio_done_wq; struct hlist_head s_pins; +#ifdef CONFIG_MEMCG + /* memcg to charge for pages allocated to this filesystem */ + struct mem_cgroup *s_memcg_to_charge; +#endif + /* * Owning user namespace and default context in which to * interpret filesystem uids, gids, quotas, device nodes, diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 6b54982fc5f37..8e2cc1e554fa1 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -25,6 +25,7 @@ struct super_block; struct user_namespace; struct vfsmount; struct path; +struct mem_cgroup; enum fs_context_purpose { FS_CONTEXT_FOR_MOUNT, /* New superblock for explicit mount */ @@ -110,6 +111,7 @@ struct fs_context { bool need_free:1; /* Need to call ops->free() */ bool global:1; /* Goes into &init_user_ns */ bool oldapi:1; /* Coming from mount(2) */ + struct mem_cgroup *memcg; /* memcg to charge */ }; struct fs_context_operations { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6b..0a9b0bba5f3c8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -27,6 +27,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct super_block; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -923,6 +924,15 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return !!(memcg->css.flags & CSS_ONLINE); } +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg); + +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping); + +struct mem_cgroup *mem_cgroup_get_from_path(const char *path); +void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb); + void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages); @@ -1223,6 +1233,28 @@ static inline int mem_cgroup_charge(struct folio *folio, return 0; } +static inline void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ +} + +static inline int mem_cgroup_charge_mapping(struct folio *folio, + struct mm_struct *mm, gfp_t gfp, + struct address_space *mapping) +{ + return 0; +} + +static inline struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + return NULL; +} + +static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq, + struct super_block *sb) +{ +} + static inline int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/filemap.c b/mm/filemap.c index 6844c9816a864..3825cf12bc345 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->index = index; if (!huge) { - error = mem_cgroup_charge(folio, NULL, gfp); + error = mem_cgroup_charge_mapping(folio, NULL, gfp, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) goto error; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1ab..8468a3ad446b9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1661,7 +1661,8 @@ static void collapse_file(struct mm_struct *mm, goto out; } - if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { + if (unlikely(mem_cgroup_charge_mapping(page_folio(new_page), mm, gfp, + mapping))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 781605e920153..c4ba7f364c214 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -62,6 +62,7 @@ #include <linux/tracehook.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/string.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -2580,6 +2581,129 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +/* + * Non error return value must eventually be released with css_put(). + */ +struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + static const char procs_filename[] = "/cgroup.procs"; + struct file *file, *procs; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + char *procs_path = + kmalloc(strlen(path) + sizeof(procs_filename), GFP_KERNEL); + + if (procs_path == NULL) + return ERR_PTR(-ENOMEM); + strcpy(procs_path, path); + strcat(procs_path, procs_filename); + + procs = filp_open(procs_path, O_WRONLY, 0); + kfree(procs_path); + + /* + * Restrict the capability for tasks to mount with memcg charging to the + * cgroup they could not join. For example, disallow: + * + * mount -t tmpfs -o memcg=root-cgroup nodev <MOUNT_DIR> + * + * if it is a non-root task. + */ + if (IS_ERR(procs)) + return (struct mem_cgroup *)procs; + fput(procs); + + file = filp_open(path, O_DIRECTORY | O_RDONLY, 0); + if (IS_ERR(file)) + return (struct mem_cgroup *)file; + + css = css_tryget_online_from_dir(file->f_path.dentry, + &memory_cgrp_subsys); + if (IS_ERR(css)) + memcg = (struct mem_cgroup *)css; + else + memcg = container_of(css, struct mem_cgroup, css); + + fput(file); + return memcg; +} + +void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb) +{ + struct mem_cgroup *memcg; + int ret = 0; + char *buf = __getname(); + int len = PATH_MAX; + + if (!buf) + return; + + buf[0] = '\0'; + + rcu_read_lock(); + memcg = rcu_dereference(sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + if (!memcg) + return; + + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); + if (ret >= len / 2) + strcpy(buf, "?"); + else { + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); + + if (p) + *p = '\0'; + else + strcpy(buf, "?"); + } + + css_put(&memcg->css); + if (buf[0] != '\0') + seq_printf(m, ",memcg=%s", buf); + + __putname(buf); +} + +/* + * Set or clear (if @memcg is NULL) charge association from file system to + * memcg. If @memcg != NULL, then a css reference must be held by the caller to + * ensure that the cgroup is not deleted during this operation, this reference + * is dropped after this operation. + */ +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ + memcg = xchg(&sb->s_memcg_to_charge, memcg); + if (memcg) + css_put(&memcg->css); +} + +/* + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller + * must drop reference with css_put(). NULL indicates that the inode does not + * have a memcg to charge, so the default process based policy should be used. + */ +static struct mem_cgroup * +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) +{ + struct mem_cgroup *memcg; + + if (!mapping) + return NULL; + + rcu_read_lock(); + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + return memcg; +} + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -6678,6 +6802,24 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, return ret; } +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping) +{ + struct mem_cgroup *mapping_memcg; + int ret = 0; + if (mem_cgroup_disabled()) + return 0; + + mapping_memcg = mem_cgroup_mapping_get_charge_target(mapping); + if (mapping_memcg) { + ret = charge_memcg(folio, mapping_memcg, gfp); + css_put(&mapping_memcg->css); + return ret; + } + + return mem_cgroup_charge(folio, mm, gfp); +} + int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) { struct mem_cgroup *memcg; diff --git a/mm/shmem.c b/mm/shmem.c index 23c91a8beb781..e469da13a1b8a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -709,7 +709,8 @@ static int shmem_add_to_page_cache(struct page *page, page->index = index; if (!PageSwapCache(page)) { - error = mem_cgroup_charge(page_folio(page), charge_mm, gfp); + error = mem_cgroup_charge_mapping(page_folio(page), charge_mm, + gfp, mapping); if (error) { if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); ^ permalink raw reply related [flat|nested] 4+ messages in thread
* [PATCH v4 1/4] mm: support deterministic memory charging of filesystems @ 2021-11-20 4:50 ` Mina Almasry 0 siblings, 0 replies; 4+ messages in thread From: Mina Almasry @ 2021-11-20 4:50 UTC (permalink / raw) To: Alexander Viro, Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins Cc: Mina Almasry, Jonathan Corbet, Shuah Khan, Shakeel Butt, Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm, cgroups Users can specify a memcg= mount option option at mount time and all data page charges will be charged to the memcg supplied. This is useful to deterministicly charge the memory of the file system or memory shared via tmpfs for example. Implementation notes: - Add memcg= option parsing to fs common code. - We attach the memcg to charge for this filesystem data pages to the struct super_block. The memcg can be changed via a remount operation, and all future memcg charges in this filesystem will be charged to the new memcg. - We create a new interface mem_cgroup_charge_mapping(), which will check if the super_block in the mapping has a memcg to charge. It charges that, and falls back to the mm passed if there is no super_block memcg. - On filesystem data memory allocation paths, we call the new interface mem_cgroup_charge_mapping(). Caveats: - Processes are only allowed to direct filesystem charges to a cgroup that they themselves can enter and allocate memory in. This so that we do not introduce an attack vector where processes can DoS any cgroup in the system that they are not normally allowed to enter and allocate memory in. - In mem_cgroup_charge_mapping() we pay the cost of checking whether the super_block has a memcg to charge, regardless of whether the mount point was mounted with memcg=. This can be alleviated by putting the memcg to charge in the struct address_space, but, this increases the size of that struct and makes it difficult to support remounting the memcg= option, although remounting is of dubious value. - mem_cgroup_charge_mapping() simply returns any error received from the following charge_memcg() or mem_cgroup_charge() calls. There is a follow up patch in this series which closely examines and handles the behavior when hitting the limit of the remote memcg. Signed-off-by: Mina Almasry <almasrymina@google.com> --- Changes in v4: - Added cover letter and moved list of Cc's there. - Made memcg= option generic to all file systems. - Reverted to calling mem_cgroup_charge_mapping() for generic file system allocation paths, since this feature is not implemented for all filesystems. - Refactored some memcontrol interfaces slightly to reduce the number of "#ifdef CONFIG_MEMCG" needed in other files. Changes in v3: - Fixed build failures/warnings Reported-by: kernel test robot <lkp@intel.com> Changes in v2: - Fixed Roman's email. - Added a new wrapper around charge_memcg() instead of __mem_cgroup_charge() - Merged the permission check into this patch as Roman suggested. - Instead of checking for a s_memcg_to_charge off the superblock in the filemap code, I set_active_memcg() before calling into the fs generic code as Dave suggests. - I have kept the s_memcg_to_charge in the superblock to keep the struct address_space pointer small and preserve the remount use case.. --- fs/fs_context.c | 27 +++++++ fs/proc_namespace.c | 4 ++ fs/super.c | 9 +++ include/linux/fs.h | 5 ++ include/linux/fs_context.h | 2 + include/linux/memcontrol.h | 32 +++++++++ mm/filemap.c | 2 +- mm/khugepaged.c | 3 +- mm/memcontrol.c | 142 +++++++++++++++++++++++++++++++++++++ mm/shmem.c | 3 +- 10 files changed, 226 insertions(+), 3 deletions(-) diff --git a/fs/fs_context.c b/fs/fs_context.c index b7e43a780a625..fe2449d5f1fbf 100644 --- a/fs/fs_context.c +++ b/fs/fs_context.c @@ -23,6 +23,7 @@ #include <asm/sections.h> #include "mount.h" #include "internal.h" +#include <linux/memcontrol.h> enum legacy_fs_param { LEGACY_FS_UNSET_PARAMS, @@ -108,6 +109,28 @@ int vfs_parse_fs_param_source(struct fs_context *fc, struct fs_parameter *param) } EXPORT_SYMBOL(vfs_parse_fs_param_source); +static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param) +{ + struct mem_cgroup *memcg; + + if (strcmp(param->key, "memcg") != 0) + return -ENOPARAM; + + if (param->type != fs_value_is_string) + return invalf(fc, "Non-string source"); + + if (fc->memcg) + return invalf(fc, "Multiple memcgs specified"); + + memcg = mem_cgroup_get_from_path(param->string); + if (IS_ERR(memcg)) + return invalf(fc, "Bad value for memcg"); + + fc->memcg = memcg; + param->string = NULL; + return 0; +} + /** * vfs_parse_fs_param - Add a single parameter to a superblock config * @fc: The filesystem context to modify @@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) return ret; } + ret = parse_param_memcg(fc, param); + if (ret != -ENOPARAM) + return ret; + /* If the filesystem doesn't take any arguments, give it the * default handling of source. */ diff --git a/fs/proc_namespace.c b/fs/proc_namespace.c index 392ef5162655b..32e1647dcef43 100644 --- a/fs/proc_namespace.c +++ b/fs/proc_namespace.c @@ -12,6 +12,7 @@ #include <linux/security.h> #include <linux/fs_struct.h> #include <linux/sched/task.h> +#include <linux/memcontrol.h> #include "proc/internal.h" /* only for get_proc_task() in ->open() */ @@ -125,6 +126,9 @@ static int show_vfsmnt(struct seq_file *m, struct vfsmount *mnt) if (err) goto out; show_mnt_opts(m, mnt); + + mem_cgroup_put_name_in_seq(m, sb); + if (sb->s_op->show_options) err = sb->s_op->show_options(m, mnt_path.dentry); seq_puts(m, " 0 0\n"); diff --git a/fs/super.c b/fs/super.c index 3bfc0f8fbd5bc..06c972f80c529 100644 --- a/fs/super.c +++ b/fs/super.c @@ -24,6 +24,7 @@ #include <linux/export.h> #include <linux/slab.h> #include <linux/blkdev.h> +#include <linux/memcontrol.h> #include <linux/mount.h> #include <linux/security.h> #include <linux/writeback.h> /* for the emergency remount stuff */ @@ -180,6 +181,7 @@ static void destroy_unused_super(struct super_block *s) up_write(&s->s_umount); list_lru_destroy(&s->s_dentry_lru); list_lru_destroy(&s->s_inode_lru); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); put_user_ns(s->s_user_ns); kfree(s->s_subtype); @@ -292,6 +294,7 @@ static void __put_super(struct super_block *s) WARN_ON(s->s_dentry_lru.node); WARN_ON(s->s_inode_lru.node); WARN_ON(!list_empty(&s->s_mounts)); + mem_cgroup_set_charge_target(s, NULL); security_sb_free(s); fscrypt_sb_free(s); put_user_ns(s->s_user_ns); @@ -904,6 +907,9 @@ int reconfigure_super(struct fs_context *fc) } } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + if (fc->ops->reconfigure) { retval = fc->ops->reconfigure(fc); if (retval) { @@ -1528,6 +1534,9 @@ int vfs_get_tree(struct fs_context *fc) return error; } + if (fc->memcg) + mem_cgroup_set_charge_target(sb, fc->memcg); + /* * filesystems should never set s_maxbytes larger than MAX_LFS_FILESIZE * but s_maxbytes was an unsigned long long for many releases. Throw diff --git a/include/linux/fs.h b/include/linux/fs.h index 3afca821df32e..59407b3e7aee3 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1567,6 +1567,11 @@ struct super_block { struct workqueue_struct *s_dio_done_wq; struct hlist_head s_pins; +#ifdef CONFIG_MEMCG + /* memcg to charge for pages allocated to this filesystem */ + struct mem_cgroup *s_memcg_to_charge; +#endif + /* * Owning user namespace and default context in which to * interpret filesystem uids, gids, quotas, device nodes, diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h index 6b54982fc5f37..8e2cc1e554fa1 100644 --- a/include/linux/fs_context.h +++ b/include/linux/fs_context.h @@ -25,6 +25,7 @@ struct super_block; struct user_namespace; struct vfsmount; struct path; +struct mem_cgroup; enum fs_context_purpose { FS_CONTEXT_FOR_MOUNT, /* New superblock for explicit mount */ @@ -110,6 +111,7 @@ struct fs_context { bool need_free:1; /* Need to call ops->free() */ bool global:1; /* Goes into &init_user_ns */ bool oldapi:1; /* Coming from mount(2) */ + struct mem_cgroup *memcg; /* memcg to charge */ }; struct fs_context_operations { diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 0c5c403f4be6b..0a9b0bba5f3c8 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -27,6 +27,7 @@ struct obj_cgroup; struct page; struct mm_struct; struct kmem_cache; +struct super_block; /* Cgroup-specific page state, on top of universal node page state */ enum memcg_stat_item { @@ -923,6 +924,15 @@ static inline bool mem_cgroup_online(struct mem_cgroup *memcg) return !!(memcg->css.flags & CSS_ONLINE); } +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg); + +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping); + +struct mem_cgroup *mem_cgroup_get_from_path(const char *path); +void mem_cgroup_put_name_in_seq(struct seq_file *seq, struct super_block *sb); + void mem_cgroup_update_lru_size(struct lruvec *lruvec, enum lru_list lru, int zid, int nr_pages); @@ -1223,6 +1233,28 @@ static inline int mem_cgroup_charge(struct folio *folio, return 0; } +static inline void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ +} + +static inline int mem_cgroup_charge_mapping(struct folio *folio, + struct mm_struct *mm, gfp_t gfp, + struct address_space *mapping) +{ + return 0; +} + +static inline struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + return NULL; +} + +static inline void mem_cgroup_put_name_in_seq(struct seq_file *seq, + struct super_block *sb) +{ +} + static inline int mem_cgroup_swapin_charge_page(struct page *page, struct mm_struct *mm, gfp_t gfp, swp_entry_t entry) { diff --git a/mm/filemap.c b/mm/filemap.c index 6844c9816a864..3825cf12bc345 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -903,7 +903,7 @@ noinline int __filemap_add_folio(struct address_space *mapping, folio->index = index; if (!huge) { - error = mem_cgroup_charge(folio, NULL, gfp); + error = mem_cgroup_charge_mapping(folio, NULL, gfp, mapping); VM_BUG_ON_FOLIO(index & (folio_nr_pages(folio) - 1), folio); if (error) goto error; diff --git a/mm/khugepaged.c b/mm/khugepaged.c index e99101162f1ab..8468a3ad446b9 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1661,7 +1661,8 @@ static void collapse_file(struct mm_struct *mm, goto out; } - if (unlikely(mem_cgroup_charge(page_folio(new_page), mm, gfp))) { + if (unlikely(mem_cgroup_charge_mapping(page_folio(new_page), mm, gfp, + mapping))) { result = SCAN_CGROUP_CHARGE_FAIL; goto out; } diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 781605e920153..c4ba7f364c214 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -62,6 +62,7 @@ #include <linux/tracehook.h> #include <linux/psi.h> #include <linux/seq_buf.h> +#include <linux/string.h> #include "internal.h" #include <net/sock.h> #include <net/ip.h> @@ -2580,6 +2581,129 @@ void mem_cgroup_handle_over_high(void) css_put(&memcg->css); } +/* + * Non error return value must eventually be released with css_put(). + */ +struct mem_cgroup *mem_cgroup_get_from_path(const char *path) +{ + static const char procs_filename[] = "/cgroup.procs"; + struct file *file, *procs; + struct cgroup_subsys_state *css; + struct mem_cgroup *memcg; + char *procs_path = + kmalloc(strlen(path) + sizeof(procs_filename), GFP_KERNEL); + + if (procs_path == NULL) + return ERR_PTR(-ENOMEM); + strcpy(procs_path, path); + strcat(procs_path, procs_filename); + + procs = filp_open(procs_path, O_WRONLY, 0); + kfree(procs_path); + + /* + * Restrict the capability for tasks to mount with memcg charging to the + * cgroup they could not join. For example, disallow: + * + * mount -t tmpfs -o memcg=root-cgroup nodev <MOUNT_DIR> + * + * if it is a non-root task. + */ + if (IS_ERR(procs)) + return (struct mem_cgroup *)procs; + fput(procs); + + file = filp_open(path, O_DIRECTORY | O_RDONLY, 0); + if (IS_ERR(file)) + return (struct mem_cgroup *)file; + + css = css_tryget_online_from_dir(file->f_path.dentry, + &memory_cgrp_subsys); + if (IS_ERR(css)) + memcg = (struct mem_cgroup *)css; + else + memcg = container_of(css, struct mem_cgroup, css); + + fput(file); + return memcg; +} + +void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb) +{ + struct mem_cgroup *memcg; + int ret = 0; + char *buf = __getname(); + int len = PATH_MAX; + + if (!buf) + return; + + buf[0] = '\0'; + + rcu_read_lock(); + memcg = rcu_dereference(sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + if (!memcg) + return; + + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); + if (ret >= len / 2) + strcpy(buf, "?"); + else { + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); + + if (p) + *p = '\0'; + else + strcpy(buf, "?"); + } + + css_put(&memcg->css); + if (buf[0] != '\0') + seq_printf(m, ",memcg=%s", buf); + + __putname(buf); +} + +/* + * Set or clear (if @memcg is NULL) charge association from file system to + * memcg. If @memcg != NULL, then a css reference must be held by the caller to + * ensure that the cgroup is not deleted during this operation, this reference + * is dropped after this operation. + */ +void mem_cgroup_set_charge_target(struct super_block *sb, + struct mem_cgroup *memcg) +{ + memcg = xchg(&sb->s_memcg_to_charge, memcg); + if (memcg) + css_put(&memcg->css); +} + +/* + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller + * must drop reference with css_put(). NULL indicates that the inode does not + * have a memcg to charge, so the default process based policy should be used. + */ +static struct mem_cgroup * +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) +{ + struct mem_cgroup *memcg; + + if (!mapping) + return NULL; + + rcu_read_lock(); + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); + if (memcg && !css_tryget_online(&memcg->css)) + memcg = NULL; + rcu_read_unlock(); + + return memcg; +} + static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned int nr_pages) { @@ -6678,6 +6802,24 @@ static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg, return ret; } +int mem_cgroup_charge_mapping(struct folio *folio, struct mm_struct *mm, + gfp_t gfp, struct address_space *mapping) +{ + struct mem_cgroup *mapping_memcg; + int ret = 0; + if (mem_cgroup_disabled()) + return 0; + + mapping_memcg = mem_cgroup_mapping_get_charge_target(mapping); + if (mapping_memcg) { + ret = charge_memcg(folio, mapping_memcg, gfp); + css_put(&mapping_memcg->css); + return ret; + } + + return mem_cgroup_charge(folio, mm, gfp); +} + int __mem_cgroup_charge(struct folio *folio, struct mm_struct *mm, gfp_t gfp) { struct mem_cgroup *memcg; diff --git a/mm/shmem.c b/mm/shmem.c index 23c91a8beb781..e469da13a1b8a 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -709,7 +709,8 @@ static int shmem_add_to_page_cache(struct page *page, page->index = index; if (!PageSwapCache(page)) { - error = mem_cgroup_charge(page_folio(page), charge_mm, gfp); + error = mem_cgroup_charge_mapping(page_folio(page), charge_mm, + gfp, mapping); if (error) { if (PageTransHuge(page)) { count_vm_event(THP_FILE_FALLBACK); -- 2.34.0.rc2.393.gf8c9666880-goog ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH v4 1/4] mm: support deterministic memory charging of filesystems 2021-11-20 4:50 ` Mina Almasry (?) @ 2021-11-20 7:53 ` Shakeel Butt -1 siblings, 0 replies; 4+ messages in thread From: Shakeel Butt @ 2021-11-20 7:53 UTC (permalink / raw) To: Mina Almasry Cc: Alexander Viro, Andrew Morton, Johannes Weiner, Michal Hocko, Vladimir Davydov, Hugh Dickins, Jonathan Corbet, Shuah Khan, Greg Thelen, Dave Chinner, Matthew Wilcox, Roman Gushchin, Theodore Ts'o, linux-kernel, linux-fsdevel, linux-mm, cgroups On Fri, Nov 19, 2021 at 8:50 PM Mina Almasry <almasrymina@google.com> wrote: > > Users can specify a memcg= mount option option at mount time and all *option > data page charges will be charged to the memcg supplied. This is useful > to deterministicly charge the memory of the file system or memory shared *deterministically > via tmpfs for example. > > Implementation notes: > - Add memcg= option parsing to fs common code. > - We attach the memcg to charge for this filesystem data pages to the > struct super_block. The memcg can be changed via a remount operation, > and all future memcg charges in this filesystem will be charged to > the new memcg. > - We create a new interface mem_cgroup_charge_mapping(), which will > check if the super_block in the mapping has a memcg to charge. It > charges that, and falls back to the mm passed if there is no > super_block memcg. > - On filesystem data memory allocation paths, we call the new interface > mem_cgroup_charge_mapping(). > > Caveats: > - Processes are only allowed to direct filesystem charges to a cgroup that I don't think 'Processes' is the right terminology here. The admin/user doing the mount operation with memcg option should have access to move processes into the target memcg. > they themselves can enter and allocate memory in. This so that we do not > introduce an attack vector where processes can DoS any cgroup in the > system that they are not normally allowed to enter and allocate memory in. > - In mem_cgroup_charge_mapping() we pay the cost of checking whether the > super_block has a memcg to charge, regardless of whether the mount > point was mounted with memcg=. This can be alleviated by putting the > memcg to charge in the struct address_space, but, this increases the > size of that struct and makes it difficult to support remounting the > memcg= option, although remounting is of dubious value. We can start simple with no remount option or maybe follow the memcg v2 semantics of process migrating from one memcg to another. The older memory of the process will remain charged to the older memcg and after migration, the new memory will be charged to the newer memcg. Similarly the older inode/mappings will still be linked to the older memcg and after remount the newer mappings will be linked with newer memcg. You would need to find out if each mapping should hold a memcg reference. [...] > > +static int parse_param_memcg(struct fs_context *fc, struct fs_parameter *param) > +{ > + struct mem_cgroup *memcg; > + > + if (strcmp(param->key, "memcg") != 0) > + return -ENOPARAM; > + > + if (param->type != fs_value_is_string) > + return invalf(fc, "Non-string source"); > + > + if (fc->memcg) > + return invalf(fc, "Multiple memcgs specified"); > + > + memcg = mem_cgroup_get_from_path(param->string); You need to drop this reference in put_fs_context() and give the super_block its own memcg reference. > + if (IS_ERR(memcg)) > + return invalf(fc, "Bad value for memcg"); > + > + fc->memcg = memcg; > + param->string = NULL; > + return 0; > +} > + > /** > * vfs_parse_fs_param - Add a single parameter to a superblock config > * @fc: The filesystem context to modify > @@ -148,6 +171,10 @@ int vfs_parse_fs_param(struct fs_context *fc, struct fs_parameter *param) > return ret; > } > > + ret = parse_param_memcg(fc, param); You might need to call this before fs specific handling (i.e. fc->ops->parse_param). > + if (ret != -ENOPARAM) > + return ret; > + > /* If the filesystem doesn't take any arguments, give it the > * default handling of source. > */ [...] > + > +void mem_cgroup_put_name_in_seq(struct seq_file *m, struct super_block *sb) > +{ > + struct mem_cgroup *memcg; > + int ret = 0; > + char *buf = __getname(); > + int len = PATH_MAX; > + > + if (!buf) > + return; > + > + buf[0] = '\0'; > + > + rcu_read_lock(); > + memcg = rcu_dereference(sb->s_memcg_to_charge); > + if (memcg && !css_tryget_online(&memcg->css)) No need to get an explicit reference. You can call cgroup_path within rcu. > + memcg = NULL; > + rcu_read_unlock(); > + > + if (!memcg) > + return; > + > + ret = cgroup_path(memcg->css.cgroup, buf + len / 2, len / 2); > + if (ret >= len / 2) > + strcpy(buf, "?"); > + else { > + char *p = mangle_path(buf, buf + len / 2, " \t\n\\"); > + > + if (p) > + *p = '\0'; > + else > + strcpy(buf, "?"); > + } > + > + css_put(&memcg->css); > + if (buf[0] != '\0') > + seq_printf(m, ",memcg=%s", buf); > + > + __putname(buf); > +} > + > +/* > + * Set or clear (if @memcg is NULL) charge association from file system to > + * memcg. If @memcg != NULL, then a css reference must be held by the caller to > + * ensure that the cgroup is not deleted during this operation, this reference > + * is dropped after this operation. The given reference is not really dropped after this operation but rather the reference is being stolen here. > + */ > +void mem_cgroup_set_charge_target(struct super_block *sb, > + struct mem_cgroup *memcg) > +{ > + memcg = xchg(&sb->s_memcg_to_charge, memcg); Don't borrow the reference, get a new one for sb. > + if (memcg) > + css_put(&memcg->css); > +} > + > +/* > + * Returns the memcg to charge for inode pages. If non-NULL is returned, caller > + * must drop reference with css_put(). NULL indicates that the inode does not > + * have a memcg to charge, * or the attached memcg is offline. > so the default process based policy should be used. > + */ > +static struct mem_cgroup * > +mem_cgroup_mapping_get_charge_target(struct address_space *mapping) > +{ > + struct mem_cgroup *memcg; > + > + if (!mapping) > + return NULL; > + > + rcu_read_lock(); > + memcg = rcu_dereference(mapping->host->i_sb->s_memcg_to_charge); > + if (memcg && !css_tryget_online(&memcg->css)) > + memcg = NULL; > + rcu_read_unlock(); > + > + return memcg; > +} > + ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2021-11-22 20:43 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-11-22 20:43 [PATCH v4 1/4] mm: support deterministic memory charging of filesystems kernel test robot
-- strict thread matches above, loose matches on Subject: below --
2021-11-20 4:50 [PATCH v4 0/4] Deterministic charging of shared memory Mina Almasry
[not found] ` <20211120045011.3074840-1-almasrymina-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
2021-11-20 4:50 ` [PATCH v4 1/4] mm: support deterministic memory charging of filesystems Mina Almasry
2021-11-20 4:50 ` Mina Almasry
2021-11-20 7:53 ` Shakeel Butt
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.