* Re: [PATCH 1/1] cgroup-v1: Grant CAP_SYS_NICE holders permission to move tasks between cgroups
From: Tejun Heo @ 2021-06-17 11:41 UTC (permalink / raw)
To: Lee Jones
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Zefan Li, Johannes Weiner,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20210617090941.340135-1-lee.jones-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
Hello,
On Thu, Jun 17, 2021 at 10:09:41AM +0100, Lee Jones wrote:
> It should be possible for processes with CAP_SYS_NICE capabilities
> (privileges) to move lower priority tasks within the same namespace to
> different cgroups.
I'm not sure that "should" is justified that easily given that cgroup can
affect things like device access permissions and basic system organization.
> One extremely common example of this is Android's 'system_server',
> which moves processes around to different cgroups/cpusets, but should
> not require any other root privileges.
Why is this being brought up now after all the years? Isn't android moving
onto cgroup2 anyway?
Thanks.
--
tejun
^ permalink raw reply
* [PATCH 1/1] cgroup-v1: Grant CAP_SYS_NICE holders permission to move tasks between cgroups
From: Lee Jones @ 2021-06-17 9:09 UTC (permalink / raw)
To: lee.jones-QSEj5FYQhm4dnm+yROfE0A
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, Tejun Heo, Zefan Li,
Johannes Weiner, cgroups-u79uwXL29TY76Z2rM5mHXA
It should be possible for processes with CAP_SYS_NICE capabilities
(privileges) to move lower priority tasks within the same namespace to
different cgroups.
One extremely common example of this is Android's 'system_server',
which moves processes around to different cgroups/cpusets, but should
not require any other root privileges.
Cc: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Zefan Li <lizefan.x-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
Cc: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
Cc: cgroups-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Signed-off-by: Lee Jones <lee.jones-QSEj5FYQhm4dnm+yROfE0A@public.gmane.org>
---
kernel/cgroup/cgroup-v1.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1f274d7fc934e..56d0d91951f02 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -510,7 +510,8 @@ static ssize_t __cgroup1_procs_write(struct kernfs_open_file *of,
tcred = get_task_cred(task);
if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
!uid_eq(cred->euid, tcred->uid) &&
- !uid_eq(cred->euid, tcred->suid))
+ !uid_eq(cred->euid, tcred->suid) &&
+ !ns_capable(tcred->user_ns, CAP_SYS_NICE))
ret = -EACCES;
put_cred(tcred);
if (ret)
--
2.32.0
^ permalink raw reply related
* Re: [PATCH 2/5] cgroup/cpuset: Add new cpus.partition type with no load balancing
From: Waiman Long @ 2021-06-17 2:57 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli
In-Reply-To: <YMpjbCWpSDIz4bHt-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
On 6/16/21 4:47 PM, Tejun Heo wrote:
> Hello,
>
> Generally looks fine to me.
>
> On Thu, Jun 03, 2021 at 05:24:13PM -0400, Waiman Long wrote:
>> @@ -1984,12 +1987,31 @@ static int update_prstate(struct cpuset *cs, int val)
>> goto out;
>>
>> err = update_parent_subparts_cpumask(cs, partcmd_enable,
>> - NULL, &tmp);
>> + NULL, &tmpmask);
>> +
>> if (err) {
>> update_flag(CS_CPU_EXCLUSIVE, cs, 0);
>> goto out;
>> + } else if (new_prs == PRS_ENABLED_NOLB) {
>> + /*
>> + * Disable the load balance flag should not return an
> ^ing
>
> and "else if" after "if (err) goto out" block is weird. The two conditions
> don't need to be tied together.
Yes, the else part is redundant in this case. Will remove it.
>
>> @@ -2518,6 +2547,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
>> case PRS_ENABLED:
>> seq_puts(seq, "root\n");
>> break;
>> + case PRS_ENABLED_NOLB:
>> + seq_puts(seq, "root-nolb\n");
>> + break;
>> case PRS_DISABLED:
>> seq_puts(seq, "member\n");
>> break;
>> @@ -2544,6 +2576,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
>> val = PRS_ENABLED;
>> else if (!strcmp(buf, "member"))
>> val = PRS_DISABLED;
>> + else if (!strcmp(buf, "root-nolb"))
>> + val = PRS_ENABLED_NOLB;
>> else
>> return -EINVAL;
> I wonder whether there's a better name than "root-nolb" because nolb isn't
> the most readable and we are using space as the delimiter for other names.
> Would something like "isolated" work?
Right. "isolated" is a better name and it corresponds better with the
isolcpus kernel command line option. Will change the name.
Thanks,
Longman
^ permalink raw reply
* Re: [PATCH 1/5] cgroup/cpuset: Don't call validate_change() for some flag changes
From: Waiman Long @ 2021-06-17 2:53 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli
In-Reply-To: <YMphhLAzmRRyD+cm@slm.duckdns.org>
On 6/16/21 4:39 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jun 03, 2021 at 05:24:12PM -0400, Waiman Long wrote:
>> The update_flag() is called with one flag bit change and without change
>> in the various cpumasks in the cpuset. Moreover, not all changes in the
>> flag bits are validated in validate_change(). In particular, the load
>> balance flag and the two spread flags are not checked there. So there
>> is no point in calling validate_change() if those flag bits change.
> The fact that it's escaping validation conditionally from caller side is
> bothersome given that the idea is to have self-contained verifier to ensure
> correctness. I'd prefer to make the validation more complete and optimized
> (ie. detect or keep track of what changed) if really necessary rather than
> escaping partially because certain conditions aren't checked.
Thanks for the comments.
You are right. I will leave out this patch. Anyway, the rests of the
patchset don't have a strict dependency on it.
Cheers,
Longman
^ permalink raw reply
* Re: [PATCH 3/5] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs
From: Tejun Heo @ 2021-06-16 20:57 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan,
cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-kselftest-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli
In-Reply-To: <20210603212416.25934-4-longman-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Hello,
On Thu, Jun 03, 2021 at 05:24:14PM -0400, Waiman Long wrote:
> @@ -2181,6 +2192,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
> (cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
> goto out_unlock;
>
> + /*
> + * On default hierarchy, task cannot be moved to a cpuset with empty
> + * effective cpus.
> + */
> + if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
> + goto out_unlock;
> +
This is inconsistent with how other events which leave a root partition
empty is handled. Woudln't it be more consistent to switch the parent to
PRS_ERROR and behave accordingly but allow it to have valid child roots?
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH 2/5] cgroup/cpuset: Add new cpus.partition type with no load balancing
From: Tejun Heo @ 2021-06-16 20:47 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli
In-Reply-To: <20210603212416.25934-3-longman@redhat.com>
Hello,
Generally looks fine to me.
On Thu, Jun 03, 2021 at 05:24:13PM -0400, Waiman Long wrote:
> @@ -1984,12 +1987,31 @@ static int update_prstate(struct cpuset *cs, int val)
> goto out;
>
> err = update_parent_subparts_cpumask(cs, partcmd_enable,
> - NULL, &tmp);
> + NULL, &tmpmask);
> +
> if (err) {
> update_flag(CS_CPU_EXCLUSIVE, cs, 0);
> goto out;
> + } else if (new_prs == PRS_ENABLED_NOLB) {
> + /*
> + * Disable the load balance flag should not return an
^ing
and "else if" after "if (err) goto out" block is weird. The two conditions
don't need to be tied together.
> @@ -2518,6 +2547,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
> case PRS_ENABLED:
> seq_puts(seq, "root\n");
> break;
> + case PRS_ENABLED_NOLB:
> + seq_puts(seq, "root-nolb\n");
> + break;
> case PRS_DISABLED:
> seq_puts(seq, "member\n");
> break;
> @@ -2544,6 +2576,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
> val = PRS_ENABLED;
> else if (!strcmp(buf, "member"))
> val = PRS_DISABLED;
> + else if (!strcmp(buf, "root-nolb"))
> + val = PRS_ENABLED_NOLB;
> else
> return -EINVAL;
I wonder whether there's a better name than "root-nolb" because nolb isn't
the most readable and we are using space as the delimiter for other names.
Would something like "isolated" work?
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH 1/5] cgroup/cpuset: Don't call validate_change() for some flag changes
From: Tejun Heo @ 2021-06-16 20:39 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Shuah Khan, cgroups,
linux-kernel, linux-doc, linux-kselftest, Andrew Morton,
Roman Gushchin, Phil Auld, Peter Zijlstra, Juri Lelli
In-Reply-To: <20210603212416.25934-2-longman@redhat.com>
Hello,
On Thu, Jun 03, 2021 at 05:24:12PM -0400, Waiman Long wrote:
> The update_flag() is called with one flag bit change and without change
> in the various cpumasks in the cpuset. Moreover, not all changes in the
> flag bits are validated in validate_change(). In particular, the load
> balance flag and the two spread flags are not checked there. So there
> is no point in calling validate_change() if those flag bits change.
The fact that it's escaping validation conditionally from caller side is
bothersome given that the idea is to have self-contained verifier to ensure
correctness. I'd prefer to make the validation more complete and optimized
(ie. detect or keep track of what changed) if really necessary rather than
escaping partially because certain conditions aren't checked.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH 0/3] add/update/fix BFQ docs
From: Jens Axboe @ 2021-06-16 17:32 UTC (permalink / raw)
To: Kir Kolyshkin, linux-doc; +Cc: Jonathan Corbet, tj, paolo.valente, cgroups
In-Reply-To: <20210611030737.1984343-1-kolyshkin@gmail.com>
On 6/10/21 9:07 PM, Kir Kolyshkin wrote:
> Improve BFQ docs: fix formatting, adding a missing piece.
>
> Fix cgroup v1 blkio docs which are not updated for CFQ -> BFQ.
>
> Patches are on top of the latest docs-next (commit acda97acb2e98c97895).
Applied, thanks.
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH v1] proc: Implement /proc/self/meminfo
From: Eric W. Biederman @ 2021-06-16 16:17 UTC (permalink / raw)
To: Shakeel Butt
Cc: Alexey Gladkov, Christian Brauner, LKML, Linux Containers,
Linux Containers, Linux FS Devel, Linux MM, Andrew Morton,
Johannes Weiner, Michal Hocko, Chris Down, Cgroups
In-Reply-To: <CALvZod7po_fK9JpcUNVrN6PyyP9k=hdcyRfZmHjSVE5r_8Laqw@mail.gmail.com>
Shakeel Butt <shakeelb@google.com> writes:
> On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov <legion@kernel.org> wrote:
>>
> [...]
>>
>> I made the second version of the patch [1], but then I had a conversation
>> with Eric W. Biederman offlist. He convinced me that it is a bad idea to
>> change all the values in meminfo to accommodate cgroups. But we agreed
>> that MemAvailable in /proc/meminfo should respect cgroups limits. This
>> field was created to hide implementation details when calculating
>> available memory. You can see that it is quite widely used [2].
>> So I want to try to move in that direction.
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
>> [2] https://codesearch.debian.net/search?q=MemAvailable%3A
>>
>
> Please see following two links on the previous discussion on having
> per-memcg MemAvailable stat.
>
> [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/
> [2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/
>
> MemAvailable itself is an imprecise metric and involving memcg makes
> this metric even more weird. The difference of semantics of swap
> accounting of v1 and v2 is one source of this weirdness (I have not
> checked your patch if it is handling this weirdness). The lazyfree and
> deferred split pages are another source.
>
> So, I am not sure if complicating an already imprecise metric will
> make it more useful.
Making a good guess at how much memory can be allocated without
triggering swapping or otherwise stressing the system is something that
requires understanding our mm internals.
To be able to continue changing the mm or even mm policy without
introducing regressions in userspace we need to export values that
userspace can use.
At a first approximation that seems to look like MemAvailable.
MemAvailable seems to have a good definition. Roughly the amount of
memory that can be allocated without triggering swapping. Updated
to include not trigger memory cgroup based swapping and I sounds good.
I don't know if it will work in practice but I think it is worth
exploring.
I do know that hiding the implementation details and providing userspace
with information it can directly use seems like the programming model
that needs to be explored. Most programs should not care if they are in
a memory cgroup, etc. Programs, load management systems, and even
balloon drivers have a legitimately interest in how much additional load
can be placed on a systems memory.
A version of this that I remember working fairly well is free space
on compressed filesystems. As I recall compressed filesystems report
the amount of uncompressed space that is available (an underestimate).
This results in the amount of space consumed going up faster than the
free space goes down.
We can't do exactly the same thing with our memory usability estimate,
but having our estimate be a reliable underestimate might be enough
to avoid problems with reporting too much memory as available to
userspace.
I know that MemAvailable already does that /2 so maybe it is already
aiming at being an underestimate. Perhaps we need some additional
accounting to help create a useful metric for userspace as well.
I don't know the final answer. I do know that not designing an
interface that userspace can use to deal with it's legitimate concerns
is sticking our collective heads in the sand and wishing the problem
will go away.
Eric
^ permalink raw reply
* Re: [PATCH 0/3] add/update/fix BFQ docs
From: Tejun Heo @ 2021-06-16 15:46 UTC (permalink / raw)
To: Kir Kolyshkin
Cc: linux-doc-u79uwXL29TY76Z2rM5mHXA, Jonathan Corbet,
axboe-tSWWG44O7X1aa/9Udqfwiw,
paolo.valente-QSEj5FYQhm4dnm+yROfE0A,
cgroups-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20210611030737.1984343-1-kolyshkin-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
On Thu, Jun 10, 2021 at 08:07:34PM -0700, Kir Kolyshkin wrote:
> Improve BFQ docs: fix formatting, adding a missing piece.
>
> Fix cgroup v1 blkio docs which are not updated for CFQ -> BFQ.
>
> Patches are on top of the latest docs-next (commit acda97acb2e98c97895).
Acked-by: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Prolly best to go through the block tree.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH] cgroup1: fix leaked context root causing sporadic NULL deref in LTP
From: Tejun Heo @ 2021-06-16 15:23 UTC (permalink / raw)
To: Paul Gortmaker
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
cgroups-u79uwXL29TY76Z2rM5mHXA, Al Viro, Zefan Li,
Johannes Weiner, stable-u79uwXL29TY76Z2rM5mHXA, Richard Purdie
In-Reply-To: <20210616125157.438837-1-paul.gortmaker-CWA4WttNNZF54TAoqtyWWQ@public.gmane.org>
Hello,
On Wed, Jun 16, 2021 at 08:51:57AM -0400, Paul Gortmaker wrote:
> A fix would be to not leave the stale reference in fc->root as follows:
>
> --------------
> dput(fc->root);
> + fc->root = NULL;
> deactivate_locked_super(sb);
> --------------
>
> ...but then we are just open-coding a duplicate of fc_drop_locked() so we
> simply use that instead.
As this is unlikely to be a real-world problem both in probability and
circumstances, I'm applying this to cgroup/for-5.14 instead of
cgroup/for-5.13-fixes.
Thanks.
--
tejun
^ permalink raw reply
* [PATCH] cgroup1: fix leaked context root causing sporadic NULL deref in LTP
From: Paul Gortmaker @ 2021-06-16 12:51 UTC (permalink / raw)
To: linux-kernel
Cc: cgroups, Paul Gortmaker, Al Viro, Tejun Heo, Zefan Li,
Johannes Weiner, stable, Richard Purdie
Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests. Things like:
BUG: kernel NULL pointer dereference, address: 0000000000000008
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
RIP: 0010:kernfs_sop_show_path+0x1b/0x60
...or these others:
RIP: 0010:do_mkdirat+0x6a/0xf0
RIP: 0010:d_alloc_parallel+0x98/0x510
RIP: 0010:do_readlinkat+0x86/0x120
There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference. I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.
In commit 71d883c37e8d ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:
--------------
struct cgroup_subsys *ss;
- struct dentry *dentry;
[...]
- dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
- CGROUP_SUPER_MAGIC, ns);
[...]
- if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
- struct super_block *sb = dentry->d_sb;
- dput(dentry);
+ ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
+ if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
+ struct super_block *sb = fc->root->d_sb;
+ dput(fc->root);
deactivate_locked_super(sb);
msleep(10);
return restart_syscall();
}
--------------
In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case. With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.
A fix would be to not leave the stale reference in fc->root as follows:
--------------
dput(fc->root);
+ fc->root = NULL;
deactivate_locked_super(sb);
--------------
...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org # v5.1+
Reported-by: Richard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c37e8d ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
diff --git a/fs/internal.h b/fs/internal.h
index 6aeae7ef3380..728f8d70d7f1 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -61,7 +61,6 @@ extern void __init chrdev_init(void);
*/
extern const struct fs_context_operations legacy_fs_context_ops;
extern int parse_monolithic_mount_data(struct fs_context *, void *);
-extern void fc_drop_locked(struct fs_context *);
extern void vfs_clean_context(struct fs_context *fc);
extern int finish_clean_context(struct fs_context *fc);
diff --git a/include/linux/fs_context.h b/include/linux/fs_context.h
index 37e1e8f7f08d..5b44b0195a28 100644
--- a/include/linux/fs_context.h
+++ b/include/linux/fs_context.h
@@ -139,6 +139,7 @@ extern int vfs_parse_fs_string(struct fs_context *fc, const char *key,
extern int generic_parse_monolithic(struct fs_context *fc, void *data);
extern int vfs_get_tree(struct fs_context *fc);
extern void put_fs_context(struct fs_context *fc);
+extern void fc_drop_locked(struct fs_context *fc);
/*
* sget() wrappers to be called from the ->get_tree() op.
diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c
index 1f274d7fc934..6e554743cf2b 100644
--- a/kernel/cgroup/cgroup-v1.c
+++ b/kernel/cgroup/cgroup-v1.c
@@ -1223,9 +1223,7 @@ int cgroup1_get_tree(struct fs_context *fc)
ret = cgroup_do_get_tree(fc);
if (!ret && percpu_ref_is_dying(&ctx->root->cgrp.self.refcnt)) {
- struct super_block *sb = fc->root->d_sb;
- dput(fc->root);
- deactivate_locked_super(sb);
+ fc_drop_locked(fc);
ret = 1;
}
--
2.31.1
^ permalink raw reply related
* Re: [PATCH v1] proc: Implement /proc/self/meminfo
From: Shakeel Butt @ 2021-06-16 1:09 UTC (permalink / raw)
To: Alexey Gladkov
Cc: Christian Brauner, LKML, Linux Containers, Linux Containers,
Linux FS Devel, Linux MM, Andrew Morton, Eric W . Biederman,
Johannes Weiner, Michal Hocko, Chris Down, Cgroups
In-Reply-To: <20210615124715.nzd5we5tl7xc2n2p@example.org>
On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov <legion@kernel.org> wrote:
>
[...]
>
> I made the second version of the patch [1], but then I had a conversation
> with Eric W. Biederman offlist. He convinced me that it is a bad idea to
> change all the values in meminfo to accommodate cgroups. But we agreed
> that MemAvailable in /proc/meminfo should respect cgroups limits. This
> field was created to hide implementation details when calculating
> available memory. You can see that it is quite widely used [2].
> So I want to try to move in that direction.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
> [2] https://codesearch.debian.net/search?q=MemAvailable%3A
>
Please see following two links on the previous discussion on having
per-memcg MemAvailable stat.
[1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/
[2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/
MemAvailable itself is an imprecise metric and involving memcg makes
this metric even more weird. The difference of semantics of swap
accounting of v1 and v2 is one source of this weirdness (I have not
checked your patch if it is handling this weirdness). The lazyfree and
deferred split pages are another source.
So, I am not sure if complicating an already imprecise metric will
make it more useful.
^ permalink raw reply
* Re: [PATCH v2 2/2] memcg: periodically flush the memcg stats
From: Shakeel Butt @ 2021-06-15 21:52 UTC (permalink / raw)
To: Johannes Weiner
Cc: Tejun Heo, Muchun Song, Michal Hocko, Roman Gushchin,
Michal Koutný, Huang Ying, Andrew Morton, Cgroups, Linux MM,
LKML
In-Reply-To: <YMj/s26uF+cQOB2D-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
On Tue, Jun 15, 2021 at 12:29 PM Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org> wrote:
>
> Hey Shakeel,
>
> On Tue, Jun 15, 2021 at 10:44:35AM -0700, Shakeel Butt wrote:
> > At the moment memcg stats are read in four contexts:
> >
> > 1. memcg stat user interfaces
> > 2. dirty throttling
> > 3. page fault
> > 4. memory reclaim
> >
> > Currently the kernel flushes the stats for first two cases. Flushing the
> > stats for remaining two casese may have performance impact. Always
> > flushing the memcg stats on the page fault code path may negatively
> > impacts the performance of the applications. In addition flushing in the
> > memory reclaim code path, though treated as slowpath, can become the
> > source of contention for the global lock taken for stat flushing because
> > when system or memcg is under memory pressure, many tasks may enter the
> > reclaim path.
> >
> > Instead of synchronously flushing the stats, this patch adds support of
> > asynchronous periodic flushing of the memcg stats. For now the flushing
> > period is hardcoded to 2*HZ but that can be changed later through maybe
> > sysctl if need arise.
>
> I'm concerned that quite a lot can happen in terms of reclaim and page
> faults in 2 seconds. It's conceivable that the error of a fixed 2s
> flush can actually exceed the error of a fixed percpu batch size.
>
Yes, that is possible.
> The way the global vmstat implementation manages error is doing both:
> ratelimiting and timelimiting. It uses percpu batching to limit the
> error when it gets busy, and periodic flushing to limit the length of
> time consumers of those stats could be stuck trying to reach a state
> that the batching would otherwise prevent from being reflected.
>
> Maybe we can use a combination of ratelimiting and timelimiting too?
>
> We shouldn't flush on every fault, but what about a percpu ratelimit
> that would at least bound the error to NR_CPU instead of nr_cgroups?
>
Couple questions here:
First, to convert the error bound to NR_CPU from nr_cgroups, I think
we have to move from (delta > threshold) comparison to
(num_update_events > threshold). Previously an increment event
followed by decrement would keep the delta to 0 (or same) but after
this change num_update_events would be 2. Is that ok?
Second, do we want to synchronously flush the stats when we cross the
threshold on update or asynchronously by queuing the flush with zero
delay?
> For thundering herds during reclaim: as long as they all tried to
> flush from the root, only one of them would actually need to do the
> work, and we could use trylock. If the lock is already taken, you can
> move on knowing that somebody is already doing the shared flush work.
Yes, this makes sense.
^ permalink raw reply
* Re: [PATCH v2 2/2] memcg: periodically flush the memcg stats
From: Johannes Weiner @ 2021-06-15 19:29 UTC (permalink / raw)
To: Shakeel Butt
Cc: Tejun Heo, Muchun Song, Michal Hocko, Roman Gushchin,
Michal Koutný, Huang Ying, Andrew Morton,
cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <20210615174435.4174364-2-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
Hey Shakeel,
On Tue, Jun 15, 2021 at 10:44:35AM -0700, Shakeel Butt wrote:
> At the moment memcg stats are read in four contexts:
>
> 1. memcg stat user interfaces
> 2. dirty throttling
> 3. page fault
> 4. memory reclaim
>
> Currently the kernel flushes the stats for first two cases. Flushing the
> stats for remaining two casese may have performance impact. Always
> flushing the memcg stats on the page fault code path may negatively
> impacts the performance of the applications. In addition flushing in the
> memory reclaim code path, though treated as slowpath, can become the
> source of contention for the global lock taken for stat flushing because
> when system or memcg is under memory pressure, many tasks may enter the
> reclaim path.
>
> Instead of synchronously flushing the stats, this patch adds support of
> asynchronous periodic flushing of the memcg stats. For now the flushing
> period is hardcoded to 2*HZ but that can be changed later through maybe
> sysctl if need arise.
I'm concerned that quite a lot can happen in terms of reclaim and page
faults in 2 seconds. It's conceivable that the error of a fixed 2s
flush can actually exceed the error of a fixed percpu batch size.
The way the global vmstat implementation manages error is doing both:
ratelimiting and timelimiting. It uses percpu batching to limit the
error when it gets busy, and periodic flushing to limit the length of
time consumers of those stats could be stuck trying to reach a state
that the batching would otherwise prevent from being reflected.
Maybe we can use a combination of ratelimiting and timelimiting too?
We shouldn't flush on every fault, but what about a percpu ratelimit
that would at least bound the error to NR_CPU instead of nr_cgroups?
For thundering herds during reclaim: as long as they all tried to
flush from the root, only one of them would actually need to do the
work, and we could use trylock. If the lock is already taken, you can
move on knowing that somebody is already doing the shared flush work.
^ permalink raw reply
* [PATCH v2 2/2] memcg: periodically flush the memcg stats
From: Shakeel Butt @ 2021-06-15 17:44 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Muchun Song
Cc: Michal Hocko, Roman Gushchin, Michal Koutný, Huang Ying,
Andrew Morton, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Shakeel Butt
In-Reply-To: <20210615174435.4174364-1-shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
At the moment memcg stats are read in four contexts:
1. memcg stat user interfaces
2. dirty throttling
3. page fault
4. memory reclaim
Currently the kernel flushes the stats for first two cases. Flushing the
stats for remaining two casese may have performance impact. Always
flushing the memcg stats on the page fault code path may negatively
impacts the performance of the applications. In addition flushing in the
memory reclaim code path, though treated as slowpath, can become the
source of contention for the global lock taken for stat flushing because
when system or memcg is under memory pressure, many tasks may enter the
reclaim path.
Instead of synchronously flushing the stats, this patch adds support of
asynchronous periodic flushing of the memcg stats. For now the flushing
period is hardcoded to 2*HZ but that can be changed later through maybe
sysctl if need arise.
This patch does add the explicit flushing in the kswapd thread as the
number of kswapd threads which corresponds to the number of nodes on
realistic machines are usually low.
Signed-off-by: Shakeel Butt <shakeelb-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org>
---
Changes since v1:
- use system_unbound_wq for flushing the memcg stats
include/linux/memcontrol.h | 10 ++++++++++
mm/memcontrol.c | 14 ++++++++++++++
mm/vmscan.c | 6 ++++++
3 files changed, 30 insertions(+)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 0bfa0409af22..f34214382a1c 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -991,6 +991,12 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return x;
}
+static inline void mem_cgroup_flush_stats(void)
+{
+ if (!mem_cgroup_disabled())
+ cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+}
+
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val);
void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
@@ -1400,6 +1406,10 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
}
+static inline void mem_cgroup_flush_stats(void)
+{
+}
+
static inline void __mod_memcg_lruvec_state(struct lruvec *lruvec,
enum node_stat_item idx, int val)
{
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6e24fd8c5301..5910658bac84 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -97,6 +97,10 @@ bool cgroup_memory_noswap __ro_after_init;
static DECLARE_WAIT_QUEUE_HEAD(memcg_cgwb_frn_waitq);
#endif
+/* Periodically flush memcg and lruvec stats. */
+static void flush_memcg_stats(struct work_struct *w);
+static DECLARE_DEFERRABLE_WORK(stats_flush, flush_memcg_stats);
+
/* Whether legacy memory+swap accounting is active */
static bool do_memsw_account(void)
{
@@ -5248,6 +5252,10 @@ static int mem_cgroup_css_online(struct cgroup_subsys_state *css)
/* Online state pins memcg ID, memcg ID pins CSS */
refcount_set(&memcg->id.ref, 1);
css_get(css);
+
+ if (unlikely(mem_cgroup_is_root(memcg)))
+ queue_delayed_work(system_unbound_wq, &stats_flush, 2UL*HZ);
+
return 0;
}
@@ -5339,6 +5347,12 @@ static void mem_cgroup_css_reset(struct cgroup_subsys_state *css)
memcg_wb_domain_size_changed(memcg);
}
+static void flush_memcg_stats(struct work_struct *w)
+{
+ cgroup_rstat_flush(root_mem_cgroup->css.cgroup);
+ queue_delayed_work(system_unbound_wq, &stats_flush, 2UL*HZ);
+}
+
static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
{
struct mem_cgroup *memcg = mem_cgroup_from_css(css);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a7602f71ec04..7bf9a4241dd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3893,6 +3893,12 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
sc.may_writepage = !laptop_mode && !nr_boost_reclaim;
sc.may_swap = !nr_boost_reclaim;
+ /*
+ * Flush the memory cgroup stats, so that we read accurate
+ * per-memcg lruvec stats for heuristics later.
+ */
+ mem_cgroup_flush_stats();
+
/*
* Do some background aging of the anon list, to give
* pages a chance to be referenced before reclaiming. All
--
2.32.0.272.g935e593368-goog
^ permalink raw reply related
* [PATCH v2 1/2] memcg: switch lruvec stats to rstat
From: Shakeel Butt @ 2021-06-15 17:44 UTC (permalink / raw)
To: Tejun Heo, Johannes Weiner, Muchun Song
Cc: Michal Hocko, Roman Gushchin, Michal Koutný, Huang Ying,
Andrew Morton, cgroups, linux-mm, linux-kernel, Shakeel Butt
The commit 2d146aa3aa84 ("mm: memcontrol: switch to rstat") but skipped
the conversion of the lruvec stats as such stats are read in the
performance critical code paths and flushing stats may have impacted the
performances of the applications. This patch converts the lruvec stats
to rstat and later patch adds the periodic flushing of the stats and
thus remove the need to synchronously flushing the stats in the
performance critical code paths.
The rstat conversion comes with the price i.e. memory cost. Effectively
this patch reverts the savings done by the commit f3344adf38bd ("mm:
memcontrol: optimize per-lruvec stats counter memory usage"). However
this cost is justified due to negative impact of the inaccurate lruvec
stats on many heuristics. One such case is reported in [1].
The memory reclaim code is filled with plethora of heuristics and many
of those heuristics reads the lruvec stats. So, inaccurate stats can
make such heuristics ineffective. [1] reports the impact of inaccurate
lruvec stats on the "cache trim mode" heuristic. Inaccurate lruvec stats
can impact the deactivation and aging anon heuristics as well.
[1] https://lore.kernel.org/linux-mm/20210311004449.1170308-1-ying.huang@intel.com/
Signed-off-by: Shakeel Butt <shakeelb@google.com>
---
Changes since v1:
- no changes
include/linux/memcontrol.h | 42 +++++++------
mm/memcontrol.c | 118 +++++++++++++------------------------
2 files changed, 60 insertions(+), 100 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index bfe5c486f4ad..0bfa0409af22 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -105,14 +105,6 @@ struct mem_cgroup_reclaim_iter {
unsigned int generation;
};
-struct lruvec_stat {
- long count[NR_VM_NODE_STAT_ITEMS];
-};
-
-struct batched_lruvec_stat {
- s32 count[NR_VM_NODE_STAT_ITEMS];
-};
-
/*
* Bitmap and deferred work of shrinker::id corresponding to memcg-aware
* shrinkers, which have elements charged to this memcg.
@@ -123,24 +115,30 @@ struct shrinker_info {
unsigned long *map;
};
+struct lruvec_stats_percpu {
+ /* Local (CPU and cgroup) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Delta calculation for lockless upward propagation */
+ long state_prev[NR_VM_NODE_STAT_ITEMS];
+};
+
+struct lruvec_stats {
+ /* Aggregated (CPU and subtree) state */
+ long state[NR_VM_NODE_STAT_ITEMS];
+
+ /* Pending child counts during tree propagation */
+ long state_pending[NR_VM_NODE_STAT_ITEMS];
+};
+
/*
* per-node information in memory controller.
*/
struct mem_cgroup_per_node {
struct lruvec lruvec;
- /*
- * Legacy local VM stats. This should be struct lruvec_stat and
- * cannot be optimized to struct batched_lruvec_stat. Because
- * the threshold of the lruvec_stat_cpu can be as big as
- * MEMCG_CHARGE_BATCH * PAGE_SIZE. It can fit into s32. But this
- * filed has no upper limit.
- */
- struct lruvec_stat __percpu *lruvec_stat_local;
-
- /* Subtree VM stats (batched updates) */
- struct batched_lruvec_stat __percpu *lruvec_stat_cpu;
- atomic_long_t lruvec_stat[NR_VM_NODE_STAT_ITEMS];
+ struct lruvec_stats_percpu __percpu *lruvec_stats_percpu;
+ struct lruvec_stats lruvec_stats;
unsigned long lru_zone_size[MAX_NR_ZONES][NR_LRU_LISTS];
@@ -965,7 +963,7 @@ static inline unsigned long lruvec_page_state(struct lruvec *lruvec,
return node_page_state(lruvec_pgdat(lruvec), idx);
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
- x = atomic_long_read(&pn->lruvec_stat[idx]);
+ x = READ_ONCE(pn->lruvec_stats.state[idx]);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
@@ -985,7 +983,7 @@ static inline unsigned long lruvec_page_state_local(struct lruvec *lruvec,
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
for_each_possible_cpu(cpu)
- x += per_cpu(pn->lruvec_stat_local->count[idx], cpu);
+ x += per_cpu(pn->lruvec_stats_percpu->state[idx], cpu);
#ifdef CONFIG_SMP
if (x < 0)
x = 0;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index dce5154fbb8f..6e24fd8c5301 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -671,46 +671,20 @@ static unsigned long memcg_page_state_local(struct mem_cgroup *memcg, int idx)
return x;
}
-static struct mem_cgroup_per_node *
-parent_nodeinfo(struct mem_cgroup_per_node *pn, int nid)
-{
- struct mem_cgroup *parent;
-
- parent = parent_mem_cgroup(pn->memcg);
- if (!parent)
- return NULL;
- return parent->nodeinfo[nid];
-}
-
void __mod_memcg_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx,
int val)
{
struct mem_cgroup_per_node *pn;
struct mem_cgroup *memcg;
- long x, threshold = MEMCG_CHARGE_BATCH;
pn = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
memcg = pn->memcg;
- /* Update memcg */
- __mod_memcg_state(memcg, idx, val);
-
/* Update lruvec */
- __this_cpu_add(pn->lruvec_stat_local->count[idx], val);
+ __this_cpu_add(pn->lruvec_stats_percpu->state[idx], val);
- if (vmstat_item_in_bytes(idx))
- threshold <<= PAGE_SHIFT;
-
- x = val + __this_cpu_read(pn->lruvec_stat_cpu->count[idx]);
- if (unlikely(abs(x) > threshold)) {
- pg_data_t *pgdat = lruvec_pgdat(lruvec);
- struct mem_cgroup_per_node *pi;
-
- for (pi = pn; pi; pi = parent_nodeinfo(pi, pgdat->node_id))
- atomic_long_add(x, &pi->lruvec_stat[idx]);
- x = 0;
- }
- __this_cpu_write(pn->lruvec_stat_cpu->count[idx], x);
+ /* Update memcg */
+ __mod_memcg_state(memcg, idx, val);
}
/**
@@ -2289,40 +2263,13 @@ static void drain_all_stock(struct mem_cgroup *root_memcg)
mutex_unlock(&percpu_charge_mutex);
}
-static void memcg_flush_lruvec_page_state(struct mem_cgroup *memcg, int cpu)
-{
- int nid;
-
- for_each_node(nid) {
- struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
- unsigned long stat[NR_VM_NODE_STAT_ITEMS];
- struct batched_lruvec_stat *lstatc;
- int i;
-
- lstatc = per_cpu_ptr(pn->lruvec_stat_cpu, cpu);
- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
- stat[i] = lstatc->count[i];
- lstatc->count[i] = 0;
- }
-
- do {
- for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++)
- atomic_long_add(stat[i], &pn->lruvec_stat[i]);
- } while ((pn = parent_nodeinfo(pn, nid)));
- }
-}
-
static int memcg_hotplug_cpu_dead(unsigned int cpu)
{
struct memcg_stock_pcp *stock;
- struct mem_cgroup *memcg;
stock = &per_cpu(memcg_stock, cpu);
drain_stock(stock);
- for_each_mem_cgroup(memcg)
- memcg_flush_lruvec_page_state(memcg, cpu);
-
return 0;
}
@@ -5126,17 +5073,9 @@ static int alloc_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return 1;
- pn->lruvec_stat_local = alloc_percpu_gfp(struct lruvec_stat,
- GFP_KERNEL_ACCOUNT);
- if (!pn->lruvec_stat_local) {
- kfree(pn);
- return 1;
- }
-
- pn->lruvec_stat_cpu = alloc_percpu_gfp(struct batched_lruvec_stat,
- GFP_KERNEL_ACCOUNT);
- if (!pn->lruvec_stat_cpu) {
- free_percpu(pn->lruvec_stat_local);
+ pn->lruvec_stats_percpu = alloc_percpu_gfp(struct lruvec_stats_percpu,
+ GFP_KERNEL_ACCOUNT);
+ if (!pn->lruvec_stats_percpu) {
kfree(pn);
return 1;
}
@@ -5157,8 +5096,7 @@ static void free_mem_cgroup_per_node_info(struct mem_cgroup *memcg, int node)
if (!pn)
return;
- free_percpu(pn->lruvec_stat_cpu);
- free_percpu(pn->lruvec_stat_local);
+ free_percpu(pn->lruvec_stats_percpu);
kfree(pn);
}
@@ -5174,15 +5112,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
static void mem_cgroup_free(struct mem_cgroup *memcg)
{
- int cpu;
-
memcg_wb_domain_exit(memcg);
- /*
- * Flush percpu lruvec stats to guarantee the value
- * correctness on parent's and all ancestor levels.
- */
- for_each_online_cpu(cpu)
- memcg_flush_lruvec_page_state(memcg, cpu);
__mem_cgroup_free(memcg);
}
@@ -5415,7 +5345,7 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
struct mem_cgroup *parent = parent_mem_cgroup(memcg);
struct memcg_vmstats_percpu *statc;
long delta, v;
- int i;
+ int i, nid;
statc = per_cpu_ptr(memcg->vmstats_percpu, cpu);
@@ -5463,6 +5393,36 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu)
if (parent)
parent->vmstats.events_pending[i] += delta;
}
+
+ for_each_node_state(nid, N_MEMORY) {
+ struct mem_cgroup_per_node *pn = memcg->nodeinfo[nid];
+ struct mem_cgroup_per_node *ppn = NULL;
+ struct lruvec_stats_percpu *lstatc;
+
+ if (parent)
+ ppn = parent->nodeinfo[nid];
+
+ lstatc = per_cpu_ptr(pn->lruvec_stats_percpu, cpu);
+
+ for (i = 0; i < NR_VM_NODE_STAT_ITEMS; i++) {
+ delta = pn->lruvec_stats.state_pending[i];
+ if (delta)
+ pn->lruvec_stats.state_pending[i] = 0;
+
+ v = READ_ONCE(lstatc->state[i]);
+ if (v != lstatc->state_prev[i]) {
+ delta += v - lstatc->state_prev[i];
+ lstatc->state_prev[i] = v;
+ }
+
+ if (!delta)
+ continue;
+
+ pn->lruvec_stats.state[i] += delta;
+ if (ppn)
+ ppn->lruvec_stats.state_pending[i] += delta;
+ }
+ }
}
#ifdef CONFIG_MMU
@@ -6396,6 +6356,8 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
int i;
struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
+ cgroup_rstat_flush(memcg->css.cgroup);
+
for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
int nid;
--
2.32.0.272.g935e593368-goog
^ permalink raw reply related
* Re: [PATCH 0/4] cgroup/cpuset: Allow cpuset to bound displayed cpu info
From: Tejun Heo @ 2021-06-15 15:59 UTC (permalink / raw)
To: Waiman Long
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, H. Peter Anvin, Greg Kroah-Hartman,
Rafael J. Wysocki, Luis Chamberlain, Kees Cook, Iurii Zaikin, x86,
cgroups, linux-kernel, linux-doc, linux-fsdevel
In-Reply-To: <0e21f16d-d91b-7cec-d832-4c401a713b10@redhat.com>
Hello, Waiman.
On Mon, Jun 14, 2021 at 10:53:53PM -0400, Waiman Long wrote:
> Thanks for your comment. I understand your point making change via cgroup
> interface files. However, this is not what the customers are asking for.
It's not like we can always follow what specific customers request. If there
are actual use-cases that can't be achieved with the existing interfaces and
features, we can look into how to provide those but making interface
decisions based on specific customer requests tends to lead to long term
pains.
> They are using tools that look at /proc/cpuinfo and the sysfs files. It is a
> much bigger effort to make all those tools look at a new cgroup file
> interface instead. It can be more efficiently done at the kernel level.
Short term, sure, it sure is more painful to adapt, but I don't think longer
term solution lies in the kernel trying to masquerage existing sytsem-wide
information interfaces. e.g. cpuset is one thing but what are we gonna do
about weight control or work-conserving memory controls? Pro-rate cpu count
and available memory?
> Anyway, I am OK if the consensus is that it is not a kernel problem and have
> to be handled in userspace.
I'd be happy to provide more information from kernel side as necessary but
the approach taken here doesn't seem generic or scalable at all.
> BTW, do you have any comment on another cpuset patch that I sent a week
> earlier?
>
> https://lore.kernel.org/lkml/20210603212416.25934-1-longman@redhat.com/
>
> I am looking forward for your feedback.
Sorry about the delay. Will take a look later today.
Thanks.
--
tejun
^ permalink raw reply
* Re: [PATCH v1] proc: Implement /proc/self/meminfo
From: Alexey Gladkov @ 2021-06-15 12:47 UTC (permalink / raw)
To: Christian Brauner
Cc: LKML, Linux Containers, Linux Containers, Linux FS Devel,
linux-mm, Andrew Morton, Eric W . Biederman, Johannes Weiner,
Michal Hocko, Chris Down, cgroups
In-Reply-To: <20210615113222.edzkaqfvrris4nth@wittgenstein>
On Tue, Jun 15, 2021 at 01:32:22PM +0200, Christian Brauner wrote:
> On Thu, Jun 03, 2021 at 12:43:07PM +0200, legion@kernel.org wrote:
> > From: Alexey Gladkov <legion@kernel.org>
> >
> > The /proc/meminfo contains information regardless of the cgroups
> > restrictions. This file is still widely used [1]. This means that all
> > these programs will not work correctly inside container [2][3][4]. Some
> > programs try to respect the cgroups limits, but not all of them
> > implement support for all cgroup versions [5].
> >
> > Correct information can be obtained from cgroups, but this requires the
> > cgroups to be available inside container and the correct version of
> > cgroups to be supported.
> >
> > There is lxcfs [6] that emulates /proc/meminfo using fuse to provide
> > information regarding cgroups. This patch can help them.
> >
> > This patch adds /proc/self/meminfo that contains a subset of
> > /proc/meminfo respecting cgroup restrictions.
> >
> > We cannot just create /proc/self/meminfo and make a symlink at the old
> > location because this will break the existing apparmor rules [7].
> > Therefore, the patch adds a separate file with the same format.
>
> Interesting work. Thanks. This is basically a variant of what I
> suggested at Plumbers and in [1].
I made the second version of the patch [1], but then I had a conversation
with Eric W. Biederman offlist. He convinced me that it is a bad idea to
change all the values in meminfo to accommodate cgroups. But we agreed
that MemAvailable in /proc/meminfo should respect cgroups limits. This
field was created to hide implementation details when calculating
available memory. You can see that it is quite widely used [2].
So I want to try to move in that direction.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
[2] https://codesearch.debian.net/search?q=MemAvailable%3A
> Judging from the patches sent by Waiman Long in [2] to also virtualize
> /proc/cpuinfo and /sys/devices/system/cpu this is a larger push to
> provide virtualized system information to containers.
>
> Although somewhere in the thread here this veered off into apparently
> just being a way for a process to gather information about it's own
> resources. At which point I'm confused why looking at its cgroups
> isn't enough.
I think it's not enough. As an example:
$ mount -t cgroup2 none /sys/fs/cgroup
$ echo +memory > /sys/fs/cgroup/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0
$ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0/mem1
$ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs
I didn't set a limit and just added the shell to the group.
$ cat /proc/self/cgroup
0::/mem0/mem1
$ cat /sys/fs/cgroup/mem0/mem1/memory.max
max
$ cat /sys/fs/cgroup/mem0/memory.max
max
In this case we need to use MemAvailable from /proc/meminfo.
Another example:
$ mount -t cgroup2 none /sys/fs/cgroup
$ echo +memory > /sys/fs/cgroup/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0
$ echo $(( 3 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/memory.max
$ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0/mem1
$ echo $(( 3 * 1024 * 1024 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/mem1/memory.max
$ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs
$ head -3 /proc/meminfo
MemTotal: 1002348 kB
MemFree: 972712 kB
MemAvailable: 968100 kB
$ cat /sys/fs/cgroup/mem0{,/mem1}/memory.max
3145728
3298534883328
Now, I have cgroup limits, but you can write absolutely any value as a
limit. So how much memory is available to shell in this case? To get this
value, you need to take the minimum of MemAvailable and **/memory.max.
... or I fundamentally don't understand something.
> So /proc/self/meminfo seems to just be the start. And note the two
> approaches seem to diverge too. This provides a new file while the other
> patchset virtualizes existing proc files/folders.
>
> In any case it seems you might want to talk since afaict you're all at
> the same company but don't seem to be aware of each others work (Which
> happens of course.).
>
> For the sake of history such patchsets have been pushed for before by
> the Siteground people.
>
> Chris and Johannes made a good point that the information provided in
> this file can be gathered from cgroups already. So applications should
> probably switch to reading those out of their cgroup and most are doing
> that already.
>
> And reading values out of cgroups is pretty straightforward even with
> the differences between cgroup v1 and v2. Userspace is doing it all over
> the place all of the time and the code has now existed for years so the
> cgroup interface is a problem. And with cgroup v2 it keeps growing so
> much more useful metrics that looking at meminfo isn't really cutting it
> anyway.
>
> So I think the argument that applications should start looking at their
> cgroup info if they want to find out detailed info is a solid argument
> that shouldn't be easily brushed aside.
>
> What might be worth is knowing exactly what applications are looking at
> /proc/meminfo and /proc/cpuinfo and make decision based on that info.
> None of that is clearly outlined in the thread unfortunately.
>
> So I immediately see two types of applications that could benefit from
> this patchset. The first ones are legacy applications that aren't aware
> of cgroups and aren't actively maintained. Introducing such
> functionality for these applications seems a weak argument.
>
> The second type is new and maintained applications that look at global
> info such as /proc/meminfo and /proc/cpuinfo. So such applications have
> ignored cgroups for a decade now. This makes it very unconvincing that
> they will suddenly switch to a newly introduced file. Especially if the
> entries in a new file aren't a 1:1 mapping of the old file.
>
> Johannes made another good point about it not being clear what
> applications actually want. And he's very right in that. It seems
> straightforward to virtualize things like meminfo but it actually isn't.
> And it's something you quite often discover after the fact. We have
> extensive experience implementing it in LXCFS in userspace. People kept
> and keep arguing what information exactly is supposed to go into
> calculating those values based on what best helps their use-case.
>
> Swap was an especially contentious point. In fact, sometimes users want
> to turn of swap even though it exists on the host and there's a command
> line switch in LXCFS to control that behavior.
>
> Another example supporting Johannes worry is virtualizing /proc/cpuinfo
> where some people wanted to virtualize cpu counts based on cpu shares.
> So we have two modes to virtualize cpus: based on cpuset alone or based
> on cpuset and cpu shares. And both modes are actively used. And that all
> really depends on application and workload.
>
> Finally, although LXCFS is briefly referenced in the commit message but
> it isn't explained very well and what it does.
>
> And we should consider it since this is a full existing userspace
> solution to the problem solved in this patchset including Dan's JRE
> use-case.
>
> This is a project started in 2014 and it is in production use since 2014
> and it delivers the features of this patchset here and more.
>
> For example, it's used in the Linux susbystem of Chromebooks, it's used
> by Alibaba (see [3]) and it is used for the JRE use-case by Google's
> Anthos when migrating such legacy applications (see [4]).
>
> At first, I was convinced we could make use of /proc/self/meminfo in
> LXCFS which is why I held back but we can't. We can't simply bind-mount
> it over /proc/meminfo because it's not a 1:1 correspondence between all
> fields. We could potentially read some values we now calculate and
> display it in /proc/meminfo but we can't stop virtualizing /proc/meminfo
> itself. So we don't gain anything from this. When Alex asked me about it
> I tried to come up with good ways to integrate this but the gain is just
> too little for us.
>
> Because our experience tells us that applications that want this type of
> virtualization don't really care about heir own resources. They care
> about a virtualized view of the system's resources. And the system in
> question is often a container. But it get's very tricky since we don't
> really define what a container is. So what data the user wants to see
> depends on the used container runtime, type of container, and workload.
> An application container has very different needs than a system
> container that boots systemd. LXCFS can be very flexible here and
> virtualize according to the users preferences (see the split between
> cpuset and cpuset + cpu shares virtualization for cpu counts).
>
> In any case, LXCFS is a tiny FUSE filesystem which virtualizes various
> procfs and sysfs files for a container:
>
> /proc/cpuinfo
> /proc/diskstats
> /proc/meminfo
> /proc/stat
> /proc/swaps
> /proc/uptime
> /proc/slabinfo
> /sys/devices/system/cpu/*
> /sys/devices/system/cpu/online
>
> If you call top in a container that makes use of this it will display
> everything virtualized to the container (See [5] for an example of
> /proc/cpuinfo and /sys/devices/system/cpu/*.). And JRE will not
> overallocate resources. It's actively used for all of that.
>
> Below at [5] you can find an example where 2 cpus out of 8 have been
> assigned to the container's cpuset. The container values are virtualized
> as you can see.
>
> [1]: https://lkml.org/lkml/2020/6/4/951
> [2]: https://lore.kernel.org/lkml/YMe/cGV4JPbzFRk0@slm.duckdns.org
> [3]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109
> [4]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization
> [5]: ## /proc/cpuinfo
> #### Host
> brauner@wittgenstein|~
> > ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu0
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu1
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu2
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu3
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu4
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu5
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu6
> drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu7
>
> #### Container
> brauner@wittgenstein|~
> > lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
> drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu3
> drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu4
>
> ## /sys/devices/system/cpu/*
> #### Host
> brauner@wittgenstein|~
> > grep ^processor /proc/cpuinfo
> processor : 0
> processor : 1
> processor : 2
> processor : 3
> processor : 4
> processor : 5
> processor : 6
> processor : 7
>
> #### Container
> brauner@wittgenstein|~
> > lxc exec f1 -- grep ^processor /proc/cpuinfo
> processor : 0
> processor : 1
>
> ## top
> #### Host
> top - 13:16:47 up 15:54, 39 users, load average: 0,76, 0,47, 0,40
> Tasks: 434 total, 1 running, 433 sleeping, 0 stopped, 0 zombie
> %Cpu0 : 2,7 us, 2,4 sy, 0,0 ni, 94,5 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
> %Cpu1 : 3,3 us, 1,3 sy, 0,0 ni, 95,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
> %Cpu2 : 1,6 us, 9,1 sy, 0,0 ni, 89,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
> %Cpu3 : 2,3 us, 1,3 sy, 0,0 ni, 96,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
> %Cpu4 : 2,7 us, 1,7 sy, 0,0 ni, 95,7 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
> %Cpu5 : 2,9 us, 2,9 sy, 0,0 ni, 94,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
> %Cpu6 : 2,3 us, 1,0 sy, 0,0 ni, 96,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
> %Cpu7 : 3,3 us, 1,3 sy, 0,0 ni, 95,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
>
> #### Container
> top - 11:16:13 up 2:08, 0 users, load average: 0.27, 0.36, 0.36
> Tasks: 24 total, 1 running, 23 sleeping, 0 stopped, 0 zombie
> %Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
> %Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
>
--
Rgrds, legion
^ permalink raw reply
* Re: [PATCH v1] proc: Implement /proc/self/meminfo
From: Christian Brauner @ 2021-06-15 11:32 UTC (permalink / raw)
To: legion
Cc: LKML, Linux Containers, Linux Containers, Linux FS Devel,
linux-mm, Andrew Morton, Eric W . Biederman, Johannes Weiner,
Michal Hocko, Chris Down, cgroups
In-Reply-To: <ac070cd90c0d45b7a554366f235262fa5c566435.1622716926.git.legion@kernel.org>
On Thu, Jun 03, 2021 at 12:43:07PM +0200, legion@kernel.org wrote:
> From: Alexey Gladkov <legion@kernel.org>
>
> The /proc/meminfo contains information regardless of the cgroups
> restrictions. This file is still widely used [1]. This means that all
> these programs will not work correctly inside container [2][3][4]. Some
> programs try to respect the cgroups limits, but not all of them
> implement support for all cgroup versions [5].
>
> Correct information can be obtained from cgroups, but this requires the
> cgroups to be available inside container and the correct version of
> cgroups to be supported.
>
> There is lxcfs [6] that emulates /proc/meminfo using fuse to provide
> information regarding cgroups. This patch can help them.
>
> This patch adds /proc/self/meminfo that contains a subset of
> /proc/meminfo respecting cgroup restrictions.
>
> We cannot just create /proc/self/meminfo and make a symlink at the old
> location because this will break the existing apparmor rules [7].
> Therefore, the patch adds a separate file with the same format.
Interesting work. Thanks. This is basically a variant of what I
suggested at Plumbers and in [1].
Judging from the patches sent by Waiman Long in [2] to also virtualize
/proc/cpuinfo and /sys/devices/system/cpu this is a larger push to
provide virtualized system information to containers.
Although somewhere in the thread here this veered off into apparently
just being a way for a process to gather information about it's own
resources. At which point I'm confused why looking at its cgroups
isn't enough.
So /proc/self/meminfo seems to just be the start. And note the two
approaches seem to diverge too. This provides a new file while the other
patchset virtualizes existing proc files/folders.
In any case it seems you might want to talk since afaict you're all at
the same company but don't seem to be aware of each others work (Which
happens of course.).
For the sake of history such patchsets have been pushed for before by
the Siteground people.
Chris and Johannes made a good point that the information provided in
this file can be gathered from cgroups already. So applications should
probably switch to reading those out of their cgroup and most are doing
that already.
And reading values out of cgroups is pretty straightforward even with
the differences between cgroup v1 and v2. Userspace is doing it all over
the place all of the time and the code has now existed for years so the
cgroup interface is a problem. And with cgroup v2 it keeps growing so
much more useful metrics that looking at meminfo isn't really cutting it
anyway.
So I think the argument that applications should start looking at their
cgroup info if they want to find out detailed info is a solid argument
that shouldn't be easily brushed aside.
What might be worth is knowing exactly what applications are looking at
/proc/meminfo and /proc/cpuinfo and make decision based on that info.
None of that is clearly outlined in the thread unfortunately.
So I immediately see two types of applications that could benefit from
this patchset. The first ones are legacy applications that aren't aware
of cgroups and aren't actively maintained. Introducing such
functionality for these applications seems a weak argument.
The second type is new and maintained applications that look at global
info such as /proc/meminfo and /proc/cpuinfo. So such applications have
ignored cgroups for a decade now. This makes it very unconvincing that
they will suddenly switch to a newly introduced file. Especially if the
entries in a new file aren't a 1:1 mapping of the old file.
Johannes made another good point about it not being clear what
applications actually want. And he's very right in that. It seems
straightforward to virtualize things like meminfo but it actually isn't.
And it's something you quite often discover after the fact. We have
extensive experience implementing it in LXCFS in userspace. People kept
and keep arguing what information exactly is supposed to go into
calculating those values based on what best helps their use-case.
Swap was an especially contentious point. In fact, sometimes users want
to turn of swap even though it exists on the host and there's a command
line switch in LXCFS to control that behavior.
Another example supporting Johannes worry is virtualizing /proc/cpuinfo
where some people wanted to virtualize cpu counts based on cpu shares.
So we have two modes to virtualize cpus: based on cpuset alone or based
on cpuset and cpu shares. And both modes are actively used. And that all
really depends on application and workload.
Finally, although LXCFS is briefly referenced in the commit message but
it isn't explained very well and what it does.
And we should consider it since this is a full existing userspace
solution to the problem solved in this patchset including Dan's JRE
use-case.
This is a project started in 2014 and it is in production use since 2014
and it delivers the features of this patchset here and more.
For example, it's used in the Linux susbystem of Chromebooks, it's used
by Alibaba (see [3]) and it is used for the JRE use-case by Google's
Anthos when migrating such legacy applications (see [4]).
At first, I was convinced we could make use of /proc/self/meminfo in
LXCFS which is why I held back but we can't. We can't simply bind-mount
it over /proc/meminfo because it's not a 1:1 correspondence between all
fields. We could potentially read some values we now calculate and
display it in /proc/meminfo but we can't stop virtualizing /proc/meminfo
itself. So we don't gain anything from this. When Alex asked me about it
I tried to come up with good ways to integrate this but the gain is just
too little for us.
Because our experience tells us that applications that want this type of
virtualization don't really care about heir own resources. They care
about a virtualized view of the system's resources. And the system in
question is often a container. But it get's very tricky since we don't
really define what a container is. So what data the user wants to see
depends on the used container runtime, type of container, and workload.
An application container has very different needs than a system
container that boots systemd. LXCFS can be very flexible here and
virtualize according to the users preferences (see the split between
cpuset and cpuset + cpu shares virtualization for cpu counts).
In any case, LXCFS is a tiny FUSE filesystem which virtualizes various
procfs and sysfs files for a container:
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu/*
/sys/devices/system/cpu/online
If you call top in a container that makes use of this it will display
everything virtualized to the container (See [5] for an example of
/proc/cpuinfo and /sys/devices/system/cpu/*.). And JRE will not
overallocate resources. It's actively used for all of that.
Below at [5] you can find an example where 2 cpus out of 8 have been
assigned to the container's cpuset. The container values are virtualized
as you can see.
[1]: https://lkml.org/lkml/2020/6/4/951
[2]: https://lore.kernel.org/lkml/YMe/cGV4JPbzFRk0@slm.duckdns.org
[3]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109
[4]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization
[5]: ## /proc/cpuinfo
#### Host
brauner@wittgenstein|~
> ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu0
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu1
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu2
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu3
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu4
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu5
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu6
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu7
#### Container
brauner@wittgenstein|~
> lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu3
drwxr-xr-x 2 nobody nogroup 0 Jun 15 10:22 cpu4
## /sys/devices/system/cpu/*
#### Host
brauner@wittgenstein|~
> grep ^processor /proc/cpuinfo
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
#### Container
brauner@wittgenstein|~
> lxc exec f1 -- grep ^processor /proc/cpuinfo
processor : 0
processor : 1
## top
#### Host
top - 13:16:47 up 15:54, 39 users, load average: 0,76, 0,47, 0,40
Tasks: 434 total, 1 running, 433 sleeping, 0 stopped, 0 zombie
%Cpu0 : 2,7 us, 2,4 sy, 0,0 ni, 94,5 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu1 : 3,3 us, 1,3 sy, 0,0 ni, 95,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu2 : 1,6 us, 9,1 sy, 0,0 ni, 89,3 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu3 : 2,3 us, 1,3 sy, 0,0 ni, 96,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu4 : 2,7 us, 1,7 sy, 0,0 ni, 95,7 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu5 : 2,9 us, 2,9 sy, 0,0 ni, 94,1 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
%Cpu6 : 2,3 us, 1,0 sy, 0,0 ni, 96,3 id, 0,0 wa, 0,0 hi, 0,3 si, 0,0 st
%Cpu7 : 3,3 us, 1,3 sy, 0,0 ni, 95,4 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
#### Container
top - 11:16:13 up 2:08, 0 users, load average: 0.27, 0.36, 0.36
Tasks: 24 total, 1 running, 23 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
^ permalink raw reply
* Re: [PATCH 0/4] cgroup/cpuset: Allow cpuset to bound displayed cpu info
From: Christian Brauner @ 2021-06-15 9:14 UTC (permalink / raw)
To: Tejun Heo, Waiman Long, Greg Kroah-Hartman, Johannes Weiner
Cc: Zefan Li, Jonathan Corbet, Thomas Gleixner, Ingo Molnar,
Borislav Petkov,
H. Peter Anvin <hpa-YMNOUZJC4hwAvxtiuMwx3w@public.gmane.org>, Rafael J. Wysocki ,
Luis Chamberlain, Kees Cook, Iurii Zaikin,
x86-DgEjT+Ai2ygdnm+yROfE0A, cgroups-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <YMe/cGV4JPbzFRk0-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
On Mon, Jun 14, 2021 at 04:43:28PM -0400, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 14, 2021 at 11:23:02AM -0400, Waiman Long wrote:
> > The current container management system is able to create the illusion
> > that applications running within a container have limited resources and
> > devices available for their use. However, one thing that is hard to hide
> > is the number of CPUs available in the system. In fact, the container
> > developers are asking for the kernel to provide such capability.
> >
> > There are two places where cpu information are available for the
> > applications to see - /proc/cpuinfo and /sys/devices/system/cpu sysfs
> > directory.
> >
> > This patchset introduces a new sysctl parameter cpuset_bound_cpuinfo
> > which, when set, will limit the amount of information disclosed by
> > /proc/cpuinfo and /sys/devices/system/cpu.
>
> The goal of cgroup has never been masquerading system information so that
> applications can pretend that they own the whole system and the proposed
> solution requires application changes anyway. The information being provided
> is useful but please do so within the usual cgroup interface - e.g.
> cpuset.stat. The applications (or libraries) that want to determine its
> confined CPU availability can locate the file through /proc/self/cgroup.
Fyi, there's another concurrent push going on to provide a new file
/proc/self/meminfo that is a subset of /proc/meminfo (cf. [1]) and
virtualizes based on cgroups as well.
But there it's a new file not virtualizing exisiting files and
directories so there things seem to be out of sync between these groups
at the same company.
In any case I would like to point out that this has a complete solution
in userspace. We have had this problem of providing virtualized
information to containers since they started existing. So we created
LXCFS in 2014 (cf. [2]) a tiny fuse fileystem to provide a virtualized
view based on cgroups and other information for containers.
The two patchsets seems like they're on the way trying to move 1:1 what
we're already doing in userspace into the kernel. LXCFS is quite well
known and widely used so it's suprising to not see it mentioned at all.
And the container people will want more then just the cpu and meminfo
stuff sooner or later. Just look at what we currently virtualize:
/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu
## So for example /proc/cpuinfo
#### Host
brauner@wittgenstein|~
> grep ^processor /proc/cpuinfo
processor : 0
processor : 1
processor : 2
processor : 3
processor : 4
processor : 5
processor : 6
processor : 7
#### Container
brauner@wittgenstein|~
> lxc exec f1 -- grep ^processor /proc/cpuinfo
processor : 0
processor : 1
## and for /sys/devices/system/cpu
#### Host
brauner@wittgenstein|~
> ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu0
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu1
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu2
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu3
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu4
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu5
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu6
drwxr-xr-x 10 root root 0 Jun 14 21:22 cpu7
#### Container
brauner@wittgenstein|~
> lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
drwxr-xr-x 2 nobody nogroup 0 Jun 15 09:07 cpu3
drwxr-xr-x 2 nobody nogroup 0 Jun 15 09:07 cpu4
We have a wide variety of users from various distros.
[1]: https://lore.kernel.org/containers/f62b652c-3f6f-31ba-be0f-5f97b304599f-EcKl7qYKIbxeoWH0uzbU5w@public.gmane.org
[2]: https://github.com/lxc/lxcfs
^ permalink raw reply
* Re: [PATCH 0/4] cgroup/cpuset: Allow cpuset to bound displayed cpu info
From: Waiman Long @ 2021-06-15 2:53 UTC (permalink / raw)
To: Tejun Heo
Cc: Zefan Li, Johannes Weiner, Jonathan Corbet, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, H. Peter Anvin, Greg Kroah-Hartman,
Rafael J. Wysocki, Luis Chamberlain, Kees Cook, Iurii Zaikin, x86,
cgroups, linux-kernel, linux-doc, linux-fsdevel
In-Reply-To: <YMe/cGV4JPbzFRk0@slm.duckdns.org>
On 6/14/21 4:43 PM, Tejun Heo wrote:
> Hello,
>
> On Mon, Jun 14, 2021 at 11:23:02AM -0400, Waiman Long wrote:
>> The current container management system is able to create the illusion
>> that applications running within a container have limited resources and
>> devices available for their use. However, one thing that is hard to hide
>> is the number of CPUs available in the system. In fact, the container
>> developers are asking for the kernel to provide such capability.
>>
>> There are two places where cpu information are available for the
>> applications to see - /proc/cpuinfo and /sys/devices/system/cpu sysfs
>> directory.
>>
>> This patchset introduces a new sysctl parameter cpuset_bound_cpuinfo
>> which, when set, will limit the amount of information disclosed by
>> /proc/cpuinfo and /sys/devices/system/cpu.
> The goal of cgroup has never been masquerading system information so that
> applications can pretend that they own the whole system and the proposed
> solution requires application changes anyway. The information being provided
> is useful but please do so within the usual cgroup interface - e.g.
> cpuset.stat. The applications (or libraries) that want to determine its
> confined CPU availability can locate the file through /proc/self/cgroup.
Thanks for your comment. I understand your point making change via
cgroup interface files. However, this is not what the customers are
asking for. They are using tools that look at /proc/cpuinfo and the
sysfs files. It is a much bigger effort to make all those tools look at
a new cgroup file interface instead. It can be more efficiently done at
the kernel level.
Anyway, I am OK if the consensus is that it is not a kernel problem and
have to be handled in userspace.
BTW, do you have any comment on another cpuset patch that I sent a week
earlier?
https://lore.kernel.org/lkml/20210603212416.25934-1-longman@redhat.com/
I am looking forward for your feedback.
Cheers,
Longman
^ permalink raw reply
* [PATCH v2] docs: block/bfq: describe per-device weight
From: Kir Kolyshkin @ 2021-06-14 21:41 UTC (permalink / raw)
To: Jonathan Corbet, linux-doc
Cc: tj, axboe, paolo.valente, cgroups, Kir Kolyshkin
In-Reply-To: <8735tlbbml.fsf@meer.lwn.net>
The functionality of setting per-device weight for BFQ was added
in v5.4 (commit 795fe54c2a828099), but the documentation was never
updated.
While at it, improve formatting a bit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
---
Documentation/block/bfq-iosched.rst | 38 ++++++++++++++++++++---------
1 file changed, 27 insertions(+), 11 deletions(-)
diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst
index 66c5a4e54130..df3a8a47f58c 100644
--- a/Documentation/block/bfq-iosched.rst
+++ b/Documentation/block/bfq-iosched.rst
@@ -553,20 +553,36 @@ throughput sustainable with bfq, because updating the blkio.bfq.*
stats is rather costly, especially for some of the stats enabled by
CONFIG_BFQ_CGROUP_DEBUG.
-Parameters to set
------------------
+Parameters
+----------
-For each group, there is only the following parameter to set.
+For each group, the following parameters can be set:
-weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
-group inside its parent. Available values: 1..1000 (default 100). The
-linear mapping between ioprio and weights, described at the beginning
-of the tunable section, is still valid, but all weights higher than
-IOPRIO_BE_NR*10 are mapped to ioprio 0.
+ weight
+ This specifies the default weight for the cgroup inside its parent.
+ Available values: 1..1000 (default: 100).
-Recall that, if low-latency is set, then BFQ automatically raises the
-weight of the queues associated with interactive and soft real-time
-applications. Unset this tunable if you need/want to control weights.
+ For cgroup v1, it is set by writing the value to `blkio.bfq.weight`.
+
+ For cgroup v2, it is set by writing the value to `io.bfq.weight`.
+ (with an optional prefix of `default` and a space).
+
+ The linear mapping between ioprio and weights, described at the beginning
+ of the tunable section, is still valid, but all weights higher than
+ IOPRIO_BE_NR*10 are mapped to ioprio 0.
+
+ Recall that, if low-latency is set, then BFQ automatically raises the
+ weight of the queues associated with interactive and soft real-time
+ applications. Unset this tunable if you need/want to control weights.
+
+ weight_device
+ This specifies a per-device weight for the cgroup. The syntax is
+ `minor:major weight`. A weight of `0` may be used to reset to the default
+ weight.
+
+ For cgroup v1, it is set by writing the value to `blkio.bfq.weight_device`.
+
+ For cgroup v2, the file name is `io.bfq.weight`.
[1]
--
2.31.1
^ permalink raw reply related
* Re: [PATCH 1/3] docs: block/bfq: describe per-device weight
From: Kirill Kolyshkin @ 2021-06-14 21:03 UTC (permalink / raw)
To: Jonathan Corbet; +Cc: linux-doc, tj, axboe, paolo.valente, cgroups
In-Reply-To: <8735tlbbml.fsf@meer.lwn.net>
On Sun, Jun 13, 2021 at 4:14 PM Jonathan Corbet <corbet@lwn.net> wrote:
>
> Kir Kolyshkin <kolyshkin@gmail.com> writes:
>
> This work looks generally good, but...
>
> > The functionality of setting per-device weight for BFQ was added
> > in v5.4 (commit 795fe54c2a828099), but the documentation was never
> > updated.
> >
> > While at it, improve formatting a bit.
> >
> > Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
> > ---
> > Documentation/block/bfq-iosched.rst | 38 ++++++++++++++++++++---------
> > 1 file changed, 27 insertions(+), 11 deletions(-)
> >
> > diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst
> > index 66c5a4e54130..7200152c461e 100644
> > --- a/Documentation/block/bfq-iosched.rst
> > +++ b/Documentation/block/bfq-iosched.rst
> > @@ -553,20 +553,36 @@ throughput sustainable with bfq, because updating the blkio.bfq.*
> > stats is rather costly, especially for some of the stats enabled by
> > CONFIG_BFQ_CGROUP_DEBUG.
> >
> > -Parameters to set
> > ------------------
> > +Parameters
> > +----------
> >
> > -For each group, there is only the following parameter to set.
> > +For each group, the following parameters cat be set:
>
> Could we please not introduce new typos while fixing other stuff?
Are you a dog person? :)
Thanks for catching this, corrected patch sent.
^ permalink raw reply
* [PATCH v2] docs: block/bfq: describe per-device weight
From: Kir Kolyshkin @ 2021-06-14 21:01 UTC (permalink / raw)
To: Jonathan Corbet, linux-doc
Cc: tj, axboe, paolo.valente, cgroups, Kir Kolyshkin
In-Reply-To: <8735tlbbml.fsf@meer.lwn.net>
The functionality of setting per-device weight for BFQ was added
in v5.4 (commit 795fe54c2a828099), but the documentation was never
updated.
While at it, improve formatting a bit.
Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
---
Documentation/block/bfq-iosched.rst | 38 ++++++++++++++++++++---------
1 file changed, 27 insertions(+), 11 deletions(-)
diff --git a/Documentation/block/bfq-iosched.rst b/Documentation/block/bfq-iosched.rst
index 66c5a4e54130..7200152c461e 100644
--- a/Documentation/block/bfq-iosched.rst
+++ b/Documentation/block/bfq-iosched.rst
@@ -553,20 +553,36 @@ throughput sustainable with bfq, because updating the blkio.bfq.*
stats is rather costly, especially for some of the stats enabled by
CONFIG_BFQ_CGROUP_DEBUG.
-Parameters to set
------------------
+Parameters
+----------
-For each group, there is only the following parameter to set.
+For each group, the following parameters cat be set:
-weight (namely blkio.bfq.weight or io.bfq-weight): the weight of the
-group inside its parent. Available values: 1..1000 (default 100). The
-linear mapping between ioprio and weights, described at the beginning
-of the tunable section, is still valid, but all weights higher than
-IOPRIO_BE_NR*10 are mapped to ioprio 0.
+ weight
+ This specifies the default weight for the cgroup inside its parent.
+ Available values: 1..1000 (default: 100).
-Recall that, if low-latency is set, then BFQ automatically raises the
-weight of the queues associated with interactive and soft real-time
-applications. Unset this tunable if you need/want to control weights.
+ For cgroup v1, it is set by writing the value to `blkio.bfq.weight`.
+
+ For cgroup v2, it is set by writing the value to `io.bfq.weight`.
+ (with an optional prefix of `default` and a space).
+
+ The linear mapping between ioprio and weights, described at the beginning
+ of the tunable section, is still valid, but all weights higher than
+ IOPRIO_BE_NR*10 are mapped to ioprio 0.
+
+ Recall that, if low-latency is set, then BFQ automatically raises the
+ weight of the queues associated with interactive and soft real-time
+ applications. Unset this tunable if you need/want to control weights.
+
+ weight_device
+ This specifies a per-device weight for the cgroup. The syntax is
+ `minor:major weight`. A weight of `0` may be used to reset to the default
+ weight.
+
+ For cgroup v1, it is set by writing the value to `blkio.bfq.weight_device`.
+
+ For cgroup v2, the file name is `io.bfq.weight`.
[1]
--
2.31.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox