cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v1] proc: Implement /proc/self/meminfo
       [not found] <ac070cd90c0d45b7a554366f235262fa5c566435.1622716926.git.legion@kernel.org>
@ 2021-06-15 11:32 ` Christian Brauner
  2021-06-15 12:47   ` Alexey Gladkov
  0 siblings, 1 reply; 7+ messages in thread
From: Christian Brauner @ 2021-06-15 11:32 UTC (permalink / raw)
  To: legion
  Cc: LKML, Linux Containers, Linux Containers, Linux FS Devel,
	linux-mm, Andrew Morton, Eric W . Biederman, Johannes Weiner,
	Michal Hocko, Chris Down, cgroups

On Thu, Jun 03, 2021 at 12:43:07PM +0200, legion@kernel.org wrote:
> From: Alexey Gladkov <legion@kernel.org>
> 
> The /proc/meminfo contains information regardless of the cgroups
> restrictions. This file is still widely used [1]. This means that all
> these programs will not work correctly inside container [2][3][4]. Some
> programs try to respect the cgroups limits, but not all of them
> implement support for all cgroup versions [5].
> 
> Correct information can be obtained from cgroups, but this requires the
> cgroups to be available inside container and the correct version of
> cgroups to be supported.
> 
> There is lxcfs [6] that emulates /proc/meminfo using fuse to provide
> information regarding cgroups. This patch can help them.
> 
> This patch adds /proc/self/meminfo that contains a subset of
> /proc/meminfo respecting cgroup restrictions.
> 
> We cannot just create /proc/self/meminfo and make a symlink at the old
> location because this will break the existing apparmor rules [7].
> Therefore, the patch adds a separate file with the same format.

Interesting work. Thanks. This is basically a variant of what I
suggested at Plumbers and in [1].

Judging from the patches sent by Waiman Long in [2] to also virtualize
/proc/cpuinfo and /sys/devices/system/cpu this is a larger push to
provide virtualized system information to containers.

Although somewhere in the thread here this veered off into apparently
just being a way for a process to gather information about it's own
resources. At which point I'm confused why looking at its cgroups
isn't enough.

So /proc/self/meminfo seems to just be the start. And note the two
approaches seem to diverge too. This provides a new file while the other
patchset virtualizes existing proc files/folders.

In any case it seems you might want to talk since afaict you're all at
the same company but don't seem to be aware of each others work (Which
happens of course.).

For the sake of history such patchsets have been pushed for before by
the Siteground people.

Chris and Johannes made a good point that the information provided in
this file can be gathered from cgroups already. So applications should
probably switch to reading those out of their cgroup and most are doing
that already.

And reading values out of cgroups is pretty straightforward even with
the differences between cgroup v1 and v2. Userspace is doing it all over
the place all of the time and the code has now existed for years so the
cgroup interface is a problem. And with cgroup v2 it keeps growing so
much more useful metrics that looking at meminfo isn't really cutting it
anyway.

So I think the argument that applications should start looking at their
cgroup info if they want to find out detailed info is a solid argument
that shouldn't be easily brushed aside.

What might be worth is knowing exactly what applications are looking at
/proc/meminfo and /proc/cpuinfo and make decision based on that info.
None of that is clearly outlined in the thread unfortunately.

So I immediately see two types of applications that could benefit from
this patchset. The first ones are legacy applications that aren't aware
of cgroups and aren't actively maintained. Introducing such
functionality for these applications seems a weak argument.

The second type is new and maintained applications that look at global
info such as /proc/meminfo and /proc/cpuinfo. So such applications have
ignored cgroups for a decade now. This makes it very unconvincing that
they will suddenly switch to a newly introduced file. Especially if the
entries in a new file aren't a 1:1 mapping of the old file.

Johannes made another good point about it not being clear what
applications actually want. And he's very right in that. It seems
straightforward to virtualize things like meminfo but it actually isn't.
And it's something you quite often discover after the fact. We have
extensive experience implementing it in LXCFS in userspace. People kept
and keep arguing what information exactly is supposed to go into
calculating those values based on what best helps their use-case.

Swap was an especially contentious point. In fact, sometimes users want
to turn of swap even though it exists on the host and there's a command
line switch in LXCFS to control that behavior.

Another example supporting Johannes worry is virtualizing /proc/cpuinfo
where some people wanted to virtualize cpu counts based on cpu shares.
So we have two modes to virtualize cpus: based on cpuset alone or based
on cpuset and cpu shares. And both modes are actively used. And that all
really depends on application and workload.

Finally, although LXCFS is briefly referenced in the commit message but
it isn't explained very well and what it does.

And we should consider it since this is a full existing userspace
solution to the problem solved in this patchset including Dan's JRE
use-case.

This is a project started in 2014 and it is in production use since 2014
and it delivers the features of this patchset here and more.

For example, it's used in the Linux susbystem of Chromebooks, it's used
by Alibaba (see [3]) and it is used for the JRE use-case by Google's
Anthos when migrating such legacy applications (see [4]).

At first, I was convinced we could make use of /proc/self/meminfo in
LXCFS which is why I held back but we can't. We can't simply bind-mount
it over /proc/meminfo because it's not a 1:1 correspondence between all
fields. We could potentially read some values we now calculate and
display it in /proc/meminfo but we can't stop virtualizing /proc/meminfo
itself. So we don't gain anything from this. When Alex asked me about it
I tried to come up with good ways to integrate this but the gain is just
too little for us.

Because our experience tells us that applications that want this type of
virtualization don't really care about heir own resources. They care
about a virtualized view of the system's resources. And the system in
question is often a container. But it get's very tricky since we don't
really define what a container is. So what data the user wants to see
depends on the used container runtime, type of container, and workload.
An application container has very different needs than a system
container that boots systemd. LXCFS can be very flexible here and
virtualize according to the users preferences (see the split between
cpuset and cpuset + cpu shares virtualization for cpu counts).

In any case, LXCFS is a tiny FUSE filesystem which virtualizes various
procfs and sysfs files for a container:

/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime
/proc/slabinfo
/sys/devices/system/cpu/*
/sys/devices/system/cpu/online

If you call top in a container that makes use of this it will display
everything virtualized to the container (See [5] for an example of
/proc/cpuinfo and /sys/devices/system/cpu/*.). And JRE will not
overallocate resources. It's actively used for all of that.

Below at [5] you can find an example where 2 cpus out of 8 have been
assigned to the container's cpuset. The container values are virtualized
as you can see.

[1]: https://lkml.org/lkml/2020/6/4/951
[2]: https://lore.kernel.org/lkml/YMe/cGV4JPbzFRk0@slm.duckdns.org
[3]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109
[4]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization
[5]: ## /proc/cpuinfo
     #### Host
     brauner@wittgenstein|~
     > ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu0
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu1
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu2
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu3
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu4
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu5
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu6
     drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu7
     
     #### Container
     brauner@wittgenstein|~
     > lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
     drwxr-xr-x  2 nobody nogroup   0 Jun 15 10:22 cpu3
     drwxr-xr-x  2 nobody nogroup   0 Jun 15 10:22 cpu4
     
     ## /sys/devices/system/cpu/*
     #### Host
     brauner@wittgenstein|~
     > grep ^processor /proc/cpuinfo
     processor       : 0
     processor       : 1
     processor       : 2
     processor       : 3
     processor       : 4
     processor       : 5
     processor       : 6
     processor       : 7
     
     #### Container
     brauner@wittgenstein|~
     > lxc exec f1 -- grep ^processor /proc/cpuinfo
     processor       : 0
     processor       : 1

     ## top
     #### Host
     top - 13:16:47 up 15:54, 39 users,  load average: 0,76, 0,47, 0,40
     Tasks: 434 total,   1 running, 433 sleeping,   0 stopped,   0 zombie
     %Cpu0  :  2,7 us,  2,4 sy,  0,0 ni, 94,5 id,  0,0 wa,  0,0 hi,  0,3 si,  0,0 st
     %Cpu1  :  3,3 us,  1,3 sy,  0,0 ni, 95,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
     %Cpu2  :  1,6 us,  9,1 sy,  0,0 ni, 89,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
     %Cpu3  :  2,3 us,  1,3 sy,  0,0 ni, 96,4 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
     %Cpu4  :  2,7 us,  1,7 sy,  0,0 ni, 95,7 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
     %Cpu5  :  2,9 us,  2,9 sy,  0,0 ni, 94,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
     %Cpu6  :  2,3 us,  1,0 sy,  0,0 ni, 96,3 id,  0,0 wa,  0,0 hi,  0,3 si,  0,0 st
     %Cpu7  :  3,3 us,  1,3 sy,  0,0 ni, 95,4 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st

     #### Container
     top - 11:16:13 up  2:08,  0 users,  load average: 0.27, 0.36, 0.36
     Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie
     %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
     %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-15 11:32 ` [PATCH v1] proc: Implement /proc/self/meminfo Christian Brauner
@ 2021-06-15 12:47   ` Alexey Gladkov
  2021-06-16  1:09     ` Shakeel Butt
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Gladkov @ 2021-06-15 12:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: LKML, Linux Containers, Linux Containers, Linux FS Devel,
	linux-mm, Andrew Morton, Eric W . Biederman, Johannes Weiner,
	Michal Hocko, Chris Down, cgroups

On Tue, Jun 15, 2021 at 01:32:22PM +0200, Christian Brauner wrote:
> On Thu, Jun 03, 2021 at 12:43:07PM +0200, legion@kernel.org wrote:
> > From: Alexey Gladkov <legion@kernel.org>
> > 
> > The /proc/meminfo contains information regardless of the cgroups
> > restrictions. This file is still widely used [1]. This means that all
> > these programs will not work correctly inside container [2][3][4]. Some
> > programs try to respect the cgroups limits, but not all of them
> > implement support for all cgroup versions [5].
> > 
> > Correct information can be obtained from cgroups, but this requires the
> > cgroups to be available inside container and the correct version of
> > cgroups to be supported.
> > 
> > There is lxcfs [6] that emulates /proc/meminfo using fuse to provide
> > information regarding cgroups. This patch can help them.
> > 
> > This patch adds /proc/self/meminfo that contains a subset of
> > /proc/meminfo respecting cgroup restrictions.
> > 
> > We cannot just create /proc/self/meminfo and make a symlink at the old
> > location because this will break the existing apparmor rules [7].
> > Therefore, the patch adds a separate file with the same format.
> 
> Interesting work. Thanks. This is basically a variant of what I
> suggested at Plumbers and in [1].

I made the second version of the patch [1], but then I had a conversation
with Eric W. Biederman offlist. He convinced me that it is a bad idea to
change all the values in meminfo to accommodate cgroups. But we agreed
that MemAvailable in /proc/meminfo should respect cgroups limits. This
field was created to hide implementation details when calculating
available memory. You can see that it is quite widely used [2].
So I want to try to move in that direction.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
[2] https://codesearch.debian.net/search?q=MemAvailable%3A

> Judging from the patches sent by Waiman Long in [2] to also virtualize
> /proc/cpuinfo and /sys/devices/system/cpu this is a larger push to
> provide virtualized system information to containers.
> 
> Although somewhere in the thread here this veered off into apparently
> just being a way for a process to gather information about it's own
> resources. At which point I'm confused why looking at its cgroups
> isn't enough.

I think it's not enough. As an example:

$ mount -t cgroup2 none /sys/fs/cgroup

$ echo +memory > /sys/fs/cgroup/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0

$ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0/mem1

$ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs

I didn't set a limit and just added the shell to the group.

$ cat /proc/self/cgroup 
0::/mem0/mem1
$ cat /sys/fs/cgroup/mem0/mem1/memory.max 
max
$ cat /sys/fs/cgroup/mem0/memory.max 
max

In this case we need to use MemAvailable from /proc/meminfo.

Another example:

$ mount -t cgroup2 none /sys/fs/cgroup

$ echo +memory > /sys/fs/cgroup/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0
$ echo $(( 3 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/memory.max

$ echo +memory > /sys/fs/cgroup/mem0/cgroup.subtree_control
$ mkdir /sys/fs/cgroup/mem0/mem1
$ echo $(( 3 * 1024 * 1024 * 1024 * 1024 )) > /sys/fs/cgroup/mem0/mem1/memory.max

$ echo $$ > /sys/fs/cgroup/mem0/mem1/cgroup.procs

$ head -3 /proc/meminfo  
MemTotal:        1002348 kB
MemFree:          972712 kB
MemAvailable:     968100 kB

$ cat /sys/fs/cgroup/mem0{,/mem1}/memory.max  
3145728
3298534883328

Now, I have cgroup limits, but you can write absolutely any value as a
limit. So how much memory is available to shell in this case? To get this
value, you need to take the minimum of MemAvailable and **/memory.max.
... or I fundamentally don't understand something.

> So /proc/self/meminfo seems to just be the start. And note the two
> approaches seem to diverge too. This provides a new file while the other
> patchset virtualizes existing proc files/folders.
> 
> In any case it seems you might want to talk since afaict you're all at
> the same company but don't seem to be aware of each others work (Which
> happens of course.).
> 
> For the sake of history such patchsets have been pushed for before by
> the Siteground people.
> 
> Chris and Johannes made a good point that the information provided in
> this file can be gathered from cgroups already. So applications should
> probably switch to reading those out of their cgroup and most are doing
> that already.
> 
> And reading values out of cgroups is pretty straightforward even with
> the differences between cgroup v1 and v2. Userspace is doing it all over
> the place all of the time and the code has now existed for years so the
> cgroup interface is a problem. And with cgroup v2 it keeps growing so
> much more useful metrics that looking at meminfo isn't really cutting it
> anyway.
> 
> So I think the argument that applications should start looking at their
> cgroup info if they want to find out detailed info is a solid argument
> that shouldn't be easily brushed aside.
> 
> What might be worth is knowing exactly what applications are looking at
> /proc/meminfo and /proc/cpuinfo and make decision based on that info.
> None of that is clearly outlined in the thread unfortunately.
> 
> So I immediately see two types of applications that could benefit from
> this patchset. The first ones are legacy applications that aren't aware
> of cgroups and aren't actively maintained. Introducing such
> functionality for these applications seems a weak argument.
> 
> The second type is new and maintained applications that look at global
> info such as /proc/meminfo and /proc/cpuinfo. So such applications have
> ignored cgroups for a decade now. This makes it very unconvincing that
> they will suddenly switch to a newly introduced file. Especially if the
> entries in a new file aren't a 1:1 mapping of the old file.
> 
> Johannes made another good point about it not being clear what
> applications actually want. And he's very right in that. It seems
> straightforward to virtualize things like meminfo but it actually isn't.
> And it's something you quite often discover after the fact. We have
> extensive experience implementing it in LXCFS in userspace. People kept
> and keep arguing what information exactly is supposed to go into
> calculating those values based on what best helps their use-case.
> 
> Swap was an especially contentious point. In fact, sometimes users want
> to turn of swap even though it exists on the host and there's a command
> line switch in LXCFS to control that behavior.
> 
> Another example supporting Johannes worry is virtualizing /proc/cpuinfo
> where some people wanted to virtualize cpu counts based on cpu shares.
> So we have two modes to virtualize cpus: based on cpuset alone or based
> on cpuset and cpu shares. And both modes are actively used. And that all
> really depends on application and workload.
> 
> Finally, although LXCFS is briefly referenced in the commit message but
> it isn't explained very well and what it does.
> 
> And we should consider it since this is a full existing userspace
> solution to the problem solved in this patchset including Dan's JRE
> use-case.
> 
> This is a project started in 2014 and it is in production use since 2014
> and it delivers the features of this patchset here and more.
> 
> For example, it's used in the Linux susbystem of Chromebooks, it's used
> by Alibaba (see [3]) and it is used for the JRE use-case by Google's
> Anthos when migrating such legacy applications (see [4]).
> 
> At first, I was convinced we could make use of /proc/self/meminfo in
> LXCFS which is why I held back but we can't. We can't simply bind-mount
> it over /proc/meminfo because it's not a 1:1 correspondence between all
> fields. We could potentially read some values we now calculate and
> display it in /proc/meminfo but we can't stop virtualizing /proc/meminfo
> itself. So we don't gain anything from this. When Alex asked me about it
> I tried to come up with good ways to integrate this but the gain is just
> too little for us.
> 
> Because our experience tells us that applications that want this type of
> virtualization don't really care about heir own resources. They care
> about a virtualized view of the system's resources. And the system in
> question is often a container. But it get's very tricky since we don't
> really define what a container is. So what data the user wants to see
> depends on the used container runtime, type of container, and workload.
> An application container has very different needs than a system
> container that boots systemd. LXCFS can be very flexible here and
> virtualize according to the users preferences (see the split between
> cpuset and cpuset + cpu shares virtualization for cpu counts).
> 
> In any case, LXCFS is a tiny FUSE filesystem which virtualizes various
> procfs and sysfs files for a container:
> 
> /proc/cpuinfo
> /proc/diskstats
> /proc/meminfo
> /proc/stat
> /proc/swaps
> /proc/uptime
> /proc/slabinfo
> /sys/devices/system/cpu/*
> /sys/devices/system/cpu/online
> 
> If you call top in a container that makes use of this it will display
> everything virtualized to the container (See [5] for an example of
> /proc/cpuinfo and /sys/devices/system/cpu/*.). And JRE will not
> overallocate resources. It's actively used for all of that.
> 
> Below at [5] you can find an example where 2 cpus out of 8 have been
> assigned to the container's cpuset. The container values are virtualized
> as you can see.
> 
> [1]: https://lkml.org/lkml/2020/6/4/951
> [2]: https://lore.kernel.org/lkml/YMe/cGV4JPbzFRk0@slm.duckdns.org
> [3]: https://www.alibabacloud.com/blog/kubernetes-demystified-using-lxcfs-to-improve-container-resource-visibility_594109
> [4]: https://cloud.google.com/blog/products/containers-kubernetes/migrate-for-anthos-streamlines-legacy-java-app-modernization
> [5]: ## /proc/cpuinfo
>      #### Host
>      brauner@wittgenstein|~
>      > ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu0
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu1
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu2
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu3
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu4
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu5
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu6
>      drwxr-xr-x 10 root root    0 Jun 14 21:22 cpu7
>      
>      #### Container
>      brauner@wittgenstein|~
>      > lxc exec f1 -- ls -al /sys/devices/system/cpu/ | grep cpu[[:digit:]]
>      drwxr-xr-x  2 nobody nogroup   0 Jun 15 10:22 cpu3
>      drwxr-xr-x  2 nobody nogroup   0 Jun 15 10:22 cpu4
>      
>      ## /sys/devices/system/cpu/*
>      #### Host
>      brauner@wittgenstein|~
>      > grep ^processor /proc/cpuinfo
>      processor       : 0
>      processor       : 1
>      processor       : 2
>      processor       : 3
>      processor       : 4
>      processor       : 5
>      processor       : 6
>      processor       : 7
>      
>      #### Container
>      brauner@wittgenstein|~
>      > lxc exec f1 -- grep ^processor /proc/cpuinfo
>      processor       : 0
>      processor       : 1
> 
>      ## top
>      #### Host
>      top - 13:16:47 up 15:54, 39 users,  load average: 0,76, 0,47, 0,40
>      Tasks: 434 total,   1 running, 433 sleeping,   0 stopped,   0 zombie
>      %Cpu0  :  2,7 us,  2,4 sy,  0,0 ni, 94,5 id,  0,0 wa,  0,0 hi,  0,3 si,  0,0 st
>      %Cpu1  :  3,3 us,  1,3 sy,  0,0 ni, 95,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
>      %Cpu2  :  1,6 us,  9,1 sy,  0,0 ni, 89,3 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
>      %Cpu3  :  2,3 us,  1,3 sy,  0,0 ni, 96,4 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
>      %Cpu4  :  2,7 us,  1,7 sy,  0,0 ni, 95,7 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
>      %Cpu5  :  2,9 us,  2,9 sy,  0,0 ni, 94,1 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
>      %Cpu6  :  2,3 us,  1,0 sy,  0,0 ni, 96,3 id,  0,0 wa,  0,0 hi,  0,3 si,  0,0 st
>      %Cpu7  :  3,3 us,  1,3 sy,  0,0 ni, 95,4 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
> 
>      #### Container
>      top - 11:16:13 up  2:08,  0 users,  load average: 0.27, 0.36, 0.36
>      Tasks:  24 total,   1 running,  23 sleeping,   0 stopped,   0 zombie
>      %Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
>      %Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
> 

-- 
Rgrds, legion


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-15 12:47   ` Alexey Gladkov
@ 2021-06-16  1:09     ` Shakeel Butt
  2021-06-16 16:17       ` Eric W. Biederman
  0 siblings, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2021-06-16  1:09 UTC (permalink / raw)
  To: Alexey Gladkov
  Cc: Christian Brauner, LKML, Linux Containers, Linux Containers,
	Linux FS Devel, Linux MM, Andrew Morton, Eric W . Biederman,
	Johannes Weiner, Michal Hocko, Chris Down, Cgroups

On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov <legion@kernel.org> wrote:
>
[...]
>
> I made the second version of the patch [1], but then I had a conversation
> with Eric W. Biederman offlist. He convinced me that it is a bad idea to
> change all the values in meminfo to accommodate cgroups. But we agreed
> that MemAvailable in /proc/meminfo should respect cgroups limits. This
> field was created to hide implementation details when calculating
> available memory. You can see that it is quite widely used [2].
> So I want to try to move in that direction.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
> [2] https://codesearch.debian.net/search?q=MemAvailable%3A
>

Please see following two links on the previous discussion on having
per-memcg MemAvailable stat.

[1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/
[2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/

MemAvailable itself is an imprecise metric and involving memcg makes
this metric even more weird. The difference of semantics of swap
accounting of v1 and v2 is one source of this weirdness (I have not
checked your patch if it is handling this weirdness). The lazyfree and
deferred split pages are another source.

So, I am not sure if complicating an already imprecise metric will
make it more useful.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-16  1:09     ` Shakeel Butt
@ 2021-06-16 16:17       ` Eric W. Biederman
  2021-06-18 17:03         ` Michal Hocko
  2021-06-18 23:38         ` Shakeel Butt
  0 siblings, 2 replies; 7+ messages in thread
From: Eric W. Biederman @ 2021-06-16 16:17 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Alexey Gladkov, Christian Brauner, LKML, Linux Containers,
	Linux Containers, Linux FS Devel, Linux MM, Andrew Morton,
	Johannes Weiner, Michal Hocko, Chris Down, Cgroups

Shakeel Butt <shakeelb@google.com> writes:

> On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov <legion@kernel.org> wrote:
>>
> [...]
>>
>> I made the second version of the patch [1], but then I had a conversation
>> with Eric W. Biederman offlist. He convinced me that it is a bad idea to
>> change all the values in meminfo to accommodate cgroups. But we agreed
>> that MemAvailable in /proc/meminfo should respect cgroups limits. This
>> field was created to hide implementation details when calculating
>> available memory. You can see that it is quite widely used [2].
>> So I want to try to move in that direction.
>>
>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
>> [2] https://codesearch.debian.net/search?q=MemAvailable%3A
>>
>
> Please see following two links on the previous discussion on having
> per-memcg MemAvailable stat.
>
> [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/
> [2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/
>
> MemAvailable itself is an imprecise metric and involving memcg makes
> this metric even more weird. The difference of semantics of swap
> accounting of v1 and v2 is one source of this weirdness (I have not
> checked your patch if it is handling this weirdness). The lazyfree and
> deferred split pages are another source.
>
> So, I am not sure if complicating an already imprecise metric will
> make it more useful.

Making a good guess at how much memory can be allocated without
triggering swapping or otherwise stressing the system is something that
requires understanding our mm internals.

To be able to continue changing the mm or even mm policy without
introducing regressions in userspace we need to export values that
userspace can use.

At a first approximation that seems to look like MemAvailable.

MemAvailable seems to have a good definition.  Roughly the amount of
memory that can be allocated without triggering swapping.  Updated
to include not trigger memory cgroup based swapping and I sounds good.

I don't know if it will work in practice but I think it is worth
exploring.

I do know that hiding the implementation details and providing userspace
with information it can directly use seems like the programming model
that needs to be explored.  Most programs should not care if they are in
a memory cgroup, etc.  Programs, load management systems, and even
balloon drivers have a legitimately interest in how much additional load
can be placed on a systems memory.


A version of this that I remember working fairly well is free space
on compressed filesystems.  As I recall compressed filesystems report
the amount of uncompressed space that is available (an underestimate).
This results in the amount of space consumed going up faster than the
free space goes down.

We can't do exactly the same thing with our memory usability estimate,
but having our estimate be a reliable underestimate might be enough
to avoid problems with reporting too much memory as available to
userspace.

I know that MemAvailable already does that /2 so maybe it is already
aiming at being an underestimate.  Perhaps we need some additional
accounting to help create a useful metric for userspace as well.


I don't know the final answer.  I do know that not designing an
interface that userspace can use to deal with it's legitimate concerns
is sticking our collective heads in the sand and wishing the problem
will go away.

Eric


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-16 16:17       ` Eric W. Biederman
@ 2021-06-18 17:03         ` Michal Hocko
  2021-06-18 23:38         ` Shakeel Butt
  1 sibling, 0 replies; 7+ messages in thread
From: Michal Hocko @ 2021-06-18 17:03 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Shakeel Butt, Alexey Gladkov, Christian Brauner, LKML,
	Linux Containers, Linux Containers, Linux FS Devel, Linux MM,
	Andrew Morton, Johannes Weiner, Chris Down, Cgroups

On Wed 16-06-21 11:17:38, Eric W. Biederman wrote:
[...]
> MemAvailable seems to have a good definition.  Roughly the amount of
> memory that can be allocated without triggering swapping.  Updated
> to include not trigger memory cgroup based swapping and I sounds good.

yes this definition is at least understandable but how do you want to
define it in the memcg scope? There are two different source of memory
pressure when dealing with memcgs. Internal one when a limit is hit and
and external when the source of the reclaim comes from higher the
hierarchy (including the global memory pressure). The former one would
be quite easy to mimic with the global semantic but the later will get
much more complex very quickly - a) you would need a snapshot of the
whole cgroup tree and evaluate it against the global memory state b) you
would have to consider memory reclaim protection c) the external memory
pressure is distributed proportionaly to the size most of the time which
is yet another complication. And more other challenges that have been
already discussed.

That being said, this might be possible to implement but I am not really
sure this is viable and I strongly suspect that it will get unreliable
in many situations in context of "how much you can allocate without
swapping".
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-16 16:17       ` Eric W. Biederman
  2021-06-18 17:03         ` Michal Hocko
@ 2021-06-18 23:38         ` Shakeel Butt
  2021-06-21 18:20           ` Enrico Weigelt, metux IT consult
  1 sibling, 1 reply; 7+ messages in thread
From: Shakeel Butt @ 2021-06-18 23:38 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Alexey Gladkov, Christian Brauner, LKML, Linux Containers,
	Linux Containers, Linux FS Devel, Linux MM, Andrew Morton,
	Johannes Weiner, Michal Hocko, Chris Down, Cgroups

On Wed, Jun 16, 2021 at 9:17 AM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Shakeel Butt <shakeelb@google.com> writes:
>
> > On Tue, Jun 15, 2021 at 5:47 AM Alexey Gladkov <legion@kernel.org> wrote:
> >>
> > [...]
> >>
> >> I made the second version of the patch [1], but then I had a conversation
> >> with Eric W. Biederman offlist. He convinced me that it is a bad idea to
> >> change all the values in meminfo to accommodate cgroups. But we agreed
> >> that MemAvailable in /proc/meminfo should respect cgroups limits. This
> >> field was created to hide implementation details when calculating
> >> available memory. You can see that it is quite widely used [2].
> >> So I want to try to move in that direction.
> >>
> >> [1] https://git.kernel.org/pub/scm/linux/kernel/git/legion/linux.git/log/?h=patchset/meminfo/v2.0
> >> [2] https://codesearch.debian.net/search?q=MemAvailable%3A
> >>
> >
> > Please see following two links on the previous discussion on having
> > per-memcg MemAvailable stat.
> >
> > [1] https://lore.kernel.org/linux-mm/alpine.DEB.2.22.394.2006281445210.855265@chino.kir.corp.google.com/
> > [2] https://lore.kernel.org/linux-mm/alpine.DEB.2.23.453.2007142018150.2667860@chino.kir.corp.google.com/
> >
> > MemAvailable itself is an imprecise metric and involving memcg makes
> > this metric even more weird. The difference of semantics of swap
> > accounting of v1 and v2 is one source of this weirdness (I have not
> > checked your patch if it is handling this weirdness). The lazyfree and
> > deferred split pages are another source.
> >
> > So, I am not sure if complicating an already imprecise metric will
> > make it more useful.
>
> Making a good guess at how much memory can be allocated without
> triggering swapping or otherwise stressing the system is something that
> requires understanding our mm internals.
>
> To be able to continue changing the mm or even mm policy without
> introducing regressions in userspace we need to export values that
> userspace can use.

The issue is the dependence of such exported values on mm internals.
MM internal code and policy changes will change this value and there
is a potential of userspace regression.

>
> At a first approximation that seems to look like MemAvailable.
>
> MemAvailable seems to have a good definition.  Roughly the amount of
> memory that can be allocated without triggering swapping.

Nowadays, I don't think MemAvailable giving "amount of memory that can
be allocated without triggering swapping" is even roughly accurate.
Actually IMO "without triggering swap" is not something an application
should concern itself with where refaults from some swap types
(zswap/swap-on-zram) are much faster than refaults from disk.

> Updated
> to include not trigger memory cgroup based swapping and I sounds good.
>
> I don't know if it will work in practice but I think it is worth
> exploring.

I agree.

>
> I do know that hiding the implementation details and providing userspace
> with information it can directly use seems like the programming model
> that needs to be explored.  Most programs should not care if they are in
> a memory cgroup, etc.  Programs, load management systems, and even
> balloon drivers have a legitimately interest in how much additional load
> can be placed on a systems memory.
>

How much additional load can be placed on a system *until what*. I
think we should focus more on the "until" part to make the problem
more tractable.

>
> A version of this that I remember working fairly well is free space
> on compressed filesystems.  As I recall compressed filesystems report
> the amount of uncompressed space that is available (an underestimate).
> This results in the amount of space consumed going up faster than the
> free space goes down.
>
> We can't do exactly the same thing with our memory usability estimate,
> but having our estimate be a reliable underestimate might be enough
> to avoid problems with reporting too much memory as available to
> userspace.
>
> I know that MemAvailable already does that /2 so maybe it is already
> aiming at being an underestimate.  Perhaps we need some additional
> accounting to help create a useful metric for userspace as well.
>

The real challenge here is that we are not 100% sure if a page is
reclaimable until we try to reclaim it. For example we might have file
lrus filled with lazyfree pages which might have been accessed.
MemAvailable will show half the size of file lrus but once we try to
reclaim them, we have to move them back to anon lru and drastic drop
in MemAvailable.

>
> I don't know the final answer.  I do know that not designing an
> interface that userspace can use to deal with it's legitimate concerns
> is sticking our collective heads in the sand and wishing the problem
> will go away.

I am a bit skeptical that a single interface would be enough but first
we should formalize what exactly the application wants with some
concrete use-cases. More specifically, are the applications interested
in avoiding swapping or OOM or stall?

Second, is the reactive approach acceptable? Instead of an upfront
number representing the room for growth, how about just grow and
backoff when some event (oom or stall) which we want to avoid is about
to happen? This is achievable today for oom and stall with PSI and
memory.high and it avoids the hard problem of reliably estimating the
reclaimable memory.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH v1] proc: Implement /proc/self/meminfo
  2021-06-18 23:38         ` Shakeel Butt
@ 2021-06-21 18:20           ` Enrico Weigelt, metux IT consult
  0 siblings, 0 replies; 7+ messages in thread
From: Enrico Weigelt, metux IT consult @ 2021-06-21 18:20 UTC (permalink / raw)
  To: Shakeel Butt, Eric W. Biederman
  Cc: Alexey Gladkov, Christian Brauner, LKML, Linux Containers,
	Linux Containers, Linux FS Devel, Linux MM, Andrew Morton,
	Johannes Weiner, Michal Hocko, Chris Down, Cgroups

On 19.06.21 01:38, Shakeel Butt wrote:

> Nowadays, I don't think MemAvailable giving "amount of memory that can
> be allocated without triggering swapping" is even roughly accurate.
> Actually IMO "without triggering swap" is not something an application
> should concern itself with where refaults from some swap types
> (zswap/swap-on-zram) are much faster than refaults from disk.

If we're talking about things like database workloads, there IMHO isn't
anything really better than doing measurements with the actual loads
and tuning incrementally.

But: what is the actual optimization goal, why an application might
want to know where swapping begins ? Computing performance ? Caching +
IO Latency or throughput ? Network traffic (e.g. w/ iscsi) ? Power
consumption ?

>> I do know that hiding the implementation details and providing userspace
>> with information it can directly use seems like the programming model
>> that needs to be explored.  Most programs should not care if they are in
>> a memory cgroup, etc.  Programs, load management systems, and even
>> balloon drivers have a legitimately interest in how much additional load
>> can be placed on a systems memory.

What kind of load exactly ? CPU ? disk IO ? network ?

> How much additional load can be placed on a system *until what*. I
> think we should focus more on the "until" part to make the problem
> more tractable.

ACK. The interesting question is what to do in that case.

An obvious move by an database system could be eg. filling only so much
caches as there's spare physical RAM, in order to avoid useless swapping
(since we'd potentiall produce more IO load when a cache is written
out to swap, instead of just discarding it)

But, this also depends ...

#1: the application doesn't know the actual performance of the swap
device, eg. the already mentioned zswap+friends, or some fast nvmem
for swap vs disk for storage.

#2: caches might also be implemented indirectly by mmap()ing the storage
file/device and so using the kernel's cache here. in that case, the
kernel would automatically discard the pages w/o going to swap. of
course that only works if the cache is nothing but copying pages from
storage into ram.

A completely different scenario would be load management on a cluster
like k8s. Here we usually care of cluster performance (dont care about
individual nodes so muck), but wanna prevent individual nodes from being
overloaded. Since we usually don't know much about the indivdual
workload, we probably don't have much other chance than contigous
monitoring and acting when a node is getting too busy - or trying to
balance when new workloads are started, on current system load (and
other metrics). In that case, I don't see where this new proc file
should be of much help.

> Second, is the reactive approach acceptable? Instead of an upfront
> number representing the room for growth, how about just grow and
> backoff when some event (oom or stall) which we want to avoid is about
> to happen? This is achievable today for oom and stall with PSI and
> memory.high and it avoids the hard problem of reliably estimating the
> reclaimable memory.

I tend to believe that for certain use cases it would be helpful if an
application gets notified if some of its pages are soon getting swapped
out due memory pressure. Then it could decide on its own which whether
it should drop certain caches in order to prevent swapping.


--mtx

-- 
---
Hinweis: unverschlüsselte E-Mails können leicht abgehört und manipuliert
werden ! Für eine vertrauliche Kommunikation senden Sie bitte ihren
GPG/PGP-Schlüssel zu.
---
Enrico Weigelt, metux IT consult
Free software and Linux embedded engineering
info@metux.net -- +49-151-27565287

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-06-21 18:20 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <ac070cd90c0d45b7a554366f235262fa5c566435.1622716926.git.legion@kernel.org>
2021-06-15 11:32 ` [PATCH v1] proc: Implement /proc/self/meminfo Christian Brauner
2021-06-15 12:47   ` Alexey Gladkov
2021-06-16  1:09     ` Shakeel Butt
2021-06-16 16:17       ` Eric W. Biederman
2021-06-18 17:03         ` Michal Hocko
2021-06-18 23:38         ` Shakeel Butt
2021-06-21 18:20           ` Enrico Weigelt, metux IT consult

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).