Re: cgroup information proc file format

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: cgroup information proc file format
       [not found]     ` <20110811215238.GC17349@peqn>
@ 2011-10-03  8:15       ` Glauber Costa
  2011-10-04  2:42         ` Serge E. Hallyn
  0 siblings, 1 reply; 6+ messages in thread
From: Glauber Costa @ 2011-10-03  8:15 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

On 08/12/2011 01:52 AM, Serge Hallyn wrote:
> Quoting Daniel Lezcano (daniel.lezcano@free.fr):
>> On 08/11/2011 11:30 PM, Glauber Costa wrote:
>>> On 08/11/2011 05:55 PM, Daniel Lezcano wrote:
>>>> Hi all,
>>>>
>>>> the cgroup cpuset and memory reduce access to a part of the resources on
>>>> the system. Some applications use the /proc/cpuinfo and /proc/meminfo to
>>>> allocate the resources. For instance, HPC jobs look at /proc/cpuinfo to
>>>> fork the number of cpu found in this file either look at /proc/meminfo
>>>> to allocate a big chunk of memory. Each process set the affinity on each
>>>> cpu, which in case a subset of cpus is used, some affinity will fail.
>>>>
>>>> In the case of the container, the cgroup is used to reduce the memory or
>>>> to assign a cpu to the container. Unfortunately, as this partitioning is
>>>> not reflected in /proc, the different system tools (ps, top, free, ...)
>>>> show a wrong information.
>>>>
>>>> I was wondering if that would make sense to create for the different
>>>> cgroup subsystem, when it is relevant, a proc formatted file we can bind
>>>> mount /proc.
>>>>
>>>> For example: /cgroup/memory.proc and /cgroup/cpuset.proc
>
> I think it's a great idea.
>
> -serge

[ sorry for those who are getting this twice:
   The containers mailing list seems to be still not working, and Paul
   and Balbir changed their addresses in the mean time. So I am resending
   it to lkml and the right addrs instead. ]

Food for thought:

In my last /proc-related series, in which most of you were copied, I 
tried to implement my understanding of this idea for /proc/stat.

For whoever didn't see it, you can find a slightly outdated but still 
valid version of it at http://lwn.net/Articles/460310/

While doing it, however, something occurred to me. I'd like to know what 
you think.

As much as I like the idea proposed by Daniel (bind-mounting proc files 
from the cgroup to inside the container namespace), what I dislike about 
it is the amount of setup involved - one bind mount per file -, and the 
fact that we need to know in advance which files to expect (which I more 
or less tried to work around by conventioning a directory-like naming).

In general, we are doing containers, using both namespaces and cgroups, 
two entities that are very loosely coupled. While I agree that such a 
loose coupling is not the end of the world - and quite desirable in the 
general case -  so far I don't feel 100 % comfortable with that. So, 
here it is: feel free to shoot to kill if you dislike the idea.

What if we try to couple them a bit more strongly ? My idea is:

1) Naming a certain namespace. For starters, we could use any pid inside
a namespace to name it, usually the first one to be created, but really, 
any of them. (Or any other mechanism in the future)

2) Create standard cgroup files, like pid_namespace, net_namespace, etc.

3) If those files are empty, no coupling takes place (Or maybe we forget 
about this special case, and just have '1' as its default content.

4) If there is a pid number written on it, that particular namespace is 
considered tied to a cgroup. proc files that shows per-ns information 
are already displayed per-ns. We would then proceed to classify the 
remainder according to the type of information they convey: net file, 
cpu file, memory file, io file, etc.

5) When a task inside a cgroup reads a file, it gets the data according 
to the namespace it belongs.

This idea is almost setup-free (with the exception of dumping pids into 
the cgroup files, but if the files are default for all cgroups, a 3-line 
loop can do it in a very future-proof way). But in reality, what appeals 
to me about it, is that it is a mechanism for coupling those two
entities that in our case, should be the same. It provides stronger 
guarantees that we will never be able to see any data outside the ones 
we are untitled to, even we get the bind mounts setup wrongly.

(disclaimer: wild idea ahead)
If we, for instance, code in such a way that if a certain proc-file is 
per-namespace, the task could get no data at all unless a cgroup-binding 
is set, providing stronger isolation guarantees.

It is also easy to check if a task that do not belong to a namespace is 
present in a namespaced cgroup. We can easily disallow that, preventing 
rogue process to escape and eat resources from a container.

The list goes on.

Please tell me what you think.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup information proc file format
  2011-10-03  8:15       ` cgroup information proc file format Glauber Costa
@ 2011-10-04  2:42         ` Serge E. Hallyn
  2011-10-04  6:17           ` Glauber Costa
  0 siblings, 1 reply; 6+ messages in thread
From: Serge E. Hallyn @ 2011-10-04  2:42 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

Quoting Glauber Costa (glommer@parallels.com):
> On 08/12/2011 01:52 AM, Serge Hallyn wrote:
> >Quoting Daniel Lezcano (daniel.lezcano@free.fr):
> >>On 08/11/2011 11:30 PM, Glauber Costa wrote:
> >>>On 08/11/2011 05:55 PM, Daniel Lezcano wrote:
> >>>>Hi all,
> >>>>
> >>>>the cgroup cpuset and memory reduce access to a part of the resources on
> >>>>the system. Some applications use the /proc/cpuinfo and /proc/meminfo to
> >>>>allocate the resources. For instance, HPC jobs look at /proc/cpuinfo to
> >>>>fork the number of cpu found in this file either look at /proc/meminfo
> >>>>to allocate a big chunk of memory. Each process set the affinity on each
> >>>>cpu, which in case a subset of cpus is used, some affinity will fail.
> >>>>
> >>>>In the case of the container, the cgroup is used to reduce the memory or
> >>>>to assign a cpu to the container. Unfortunately, as this partitioning is
> >>>>not reflected in /proc, the different system tools (ps, top, free, ...)
> >>>>show a wrong information.
> >>>>
> >>>>I was wondering if that would make sense to create for the different
> >>>>cgroup subsystem, when it is relevant, a proc formatted file we can bind
> >>>>mount /proc.
> >>>>
> >>>>For example: /cgroup/memory.proc and /cgroup/cpuset.proc
> >
> >I think it's a great idea.
> >
> >-serge
> 
> [ sorry for those who are getting this twice:
>   The containers mailing list seems to be still not working, and Paul
>   and Balbir changed their addresses in the mean time. So I am resending
>   it to lkml and the right addrs instead. ]
> 
> Food for thought:
> 
> In my last /proc-related series, in which most of you were copied, I
> tried to implement my understanding of this idea for /proc/stat.
> 
> For whoever didn't see it, you can find a slightly outdated but
> still valid version of it at http://lwn.net/Articles/460310/
> 
> While doing it, however, something occurred to me. I'd like to know
> what you think.
> 
> As much as I like the idea proposed by Daniel (bind-mounting proc
> files from the cgroup to inside the container namespace), what I
> dislike about it is the amount of setup involved - one bind mount
> per file -, and the fact that we need to know in advance which files
> to expect (which I more or less tried to work around by
> conventioning a directory-like naming).
> 
> In general, we are doing containers, using both namespaces and
> cgroups, two entities that are very loosely coupled. While I agree
> that such a loose coupling is not the end of the world - and quite
> desirable in the general case -  so far I don't feel 100 %
> comfortable with that. So, here it is: feel free to shoot to kill if
> you dislike the idea.
> 
> What if we try to couple them a bit more strongly ? My idea is:
> 
> 1) Naming a certain namespace. For starters, we could use any pid inside
> a namespace to name it, usually the first one to be created, but
> really, any of them. (Or any other mechanism in the future)

Naming namespaces is something we've been trying to avoid (because that
introduces a new namespace), but note that /proc/self/ns/ now has
files which you can use for comparing and entering some, and soon all,
namespaces.  Hopefully we can somehow use these rather than using pids
to identify namespaces?

But actually... :

> 2) Create standard cgroup files, like pid_namespace, net_namespace, etc.
> 
> 3) If those files are empty, no coupling takes place (Or maybe we
> forget about this special case, and just have '1' as its default
> content.
> 
> 4) If there is a pid number written on it, that particular namespace
> is considered tied to a cgroup. proc files that shows per-ns
> information are already displayed per-ns. We would then proceed to
> classify the remainder according to the type of information they
> convey: net file, cpu file, memory file, io file, etc.
> 
> 5) When a task inside a cgroup reads a file, it gets the data
> according to the namespace it belongs.

I think Daniel has thought a bit along these lines as well.  I don't
think it needs to be particularly complicated.  We don't really need
userspace involved, so actually we shouldn't need (userspace-visible)
namespace identifiers, right?  Can't we just introduce the
/sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
if cgroups are enabled and the task's memory cgroup != '/', return
the data from that file?

We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
(etc) file which defaults to 1 (show the cgroup's file data in place of
/proc/meminfo), which can be set to 0 on the host so that the container,
if it wants, can see the host's data.

> This idea is almost setup-free (with the exception of dumping pids
> into the cgroup files, but if the files are default for all cgroups,
> a 3-line loop can do it in a very future-proof way). But in reality,
> what appeals to me about it, is that it is a mechanism for coupling
> those two
> entities that in our case, should be the same. It provides stronger
> guarantees that we will never be able to see any data outside the
> ones we are untitled to, even we get the bind mounts setup wrongly.
>
> (disclaimer: wild idea ahead)
> If we, for instance, code in such a way that if a certain proc-file
> is per-namespace, the task could get no data at all unless a
> cgroup-binding is set, providing stronger isolation guarantees.

Are there good reasons to worry about guaranteeing this particular
isolation?  My impression was that this stuff is useful for the
application - the better it can calculate the resources available
to it, the better it can get along with others avoid getting killed
later.  But I didn't think our goal was to try and hide the host
info from the container - we just want to give it most meaningful
info.

(That's probably also why this stuff has been languishing - it's
rather low in priority because unlike other things it won't harm
the host)

> It is also easy to check if a task that do not belong to a namespace
> is present in a namespaced cgroup. We can easily disallow that,
> preventing rogue process to escape and eat resources from a
> container.
> 
> The list goes on.
> 
> Please tell me what you think.

thanks,
-serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup information proc file format
  2011-10-04  2:42         ` Serge E. Hallyn
@ 2011-10-04  6:17           ` Glauber Costa
  2011-10-04 14:05             ` Serge Hallyn
  0 siblings, 1 reply; 6+ messages in thread
From: Glauber Costa @ 2011-10-04  6:17 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

On 10/04/2011 06:42 AM, Serge E. Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
>> On 08/12/2011 01:52 AM, Serge Hallyn wrote:
>>> Quoting Daniel Lezcano (daniel.lezcano@free.fr):
>>>> On 08/11/2011 11:30 PM, Glauber Costa wrote:
>>>>> On 08/11/2011 05:55 PM, Daniel Lezcano wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> the cgroup cpuset and memory reduce access to a part of the resources on
>>>>>> the system. Some applications use the /proc/cpuinfo and /proc/meminfo to
>>>>>> allocate the resources. For instance, HPC jobs look at /proc/cpuinfo to
>>>>>> fork the number of cpu found in this file either look at /proc/meminfo
>>>>>> to allocate a big chunk of memory. Each process set the affinity on each
>>>>>> cpu, which in case a subset of cpus is used, some affinity will fail.
>>>>>>
>>>>>> In the case of the container, the cgroup is used to reduce the memory or
>>>>>> to assign a cpu to the container. Unfortunately, as this partitioning is
>>>>>> not reflected in /proc, the different system tools (ps, top, free, ...)
>>>>>> show a wrong information.
>>>>>>
>>>>>> I was wondering if that would make sense to create for the different
>>>>>> cgroup subsystem, when it is relevant, a proc formatted file we can bind
>>>>>> mount /proc.
>>>>>>
>>>>>> For example: /cgroup/memory.proc and /cgroup/cpuset.proc
>>>
>>> I think it's a great idea.
>>>
>>> -serge
>>
>> [ sorry for those who are getting this twice:
>>    The containers mailing list seems to be still not working, and Paul
>>    and Balbir changed their addresses in the mean time. So I am resending
>>    it to lkml and the right addrs instead. ]
>>
>> Food for thought:
>>
>> In my last /proc-related series, in which most of you were copied, I
>> tried to implement my understanding of this idea for /proc/stat.
>>
>> For whoever didn't see it, you can find a slightly outdated but
>> still valid version of it at http://lwn.net/Articles/460310/
>>
>> While doing it, however, something occurred to me. I'd like to know
>> what you think.
>>
>> As much as I like the idea proposed by Daniel (bind-mounting proc
>> files from the cgroup to inside the container namespace), what I
>> dislike about it is the amount of setup involved - one bind mount
>> per file -, and the fact that we need to know in advance which files
>> to expect (which I more or less tried to work around by
>> conventioning a directory-like naming).
>>
>> In general, we are doing containers, using both namespaces and
>> cgroups, two entities that are very loosely coupled. While I agree
>> that such a loose coupling is not the end of the world - and quite
>> desirable in the general case -  so far I don't feel 100 %
>> comfortable with that. So, here it is: feel free to shoot to kill if
>> you dislike the idea.
>>
>> What if we try to couple them a bit more strongly ? My idea is:
>>
>> 1) Naming a certain namespace. For starters, we could use any pid inside
>> a namespace to name it, usually the first one to be created, but
>> really, any of them. (Or any other mechanism in the future)
>
> Naming namespaces is something we've been trying to avoid (because that
> introduces a new namespace), but note that /proc/self/ns/ now has
> files which you can use for comparing and entering some, and soon all,
> namespaces.  Hopefully we can somehow use these rather than using pids
> to identify namespaces?
>
> But actually... :

Well, that is what I meant by "naming". Basically anything that would
let us compare and identify that we're in a given namespace. 
/proc/self/ns is good enough.

But actually... :
>
>> 2) Create standard cgroup files, like pid_namespace, net_namespace, etc.
>>
>> 3) If those files are empty, no coupling takes place (Or maybe we
>> forget about this special case, and just have '1' as its default
>> content.
>>
>> 4) If there is a pid number written on it, that particular namespace
>> is considered tied to a cgroup. proc files that shows per-ns
>> information are already displayed per-ns. We would then proceed to
>> classify the remainder according to the type of information they
>> convey: net file, cpu file, memory file, io file, etc.
>>
>> 5) When a task inside a cgroup reads a file, it gets the data
>> according to the namespace it belongs.
>
> I think Daniel has thought a bit along these lines as well.  I don't
> think it needs to be particularly complicated.
+1
> We don't really need
> userspace involved, so actually we shouldn't need (userspace-visible)
> namespace identifiers, right?

Ideally, right. But...
> Can't we just introduce the
> /sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
> if cgroups are enabled and the task's memory cgroup != '/', return
> the data from that file?

First: If we're doing that, why do we need that file in the first place?
The file is useful if we're bind mounting, but if we're automatically 
displaying it according to any criteria, not that interesting. Well, it 
would allow the root container to view it, so maybe it is in fact 
interesting...

As for cgroup != '/', I am not sure if it works. Well, for containers, 
it works beautifully. But what we have in the kernel now is a mechanism 
for resource control (cgroups) and a mechanism for isolation 
(namespaces). Displaying data falls in the isolation realm. There are 
users using just the resource control part (think of systemd). I doubt 
they'd like to suddenly, after years expecting system-wide info, read 
per-cgroup data when querying a /proc file.

So, because I'm all for automatic, is that I am proposing this. I think 
we need a mechanism to tie a cgroup to a namespace (or many, one of each 
kind).

I myself can settle down for:
   * If namespace != '/' => show cgroup information instead of
     system-wide. (What do you think?)

The only reason I proposed anything more complicated than that, is that 
I was fearing there were weirdos out there for whom "every process in a 
cgroup is in the same namespace" wouldn't hold, and they'd want to opt 
this out. But I honestly think this is a very sick usecase.


> We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
> (etc) file which defaults to 1 (show the cgroup's file data in place of
> /proc/meminfo), which can be set to 0 on the host so that the container,
> if it wants, can see the host's data.
>
>> This idea is almost setup-free (with the exception of dumping pids
>> into the cgroup files, but if the files are default for all cgroups,
>> a 3-line loop can do it in a very future-proof way). But in reality,
>> what appeals to me about it, is that it is a mechanism for coupling
>> those two
>> entities that in our case, should be the same. It provides stronger
>> guarantees that we will never be able to see any data outside the
>> ones we are untitled to, even we get the bind mounts setup wrongly.
>>
>> (disclaimer: wild idea ahead)
>> If we, for instance, code in such a way that if a certain proc-file
>> is per-namespace, the task could get no data at all unless a
>> cgroup-binding is set, providing stronger isolation guarantees.
>
> Are there good reasons to worry about guaranteeing this particular
> isolation?  My impression was that this stuff is useful for the
> application - the better it can calculate the resources available
> to it, the better it can get along with others avoid getting killed
> later.  But I didn't think our goal was to try and hide the host
> info from the container - we just want to give it most meaningful
> info.

First of all, note that I am not overly concerned about that.
But it may prove useful.
If I am in a container side by side with yours, I'd prefer you wouldn't
be able to guess anything about me, including my workload type, memory 
usage, etc, and this could be used by clever exploiters.

Besides, /proc holds all sorts of stuff. Networking routing tables and 
connection status, for example. Those are not just statistics, and 
should maybe be totally hidden.
>
> (That's probably also why this stuff has been languishing - it's
> rather low in priority because unlike other things it won't harm
> the host)

Agreed about that. But hey, at some point it has to be done...

>> It is also easy to check if a task that do not belong to a namespace
>> is present in a namespaced cgroup. We can easily disallow that,
>> preventing rogue process to escape and eat resources from a
>> container.
>>
>> The list goes on.
>>
>> Please tell me what you think.
>
> thanks,
> -serge


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup information proc file format
  2011-10-04  6:17           ` Glauber Costa
@ 2011-10-04 14:05             ` Serge Hallyn
  2011-10-05  7:47               ` Glauber Costa
  0 siblings, 1 reply; 6+ messages in thread
From: Serge Hallyn @ 2011-10-04 14:05 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

Quoting Glauber Costa (glommer@parallels.com):
...

> >Can't we just introduce the
> >/sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
> >if cgroups are enabled and the task's memory cgroup != '/', return
> >the data from that file?
> 
> First: If we're doing that, why do we need that file in the first place?

We might not :)  But we might, if we want to offer containers a choice of
whether /proc/meminfo is the host's or the container's.

> The file is useful if we're bind mounting, but if we're
> automatically displaying it according to any criteria, not that
> interesting. Well, it would allow the root container to view it, so
> maybe it is in fact interesting...
> 
> As for cgroup != '/', I am not sure if it works. Well, for
> containers, it works beautifully. But what we have in the kernel now
> is a mechanism for resource control (cgroups) and a mechanism for
> isolation (namespaces). Displaying data falls in the isolation
> realm. There are users using just the resource control part (think
> of systemd). I doubt they'd like to suddenly, after years expecting
> system-wide info, read per-cgroup data when querying a /proc file.

That's where the /sys/fs/cgroup/memory/memory.use_cgroup_as_proc file
I mentioned below would come in.  The host could choose to give
that application the host /proc/meminfo view.

Still, if the applications you are thinking of are having their
resources restricted, what harm would come of reporting their actual
allotted resources in place of an artificially inflated number?

> So, because I'm all for automatic, is that I am proposing this. I
> think we need a mechanism to tie a cgroup to a namespace (or many,
> one of each kind).
> 
> I myself can settle down for:
>   * If namespace != '/' => show cgroup information instead of
>     system-wide. (What do you think?)

I don't like it  :)

The namespaces are about name->object relations, not just about
isolation.  In contrast, the cgroups are precisely about resource
limitations.

> The only reason I proposed anything more complicated than that, is
> that I was fearing there were weirdos out there for whom "every
> process in a cgroup is in the same namespace" wouldn't hold, and

Absolutely.

> they'd want to opt this out. But I honestly think this is a very
> sick usecase.

:)

Don't get me wrong, I don't think it would hurt to always give them
the cgroup data.  I just think the check is not 'correct'.

> >We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
> >(etc) file which defaults to 1 (show the cgroup's file data in place of
> >/proc/meminfo), which can be set to 0 on the host so that the container,
> >if it wants, can see the host's data.
> >
> >>This idea is almost setup-free (with the exception of dumping pids
> >>into the cgroup files, but if the files are default for all cgroups,
> >>a 3-line loop can do it in a very future-proof way). But in reality,
> >>what appeals to me about it, is that it is a mechanism for coupling
> >>those two
> >>entities that in our case, should be the same. It provides stronger
> >>guarantees that we will never be able to see any data outside the
> >>ones we are untitled to, even we get the bind mounts setup wrongly.
> >>
> >>(disclaimer: wild idea ahead)
> >>If we, for instance, code in such a way that if a certain proc-file
> >>is per-namespace, the task could get no data at all unless a
> >>cgroup-binding is set, providing stronger isolation guarantees.
> >
> >Are there good reasons to worry about guaranteeing this particular
> >isolation?  My impression was that this stuff is useful for the
> >application - the better it can calculate the resources available
> >to it, the better it can get along with others avoid getting killed
> >later.  But I didn't think our goal was to try and hide the host
> >info from the container - we just want to give it most meaningful
> >info.
> 
> First of all, note that I am not overly concerned about that.
> But it may prove useful.
> If I am in a container side by side with yours, I'd prefer you wouldn't
> be able to guess anything about me, including my workload type,
> memory usage, etc, and this could be used by clever exploiters.
> 
> Besides, /proc holds all sorts of stuff. Networking routing tables
> and connection status, for example. Those are not just statistics,
> and should maybe be totally hidden.

I think that should be done separate from this whole discussion - using
user namespaces.  Any task in a non-initial user namespace will only
get the world access rights to a procfile.  So if the file isn't world
readable, then a container won't be able to read it.

> >(That's probably also why this stuff has been languishing - it's
> >rather low in priority because unlike other things it won't harm
> >the host)
> 
> Agreed about that. But hey, at some point it has to be done...

:)

-serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup information proc file format
  2011-10-04 14:05             ` Serge Hallyn
@ 2011-10-05  7:47               ` Glauber Costa
  2011-10-06 12:50                 ` Serge E. Hallyn
  0 siblings, 1 reply; 6+ messages in thread
From: Glauber Costa @ 2011-10-05  7:47 UTC (permalink / raw)
  To: Serge Hallyn; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

On 10/04/2011 06:05 PM, Serge Hallyn wrote:
> Quoting Glauber Costa (glommer@parallels.com):
> ...
>
>>> Can't we just introduce the
>>> /sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
>>> if cgroups are enabled and the task's memory cgroup != '/', return
>>> the data from that file?
>>
>> First: If we're doing that, why do we need that file in the first place?
>
> We might not :)  But we might, if we want to offer containers a choice of
> whether /proc/meminfo is the host's or the container's.

Hi,

Please allow me to clarify some points so we are in the same page (thus 
avoiding fragmentation =p )

Are you quoting /proc/meminfo as an example only, or are you concerned 
specifically with this file? I myself am talking about proc files in 
general.

We have to keep in mind that the myriad of them, convey different kinds 
of information, belong to different subsystems and have different 
expected behavior.

That is important because for some of them, what you state about only 
allowing a group of processes to see the resources they have makes 
sense. For others, maybe not.

>> The file is useful if we're bind mounting, but if we're
>> automatically displaying it according to any criteria, not that
>> interesting. Well, it would allow the root container to view it, so
>> maybe it is in fact interesting...
>>
>> As for cgroup != '/', I am not sure if it works. Well, for
>> containers, it works beautifully. But what we have in the kernel now
>> is a mechanism for resource control (cgroups) and a mechanism for
>> isolation (namespaces). Displaying data falls in the isolation
>> realm. There are users using just the resource control part (think
>> of systemd). I doubt they'd like to suddenly, after years expecting
>> system-wide info, read per-cgroup data when querying a /proc file.
>
> That's where the /sys/fs/cgroup/memory/memory.use_cgroup_as_proc file
> I mentioned below would come in.  The host could choose to give
> that application the host /proc/meminfo view.
I am sorry, I think I missed you mentioning this file.

Correct me if I am wrong, but it seems to me now that we agree that 
there should be a mechanism determining whether or not to automatically 
show cgroup-restrained values in proc files.

This is a key point for me. What is this mechanism, is less important, 
as long as it is a one-time shot.

>
> Still, if the applications you are thinking of are having their
> resources restricted, what harm would come of reporting their actual
> allotted resources in place of an artificially inflated number?
Think /proc/stat, the file I am working now, as an example.

Historically, this file shows, among other things, user ticks for all 
processes in the system. In a container system, we want this to 
represent only the set of processes inside a container.

But why on earth can we assume that everybody, in all use cases, 
wouldn't be harmed by having just your process' ticks displayed? I don't 
think we can.

Note that people are now using cgroups for other things, (think systemd).

They can serve as process grouping, simple restriction, etc.
So the less we assume, the better.

>
>> So, because I'm all for automatic, is that I am proposing this. I
>> think we need a mechanism to tie a cgroup to a namespace (or many,
>> one of each kind).
>>
>> I myself can settle down for:
>>    * If namespace != '/' =>  show cgroup information instead of
>>      system-wide. (What do you think?)
>
> I don't like it  :)
>
> The namespaces are about name->object relations, not just about
> isolation.  In contrast, the cgroups are precisely about resource
> limitations.
Right.

>> The only reason I proposed anything more complicated than that, is
>> that I was fearing there were weirdos out there for whom "every
>> process in a cgroup is in the same namespace" wouldn't hold, and
>
> Absolutely.
>
>> they'd want to opt this out. But I honestly think this is a very
>> sick usecase.
>
> :)
>
> Don't get me wrong, I don't think it would hurt to always give them
> the cgroup data.  I just think the check is not 'correct'.
>
>>> We might also want to have a /sys/fs/cgroup/memory/memory.show_proc_data
>>> (etc) file which defaults to 1 (show the cgroup's file data in place of
>>> /proc/meminfo), which can be set to 0 on the host so that the container,
>>> if it wants, can see the host's data.

A container can't want anything. I am more concerned here with the other 
types of use cases.

BTW, A file in each cgroup:

/sys/fs/cgroup/memory/memory.restrict_proc_data (or any other name)
/sys/fs/cgroup/cpu/cpu.restrict_proc_data (or any other name)
etc...

works for me as well.

>>>
>>>> This idea is almost setup-free (with the exception of dumping pids
>>>> into the cgroup files, but if the files are default for all cgroups,
>>>> a 3-line loop can do it in a very future-proof way). But in reality,
>>>> what appeals to me about it, is that it is a mechanism for coupling
>>>> those two
>>>> entities that in our case, should be the same. It provides stronger
>>>> guarantees that we will never be able to see any data outside the
>>>> ones we are untitled to, even we get the bind mounts setup wrongly.
>>>>
>>>> (disclaimer: wild idea ahead)
>>>> If we, for instance, code in such a way that if a certain proc-file
>>>> is per-namespace, the task could get no data at all unless a
>>>> cgroup-binding is set, providing stronger isolation guarantees.
>>>
>>> Are there good reasons to worry about guaranteeing this particular
>>> isolation?  My impression was that this stuff is useful for the
>>> application - the better it can calculate the resources available
>>> to it, the better it can get along with others avoid getting killed
>>> later.  But I didn't think our goal was to try and hide the host
>>> info from the container - we just want to give it most meaningful
>>> info.
>>
>> First of all, note that I am not overly concerned about that.
>> But it may prove useful.
>> If I am in a container side by side with yours, I'd prefer you wouldn't
>> be able to guess anything about me, including my workload type,
>> memory usage, etc, and this could be used by clever exploiters.
>>
>> Besides, /proc holds all sorts of stuff. Networking routing tables
>> and connection status, for example. Those are not just statistics,
>> and should maybe be totally hidden.
>
> I think that should be done separate from this whole discussion - using
> user namespaces.  Any task in a non-initial user namespace will only
> get the world access rights to a procfile.  So if the file isn't world
> readable, then a container won't be able to read it.

Yeah. Well, this was never part of the main discussion anyway =)
I agree with you here.

>>> (That's probably also why this stuff has been languishing - it's
>>> rather low in priority because unlike other things it won't harm
>>> the host)
>>
>> Agreed about that. But hey, at some point it has to be done...
>
> :)
>
> -serge


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: cgroup information proc file format
  2011-10-05  7:47               ` Glauber Costa
@ 2011-10-06 12:50                 ` Serge E. Hallyn
  0 siblings, 0 replies; 6+ messages in thread
From: Serge E. Hallyn @ 2011-10-06 12:50 UTC (permalink / raw)
  To: Glauber Costa; +Cc: Daniel Lezcano, linux-kernel, Balbir Singh, Paul Menage

Quoting Glauber Costa (glommer@parallels.com):
> On 10/04/2011 06:05 PM, Serge Hallyn wrote:
> >Quoting Glauber Costa (glommer@parallels.com):
> >...
> >
> >>>Can't we just introduce the
> >>>/sys/fs/cgroup/memory/memory.proc etc files, and have the procfs code,
> >>>if cgroups are enabled and the task's memory cgroup != '/', return
> >>>the data from that file?
> >>
> >>First: If we're doing that, why do we need that file in the first place?
> >
> >We might not :)  But we might, if we want to offer containers a choice of
> >whether /proc/meminfo is the host's or the container's.
> 
> Hi,
> 
> Please allow me to clarify some points so we are in the same page
> (thus avoiding fragmentation =p )
> 
> Are you quoting /proc/meminfo as an example only, or are you
> concerned specifically with this file? I myself am talking about
> proc files in general.

An example.  But as we are talking in terms of cgroups, I assumed this
was only about procfiles representing resources affected by cgroups -
like /proc/cpuinfo, /proc/meminfo, /proc/devices...

...

> Correct me if I am wrong, but it seems to me now that we agree that
> there should be a mechanism determining whether or not to
> automatically show cgroup-restrained values in proc files.

Agreed.

...

> BTW, A file in each cgroup:
> 
> /sys/fs/cgroup/memory/memory.restrict_proc_data (or any other name)
> /sys/fs/cgroup/cpu/cpu.restrict_proc_data (or any other name)
> etc...
> 
> works for me as well.

Cool.

-serge

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2011-10-06 12:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4E4441C3.5020603@free.fr>
     [not found] ` <4E4449F5.3010909@parallels.com>
     [not found]   ` <4E444D96.7080206@free.fr>
     [not found]     ` <20110811215238.GC17349@peqn>
2011-10-03  8:15       ` cgroup information proc file format Glauber Costa
2011-10-04  2:42         ` Serge E. Hallyn
2011-10-04  6:17           ` Glauber Costa
2011-10-04 14:05             ` Serge Hallyn
2011-10-05  7:47               ` Glauber Costa
2011-10-06 12:50                 ` Serge E. Hallyn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox