Control groups and Resource Management notes (part I)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Control groups and Resource Management notes (part I)
@ 2008-08-01 13:54 ` Balbir Singh
  0 siblings, 0 replies; 11+ messages in thread
From: Balbir Singh @ 2008-08-01 13:54 UTC (permalink / raw)
  To: Linux Containers

Hi, All,

This is the first part of the resource management and control groups discussion.
I might have made mistakes while taking notes or typing them out, please feel
free to correct them for me or send me corrections.

The notes are really large, so they'll come in installments. This is the first
part of the notes.

Control Groups
==============

1. Multiphase locking - Paul brought up his multi phase locking design and
suggested approaches to implementing them. The problem with control groups
currently is that transactions cannot be atomically committed. If some
transactions fail (can_attach() callback fails or returns error), then there is
no notification sent out to groups that already committed the transaction

The suggested design includes
	- Acquiring locks across callbacks - Balbir opposed this approach
          stating that this would make it easier for subsystems to deadlock.
          Balbir instead suggested that each callback hold it's own lock and
          add an undo operation that cannot fail (returns void), since
          uncharging usually succeeds. Dave suggested doing undo without holding
          any locks.

2. Procs - Balbir and others have asked for an API to move all threads of a
process in one go from one control group to another. The question about doing it
in user space was asked. Doing it in user space is easy, but it can be expensive
(moving all threads one by one - acquiring the cgroup lock and releasing it for
every thread). What happens if another move is requested while a partial move is
in progress? Dave suggested that we have an abstract aggregation so that we
don't need to keep adding interfaces for every aggregation. Balbir mentioned
that the aggregation of interest are process, process groups and sessions and
the kernel already knows about these (there are data structures to link all
elements together). Abstracting it is a good idea, but hard to implement.

Paul asked what the behaviour should be, if a process being moved has several
threads belong to different cgroups. The answer that came up was that they
should all be migrated to the destination cgroup

3. Cgroup lock - The cgroup lock is held at various places in the system. The
question is -- is cgroup_lock() becoming the next BKL? Several solutions were
discussed - making the lock per hierarchy or per cgroup or use subsystem locks.
Paul mentioned that cgroups already use RCU.

4. Binary statistics - The question about binary statistics was raised. Since
control groups don't enforce any particular kind of API, is there a way to
generically handle control files and their parameters in the library? Paul
suggested his binary API approach, where every control group and it's API is
documented in an api file. Eric suggested using an ASCII interface (since that
is very generic) and using one file per API. Balbir mentioned that this will
lead to too many dentries and issues related to having extensive number of dentries.

5. User space notifications - Kamezawa had requested for user space notification
(through inotify) when a control group reaches it's memory limit for example.
The questions that were asked were, what happens if no one is listening in on
notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
netlinks and building stuff on top of cgroupstats. With netlink we can pass
type, value and length of arguments, making it more suitable for this kind of
information exchange. The only concern with netlink is that it can lose
messages. The general consensus was to add one FIFO per control group and use
that for all notifications related to the control group.

Resource management
===================
1. Memory controller - Balbir mentioned that this is best discussed at the
memory controller BoF
2. Device subsystem was discussed and it was decided that mount (filesystem)
namespace and device namespace are the best places to handle device subsystem
issues.
3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are
opposed to doing any limits based on virtual address space. Balbir mentioned
that it serves several purposes

a. It allows us to control swap usage
b. It allows us to build a generic rlimits infrastructure
c. It allows us to fail applications nicely

Paul mentioned that (c) was not useful since no applications handle it today.
Balbir disagreed with that argument as being sufficient to prevent future
applications to handle malloc()/mmap() failure. Balbir asked why overcommit
accounting was not useful?

There was general agreement that a mlock() controller would be useful.

4. CPU controller - There was a request for hard limit feature. Peter opposed
the approach stating that anyone wanting hard limits should use the real time
group scheduler and a new EDF scheduler is being implemented. Denis mentioned
that without hard limits it is not possible for a service provider to
decide/plan how much capacity a single CPU can provide. Balbir mentioned that
with hard limits and SLA's the service provider could on reaching the hard limit
can save power by hard limiting execution on a CPU that is meeting its SLA
requirements. Peter mentioned that hard limits would make the group scheduler,
non work conserving.

Peter also updated everyone about the new load balancing patches that will make
it into the next merge window.

5. Kernel memory controller - The kernel memory controller was discussed
briefly. Pavel has not been actively working on it. Denis mentioned that it
would be nice to have a network buffer controller as well. Questions were asked
if the kernel memory controller should be merged with the existing memory
controller?

6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for
fundamental operations and that he posted a version of the patch three weeks
ago. The patch controls swap entries to control the swap usage of a control
group. Paul mentioned that google has a patch internally to link swap files to
cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace
is a different issue all together (compared to the swap controller). Currently
the swap controller is a part of the memory controller. There has been some
discussion about it being an independent controller.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Control groups and Resource Management notes (part I)
@ 2008-08-01 13:54 ` Balbir Singh
  0 siblings, 0 replies; 11+ messages in thread
From: Balbir Singh @ 2008-08-01 13:54 UTC (permalink / raw)
  To: Linux Containers

Hi, All,

This is the first part of the resource management and control groups discussion.

I might have made mistakes while taking notes or typing them out, please feel

free to correct them for me or send me corrections.

The notes are really large, so they'll come in installments. This is the first

part of the notes.

Control Groups

==============

1. Multiphase locking - Paul brought up his multi phase locking design and

suggested approaches to implementing them. The problem with control groups

currently is that transactions cannot be atomically committed. If some

transactions fail (can_attach() callback fails or returns error), then there is

no notification sent out to groups that already committed the transaction

The suggested design includes

	- Acquiring locks across callbacks - Balbir opposed this approach

          stating that this would make it easier for subsystems to deadlock.

          Balbir instead suggested that each callback hold it's own lock and

          add an undo operation that cannot fail (returns void), since

          uncharging usually succeeds. Dave suggested doing undo without holding

          any locks.

2. Procs - Balbir and others have asked for an API to move all threads of a

process in one go from one control group to another. The question about doing it

in user space was asked. Doing it in user space is easy, but it can be expensive

(moving all threads one by one - acquiring the cgroup lock and releasing it for

every thread). What happens if another move is requested while a partial move is

in progress? Dave suggested that we have an abstract aggregation so that we

don't need to keep adding interfaces for every aggregation. Balbir mentioned

that the aggregation of interest are process, process groups and sessions and

the kernel already knows about these (there are data structures to link all

elements together). Abstracting it is a good idea, but hard to implement.

Paul asked what the behaviour should be, if a process being moved has several

threads belong to different cgroups. The answer that came up was that they

should all be migrated to the destination cgroup

3. Cgroup lock - The cgroup lock is held at various places in the system. The

question is -- is cgroup_lock() becoming the next BKL? Several solutions were

discussed - making the lock per hierarchy or per cgroup or use subsystem locks.

Paul mentioned that cgroups already use RCU.

4. Binary statistics - The question about binary statistics was raised. Since

control groups don't enforce any particular kind of API, is there a way to

generically handle control files and their parameters in the library? Paul

suggested his binary API approach, where every control group and it's API is

documented in an api file. Eric suggested using an ASCII interface (since that

is very generic) and using one file per API. Balbir mentioned that this will

lead to too many dentries and issues related to having extensive number of dentries.

5. User space notifications - Kamezawa had requested for user space notification

(through inotify) when a control group reaches it's memory limit for example.

The questions that were asked were, what happens if no one is listening in on

notifications? Denis suggested using a FIFO mechanism. Balbir suggested using

netlinks and building stuff on top of cgroupstats. With netlink we can pass

type, value and length of arguments, making it more suitable for this kind of

information exchange. The only concern with netlink is that it can lose

messages. The general consensus was to add one FIFO per control group and use

that for all notifications related to the control group.

Resource management

===================

1. Memory controller - Balbir mentioned that this is best discussed at the

memory controller BoF

2. Device subsystem was discussed and it was decided that mount (filesystem)

namespace and device namespace are the best places to handle device subsystem

issues.

3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are

opposed to doing any limits based on virtual address space. Balbir mentioned

that it serves several purposes

a. It allows us to control swap usage

b. It allows us to build a generic rlimits infrastructure

c. It allows us to fail applications nicely

Paul mentioned that (c) was not useful since no applications handle it today.

Balbir disagreed with that argument as being sufficient to prevent future

applications to handle malloc()/mmap() failure. Balbir asked why overcommit

accounting was not useful?

There was general agreement that a mlock() controller would be useful.

4. CPU controller - There was a request for hard limit feature. Peter opposed

the approach stating that anyone wanting hard limits should use the real time

group scheduler and a new EDF scheduler is being implemented. Denis mentioned

that without hard limits it is not possible for a service provider to

decide/plan how much capacity a single CPU can provide. Balbir mentioned that

with hard limits and SLA's the service provider could on reaching the hard limit

can save power by hard limiting execution on a CPU that is meeting its SLA

requirements. Peter mentioned that hard limits would make the group scheduler,

non work conserving.

Peter also updated everyone about the new load balancing patches that will make

it into the next merge window.

5. Kernel memory controller - The kernel memory controller was discussed

briefly. Pavel has not been actively working on it. Denis mentioned that it

would be nice to have a network buffer controller as well. Questions were asked

if the kernel memory controller should be merged with the existing memory

controller?

6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for

fundamental operations and that he posted a version of the patch three weeks

ago. The patch controls swap entries to control the swap usage of a control

group. Paul mentioned that google has a patch internally to link swap files to

cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace

is a different issue all together (compared to the swap controller). Currently

the swap controller is a part of the memory controller. There has been some

discussion about it being an independent controller.

-- 

	Warm Regards,

	Balbir Singh

	Linux Technology Center

	IBM, ISTL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Control groups and Resource Management notes (part II)
  2008-08-01 13:54 ` Balbir Singh
  (?)
@ 2008-08-02  1:10 ` Balbir Singh
  2008-08-05  7:45   ` KOSAKI Motohiro
  -1 siblings, 1 reply; 11+ messages in thread
From: Balbir Singh @ 2008-08-02  1:10 UTC (permalink / raw)
  To: Linux Containers; +Cc: linux kernel mailing list, libcg-devel

Here's part II (part I can be found at
(https://lists.linux-foundation.org/pipermail/containers/2008-August/012128.html)

Resource management (cont'd)
============================
7. Disk IO controller - There was a general discussion on the various disk IO
controllers
	a. DM - IOBand
	b. IO throttle
	c. Anticipatory
	d. CFQ

It was decided that it would be best for all the stake holders to work together
and let Jens Axboe and the block layer experts figure out what would be right
for the Linux kernel

8. Network traffic control - Paul discussed network traffic control and the
approach followed by Google. The existing classifier mechanism can be easily
extended by adding a classifier id (based on the control group). This is used in
combination with netfilters. Balbir mentioned that Thomas Graf was also looking
at something similar and raised the issue of input bandwidth control. Balbir
also pointed people to CKRM where the solution has been implemented. The OpenVZ
and Google team will post their patches

9. Network permissions - There was a recommendation to use security hooks for
network permissions. Paul explained what they use permissions with
	a. connect
	b. bind
	c. accept

The issue of using netlabels was brought up.

10. Freezer subsystem - The freezer system was discussed briefly. Serge
mentioned the patches and wanted to collect feedback (if any) on them.

11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
certain short comings of the OOM handler
	a. Logic - it is based on total_vm, is that the correct metric for
                   OOMing?
	b. Concurrency - it kills several tasks at once

There was a discussion on moving the policy for OOM handling to user space. Paul
described how the OOM handler has been modified at google to notify user space
when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
good idea, it was generally discussed that it might not be such a good idea.

Control group library
=====================
Dhaval and Balbir introduced libcgroups and the purpose of the library and the
goals. Balbir described on paper what the current design looks like, it consists of

	1. API
	2. Test framework
	3. A configuration subsystem

Dhaval discussed configuration syntax of XML versus home made. The issue of
classification of tasks was brought up. The reason that we want to classify
tasks is that we want them to move at fork/exec time to the correct cgroup so that

1. They don't consume resources in the parents group
2. The movement is automatic

It was generally agreed upon that the classification should take place in user
space. Eric and others suggested having a wrapper to start the application in
the correct cgroup (wrapper around fork/exec). Dave suggested that one might
even go the extent of hacking, such that a process is ptraced after fork/exec,
moved to the correct group and resumed. Using SELinux contexts was also recommended.

Vivek brought up using PAM plugins to do classifications, this suggestion was
nicely received. The decision was to do classification in user space and then
think of kernel space if it cannot be done in user space. Denis suggested that
classification is useful. In OpenVZ they classify all apache children to a
different group. Balbir asked Denis to post their classification infrastructure
as RFC.

Balbir asked for contributions to libcgroup. Libcgroup will effect system design
 and both administrators and application administrators. Now is a good time to
get *involved*.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Control groups and Resource Management notes (part II)
  2008-08-02  1:10 ` Control groups and Resource Management notes (part II) Balbir Singh
@ 2008-08-05  7:45   ` KOSAKI Motohiro
       [not found]     ` <20080805160709.A88B.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05  7:45 UTC (permalink / raw)
  To: balbir
  Cc: kosaki.motohiro, Linux Containers, linux kernel mailing list,
	libcg-devel

Hi balbir-san,

Thank you for nice minutes.
it is very helpful for non invited people (include me).


> 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> mentioned the patches and wanted to collect feedback (if any) on them.

Who use it?

AFAIK the freezer is used by HPC guys in general.
but they think MPI process must be freezed.

Unfortunately, Opensource MPI implementation use various inter-process
communication method (e.g. SYSV IPC, socket, ptrace)

then, general freezer implementaion is very difficult.


> 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> certain short comings of the OOM handler
> 	a. Logic - it is based on total_vm, is that the correct metric for
>                    OOMing?
> 	b. Concurrency - it kills several tasks at once
> 
> There was a discussion on moving the policy for OOM handling to user space. Paul
> described how the OOM handler has been modified at google to notify user space
> when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> good idea, it was generally discussed that it might not be such a good idea.

CPUSET based limitation is not easy to use (slightly).
memcgroup based is better.

In addition, notification on reaching limit can be very generic.

various limit (e.g. cpu time, memory usage), various notification
(e.g. kill process, send signal, inotify), various target
(each process on the cgroup or manager process) can be tought.



> Control group library
> =====================
> Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> goals. Balbir described on paper what the current design looks like, it consists of
> 
> 	1. API
> 	2. Test framework
> 	3. A configuration subsystem
> 
> Dhaval discussed configuration syntax of XML versus home made. The issue of
> classification of tasks was brought up. The reason that we want to classify
> tasks is that we want them to move at fork/exec time to the correct cgroup so that

I don't recommend XML, because XML is tree based syntax but we want more fulexible
classification. then I guess XML reduce human readability.


> 1. They don't consume resources in the parents group
> 2. The movement is automatic
> 
> It was generally agreed upon that the classification should take place in user
> space. Eric and others suggested having a wrapper to start the application in
> the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> even go the extent of hacking, such that a process is ptraced after fork/exec,
> moved to the correct group and resumed. Using SELinux contexts was also recommended.
> 
> Vivek brought up using PAM plugins to do classifications, this suggestion was
> nicely received. The decision was to do classification in user space and then
> think of kernel space if it cannot be done in user space. Denis suggested that
> classification is useful. In OpenVZ they classify all apache children to a
> different group. Balbir asked Denis to post their classification infrastructure
> as RFC.

I'm not sure about this issue.
but I like PAM approach.

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <20080805160709.A88B.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]

* Re: [Libcg-devel] Control groups and Resource Management notes (part II)
  2008-08-05  7:45   ` KOSAKI Motohiro
@ 2008-08-05 13:30         ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2008-08-05 13:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: Linux Containers, libcg-devel, linux kernel mailing list,
	balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8

On Tue, Aug 05, 2008 at 04:45:30PM +0900, KOSAKI Motohiro wrote:
> Hi balbir-san,
> 
> Thank you for nice minutes.
> it is very helpful for non invited people (include me).
> 
> 
> > 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> > mentioned the patches and wanted to collect feedback (if any) on them.
> 
> Who use it?
> 
> AFAIK the freezer is used by HPC guys in general.
> but they think MPI process must be freezed.
> 
> Unfortunately, Opensource MPI implementation use various inter-process
> communication method (e.g. SYSV IPC, socket, ptrace)
> 
> then, general freezer implementaion is very difficult.
> 
> 
> > 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> > certain short comings of the OOM handler
> > 	a. Logic - it is based on total_vm, is that the correct metric for
> >                    OOMing?
> > 	b. Concurrency - it kills several tasks at once
> > 
> > There was a discussion on moving the policy for OOM handling to user space. Paul
> > described how the OOM handler has been modified at google to notify user space
> > when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> > good idea, it was generally discussed that it might not be such a good idea.
> 
> CPUSET based limitation is not easy to use (slightly).
> memcgroup based is better.
> 
> In addition, notification on reaching limit can be very generic.
> 
> various limit (e.g. cpu time, memory usage), various notification
> (e.g. kill process, send signal, inotify), various target
> (each process on the cgroup or manager process) can be tought.
> 
> 
> 
> > Control group library
> > =====================
> > Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> > goals. Balbir described on paper what the current design looks like, it consists of
> > 
> > 	1. API
> > 	2. Test framework
> > 	3. A configuration subsystem
> > 
> > Dhaval discussed configuration syntax of XML versus home made. The issue of
> > classification of tasks was brought up. The reason that we want to classify
> > tasks is that we want them to move at fork/exec time to the correct cgroup so that
> 
> I don't recommend XML, because XML is tree based syntax but we want more fulexible
> classification. then I guess XML reduce human readability.
> 
> 
> > 1. They don't consume resources in the parents group
> > 2. The movement is automatic
> > 
> > It was generally agreed upon that the classification should take place in user
> > space. Eric and others suggested having a wrapper to start the application in
> > the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> > even go the extent of hacking, such that a process is ptraced after fork/exec,
> > moved to the correct group and resumed. Using SELinux contexts was also recommended.
> > 
> > Vivek brought up using PAM plugins to do classifications, this suggestion was
> > nicely received. The decision was to do classification in user space and then
> > think of kernel space if it cannot be done in user space. Denis suggested that
> > classification is useful. In OpenVZ they classify all apache children to a
> > different group. Balbir asked Denis to post their classification infrastructure
> > as RFC.
> 
> I'm not sure about this issue.
> but I like PAM approach.
> 

Thanks balbir for nice summary.

Well, it was Rik Van Riel's idea to use PAM plugins so that processes
are put into right user cgroups upon login.

Is pam based classification alone is sufficient? I noticed couple of
instances which will avoid pam. For example.

- If one starts apache "service httpd start", then httpd threads change
  their uid/gid to "apache/apache". But these threads will continue to
  run in the cgroup belonging to root and will not go into apache cgroup.

- apache also offers "suexec" tool which execs a CGI script under a 
  different user than the user who has launched web server. I quickly
  grepped for source code of suexec and it does not seem to be using
  pam. That means CGI scripts running under some other user name will
  continue to run in cgroup where apache is running.

I am not sure how many more such corner cases are there. These cases can
either be covered by modification of application or using some kind of
wrapper around application or by writing classification daemon.

Do we really need classification daemon to cover such cases or wrapper
approach is sufficient? I remember somebody in minisummit was mentioning
that it should work without any apache modifications.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Libcg-devel] Control groups and Resource Management notes (part II)
@ 2008-08-05 13:30         ` Vivek Goyal
  0 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2008-08-05 13:30 UTC (permalink / raw)
  To: KOSAKI Motohiro
  Cc: balbir, Linux Containers, linux kernel mailing list, libcg-devel

On Tue, Aug 05, 2008 at 04:45:30PM +0900, KOSAKI Motohiro wrote:
> Hi balbir-san,
> 
> Thank you for nice minutes.
> it is very helpful for non invited people (include me).
> 
> 
> > 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> > mentioned the patches and wanted to collect feedback (if any) on them.
> 
> Who use it?
> 
> AFAIK the freezer is used by HPC guys in general.
> but they think MPI process must be freezed.
> 
> Unfortunately, Opensource MPI implementation use various inter-process
> communication method (e.g. SYSV IPC, socket, ptrace)
> 
> then, general freezer implementaion is very difficult.
> 
> 
> > 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> > certain short comings of the OOM handler
> > 	a. Logic - it is based on total_vm, is that the correct metric for
> >                    OOMing?
> > 	b. Concurrency - it kills several tasks at once
> > 
> > There was a discussion on moving the policy for OOM handling to user space. Paul
> > described how the OOM handler has been modified at google to notify user space
> > when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> > good idea, it was generally discussed that it might not be such a good idea.
> 
> CPUSET based limitation is not easy to use (slightly).
> memcgroup based is better.
> 
> In addition, notification on reaching limit can be very generic.
> 
> various limit (e.g. cpu time, memory usage), various notification
> (e.g. kill process, send signal, inotify), various target
> (each process on the cgroup or manager process) can be tought.
> 
> 
> 
> > Control group library
> > =====================
> > Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> > goals. Balbir described on paper what the current design looks like, it consists of
> > 
> > 	1. API
> > 	2. Test framework
> > 	3. A configuration subsystem
> > 
> > Dhaval discussed configuration syntax of XML versus home made. The issue of
> > classification of tasks was brought up. The reason that we want to classify
> > tasks is that we want them to move at fork/exec time to the correct cgroup so that
> 
> I don't recommend XML, because XML is tree based syntax but we want more fulexible
> classification. then I guess XML reduce human readability.
> 
> 
> > 1. They don't consume resources in the parents group
> > 2. The movement is automatic
> > 
> > It was generally agreed upon that the classification should take place in user
> > space. Eric and others suggested having a wrapper to start the application in
> > the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> > even go the extent of hacking, such that a process is ptraced after fork/exec,
> > moved to the correct group and resumed. Using SELinux contexts was also recommended.
> > 
> > Vivek brought up using PAM plugins to do classifications, this suggestion was
> > nicely received. The decision was to do classification in user space and then
> > think of kernel space if it cannot be done in user space. Denis suggested that
> > classification is useful. In OpenVZ they classify all apache children to a
> > different group. Balbir asked Denis to post their classification infrastructure
> > as RFC.
> 
> I'm not sure about this issue.
> but I like PAM approach.
> 

Thanks balbir for nice summary.

Well, it was Rik Van Riel's idea to use PAM plugins so that processes
are put into right user cgroups upon login.

Is pam based classification alone is sufficient? I noticed couple of
instances which will avoid pam. For example.

- If one starts apache "service httpd start", then httpd threads change
  their uid/gid to "apache/apache". But these threads will continue to
  run in the cgroup belonging to root and will not go into apache cgroup.

- apache also offers "suexec" tool which execs a CGI script under a 
  different user than the user who has launched web server. I quickly
  grepped for source code of suexec and it does not seem to be using
  pam. That means CGI scripts running under some other user name will
  continue to run in cgroup where apache is running.

I am not sure how many more such corner cases are there. These cases can
either be covered by modification of application or using some kind of
wrapper around application or by writing classification daemon.

Do we really need classification daemon to cover such cases or wrapper
approach is sufficient? I remember somebody in minisummit was mentioning
that it should work without any apache modifications.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <20080805133007.GC15193-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]

* Re: [Libcg-devel] Control groups and Resource Management notes (part II)
  2008-08-05 13:30         ` Vivek Goyal
@ 2008-08-06  1:05             ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-06  1:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linux Containers, balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
	libcg-devel, linux kernel mailing list

On Tue, 5 Aug 2008 09:30:07 -0400
Vivek Goyal <vgoyal-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Tue, Aug 05, 2008 at 04:45:30PM +0900, KOSAKI Motohiro wrote:
> > > Control group library
> > > =====================
> > > Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> > > goals. Balbir described on paper what the current design looks like, it consists of
> > > 
> > > 	1. API
> > > 	2. Test framework
> > > 	3. A configuration subsystem
> > > 
> > > Dhaval discussed configuration syntax of XML versus home made. The issue of
> > > classification of tasks was brought up. The reason that we want to classify
> > > tasks is that we want them to move at fork/exec time to the correct cgroup so that
> > 
> > I don't recommend XML, because XML is tree based syntax but we want more fulexible
> > classification. then I guess XML reduce human readability.
> > 
> > 
> > > 1. They don't consume resources in the parents group
> > > 2. The movement is automatic
> > > 
> > > It was generally agreed upon that the classification should take place in user
> > > space. Eric and others suggested having a wrapper to start the application in
> > > the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> > > even go the extent of hacking, such that a process is ptraced after fork/exec,
> > > moved to the correct group and resumed. Using SELinux contexts was also recommended.
> > > 
> > > Vivek brought up using PAM plugins to do classifications, this suggestion was
> > > nicely received. The decision was to do classification in user space and then
> > > think of kernel space if it cannot be done in user space. Denis suggested that
> > > classification is useful. In OpenVZ they classify all apache children to a
> > > different group. Balbir asked Denis to post their classification infrastructure
> > > as RFC.
> > 
> > I'm not sure about this issue.
> > but I like PAM approach.
> > 
> 
> Thanks balbir for nice summary.
> 
Thanks, too.

> Well, it was Rik Van Riel's idea to use PAM plugins so that processes
> are put into right user cgroups upon login.
> 
> Is pam based classification alone is sufficient? I noticed couple of
> instances which will avoid pam. For example.
> 
> - If one starts apache "service httpd start", then httpd threads change
>   their uid/gid to "apache/apache". But these threads will continue to
>   run in the cgroup belonging to root and will not go into apache cgroup.
> 
> - apache also offers "suexec" tool which execs a CGI script under a 
>   different user than the user who has launched web server. I quickly
>   grepped for source code of suexec and it does not seem to be using
>   pam. That means CGI scripts running under some other user name will
>   continue to run in cgroup where apache is running.
> 
> I am not sure how many more such corner cases are there. These cases can
> either be covered by modification of application or using some kind of
> wrapper around application or by writing classification daemon.
> 
> Do we really need classification daemon to cover such cases or wrapper
> approach is sufficient? I remember somebody in minisummit was mentioning
> that it should work without any apache modifications.
> 

We can go ahead step by step. I think PAM support is the first step.
The daemon is the second.

1. PAM
2. A daemon for task placement (via netlink ?)

I think developping "a daemon for task placement" is important.
but cannot be perfect solution for any situations.

The third step is

3. Modify applications (in newer version of them.)

"should work without any apache modifications" is (maybe) necessary. But for 
perfect control, it's not enough. We should support a method to modify
applications easily in library. 

I think develpment cost for "2" is bigger than "1" and "3". If "2" is hard,
starting from "1" and support funcs for "3" is a choice.
If support for "3" is ready, someone may start implementation of "2" in easier
way.

Thanks,
-Kame

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Libcg-devel] Control groups and Resource Management notes (part II)
@ 2008-08-06  1:05             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 11+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-08-06  1:05 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: KOSAKI Motohiro, Linux Containers, libcg-devel,
	linux kernel mailing list, balbir

On Tue, 5 Aug 2008 09:30:07 -0400
Vivek Goyal <vgoyal@redhat.com> wrote:

> On Tue, Aug 05, 2008 at 04:45:30PM +0900, KOSAKI Motohiro wrote:
> > > Control group library
> > > =====================
> > > Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> > > goals. Balbir described on paper what the current design looks like, it consists of
> > > 
> > > 	1. API
> > > 	2. Test framework
> > > 	3. A configuration subsystem
> > > 
> > > Dhaval discussed configuration syntax of XML versus home made. The issue of
> > > classification of tasks was brought up. The reason that we want to classify
> > > tasks is that we want them to move at fork/exec time to the correct cgroup so that
> > 
> > I don't recommend XML, because XML is tree based syntax but we want more fulexible
> > classification. then I guess XML reduce human readability.
> > 
> > 
> > > 1. They don't consume resources in the parents group
> > > 2. The movement is automatic
> > > 
> > > It was generally agreed upon that the classification should take place in user
> > > space. Eric and others suggested having a wrapper to start the application in
> > > the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> > > even go the extent of hacking, such that a process is ptraced after fork/exec,
> > > moved to the correct group and resumed. Using SELinux contexts was also recommended.
> > > 
> > > Vivek brought up using PAM plugins to do classifications, this suggestion was
> > > nicely received. The decision was to do classification in user space and then
> > > think of kernel space if it cannot be done in user space. Denis suggested that
> > > classification is useful. In OpenVZ they classify all apache children to a
> > > different group. Balbir asked Denis to post their classification infrastructure
> > > as RFC.
> > 
> > I'm not sure about this issue.
> > but I like PAM approach.
> > 
> 
> Thanks balbir for nice summary.
> 
Thanks, too.

> Well, it was Rik Van Riel's idea to use PAM plugins so that processes
> are put into right user cgroups upon login.
> 
> Is pam based classification alone is sufficient? I noticed couple of
> instances which will avoid pam. For example.
> 
> - If one starts apache "service httpd start", then httpd threads change
>   their uid/gid to "apache/apache". But these threads will continue to
>   run in the cgroup belonging to root and will not go into apache cgroup.
> 
> - apache also offers "suexec" tool which execs a CGI script under a 
>   different user than the user who has launched web server. I quickly
>   grepped for source code of suexec and it does not seem to be using
>   pam. That means CGI scripts running under some other user name will
>   continue to run in cgroup where apache is running.
> 
> I am not sure how many more such corner cases are there. These cases can
> either be covered by modification of application or using some kind of
> wrapper around application or by writing classification daemon.
> 
> Do we really need classification daemon to cover such cases or wrapper
> approach is sufficient? I remember somebody in minisummit was mentioning
> that it should work without any apache modifications.
> 

We can go ahead step by step. I think PAM support is the first step.
The daemon is the second.

1. PAM
2. A daemon for task placement (via netlink ?)

I think developping "a daemon for task placement" is important.
but cannot be perfect solution for any situations.

The third step is

3. Modify applications (in newer version of them.)

"should work without any apache modifications" is (maybe) necessary. But for 
perfect control, it's not enough. We should support a method to modify
applications easily in library. 

I think develpment cost for "2" is bigger than "1" and "3". If "2" is hard,
starting from "1" and support funcs for "3" is a choice.
If support for "3" is ready, someone may start implementation of "2" in easier
way.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Libcg-devel] Control groups and Resource Management notes (part II)
  2008-08-06  1:05             ` KAMEZAWA Hiroyuki
  (?)
@ 2008-08-06 13:00             ` Vivek Goyal
  -1 siblings, 0 replies; 11+ messages in thread
From: Vivek Goyal @ 2008-08-06 13:00 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: KOSAKI Motohiro, Linux Containers, libcg-devel,
	linux kernel mailing list, balbir

On Wed, Aug 06, 2008 at 10:05:00AM +0900, KAMEZAWA Hiroyuki wrote:

[..]
> > > > Vivek brought up using PAM plugins to do classifications, this suggestion was
> > > > nicely received. The decision was to do classification in user space and then
> > > > think of kernel space if it cannot be done in user space. Denis suggested that
> > > > classification is useful. In OpenVZ they classify all apache children to a
> > > > different group. Balbir asked Denis to post their classification infrastructure
> > > > as RFC.
> > > 
> > > I'm not sure about this issue.
> > > but I like PAM approach.
> > > 
> > 
> > Thanks balbir for nice summary.
> > 
> Thanks, too.
> 
> > Well, it was Rik Van Riel's idea to use PAM plugins so that processes
> > are put into right user cgroups upon login.
> > 
> > Is pam based classification alone is sufficient? I noticed couple of
> > instances which will avoid pam. For example.
> > 
> > - If one starts apache "service httpd start", then httpd threads change
> >   their uid/gid to "apache/apache". But these threads will continue to
> >   run in the cgroup belonging to root and will not go into apache cgroup.
> > 
> > - apache also offers "suexec" tool which execs a CGI script under a 
> >   different user than the user who has launched web server. I quickly
> >   grepped for source code of suexec and it does not seem to be using
> >   pam. That means CGI scripts running under some other user name will
> >   continue to run in cgroup where apache is running.
> > 
> > I am not sure how many more such corner cases are there. These cases can
> > either be covered by modification of application or using some kind of
> > wrapper around application or by writing classification daemon.
> > 
> > Do we really need classification daemon to cover such cases or wrapper
> > approach is sufficient? I remember somebody in minisummit was mentioning
> > that it should work without any apache modifications.
> > 
> 
> We can go ahead step by step. I think PAM support is the first step.
> The daemon is the second.
> 
> 1. PAM
> 2. A daemon for task placement (via netlink ?)
> 
> I think developping "a daemon for task placement" is important.
> but cannot be perfect solution for any situations.
> 
> The third step is
> 
> 3. Modify applications (in newer version of them.)
> 
> "should work without any apache modifications" is (maybe) necessary. But for 
> perfect control, it's not enough. We should support a method to modify
> applications easily in library. 
> 
> I think develpment cost for "2" is bigger than "1" and "3". If "2" is hard,
> starting from "1" and support funcs for "3" is a choice.
> If support for "3" is ready, someone may start implementation of "2" in easier
> way.
> 

Phase wise approach makes sense. I already have working patches for
following things.

1. PAM module for placement of tasks
2. Modification of init scripts and a tool "cgclassify" so that at boot up
   time "init" and other system services are moved to "admin"'s group.
3. libcgroup API so that application can use these to place forked children
   in right cgroup before doing exec.
4. A command line tool "execcg" which helps a user launch application in
   specific "cgroup".

5. A classification daemon (based on netlink as of today. Should move to
  cgroup fs based notification mechanism probably.)   

I think in phase1, we can get first 4 items merged and stablized and then
work on daemon in phase2 (if need be).

One issue with daemon was raised with respect to containers. It will
interfere with placement of container threads also and this is not
desired.

This will have to be worked out.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]

* Re: Control groups and Resource Management notes (part I)
       [not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2008-08-05  7:06   ` KOSAKI Motohiro
  2008-08-06 17:38   ` [Libcg-devel] " Dhaval Giani
  1 sibling, 0 replies; 11+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05  7:06 UTC (permalink / raw)
  To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8; +Cc: Linux Containers

Hi

nice minutes!
below is just my note.

> Control Groups
> ==============
> 
> 1. Multiphase locking - Paul brought up his multi phase locking design and
> suggested approaches to implementing them. The problem with control groups
> currently is that transactions cannot be atomically committed. If some
> transactions fail (can_attach() callback fails or returns error), then there is
> no notification sent out to groups that already committed the transaction
> 
> The suggested design includes
> 	- Acquiring locks across callbacks - Balbir opposed this approach
>           stating that this would make it easier for subsystems to deadlock.
>           Balbir instead suggested that each callback hold it's own lock and
>           add an undo operation that cannot fail (returns void), since
>           uncharging usually succeeds. Dave suggested doing undo without holding
>           any locks.

task_limit cgroup has one problem with atomic related things.
task_limit check number of tasks when can_attach() called and increment number of tasks
when attach() called.
thus, it has race. if two attch processing run parallel, number of tasks exceed task limit.

> 4. Binary statistics - The question about binary statistics was raised. Since
> control groups don't enforce any particular kind of API, is there a way to
> generically handle control files and their parameters in the library? Paul
> suggested his binary API approach, where every control group and it's API is
> documented in an api file. Eric suggested using an ASCII interface (since that
> is very generic) and using one file per API. Balbir mentioned that this will
> lead to too many dentries and issues related to having extensive number of dentries.

if too many dentries come trouble, we should attach it?
I feel binary interface is detour solution.

but if any cgroup need any atomic operation and its implementation is 
difficult on sysfs like inteface, I'll advocate binary api.

> 5. User space notifications - Kamezawa had requested for user space notification
> (through inotify) when a control group reaches it's memory limit for example.
> The questions that were asked were, what happens if no one is listening in on
> notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
> netlinks and building stuff on top of cgroupstats. With netlink we can pass
> type, value and length of arguments, making it more suitable for this kind of
> information exchange. The only concern with netlink is that it can lose
> messages. The general consensus was to add one FIFO per control group and use
> that for all notifications related to the control group.

At least, HPC like batch system need some notification (e.g. elaps time,
cpu time, memory consumption exceed)

In addition, some embedded people want userland oom-manager.
it get notification when system memory shortage, and shrink properly
process memory.
because kernel can't know how much droppable cache user process has.
(e.g. browser cache, free list in malloc, GUI bitmap cache)

if we think system memory shortage, FIFO is not so good idea.
it accelerate to memory stavation more.
and netlink use some kmalloc, then it doesn't works properly 
on memory stavation state.

but We should be thought it?

btw, I guess Peter Zijlstra's memory resavation framework can solve 
above netlink issue. but I'm not sure it.

> 4. CPU controller - There was a request for hard limit feature. Peter opposed
> the approach stating that anyone wanting hard limits should use the real time
> group scheduler and a new EDF scheduler is being implemented. Denis mentioned
> that without hard limits it is not possible for a service provider to
> decide/plan how much capacity a single CPU can provide. Balbir mentioned that
> with hard limits and SLA's the service provider could on reaching the hard limit
> can save power by hard limiting execution on a CPU that is meeting its SLA
> requirements. Peter mentioned that hard limits would make the group scheduler,
> non work conserving.

What's SLA?

> 5. Kernel memory controller - The kernel memory controller was discussed
> briefly. Pavel has not been actively working on it. Denis mentioned that it
> would be nice to have a network buffer controller as well. Questions were asked
> if the kernel memory controller should be merged with the existing memory
> controller?

I don't hope merge it.
I think network buffer control is useful, but kernel memory controller is not.
because it require administrator know kernel implementation.
but it is too difficult.

Swiss army knife like approach press down every trouble to administrator.

I know embedded people like kernel memory controller, 
because they know the kernel internal very well and 
they don't want create custom kernel.
but it is general assumption?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [Libcg-devel] Control groups and Resource Management notes (part I)
       [not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
  2008-08-05  7:06   ` Control groups and Resource Management notes (part I) KOSAKI Motohiro
@ 2008-08-06 17:38   ` Dhaval Giani
  1 sibling, 0 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-08-06 17:38 UTC (permalink / raw)
  To: Balbir Singh; +Cc: Linux Containers, menage-hpIqsD4AKlfQT0dZR+AlfA, Balaji Rao

On Fri, Aug 01, 2008 at 07:24:58PM +0530, Balbir Singh wrote:

> 4. Binary statistics - The question about binary statistics was raised. Since
> control groups don't enforce any particular kind of API, is there a way to
> generically handle control files and their parameters in the library? Paul
> suggested his binary API approach, where every control group and it's API is
> documented in an api file. Eric suggested using an ASCII interface (since that
> is very generic) and using one file per API. Balbir mentioned that this will
> lead to too many dentries and issues related to having extensive number of dentries.

So where are we heading with respect to this issue? Was any consensus
reached? I plan to look at enabling statistics in libcgroup soon. Also I
believe Balaji is also looking at cgroupstats.

Thanks,
-- 
regards,
Dhaval

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-08-06 17:38 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-01 13:54 Control groups and Resource Management notes (part I) Balbir Singh
2008-08-01 13:54 ` Balbir Singh
2008-08-02  1:10 ` Control groups and Resource Management notes (part II) Balbir Singh
2008-08-05  7:45   ` KOSAKI Motohiro
     [not found]     ` <20080805160709.A88B.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-08-05 13:30       ` [Libcg-devel] " Vivek Goyal
2008-08-05 13:30         ` Vivek Goyal
     [not found]         ` <20080805133007.GC15193-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-08-06  1:05           ` KAMEZAWA Hiroyuki
2008-08-06  1:05             ` KAMEZAWA Hiroyuki
2008-08-06 13:00             ` Vivek Goyal
     [not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2008-08-05  7:06   ` Control groups and Resource Management notes (part I) KOSAKI Motohiro
2008-08-06 17:38   ` [Libcg-devel] " Dhaval Giani

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.