* Control groups and Resource Management notes (part I)
@ 2008-08-01 13:54 ` Balbir Singh
0 siblings, 0 replies; 11+ messages in thread
From: Balbir Singh @ 2008-08-01 13:54 UTC (permalink / raw)
To: Linux Containers
Hi, All,
This is the first part of the resource management and control groups discussion.
I might have made mistakes while taking notes or typing them out, please feel
free to correct them for me or send me corrections.
The notes are really large, so they'll come in installments. This is the first
part of the notes.
Control Groups
==============
1. Multiphase locking - Paul brought up his multi phase locking design and
suggested approaches to implementing them. The problem with control groups
currently is that transactions cannot be atomically committed. If some
transactions fail (can_attach() callback fails or returns error), then there is
no notification sent out to groups that already committed the transaction
The suggested design includes
- Acquiring locks across callbacks - Balbir opposed this approach
stating that this would make it easier for subsystems to deadlock.
Balbir instead suggested that each callback hold it's own lock and
add an undo operation that cannot fail (returns void), since
uncharging usually succeeds. Dave suggested doing undo without holding
any locks.
2. Procs - Balbir and others have asked for an API to move all threads of a
process in one go from one control group to another. The question about doing it
in user space was asked. Doing it in user space is easy, but it can be expensive
(moving all threads one by one - acquiring the cgroup lock and releasing it for
every thread). What happens if another move is requested while a partial move is
in progress? Dave suggested that we have an abstract aggregation so that we
don't need to keep adding interfaces for every aggregation. Balbir mentioned
that the aggregation of interest are process, process groups and sessions and
the kernel already knows about these (there are data structures to link all
elements together). Abstracting it is a good idea, but hard to implement.
Paul asked what the behaviour should be, if a process being moved has several
threads belong to different cgroups. The answer that came up was that they
should all be migrated to the destination cgroup
3. Cgroup lock - The cgroup lock is held at various places in the system. The
question is -- is cgroup_lock() becoming the next BKL? Several solutions were
discussed - making the lock per hierarchy or per cgroup or use subsystem locks.
Paul mentioned that cgroups already use RCU.
4. Binary statistics - The question about binary statistics was raised. Since
control groups don't enforce any particular kind of API, is there a way to
generically handle control files and their parameters in the library? Paul
suggested his binary API approach, where every control group and it's API is
documented in an api file. Eric suggested using an ASCII interface (since that
is very generic) and using one file per API. Balbir mentioned that this will
lead to too many dentries and issues related to having extensive number of dentries.
5. User space notifications - Kamezawa had requested for user space notification
(through inotify) when a control group reaches it's memory limit for example.
The questions that were asked were, what happens if no one is listening in on
notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
netlinks and building stuff on top of cgroupstats. With netlink we can pass
type, value and length of arguments, making it more suitable for this kind of
information exchange. The only concern with netlink is that it can lose
messages. The general consensus was to add one FIFO per control group and use
that for all notifications related to the control group.
Resource management
===================
1. Memory controller - Balbir mentioned that this is best discussed at the
memory controller BoF
2. Device subsystem was discussed and it was decided that mount (filesystem)
namespace and device namespace are the best places to handle device subsystem
issues.
3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are
opposed to doing any limits based on virtual address space. Balbir mentioned
that it serves several purposes
a. It allows us to control swap usage
b. It allows us to build a generic rlimits infrastructure
c. It allows us to fail applications nicely
Paul mentioned that (c) was not useful since no applications handle it today.
Balbir disagreed with that argument as being sufficient to prevent future
applications to handle malloc()/mmap() failure. Balbir asked why overcommit
accounting was not useful?
There was general agreement that a mlock() controller would be useful.
4. CPU controller - There was a request for hard limit feature. Peter opposed
the approach stating that anyone wanting hard limits should use the real time
group scheduler and a new EDF scheduler is being implemented. Denis mentioned
that without hard limits it is not possible for a service provider to
decide/plan how much capacity a single CPU can provide. Balbir mentioned that
with hard limits and SLA's the service provider could on reaching the hard limit
can save power by hard limiting execution on a CPU that is meeting its SLA
requirements. Peter mentioned that hard limits would make the group scheduler,
non work conserving.
Peter also updated everyone about the new load balancing patches that will make
it into the next merge window.
5. Kernel memory controller - The kernel memory controller was discussed
briefly. Pavel has not been actively working on it. Denis mentioned that it
would be nice to have a network buffer controller as well. Questions were asked
if the kernel memory controller should be merged with the existing memory
controller?
6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for
fundamental operations and that he posted a version of the patch three weeks
ago. The patch controls swap entries to control the swap usage of a control
group. Paul mentioned that google has a patch internally to link swap files to
cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace
is a different issue all together (compared to the swap controller). Currently
the swap controller is a part of the memory controller. There has been some
discussion about it being an independent controller.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
^ permalink raw reply [flat|nested] 11+ messages in thread* Control groups and Resource Management notes (part I)
@ 2008-08-01 13:54 ` Balbir Singh
0 siblings, 0 replies; 11+ messages in thread
From: Balbir Singh @ 2008-08-01 13:54 UTC (permalink / raw)
To: Linux Containers
Hi, All,
This is the first part of the resource management and control groups discussion.
I might have made mistakes while taking notes or typing them out, please feel
free to correct them for me or send me corrections.
The notes are really large, so they'll come in installments. This is the first
part of the notes.
Control Groups
==============
1. Multiphase locking - Paul brought up his multi phase locking design and
suggested approaches to implementing them. The problem with control groups
currently is that transactions cannot be atomically committed. If some
transactions fail (can_attach() callback fails or returns error), then there is
no notification sent out to groups that already committed the transaction
The suggested design includes
- Acquiring locks across callbacks - Balbir opposed this approach
stating that this would make it easier for subsystems to deadlock.
Balbir instead suggested that each callback hold it's own lock and
add an undo operation that cannot fail (returns void), since
uncharging usually succeeds. Dave suggested doing undo without holding
any locks.
2. Procs - Balbir and others have asked for an API to move all threads of a
process in one go from one control group to another. The question about doing it
in user space was asked. Doing it in user space is easy, but it can be expensive
(moving all threads one by one - acquiring the cgroup lock and releasing it for
every thread). What happens if another move is requested while a partial move is
in progress? Dave suggested that we have an abstract aggregation so that we
don't need to keep adding interfaces for every aggregation. Balbir mentioned
that the aggregation of interest are process, process groups and sessions and
the kernel already knows about these (there are data structures to link all
elements together). Abstracting it is a good idea, but hard to implement.
Paul asked what the behaviour should be, if a process being moved has several
threads belong to different cgroups. The answer that came up was that they
should all be migrated to the destination cgroup
3. Cgroup lock - The cgroup lock is held at various places in the system. The
question is -- is cgroup_lock() becoming the next BKL? Several solutions were
discussed - making the lock per hierarchy or per cgroup or use subsystem locks.
Paul mentioned that cgroups already use RCU.
4. Binary statistics - The question about binary statistics was raised. Since
control groups don't enforce any particular kind of API, is there a way to
generically handle control files and their parameters in the library? Paul
suggested his binary API approach, where every control group and it's API is
documented in an api file. Eric suggested using an ASCII interface (since that
is very generic) and using one file per API. Balbir mentioned that this will
lead to too many dentries and issues related to having extensive number of dentries.
5. User space notifications - Kamezawa had requested for user space notification
(through inotify) when a control group reaches it's memory limit for example.
The questions that were asked were, what happens if no one is listening in on
notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
netlinks and building stuff on top of cgroupstats. With netlink we can pass
type, value and length of arguments, making it more suitable for this kind of
information exchange. The only concern with netlink is that it can lose
messages. The general consensus was to add one FIFO per control group and use
that for all notifications related to the control group.
Resource management
===================
1. Memory controller - Balbir mentioned that this is best discussed at the
memory controller BoF
2. Device subsystem was discussed and it was decided that mount (filesystem)
namespace and device namespace are the best places to handle device subsystem
issues.
3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are
opposed to doing any limits based on virtual address space. Balbir mentioned
that it serves several purposes
a. It allows us to control swap usage
b. It allows us to build a generic rlimits infrastructure
c. It allows us to fail applications nicely
Paul mentioned that (c) was not useful since no applications handle it today.
Balbir disagreed with that argument as being sufficient to prevent future
applications to handle malloc()/mmap() failure. Balbir asked why overcommit
accounting was not useful?
There was general agreement that a mlock() controller would be useful.
4. CPU controller - There was a request for hard limit feature. Peter opposed
the approach stating that anyone wanting hard limits should use the real time
group scheduler and a new EDF scheduler is being implemented. Denis mentioned
that without hard limits it is not possible for a service provider to
decide/plan how much capacity a single CPU can provide. Balbir mentioned that
with hard limits and SLA's the service provider could on reaching the hard limit
can save power by hard limiting execution on a CPU that is meeting its SLA
requirements. Peter mentioned that hard limits would make the group scheduler,
non work conserving.
Peter also updated everyone about the new load balancing patches that will make
it into the next merge window.
5. Kernel memory controller - The kernel memory controller was discussed
briefly. Pavel has not been actively working on it. Denis mentioned that it
would be nice to have a network buffer controller as well. Questions were asked
if the kernel memory controller should be merged with the existing memory
controller?
6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for
fundamental operations and that he posted a version of the patch three weeks
ago. The patch controls swap entries to control the swap usage of a control
group. Paul mentioned that google has a patch internally to link swap files to
cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace
is a different issue all together (compared to the swap controller). Currently
the swap controller is a part of the memory controller. There has been some
discussion about it being an independent controller.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
^ permalink raw reply [flat|nested] 11+ messages in thread* Control groups and Resource Management notes (part II)
2008-08-01 13:54 ` Balbir Singh
(?)
@ 2008-08-02 1:10 ` Balbir Singh
2008-08-05 7:45 ` KOSAKI Motohiro
-1 siblings, 1 reply; 11+ messages in thread
From: Balbir Singh @ 2008-08-02 1:10 UTC (permalink / raw)
To: Linux Containers; +Cc: linux kernel mailing list, libcg-devel
Here's part II (part I can be found at
(https://lists.linux-foundation.org/pipermail/containers/2008-August/012128.html)
Resource management (cont'd)
============================
7. Disk IO controller - There was a general discussion on the various disk IO
controllers
a. DM - IOBand
b. IO throttle
c. Anticipatory
d. CFQ
It was decided that it would be best for all the stake holders to work together
and let Jens Axboe and the block layer experts figure out what would be right
for the Linux kernel
8. Network traffic control - Paul discussed network traffic control and the
approach followed by Google. The existing classifier mechanism can be easily
extended by adding a classifier id (based on the control group). This is used in
combination with netfilters. Balbir mentioned that Thomas Graf was also looking
at something similar and raised the issue of input bandwidth control. Balbir
also pointed people to CKRM where the solution has been implemented. The OpenVZ
and Google team will post their patches
9. Network permissions - There was a recommendation to use security hooks for
network permissions. Paul explained what they use permissions with
a. connect
b. bind
c. accept
The issue of using netlabels was brought up.
10. Freezer subsystem - The freezer system was discussed briefly. Serge
mentioned the patches and wanted to collect feedback (if any) on them.
11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
certain short comings of the OOM handler
a. Logic - it is based on total_vm, is that the correct metric for
OOMing?
b. Concurrency - it kills several tasks at once
There was a discussion on moving the policy for OOM handling to user space. Paul
described how the OOM handler has been modified at google to notify user space
when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
good idea, it was generally discussed that it might not be such a good idea.
Control group library
=====================
Dhaval and Balbir introduced libcgroups and the purpose of the library and the
goals. Balbir described on paper what the current design looks like, it consists of
1. API
2. Test framework
3. A configuration subsystem
Dhaval discussed configuration syntax of XML versus home made. The issue of
classification of tasks was brought up. The reason that we want to classify
tasks is that we want them to move at fork/exec time to the correct cgroup so that
1. They don't consume resources in the parents group
2. The movement is automatic
It was generally agreed upon that the classification should take place in user
space. Eric and others suggested having a wrapper to start the application in
the correct cgroup (wrapper around fork/exec). Dave suggested that one might
even go the extent of hacking, such that a process is ptraced after fork/exec,
moved to the correct group and resumed. Using SELinux contexts was also recommended.
Vivek brought up using PAM plugins to do classifications, this suggestion was
nicely received. The decision was to do classification in user space and then
think of kernel space if it cannot be done in user space. Denis suggested that
classification is useful. In OpenVZ they classify all apache children to a
different group. Balbir asked Denis to post their classification infrastructure
as RFC.
Balbir asked for contributions to libcgroup. Libcgroup will effect system design
and both administrators and application administrators. Now is a good time to
get *involved*.
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: Control groups and Resource Management notes (part II)
2008-08-02 1:10 ` Control groups and Resource Management notes (part II) Balbir Singh
@ 2008-08-05 7:45 ` KOSAKI Motohiro
[not found] ` <20080805160709.A88B.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
0 siblings, 1 reply; 11+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05 7:45 UTC (permalink / raw)
To: balbir
Cc: kosaki.motohiro, Linux Containers, linux kernel mailing list,
libcg-devel
Hi balbir-san,
Thank you for nice minutes.
it is very helpful for non invited people (include me).
> 10. Freezer subsystem - The freezer system was discussed briefly. Serge
> mentioned the patches and wanted to collect feedback (if any) on them.
Who use it?
AFAIK the freezer is used by HPC guys in general.
but they think MPI process must be freezed.
Unfortunately, Opensource MPI implementation use various inter-process
communication method (e.g. SYSV IPC, socket, ptrace)
then, general freezer implementaion is very difficult.
> 11. OOM Handler - The OOM handler was discussed in detail. Balbir mentioned
> certain short comings of the OOM handler
> a. Logic - it is based on total_vm, is that the correct metric for
> OOMing?
> b. Concurrency - it kills several tasks at once
>
> There was a discussion on moving the policy for OOM handling to user space. Paul
> described how the OOM handler has been modified at google to notify user space
> when a CPUSet runs out of memory. Balbir asked if OOMing on reaching limits is a
> good idea, it was generally discussed that it might not be such a good idea.
CPUSET based limitation is not easy to use (slightly).
memcgroup based is better.
In addition, notification on reaching limit can be very generic.
various limit (e.g. cpu time, memory usage), various notification
(e.g. kill process, send signal, inotify), various target
(each process on the cgroup or manager process) can be tought.
> Control group library
> =====================
> Dhaval and Balbir introduced libcgroups and the purpose of the library and the
> goals. Balbir described on paper what the current design looks like, it consists of
>
> 1. API
> 2. Test framework
> 3. A configuration subsystem
>
> Dhaval discussed configuration syntax of XML versus home made. The issue of
> classification of tasks was brought up. The reason that we want to classify
> tasks is that we want them to move at fork/exec time to the correct cgroup so that
I don't recommend XML, because XML is tree based syntax but we want more fulexible
classification. then I guess XML reduce human readability.
> 1. They don't consume resources in the parents group
> 2. The movement is automatic
>
> It was generally agreed upon that the classification should take place in user
> space. Eric and others suggested having a wrapper to start the application in
> the correct cgroup (wrapper around fork/exec). Dave suggested that one might
> even go the extent of hacking, such that a process is ptraced after fork/exec,
> moved to the correct group and resumed. Using SELinux contexts was also recommended.
>
> Vivek brought up using PAM plugins to do classifications, this suggestion was
> nicely received. The decision was to do classification in user space and then
> think of kernel space if it cannot be done in user space. Denis suggested that
> classification is useful. In OpenVZ they classify all apache children to a
> different group. Balbir asked Denis to post their classification infrastructure
> as RFC.
I'm not sure about this issue.
but I like PAM approach.
^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>]
* Re: Control groups and Resource Management notes (part I)
[not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
@ 2008-08-05 7:06 ` KOSAKI Motohiro
2008-08-06 17:38 ` [Libcg-devel] " Dhaval Giani
1 sibling, 0 replies; 11+ messages in thread
From: KOSAKI Motohiro @ 2008-08-05 7:06 UTC (permalink / raw)
To: balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8; +Cc: Linux Containers
Hi
nice minutes!
below is just my note.
> Control Groups
> ==============
>
> 1. Multiphase locking - Paul brought up his multi phase locking design and
> suggested approaches to implementing them. The problem with control groups
> currently is that transactions cannot be atomically committed. If some
> transactions fail (can_attach() callback fails or returns error), then there is
> no notification sent out to groups that already committed the transaction
>
> The suggested design includes
> - Acquiring locks across callbacks - Balbir opposed this approach
> stating that this would make it easier for subsystems to deadlock.
> Balbir instead suggested that each callback hold it's own lock and
> add an undo operation that cannot fail (returns void), since
> uncharging usually succeeds. Dave suggested doing undo without holding
> any locks.
task_limit cgroup has one problem with atomic related things.
task_limit check number of tasks when can_attach() called and increment number of tasks
when attach() called.
thus, it has race. if two attch processing run parallel, number of tasks exceed task limit.
> 4. Binary statistics - The question about binary statistics was raised. Since
> control groups don't enforce any particular kind of API, is there a way to
> generically handle control files and their parameters in the library? Paul
> suggested his binary API approach, where every control group and it's API is
> documented in an api file. Eric suggested using an ASCII interface (since that
> is very generic) and using one file per API. Balbir mentioned that this will
> lead to too many dentries and issues related to having extensive number of dentries.
if too many dentries come trouble, we should attach it?
I feel binary interface is detour solution.
but if any cgroup need any atomic operation and its implementation is
difficult on sysfs like inteface, I'll advocate binary api.
> 5. User space notifications - Kamezawa had requested for user space notification
> (through inotify) when a control group reaches it's memory limit for example.
> The questions that were asked were, what happens if no one is listening in on
> notifications? Denis suggested using a FIFO mechanism. Balbir suggested using
> netlinks and building stuff on top of cgroupstats. With netlink we can pass
> type, value and length of arguments, making it more suitable for this kind of
> information exchange. The only concern with netlink is that it can lose
> messages. The general consensus was to add one FIFO per control group and use
> that for all notifications related to the control group.
At least, HPC like batch system need some notification (e.g. elaps time,
cpu time, memory consumption exceed)
In addition, some embedded people want userland oom-manager.
it get notification when system memory shortage, and shrink properly
process memory.
because kernel can't know how much droppable cache user process has.
(e.g. browser cache, free list in malloc, GUI bitmap cache)
if we think system memory shortage, FIFO is not so good idea.
it accelerate to memory stavation more.
and netlink use some kmalloc, then it doesn't works properly
on memory stavation state.
but We should be thought it?
btw, I guess Peter Zijlstra's memory resavation framework can solve
above netlink issue. but I'm not sure it.
> 4. CPU controller - There was a request for hard limit feature. Peter opposed
> the approach stating that anyone wanting hard limits should use the real time
> group scheduler and a new EDF scheduler is being implemented. Denis mentioned
> that without hard limits it is not possible for a service provider to
> decide/plan how much capacity a single CPU can provide. Balbir mentioned that
> with hard limits and SLA's the service provider could on reaching the hard limit
> can save power by hard limiting execution on a CPU that is meeting its SLA
> requirements. Peter mentioned that hard limits would make the group scheduler,
> non work conserving.
What's SLA?
> 5. Kernel memory controller - The kernel memory controller was discussed
> briefly. Pavel has not been actively working on it. Denis mentioned that it
> would be nice to have a network buffer controller as well. Questions were asked
> if the kernel memory controller should be merged with the existing memory
> controller?
I don't hope merge it.
I think network buffer control is useful, but kernel memory controller is not.
because it require administrator know kernel implementation.
but it is too difficult.
Swiss army knife like approach press down every trouble to administrator.
I know embedded people like kernel memory controller,
because they know the kernel internal very well and
they don't want create custom kernel.
but it is general assumption?
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [Libcg-devel] Control groups and Resource Management notes (part I)
[not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2008-08-05 7:06 ` Control groups and Resource Management notes (part I) KOSAKI Motohiro
@ 2008-08-06 17:38 ` Dhaval Giani
1 sibling, 0 replies; 11+ messages in thread
From: Dhaval Giani @ 2008-08-06 17:38 UTC (permalink / raw)
To: Balbir Singh; +Cc: Linux Containers, menage-hpIqsD4AKlfQT0dZR+AlfA, Balaji Rao
On Fri, Aug 01, 2008 at 07:24:58PM +0530, Balbir Singh wrote:
> 4. Binary statistics - The question about binary statistics was raised. Since
> control groups don't enforce any particular kind of API, is there a way to
> generically handle control files and their parameters in the library? Paul
> suggested his binary API approach, where every control group and it's API is
> documented in an api file. Eric suggested using an ASCII interface (since that
> is very generic) and using one file per API. Balbir mentioned that this will
> lead to too many dentries and issues related to having extensive number of dentries.
So where are we heading with respect to this issue? Was any consensus
reached? I plan to look at enabling statistics in libcgroup soon. Also I
believe Balaji is also looking at cgroupstats.
Thanks,
--
regards,
Dhaval
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-08-06 17:38 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-08-01 13:54 Control groups and Resource Management notes (part I) Balbir Singh
2008-08-01 13:54 ` Balbir Singh
2008-08-02 1:10 ` Control groups and Resource Management notes (part II) Balbir Singh
2008-08-05 7:45 ` KOSAKI Motohiro
[not found] ` <20080805160709.A88B.KOSAKI.MOTOHIRO-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
2008-08-05 13:30 ` [Libcg-devel] " Vivek Goyal
2008-08-05 13:30 ` Vivek Goyal
[not found] ` <20080805133007.GC15193-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2008-08-06 1:05 ` KAMEZAWA Hiroyuki
2008-08-06 1:05 ` KAMEZAWA Hiroyuki
2008-08-06 13:00 ` Vivek Goyal
[not found] ` <489315B2.2080506-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
2008-08-05 7:06 ` Control groups and Resource Management notes (part I) KOSAKI Motohiro
2008-08-06 17:38 ` [Libcg-devel] " Dhaval Giani
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.