diff for duplicates of <489315B2.2080506@linux.vnet.ibm.com> diff --git a/a/1.txt b/N1/1.txt index 069dbc1..2026558 100644 --- a/a/1.txt +++ b/N1/1.txt @@ -1,121 +1,241 @@ Hi, All, + + This is the first part of the resource management and control groups discussion. + I might have made mistakes while taking notes or typing them out, please feel + free to correct them for me or send me corrections. + + The notes are really large, so they'll come in installments. This is the first + part of the notes. + + Control Groups + ============== + + 1. Multiphase locking - Paul brought up his multi phase locking design and + suggested approaches to implementing them. The problem with control groups + currently is that transactions cannot be atomically committed. If some + transactions fail (can_attach() callback fails or returns error), then there is + no notification sent out to groups that already committed the transaction + + The suggested design includes + - Acquiring locks across callbacks - Balbir opposed this approach + stating that this would make it easier for subsystems to deadlock. + Balbir instead suggested that each callback hold it's own lock and + add an undo operation that cannot fail (returns void), since + uncharging usually succeeds. Dave suggested doing undo without holding + any locks. + + 2. Procs - Balbir and others have asked for an API to move all threads of a + process in one go from one control group to another. The question about doing it + in user space was asked. Doing it in user space is easy, but it can be expensive + (moving all threads one by one - acquiring the cgroup lock and releasing it for + every thread). What happens if another move is requested while a partial move is + in progress? Dave suggested that we have an abstract aggregation so that we + don't need to keep adding interfaces for every aggregation. Balbir mentioned + that the aggregation of interest are process, process groups and sessions and + the kernel already knows about these (there are data structures to link all + elements together). Abstracting it is a good idea, but hard to implement. + + Paul asked what the behaviour should be, if a process being moved has several + threads belong to different cgroups. The answer that came up was that they + should all be migrated to the destination cgroup + + 3. Cgroup lock - The cgroup lock is held at various places in the system. The + question is -- is cgroup_lock() becoming the next BKL? Several solutions were + discussed - making the lock per hierarchy or per cgroup or use subsystem locks. + Paul mentioned that cgroups already use RCU. + + 4. Binary statistics - The question about binary statistics was raised. Since + control groups don't enforce any particular kind of API, is there a way to + generically handle control files and their parameters in the library? Paul + suggested his binary API approach, where every control group and it's API is + documented in an api file. Eric suggested using an ASCII interface (since that + is very generic) and using one file per API. Balbir mentioned that this will + lead to too many dentries and issues related to having extensive number of dentries. + + 5. User space notifications - Kamezawa had requested for user space notification + (through inotify) when a control group reaches it's memory limit for example. + The questions that were asked were, what happens if no one is listening in on + notifications? Denis suggested using a FIFO mechanism. Balbir suggested using + netlinks and building stuff on top of cgroupstats. With netlink we can pass + type, value and length of arguments, making it more suitable for this kind of + information exchange. The only concern with netlink is that it can lose + messages. The general consensus was to add one FIFO per control group and use + that for all notifications related to the control group. + + Resource management + =================== + 1. Memory controller - Balbir mentioned that this is best discussed at the + memory controller BoF + 2. Device subsystem was discussed and it was decided that mount (filesystem) + namespace and device namespace are the best places to handle device subsystem + issues. + 3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are + opposed to doing any limits based on virtual address space. Balbir mentioned + that it serves several purposes + + a. It allows us to control swap usage + b. It allows us to build a generic rlimits infrastructure + c. It allows us to fail applications nicely + + Paul mentioned that (c) was not useful since no applications handle it today. + Balbir disagreed with that argument as being sufficient to prevent future + applications to handle malloc()/mmap() failure. Balbir asked why overcommit + accounting was not useful? + + There was general agreement that a mlock() controller would be useful. + + 4. CPU controller - There was a request for hard limit feature. Peter opposed + the approach stating that anyone wanting hard limits should use the real time + group scheduler and a new EDF scheduler is being implemented. Denis mentioned + that without hard limits it is not possible for a service provider to + decide/plan how much capacity a single CPU can provide. Balbir mentioned that + with hard limits and SLA's the service provider could on reaching the hard limit + can save power by hard limiting execution on a CPU that is meeting its SLA + requirements. Peter mentioned that hard limits would make the group scheduler, + non work conserving. + + Peter also updated everyone about the new load balancing patches that will make + it into the next merge window. + + 5. Kernel memory controller - The kernel memory controller was discussed + briefly. Pavel has not been actively working on it. Denis mentioned that it + would be nice to have a network buffer controller as well. Questions were asked + if the kernel memory controller should be merged with the existing memory + controller? + + 6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for + fundamental operations and that he posted a version of the patch three weeks + ago. The patch controls swap entries to control the swap usage of a control + group. Paul mentioned that google has a patch internally to link swap files to + cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace + is a different issue all together (compared to the swap controller). Currently + the swap controller is a part of the memory controller. There has been some + discussion about it being an independent controller. + + + + -- + Warm Regards, + Balbir Singh + Linux Technology Center + IBM, ISTL diff --git a/a/content_digest b/N1/content_digest index c115a4a..b68256d 100644 --- a/a/content_digest +++ b/N1/content_digest @@ -1,129 +1,249 @@ - "From\0Balbir Singh <balbir-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>\0" + "From\0Balbir Singh <balbir@linux.vnet.ibm.com>\0" "Subject\0Control groups and Resource Management notes (part I)\0" "Date\0Fri, 01 Aug 2008 19:24:58 +0530\0" - "To\0Linux Containers <containers-qjLDD68F18O7TbgM5vRIOg@public.gmane.org>\0" + "To\0Linux Containers <containers@lists.osdl.org>\0" "\00:1\0" "b\0" "Hi, All,\n" "\n" + "\n" + "\n" "This is the first part of the resource management and control groups discussion.\n" + "\n" "I might have made mistakes while taking notes or typing them out, please feel\n" + "\n" "free to correct them for me or send me corrections.\n" "\n" + "\n" + "\n" "The notes are really large, so they'll come in installments. This is the first\n" + "\n" "part of the notes.\n" "\n" + "\n" + "\n" "Control Groups\n" + "\n" "==============\n" "\n" + "\n" + "\n" "1. Multiphase locking - Paul brought up his multi phase locking design and\n" + "\n" "suggested approaches to implementing them. The problem with control groups\n" + "\n" "currently is that transactions cannot be atomically committed. If some\n" + "\n" "transactions fail (can_attach() callback fails or returns error), then there is\n" + "\n" "no notification sent out to groups that already committed the transaction\n" "\n" + "\n" + "\n" "The suggested design includes\n" + "\n" "\t- Acquiring locks across callbacks - Balbir opposed this approach\n" + "\n" " stating that this would make it easier for subsystems to deadlock.\n" + "\n" " Balbir instead suggested that each callback hold it's own lock and\n" + "\n" " add an undo operation that cannot fail (returns void), since\n" + "\n" " uncharging usually succeeds. Dave suggested doing undo without holding\n" + "\n" " any locks.\n" "\n" + "\n" + "\n" "2. Procs - Balbir and others have asked for an API to move all threads of a\n" + "\n" "process in one go from one control group to another. The question about doing it\n" + "\n" "in user space was asked. Doing it in user space is easy, but it can be expensive\n" + "\n" "(moving all threads one by one - acquiring the cgroup lock and releasing it for\n" + "\n" "every thread). What happens if another move is requested while a partial move is\n" + "\n" "in progress? Dave suggested that we have an abstract aggregation so that we\n" + "\n" "don't need to keep adding interfaces for every aggregation. Balbir mentioned\n" + "\n" "that the aggregation of interest are process, process groups and sessions and\n" + "\n" "the kernel already knows about these (there are data structures to link all\n" + "\n" "elements together). Abstracting it is a good idea, but hard to implement.\n" "\n" + "\n" + "\n" "Paul asked what the behaviour should be, if a process being moved has several\n" + "\n" "threads belong to different cgroups. The answer that came up was that they\n" + "\n" "should all be migrated to the destination cgroup\n" "\n" + "\n" + "\n" "3. Cgroup lock - The cgroup lock is held at various places in the system. The\n" + "\n" "question is -- is cgroup_lock() becoming the next BKL? Several solutions were\n" + "\n" "discussed - making the lock per hierarchy or per cgroup or use subsystem locks.\n" + "\n" "Paul mentioned that cgroups already use RCU.\n" "\n" + "\n" + "\n" "4. Binary statistics - The question about binary statistics was raised. Since\n" + "\n" "control groups don't enforce any particular kind of API, is there a way to\n" + "\n" "generically handle control files and their parameters in the library? Paul\n" + "\n" "suggested his binary API approach, where every control group and it's API is\n" + "\n" "documented in an api file. Eric suggested using an ASCII interface (since that\n" + "\n" "is very generic) and using one file per API. Balbir mentioned that this will\n" + "\n" "lead to too many dentries and issues related to having extensive number of dentries.\n" "\n" + "\n" + "\n" "5. User space notifications - Kamezawa had requested for user space notification\n" + "\n" "(through inotify) when a control group reaches it's memory limit for example.\n" + "\n" "The questions that were asked were, what happens if no one is listening in on\n" + "\n" "notifications? Denis suggested using a FIFO mechanism. Balbir suggested using\n" + "\n" "netlinks and building stuff on top of cgroupstats. With netlink we can pass\n" + "\n" "type, value and length of arguments, making it more suitable for this kind of\n" + "\n" "information exchange. The only concern with netlink is that it can lose\n" + "\n" "messages. The general consensus was to add one FIFO per control group and use\n" + "\n" "that for all notifications related to the control group.\n" "\n" + "\n" + "\n" "Resource management\n" + "\n" "===================\n" + "\n" "1. Memory controller - Balbir mentioned that this is best discussed at the\n" + "\n" "memory controller BoF\n" + "\n" "2. Device subsystem was discussed and it was decided that mount (filesystem)\n" + "\n" "namespace and device namespace are the best places to handle device subsystem\n" + "\n" "issues.\n" + "\n" "3. Memrlimit - Balbir discussed the memrlimit controller. Dave and Paul are\n" + "\n" "opposed to doing any limits based on virtual address space. Balbir mentioned\n" + "\n" "that it serves several purposes\n" "\n" + "\n" + "\n" "a. It allows us to control swap usage\n" + "\n" "b. It allows us to build a generic rlimits infrastructure\n" + "\n" "c. It allows us to fail applications nicely\n" "\n" + "\n" + "\n" "Paul mentioned that (c) was not useful since no applications handle it today.\n" + "\n" "Balbir disagreed with that argument as being sufficient to prevent future\n" + "\n" "applications to handle malloc()/mmap() failure. Balbir asked why overcommit\n" + "\n" "accounting was not useful?\n" "\n" + "\n" + "\n" "There was general agreement that a mlock() controller would be useful.\n" "\n" + "\n" + "\n" "4. CPU controller - There was a request for hard limit feature. Peter opposed\n" + "\n" "the approach stating that anyone wanting hard limits should use the real time\n" + "\n" "group scheduler and a new EDF scheduler is being implemented. Denis mentioned\n" + "\n" "that without hard limits it is not possible for a service provider to\n" + "\n" "decide/plan how much capacity a single CPU can provide. Balbir mentioned that\n" + "\n" "with hard limits and SLA's the service provider could on reaching the hard limit\n" + "\n" "can save power by hard limiting execution on a CPU that is meeting its SLA\n" + "\n" "requirements. Peter mentioned that hard limits would make the group scheduler,\n" + "\n" "non work conserving.\n" "\n" + "\n" + "\n" "Peter also updated everyone about the new load balancing patches that will make\n" + "\n" "it into the next merge window.\n" "\n" + "\n" + "\n" "5. Kernel memory controller - The kernel memory controller was discussed\n" + "\n" "briefly. Pavel has not been actively working on it. Denis mentioned that it\n" + "\n" "would be nice to have a network buffer controller as well. Questions were asked\n" + "\n" "if the kernel memory controller should be merged with the existing memory\n" + "\n" "controller?\n" "\n" + "\n" + "\n" "6. Swap subsystem - Daisuke mentioned that the swap subsystem works well for\n" + "\n" "fundamental operations and that he posted a version of the patch three weeks\n" + "\n" "ago. The patch controls swap entries to control the swap usage of a control\n" + "\n" "group. Paul mentioned that google has a patch internally to link swap files to\n" + "\n" "cpusets. Balbir asked Serge about his swap namespace patches. The swap namespace\n" + "\n" "is a different issue all together (compared to the swap controller). Currently\n" + "\n" "the swap controller is a part of the memory controller. There has been some\n" + "\n" "discussion about it being an independent controller.\n" "\n" "\n" "\n" + "\n" + "\n" + "\n" + "\n" "-- \n" + "\n" "\tWarm Regards,\n" + "\n" "\tBalbir Singh\n" + "\n" "\tLinux Technology Center\n" + "\n" "\tIBM, ISTL" -7eae0cb133fbf36c642079ace58a9085dfc5c85dae7e4f55ee1375200d057061 +fe2084aac6c42b5a04f2929bc7f860dd5b8be866ef5867ae0d5a39c3c7c0136f
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.