* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:25 ` Christoph Lameter
@ 2006-09-20 16:44 ` Nick Piggin
2006-09-20 16:48 ` Christoph Lameter
2006-09-20 17:26 ` Rohit Seth
` (3 subsequent siblings)
4 siblings, 1 reply; 125+ messages in thread
From: Nick Piggin @ 2006-09-20 16:44 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, CKRM-Tech, devel, pj, linux-kernel
On Wed, Sep 20, 2006 at 09:25:03AM -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
What I like about Rohit's patches is the page tracking stuff which
seems quite simple but capable.
I suspect cpusets don't quite provide enough features for non-exclusive
use of memory (eg. page tracking for directed reclaim).
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:44 ` Nick Piggin
@ 2006-09-20 16:48 ` Christoph Lameter
2006-09-20 17:07 ` Nick Piggin
0 siblings, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 16:48 UTC (permalink / raw)
To: Nick Piggin; +Cc: Rohit Seth, CKRM-Tech, devel, pj, linux-kernel
On Wed, 20 Sep 2006, Nick Piggin wrote:
> > Right thats what cpusets do and it has been working fine for years. Maybe
> > Paul can help you if you find anything missing in the existing means to
> > control resources.
>
> What I like about Rohit's patches is the page tracking stuff which
> seems quite simple but capable.
>
> I suspect cpusets don't quite provide enough features for non-exclusive
> use of memory (eg. page tracking for directed reclaim).
Look at the VM statistics please. We have detailed page statistics per
zone these days. If there is anything missing then this would best be put
into general functionality. When I looked at it, I saw page statistics
that were replicating things that we already track per zone. All these
would become available if a container is realized via a node and we would
be using proven VM code.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:48 ` Christoph Lameter
@ 2006-09-20 17:07 ` Nick Piggin
2006-09-20 17:12 ` Christoph Lameter
0 siblings, 1 reply; 125+ messages in thread
From: Nick Piggin @ 2006-09-20 17:07 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, CKRM-Tech, devel, pj, linux-kernel
On Wed, Sep 20, 2006 at 09:48:13AM -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Nick Piggin wrote:
>
> > > Right thats what cpusets do and it has been working fine for years. Maybe
> > > Paul can help you if you find anything missing in the existing means to
> > > control resources.
> >
> > What I like about Rohit's patches is the page tracking stuff which
> > seems quite simple but capable.
> >
> > I suspect cpusets don't quite provide enough features for non-exclusive
> > use of memory (eg. page tracking for directed reclaim).
>
> Look at the VM statistics please. We have detailed page statistics per
> zone these days. If there is anything missing then this would best be put
> into general functionality. When I looked at it, I saw page statistics
> that were replicating things that we already track per zone. All these
> would become available if a container is realized via a node and we would
> be using proven VM code.
Look at what the patches do. These are not only for hard partitioning
of memory per container but also those that share memory (eg. you might
want each to share 100MB of memory, up to a max of 80MB for an individual
container).
The nodes+cpusets stuff doesn't seem to help with that because you
with that because you fundamentally need to track pages on a per
container basis otherwise you don't know who's got what.
Now if, in practice, it turns out that nobody really needed these
features then of course I would prefer the cpuset+nodes approach. My
point is that I am not in a position to know who wants what, so I
hope people will come out and discuss some of these issues.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:07 ` Nick Piggin
@ 2006-09-20 17:12 ` Christoph Lameter
2006-09-20 22:27 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 17:12 UTC (permalink / raw)
To: Nick Piggin; +Cc: Rohit Seth, CKRM-Tech, devel, pj, linux-kernel
On Wed, 20 Sep 2006, Nick Piggin wrote:
> Look at what the patches do. These are not only for hard partitioning
> of memory per container but also those that share memory (eg. you might
> want each to share 100MB of memory, up to a max of 80MB for an individual
> container).
So far I have not been able to find the hooks to the VM. The sharing
would also work with nodes. Just create a couple of nodes with the sizes you
want and then put the node with the shared memory into the cpusets for the
apps sharing them.
> The nodes+cpusets stuff doesn't seem to help with that because you
> with that because you fundamentally need to track pages on a per
> container basis otherwise you don't know who's got what.
Hmmm... That gets into issues of knowing how many pages are in use by an
application and that is fundamentally difficult to do due to pages being
shared between processes.
> Now if, in practice, it turns out that nobody really needed these
> features then of course I would prefer the cpuset+nodes approach. My
> point is that I am not in a position to know who wants what, so I
> hope people will come out and discuss some of these issues.
I'd really be interested to have proper tracking of memory use for
processes but I am not sure what use the containers would be there.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:12 ` Christoph Lameter
@ 2006-09-20 22:27 ` Paul Jackson
2006-09-20 22:59 ` Christoph Lameter
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 22:27 UTC (permalink / raw)
To: Christoph Lameter; +Cc: npiggin, rohitseth, ckrm-tech, devel, linux-kernel
Christoph, responding to Nick:
> > Look at what the patches do. These are not only for hard partitioning
> > of memory per container but also those that share memory (eg. you might
> > want each to share 100MB of memory, up to a max of 80MB for an individual
> > container).
>
> So far I have not been able to find the hooks to the VM. The sharing
> would also work with nodes. Just create a couple of nodes with the sizes you
> want and then put the node with the shared memory into the cpusets for the
> apps sharing them.
Cpusets certainly allows for sharing - in the sense that multiple
tasks can be each be allowed to allocate from the same node (fake
or real.)
However, this is not sharing quite in the sense that Nick describes it.
In cpuset sharing, it is predetermined which pages are allowed to be
allocated by which tasks. Not "how many" pages, but "just which" pages.
Let's say we carve this 100 MB's up into 5 cpusets, of 20 MBs each, and
allow each of our many tasks to allocate from some specified 4 of these
5 cpusets. Then, even if some of those 100 MB's were still free, and
if a task was well below its allowed 80 MB's, the task might still not
be able to use that free memory, if that free memory happened to be in
whatever was the 5th cpuset that it was not allowed to use.
Seth:
Could your container proposal handle the above example, and let that
task have some of that memory, up to 80 MB's if available, but not
more, regardless of what node the free memory was on?
I presume so.
Another example that highlights this difference - airline overbooking.
If an airline has to preassign every seat, it can't overbook, short of
putting two passengers in the same seat and hoping one is a no show,
which is pretty cut throat. If an airline is willing to bet that
seldom more than 90% of the ticketed passengers will show up, and it
doesn't preassign all seats, it can wait until flight time, see who
shows up, and hand out the seats then. It can preassign some seats,
but it needs some passengers showing up unassigned, free to take what's
left over.
Cpusets preassigns which nodes are allowed a task. If not all the
pages on a node are allocated by one of the tasks it is preassigned to,
those pages "fly empty" -- remain unallocated. This happens regardless
of how overbooked is the memory on other nodes.
If you just want to avoid fisticuffs at the gate between overbooked
passengers, cpusets are enough. If you further want to maximize utilization,
then you need the capacity management of resource groups, or some such.
> > The nodes+cpusets stuff doesn't seem to help with that because you
> > with that because you fundamentally need to track pages on a per
> > container basis otherwise you don't know who's got what.
>
> Hmmm... That gets into issues of knowing how many pages are in use by an
> application and that is fundamentally difficult to do due to pages being
> shared between processes.
Fundamentally difficult or not, it seems to be required for what Nick
describes, and for sure cpusets doesn't do it (track memory usage per
container.)
> > Now if, in practice, it turns out that nobody really needed these
> > features then of course I would prefer the cpuset+nodes approach. My
> > point is that I am not in a position to know who wants what, so I
> > hope people will come out and discuss some of these issues.
I don't know either ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 22:27 ` Paul Jackson
@ 2006-09-20 22:59 ` Christoph Lameter
0 siblings, 0 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 22:59 UTC (permalink / raw)
To: Paul Jackson; +Cc: npiggin, rohitseth, ckrm-tech, devel, linux-kernel
On Wed, 20 Sep 2006, Paul Jackson wrote:
> > Hmmm... That gets into issues of knowing how many pages are in use by an
> > application and that is fundamentally difficult to do due to pages being
> > shared between processes.
>
> Fundamentally difficult or not, it seems to be required for what Nick
> describes, and for sure cpusets doesn't do it (track memory usage per
> container.)
We have the memory usage on a node. So in that sense we track memory
usage. We also have VM counters for that node or nodes that describe
resource usage.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:25 ` Christoph Lameter
2006-09-20 16:44 ` Nick Piggin
@ 2006-09-20 17:26 ` Rohit Seth
2006-09-20 17:37 ` [ckrm-tech] " Paul Menage
` (2 more replies)
2006-09-20 17:34 ` Alan Cox
` (2 subsequent siblings)
4 siblings, 3 replies; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 17:26 UTC (permalink / raw)
To: Christoph Lameter; +Cc: CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
cpusets provides cpu and memory NODES binding to tasks. And I think it
works great for NUMA machines where you have different nodes with its
own set of CPUs and memory. The number of those nodes on a commodity HW
is still 1. And they can have 8-16 CPUs and in access of 100G of
memory. You may start using fake nodes (untested territory) to
translate a single node machine into N different nodes. But am not sure
if this number of nodes can change dynamically on the running machine or
a reboot is required to change the number of nodes.
Though when you want to have in access of 100 containers then the cpuset
function starts popping up on the oprofile chart very aggressively. And
this is the cost that shouldn't have to be paid (particularly) for a
single node machine.
And what happens when you want to have cpuset with memory that needs to
be even further fine grained than each node.
Containers also provide a mechanism to move files to containers. Any
further references to this file come from the same container rather than
the container which is bringing in a new page.
In future there will be more handlers like CPU and disk that can be
easily embeded into this container infrastructure.
-rohit
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:26 ` Rohit Seth
@ 2006-09-20 17:37 ` Paul Menage
2006-09-20 17:38 ` Christoph Lameter
2006-09-20 22:51 ` Paul Jackson
2 siblings, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-20 17:37 UTC (permalink / raw)
To: rohitseth; +Cc: Christoph Lameter, npiggin, pj, linux-kernel, devel, CKRM-Tech
On 9/20/06, Rohit Seth <rohitseth@google.com> wrote:
>
> cpusets provides cpu and memory NODES binding to tasks. And I think it
> works great for NUMA machines where you have different nodes with its
> own set of CPUs and memory. The number of those nodes on a commodity HW
> is still 1. And they can have 8-16 CPUs and in access of 100G of
> memory. You may start using fake nodes (untested territory) to
I've been experimenting with this, and it's looking promising.
>
> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.
This could be done with memory nodes too - a vma can specify its
memory policy, so binding individual files to nodes shouldn't be hard.
>
> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.
I think that at least the userspace API for adding more handlers would
need to be defined before actually committing a container patch, even
if the kernel code isn't yet extensible. The CKRM/RG interface is a
good start towards this.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:26 ` Rohit Seth
2006-09-20 17:37 ` [ckrm-tech] " Paul Menage
@ 2006-09-20 17:38 ` Christoph Lameter
2006-09-20 17:42 ` [ckrm-tech] " Paul Menage
2006-09-20 18:07 ` Rohit Seth
2006-09-20 22:51 ` Paul Jackson
2 siblings, 2 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 17:38 UTC (permalink / raw)
To: Rohit Seth; +Cc: CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 20 Sep 2006, Rohit Seth wrote:
> cpusets provides cpu and memory NODES binding to tasks. And I think it
> works great for NUMA machines where you have different nodes with its
> own set of CPUs and memory. The number of those nodes on a commodity HW
> is still 1. And they can have 8-16 CPUs and in access of 100G of
> memory. You may start using fake nodes (untested territory) to
See linux-mm. We just went through a series of tests and functionality
wise it worked just fine.
> translate a single node machine into N different nodes. But am not sure
> if this number of nodes can change dynamically on the running machine or
> a reboot is required to change the number of nodes.
This is commonly discussed under the subject of memory hotplug.
> Though when you want to have in access of 100 containers then the cpuset
> function starts popping up on the oprofile chart very aggressively. And
> this is the cost that shouldn't have to be paid (particularly) for a
> single node machine.
Yes this is a new way of using cpusets but it works and we could fix the
scalability issues rather than adding new subsystems.
> And what happens when you want to have cpuset with memory that needs to
> be even further fine grained than each node.
New node?
> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.
Hmmmm... Thats is interesting.
> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.
I think we should have one container mechanism instead of multiple. Maybe
merge the two? The cpuset functionality is well established and working
right.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:38 ` Christoph Lameter
@ 2006-09-20 17:42 ` Paul Menage
2006-09-20 18:07 ` Rohit Seth
1 sibling, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-20 17:42 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, npiggin, pj, linux-kernel, devel, CKRM-Tech
On 9/20/06, Christoph Lameter <clameter@sgi.com> wrote:
>
> I think we should have one container mechanism instead of multiple. Maybe
> merge the two? The cpuset functionality is well established and working
> right.
The basic container abstraction provided by cpusets is very nice -
maybe rename it from "cpuset" to "container"? Since it already
provides access to memory nodes as well as cpus, and could be extended
to handle other resource types too (network, disk).
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:38 ` Christoph Lameter
2006-09-20 17:42 ` [ckrm-tech] " Paul Menage
@ 2006-09-20 18:07 ` Rohit Seth
2006-09-20 19:51 ` Christoph Lameter
` (2 more replies)
1 sibling, 3 replies; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 18:07 UTC (permalink / raw)
To: Christoph Lameter; +Cc: CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 2006-09-20 at 10:38 -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Rohit Seth wrote:
>
> > cpusets provides cpu and memory NODES binding to tasks. And I think it
> > works great for NUMA machines where you have different nodes with its
> > own set of CPUs and memory. The number of those nodes on a commodity HW
> > is still 1. And they can have 8-16 CPUs and in access of 100G of
> > memory. You may start using fake nodes (untested territory) to
>
> See linux-mm. We just went through a series of tests and functionality
> wise it worked just fine.
>
I thought the fake NUMA support still does not work on x86_64 baseline
kernel. Though Paul and Andrew have patches to make it work.
> > translate a single node machine into N different nodes. But am not sure
> > if this number of nodes can change dynamically on the running machine or
> > a reboot is required to change the number of nodes.
>
> This is commonly discussed under the subject of memory hotplug.
>
So now we depend on getting memory hot-plug to work for faking up these
nodes ...for the memory that is already present in the system. It just
does not sound logical.
> > Though when you want to have in access of 100 containers then the cpuset
> > function starts popping up on the oprofile chart very aggressively. And
> > this is the cost that shouldn't have to be paid (particularly) for a
> > single node machine.
>
> Yes this is a new way of using cpusets but it works and we could fix the
> scalability issues rather than adding new subsystems.
>
I think when you have 100's of zones then cost of allocating a page will
include checking cpuset validation and different zone list traversals
and checks...unless there is major surgery.
> > And what happens when you want to have cpuset with memory that needs to
> > be even further fine grained than each node.
>
> New node?
>
Am not clear how is this possible. Could you or Paul please explain.
> > Containers also provide a mechanism to move files to containers. Any
> > further references to this file come from the same container rather than
> > the container which is bringing in a new page.
>
> Hmmmm... Thats is interesting.
>
> > In future there will be more handlers like CPU and disk that can be
> > easily embeded into this container infrastructure.
>
> I think we should have one container mechanism instead of multiple. Maybe
> merge the two? The cpuset functionality is well established and working
> right.
>
I agree that we will need one container subsystem in the long run.
Something that can easily adapt to different configurations.
-rohit
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:07 ` Rohit Seth
@ 2006-09-20 19:51 ` Christoph Lameter
2006-09-20 20:06 ` Paul Jackson
2006-09-20 22:58 ` Paul Jackson
2 siblings, 0 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 19:51 UTC (permalink / raw)
To: Rohit Seth; +Cc: CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 20 Sep 2006, Rohit Seth wrote:
> I thought the fake NUMA support still does not work on x86_64 baseline
> kernel. Though Paul and Andrew have patches to make it work.
Read linux-mm. There is work in progress.
> > This is commonly discussed under the subject of memory hotplug.
> So now we depend on getting memory hot-plug to work for faking up these
> nodes ...for the memory that is already present in the system. It just
> does not sound logical.
It is logical since nodes are containers of memory today and we have
established VM functionality to deal with these containers. If you read
the latest linx-mm then you will find that this was tested and works fine.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:07 ` Rohit Seth
2006-09-20 19:51 ` Christoph Lameter
@ 2006-09-20 20:06 ` Paul Jackson
2006-09-20 22:58 ` Paul Jackson
2 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 20:06 UTC (permalink / raw)
To: rohitseth; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
Seth wrote:
> I thought the fake NUMA support still does not work on x86_64 baseline
> kernel. Though Paul and Andrew have patches to make it work.
It works. Having long zonelists where one expects to have to scan a
long way down the list has a performance glitch, in the
get_page_from_freelist() code sucks. We don't want to be doing a
linear scan of a long list on this code path.
The cpuset_zone_allowed() routine happens to be the most obvious canary
in this linear scan loop (google 'canary in the mine shaft' for the
idiom), so shows up the problem first.
We don't have patches yet to fix this (well, we might, I still haven't
digested the last couple days worth of postings.) But we are persuing
Andrew's suggestion to cache the zone that we found memory on last time
around, so as to dramatically reduce the chance we have to rescan the
entire dang zonelist every time through this code.
Initially these zonelists had been designed to handle the various
kinds of dma, main and upper memory on common PC architectures, then
they were (ab)used to handle multiple Non-Uniform Memory Nodes (NUMA)
on bigger boxen. So it is not entirely surprising that we hit a
performance speed bump when further (ab)using them to handle multiple
Uniform sub-nodes as part of a memory containerization effort. Each
different kind of use hits these algorithms and data structures
differently.
It seems pretty clear by now that we will be able to pave over this
speed bump without doing any major reconstruction.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:07 ` Rohit Seth
2006-09-20 19:51 ` Christoph Lameter
2006-09-20 20:06 ` Paul Jackson
@ 2006-09-20 22:58 ` Paul Jackson
2006-09-20 23:02 ` Christoph Lameter
2006-09-20 23:26 ` Rohit Seth
2 siblings, 2 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 22:58 UTC (permalink / raw)
To: rohitseth; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
Seth wrote:
> So now we depend on getting memory hot-plug to work for faking up these
> nodes ...for the memory that is already present in the system. It just
> does not sound logical.
It's logical to me. Part of memory hotplug is adding physial memory,
which is not an issue here. Part of it is adding another logical
memory node (turning on another bit in node_online_map) and fixing up
any code that thought a systems memory nodes were baked in at boottime.
Perhaps the hardest part is the memory hot-un-plug, which would become
more urgently needed with such use of fake numa nodes. The assumption
that memory doesn't just up and vanish is non-trivial to remove from
the kernel. A useful memory containerization should (IMHO) allow for
both adding and removing such containers.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 22:58 ` Paul Jackson
@ 2006-09-20 23:02 ` Christoph Lameter
2006-09-20 23:33 ` Rohit Seth
2006-09-20 23:26 ` Rohit Seth
1 sibling, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 23:02 UTC (permalink / raw)
To: Paul Jackson; +Cc: rohitseth, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 20 Sep 2006, Paul Jackson wrote:
> the kernel. A useful memory containerization should (IMHO) allow for
> both adding and removing such containers.
How does the containers implementation under discussion behave if a
process is part of a container and the container is removed?
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:02 ` Christoph Lameter
@ 2006-09-20 23:33 ` Rohit Seth
2006-09-20 23:36 ` Christoph Lameter
0 siblings, 1 reply; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 23:33 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Paul Jackson, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 2006-09-20 at 16:02 -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Paul Jackson wrote:
>
> > the kernel. A useful memory containerization should (IMHO) allow for
> > both adding and removing such containers.
>
> How does the containers implementation under discussion behave if a
> process is part of a container and the container is removed?
>
It first removes all the tasks belonging to this container (which means
resetting the container pointers in task_struct and then per page
container pointer belonging to anonymous pages). It then clears the
container pointers in the mapping structure and also in the pages
belonging to these files.
-rohit
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:33 ` Rohit Seth
@ 2006-09-20 23:36 ` Christoph Lameter
2006-09-20 23:39 ` Rohit Seth
0 siblings, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 23:36 UTC (permalink / raw)
To: Rohit Seth; +Cc: Paul Jackson, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 20 Sep 2006, Rohit Seth wrote:
> > How does the containers implementation under discussion behave if a
> > process is part of a container and the container is removed?
> It first removes all the tasks belonging to this container (which means
> resetting the container pointers in task_struct and then per page
> container pointer belonging to anonymous pages). It then clears the
> container pointers in the mapping structure and also in the pages
> belonging to these files.
So the application continues to run unharmed?
Could we equip containers with restrictions on processors and nodes for
NUMA?
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:36 ` Christoph Lameter
@ 2006-09-20 23:39 ` Rohit Seth
2006-09-20 23:51 ` Christoph Lameter
0 siblings, 1 reply; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 23:39 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Paul Jackson, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 2006-09-20 at 16:36 -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Rohit Seth wrote:
>
> > > How does the containers implementation under discussion behave if a
> > > process is part of a container and the container is removed?
> > It first removes all the tasks belonging to this container (which means
> > resetting the container pointers in task_struct and then per page
> > container pointer belonging to anonymous pages). It then clears the
> > container pointers in the mapping structure and also in the pages
> > belonging to these files.
>
> So the application continues to run unharmed?
>
It will hit a one time penalty of getting those pointers reset, but
besides that it will continue to run fine.
> Could we equip containers with restrictions on processors and nodes for
> NUMA?
>
Yes. That is something we will have to do (I think part of CPU
handler-TBD).
-rohit
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:39 ` Rohit Seth
@ 2006-09-20 23:51 ` Christoph Lameter
2006-09-21 0:05 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 23:51 UTC (permalink / raw)
To: Rohit Seth; +Cc: Paul Jackson, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 20 Sep 2006, Rohit Seth wrote:
> > Could we equip containers with restrictions on processors and nodes for
> > NUMA?
> Yes. That is something we will have to do (I think part of CPU
> handler-TBD).
Paul: Will we still need cpusets if that is there?
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:51 ` Christoph Lameter
@ 2006-09-21 0:05 ` Paul Jackson
2006-09-21 0:09 ` [ckrm-tech] " Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:05 UTC (permalink / raw)
To: Christoph Lameter; +Cc: rohitseth, ckrm-tech, devel, npiggin, linux-kernel
> > > Could we equip containers with restrictions on processors and nodes for
> > > NUMA?
> > Yes. That is something we will have to do (I think part of CPU
> > handler-TBD).
>
> Paul: Will we still need cpusets if that is there?
Yes. There's quite a bit more to cpusets than just some form,
any form, of CPU and Memory restriction. I can't imagine that
Containers, in any form, are going to replicate that API.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:05 ` Paul Jackson
@ 2006-09-21 0:09 ` Paul Menage
0 siblings, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-21 0:09 UTC (permalink / raw)
To: Paul Jackson
Cc: Christoph Lameter, rohitseth, npiggin, linux-kernel, devel,
ckrm-tech
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
>
> Yes. There's quite a bit more to cpusets than just some form,
> any form, of CPU and Memory restriction. I can't imagine that
> Containers, in any form, are going to replicate that API.
>
That would be one of the nice aspects of a generic process container
abstraction linked to different resource controllers - you wouldn't
need to replicate the cpuset support, you could use it in parallel
with other resource controllers. (So e.g. use the cpusets support to
pin a group of processes on to a given set of CPU/memory nodes, and
then use the CKRM/RG CPU and disk/IO controllers to limit resource
usage within those nodes)
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 22:58 ` Paul Jackson
2006-09-20 23:02 ` Christoph Lameter
@ 2006-09-20 23:26 ` Rohit Seth
2006-09-20 23:31 ` Christoph Lameter
1 sibling, 1 reply; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 23:26 UTC (permalink / raw)
To: Paul Jackson; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 2006-09-20 at 15:58 -0700, Paul Jackson wrote:
> Seth wrote:
> > So now we depend on getting memory hot-plug to work for faking up these
> > nodes ...for the memory that is already present in the system. It just
> > does not sound logical.
>
> It's logical to me. Part of memory hotplug is adding physial memory,
> which is not an issue here. Part of it is adding another logical
> memory node (turning on another bit in node_online_map) and fixing up
> any code that thought a systems memory nodes were baked in at boottime.
> Perhaps the hardest part is the memory hot-un-plug, which would become
> more urgently needed with such use of fake numa nodes. The assumption
> that memory doesn't just up and vanish is non-trivial to remove from
> the kernel. A useful memory containerization should (IMHO) allow for
> both adding and removing such containers.
>
Absolutely. Since these containers are not (hard) partitioning the
memory in any way so it is easy to change the limits (effectively
reducing and increasing the memory limits for tasks belonging to
containers). As you said, memory hot-un-plug is important and it is
non-trivial amount of work.
-rohit
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:26 ` Rohit Seth
@ 2006-09-20 23:31 ` Christoph Lameter
2006-09-21 0:51 ` [Lhms-devel] " KAMEZAWA Hiroyuki
0 siblings, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 23:31 UTC (permalink / raw)
To: lhms-devel
Cc: Rohit Seth, Paul Jackson, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 20 Sep 2006, Rohit Seth wrote:
> Absolutely. Since these containers are not (hard) partitioning the
> memory in any way so it is easy to change the limits (effectively
> reducing and increasing the memory limits for tasks belonging to
> containers). As you said, memory hot-un-plug is important and it is
> non-trivial amount of work.
Maybe the hotplug guys want to contribute to the discussion?
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [Lhms-devel] [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:31 ` Christoph Lameter
@ 2006-09-21 0:51 ` KAMEZAWA Hiroyuki
2006-09-21 1:33 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 125+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-09-21 0:51 UTC (permalink / raw)
To: Christoph Lameter
Cc: lhms-devel, npiggin, ckrm-tech, linux-kernel, pj, rohitseth,
devel
On Wed, 20 Sep 2006 16:31:22 -0700 (PDT)
Christoph Lameter <clameter@sgi.com> wrote:
> On Wed, 20 Sep 2006, Rohit Seth wrote:
>
> > Absolutely. Since these containers are not (hard) partitioning the
> > memory in any way so it is easy to change the limits (effectively
> > reducing and increasing the memory limits for tasks belonging to
> > containers). As you said, memory hot-un-plug is important and it is
> > non-trivial amount of work.
>
> Maybe the hotplug guys want to contribute to the discussion?
>
Ah, I'm reading threads with interest.
I think this discussion is about using fake nodes ('struct pgdat')
to divide system's memory into some chunks. Your thought is that
for resizing/adding/removing fake pgdat, memory-hot-plug codes may be useful.
correct ?
Now, memory-hotplug manages all memory by 'section' and allows adding/(removing)
section to pgdat.
Does this section-size handling meet container people's requirement ?
And do we need freeing page when pgdat is removed ?
I think at least SPARSEMEM is useful for fake nodes because 'struct page'
are not tied to pgdat. (DISCONTIGMEM uses node_start_pfn. SPARSEMEM not.)
-Kame
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [Lhms-devel] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:51 ` [Lhms-devel] " KAMEZAWA Hiroyuki
@ 2006-09-21 1:33 ` KAMEZAWA Hiroyuki
2006-09-21 1:36 ` [ckrm-tech] " Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: KAMEZAWA Hiroyuki @ 2006-09-21 1:33 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: clameter, lhms-devel, npiggin, ckrm-tech, linux-kernel, pj,
rohitseth, devel
Self-response..
On Thu, 21 Sep 2006 09:51:00 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 20 Sep 2006 16:31:22 -0700 (PDT)
> Christoph Lameter <clameter@sgi.com> wrote:
>
> > On Wed, 20 Sep 2006, Rohit Seth wrote:
> >
> > > Absolutely. Since these containers are not (hard) partitioning the
> > > memory in any way so it is easy to change the limits (effectively
> > > reducing and increasing the memory limits for tasks belonging to
> > > containers). As you said, memory hot-un-plug is important and it is
> > > non-trivial amount of work.
> >
> > Maybe the hotplug guys want to contribute to the discussion?
> >
> Ah, I'm reading threads with interest.
I wonder it may not good to use pgdat for resource controlling.
For example
In following scenario,
==
(1). add <pid> > /mnt/configfs/containers/my_container/add_task
(2). <pid> does some work.
(3). echo <pid> > /mnt/configfs/containers/my_container/rm_task
(4). echo <pid> > /mnt/configfs/containers/my_container2/add_task
==
(if fake-pgdat/memory-hotplug is used)
The pages used by <pid> in (2) will be accounted in 'my_container' after (3).
Is this user's wrong use of system ?
-Kame
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [Lhms-devel] [patch00/05]: Containers(V2)- Introduction
2006-09-21 1:33 ` KAMEZAWA Hiroyuki
@ 2006-09-21 1:36 ` Paul Menage
0 siblings, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-21 1:36 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: npiggin, ckrm-tech, linux-kernel, pj, lhms-devel, rohitseth,
devel, clameter
On 9/20/06, KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> For example
>
> In following scenario,
> ==
> (1). add <pid> > /mnt/configfs/containers/my_container/add_task
> (2). <pid> does some work.
> (3). echo <pid> > /mnt/configfs/containers/my_container/rm_task
> (4). echo <pid> > /mnt/configfs/containers/my_container2/add_task
> ==
> (if fake-pgdat/memory-hotplug is used)
> The pages used by <pid> in (2) will be accounted in 'my_container' after (3).
> Is this user's wrong use of system ?
Yes. You can't use memory node partitioning for file pages in this way
unless you have strict controls over who can access the data sets in
question, and are careful to prevent people from moving between
containers. So it's not suitable for all uses of resource-isolating
containers.
Who is to say that the pages allocated in (2) *should* be reaccounted
to my_container2 after (3)? Some people might want that, other
(including me) might not.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:26 ` Rohit Seth
2006-09-20 17:37 ` [ckrm-tech] " Paul Menage
2006-09-20 17:38 ` Christoph Lameter
@ 2006-09-20 22:51 ` Paul Jackson
2006-09-20 23:01 ` Christoph Lameter
2006-09-20 23:22 ` Rohit Seth
2 siblings, 2 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 22:51 UTC (permalink / raw)
To: rohitseth; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
Seth wrote:
> But am not sure
> if this number of nodes can change dynamically on the running machine or
> a reboot is required to change the number of nodes.
The current numa=fake=N kernel command line option is just boottime,
and just x86_64.
I presume we'd have to remove these two constraints for this to be
generally usable to containerize memory.
We also, in my current opinion, need to fix up the node_distance
between such fake numa sibling nodes, to correctly reflect that they
are on the same real node (LOCAL_DISTANCE).
And some non-trivial, arch-specific, zonelist sorting and reconstruction
work will be needed.
And an API devised for the above mentioned dynamic changing.
And this will push on the memory hotplug/unplug technology.
All in all, it could avoid anything more than trivial changes to the
existing memory allocation code hot paths. But the infrastructure
needed for managing this mechanism needs some non-trivial work.
> Though when you want to have in access of 100 containers then the cpuset
> function starts popping up on the oprofile chart very aggressively.
As the linux-mm discussion last weekend examined in detail, we can
eliminate this performance speed bump, probably by caching the
last zone on which we found some memory. The linear search that was
implicit in __alloc_pages()'s use of zonelists for many years finally
become explicit with this new usage pattern.
> Containers also provide a mechanism to move files to containers. Any
> further references to this file come from the same container rather than
> the container which is bringing in a new page.
I haven't read these patches enough to quite make sense of this, but I
suspect that this is not a distinction between cpusets and these
containers, for the basic reason that cpusets doesn't need to 'move'
a file's references because it has no clue what such are.
> In future there will be more handlers like CPU and disk that can be
> easily embeded into this container infrastructure.
This may be a deciding point.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 22:51 ` Paul Jackson
@ 2006-09-20 23:01 ` Christoph Lameter
2006-09-20 23:22 ` Rohit Seth
1 sibling, 0 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 23:01 UTC (permalink / raw)
To: Paul Jackson; +Cc: rohitseth, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 20 Sep 2006, Paul Jackson wrote:
> All in all, it could avoid anything more than trivial changes to the
> existing memory allocation code hot paths. But the infrastructure
> needed for managing this mechanism needs some non-trivial work.
This is material we have to anyways for hotplug support. Adding a real
node or a virtual node whatever it is fundamentally the same process that
requires a regeneration of the zonelists in the system.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 22:51 ` Paul Jackson
2006-09-20 23:01 ` Christoph Lameter
@ 2006-09-20 23:22 ` Rohit Seth
2006-09-20 23:45 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Rohit Seth @ 2006-09-20 23:22 UTC (permalink / raw)
To: Paul Jackson; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
On Wed, 2006-09-20 at 15:51 -0700, Paul Jackson wrote:
> Seth wrote:
> > But am not sure
> > if this number of nodes can change dynamically on the running machine or
> > a reboot is required to change the number of nodes.
>
> The current numa=fake=N kernel command line option is just boottime,
> and just x86_64.
>
Ah okay.
> I presume we'd have to remove these two constraints for this to be
> generally usable to containerize memory.
>
Right.
> We also, in my current opinion, need to fix up the node_distance
> between such fake numa sibling nodes, to correctly reflect that they
> are on the same real node (LOCAL_DISTANCE).
>
> And some non-trivial, arch-specific, zonelist sorting and reconstruction
> work will be needed.
>
> And an API devised for the above mentioned dynamic changing.
>
> And this will push on the memory hotplug/unplug technology.
>
Yes, if we use the existing notion of nodes for other purposes then you
have captured the right set of changes that will be needed to make that
happen. Such changes are not required for container patches as such.
> All in all, it could avoid anything more than trivial changes to the
> existing memory allocation code hot paths. But the infrastructure
> needed for managing this mechanism needs some non-trivial work.
>
>
> > Though when you want to have in access of 100 containers then the cpuset
> > function starts popping up on the oprofile chart very aggressively.
>
> As the linux-mm discussion last weekend examined in detail, we can
> eliminate this performance speed bump, probably by caching the
> last zone on which we found some memory. The linear search that was
> implicit in __alloc_pages()'s use of zonelists for many years finally
> become explicit with this new usage pattern.
>
Okay.
>
> > Containers also provide a mechanism to move files to containers. Any
> > further references to this file come from the same container rather than
> > the container which is bringing in a new page.
>
> I haven't read these patches enough to quite make sense of this, but I
> suspect that this is not a distinction between cpusets and these
> containers, for the basic reason that cpusets doesn't need to 'move'
> a file's references because it has no clue what such are.
>
But container support will allow the certain files pages to come from
the same container irrespective of who is using them. Something useful
for shared libs etc.
-rohit
>
> > In future there will be more handlers like CPU and disk that can be
> > easily embeded into this container infrastructure.
>
> This may be a deciding point.
>
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:22 ` Rohit Seth
@ 2006-09-20 23:45 ` Paul Jackson
0 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 23:45 UTC (permalink / raw)
To: rohitseth; +Cc: clameter, ckrm-tech, devel, npiggin, linux-kernel
Seth wrote:
> But container support will allow the certain files pages to come from
> the same container irrespective of who is using them. Something useful
> for shared libs etc.
Yes - that is useful for shared libs, and your container patch
apparently does that cleanly, while cpuset+fakenuma containers can
only provide a compromise kludge.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:25 ` Christoph Lameter
2006-09-20 16:44 ` Nick Piggin
2006-09-20 17:26 ` Rohit Seth
@ 2006-09-20 17:34 ` Alan Cox
2006-09-20 17:15 ` Christoph Lameter
2006-09-20 17:30 ` [ckrm-tech] " Paul Menage
2006-09-20 18:34 ` Chandra Seetharaman
2006-09-20 19:09 ` Chandra Seetharaman
4 siblings, 2 replies; 125+ messages in thread
From: Alan Cox @ 2006-09-20 17:34 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, CKRM-Tech, devel, pj, npiggin, linux-kernel
Ar Mer, 2006-09-20 am 09:25 -0700, ysgrifennodd Christoph Lameter:
> We already have such a functionality in the kernel its called a cpuset. A
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
CPUsets don't appear to scale to large numbers of containers (say 5000,
with 200-500 doing stuff at a time). They also don't appear to do any
tracking of kernel side resource objects, which is critical to
containment. Indeed for some of us the CPU management and user memory
management angle is mostly uninteresting.
I'm also not clear how you handle shared pages correctly under the fake
node system, can you perhaps explain that further how this works for say
a single apache/php/glibc shared page set across 5000 containers each a
web site.
Alan
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:34 ` Alan Cox
@ 2006-09-20 17:15 ` Christoph Lameter
2006-09-20 17:48 ` Alan Cox
2006-09-20 23:18 ` Paul Jackson
2006-09-20 17:30 ` [ckrm-tech] " Paul Menage
1 sibling, 2 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 17:15 UTC (permalink / raw)
To: Alan Cox; +Cc: Rohit Seth, CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 20 Sep 2006, Alan Cox wrote:
> Ar Mer, 2006-09-20 am 09:25 -0700, ysgrifennodd Christoph Lameter:
> > We already have such a functionality in the kernel its called a cpuset. A
> > container could be created simply by creating a fake node that then
> > allows constraining applications to this node. We already track the
> > types of pages per node. The statistics you want are already existing.
> > See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> CPUsets don't appear to scale to large numbers of containers (say 5000,
> with 200-500 doing stuff at a time). They also don't appear to do any
> tracking of kernel side resource objects, which is critical to
> containment. Indeed for some of us the CPU management and user memory
> management angle is mostly uninteresting.
The scalability issues can certainly be managed. See the discussions on
linux-mm. Kernel side resource objects? slab pages? Those are tracked.
> I'm also not clear how you handle shared pages correctly under the fake
> node system, can you perhaps explain that further how this works for say
> a single apache/php/glibc shared page set across 5000 containers each a
> web site.
Cpusets can share nodes. I am not sure what the problem would be? Paul may
be able to give you more details.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:15 ` Christoph Lameter
@ 2006-09-20 17:48 ` Alan Cox
2006-09-20 17:35 ` Christoph Lameter
2006-09-20 23:29 ` Paul Jackson
2006-09-20 23:18 ` Paul Jackson
1 sibling, 2 replies; 125+ messages in thread
From: Alan Cox @ 2006-09-20 17:48 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, CKRM-Tech, devel, pj, npiggin, linux-kernel
Ar Mer, 2006-09-20 am 10:15 -0700, ysgrifennodd Christoph Lameter:
> The scalability issues can certainly be managed. See the discussions on
> linux-mm.
I'll take a look at a web archive of it, I don't follow -mm.
> Kernel side resource objects? slab pages? Those are tracked.
Slab pages isn't a useful tracking tool for two reasons. The first is
that some resources are genuinely a shared kernel managed pool and
should be treated that way - thats obviously easy to sort out.
The second is that slab pages are not the granularity of allocations so
it becomes possible (and deliberately influencable) to make someone else
allocate the pages all the time so you don't pay the cost. Hence the
beancounters track the real objects.
> Cpusets can share nodes. I am not sure what the problem would be? Paul may
> be able to give you more details.
If it can do it in a human understandable way, configured at runtime
with dynamic sharing, overcommit and reconfiguration of sizes then
great. Lets see what Paul has to say.
Alan
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:48 ` Alan Cox
@ 2006-09-20 17:35 ` Christoph Lameter
2006-09-20 23:29 ` Paul Jackson
1 sibling, 0 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 17:35 UTC (permalink / raw)
To: Alan Cox; +Cc: Rohit Seth, CKRM-Tech, devel, pj, npiggin, linux-kernel
On Wed, 20 Sep 2006, Alan Cox wrote:
> If it can do it in a human understandable way, configured at runtime
> with dynamic sharing, overcommit and reconfiguration of sizes then
> great. Lets see what Paul has to say.
You have full VM support this means overcommit and every else you are used
ot.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:48 ` Alan Cox
2006-09-20 17:35 ` Christoph Lameter
@ 2006-09-20 23:29 ` Paul Jackson
1 sibling, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 23:29 UTC (permalink / raw)
To: Alan Cox; +Cc: clameter, rohitseth, ckrm-tech, devel, npiggin, linux-kernel
Alan replying to Christoph:
> > Cpusets can share nodes. I am not sure what the problem would be? Paul may
> > be able to give you more details.
>
> If it can do it in a human understandable way, configured at runtime
> with dynamic sharing, overcommit and reconfiguration of sizes then
> great. Lets see what Paul has to say.
Unless I'm missing something (a frequent occurrence) such a use of
cpusets looses on the understandable, is hobbled on the overcommit, and
has to make do with a somewhat oddly limited and not trivial to
configure approximation of the dynamic sharing. And the
reconfiguration would seem to be a great exercise of memory hotunplug
(echos of the original motivation for fake numa - exercising cpusets ;).
Not exactly passing with flying colors ;).
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:15 ` Christoph Lameter
2006-09-20 17:48 ` Alan Cox
@ 2006-09-20 23:18 ` Paul Jackson
1 sibling, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 23:18 UTC (permalink / raw)
To: Christoph Lameter
Cc: alan, rohitseth, ckrm-tech, devel, npiggin, linux-kernel
Chistroph, responding to Alan:
> > I'm also not clear how you handle shared pages correctly under the fake
> > node system, can you perhaps explain that further how this works for say
> > a single apache/php/glibc shared page set across 5000 containers each a
> > web site.
>
> Cpusets can share nodes. I am not sure what the problem would be? Paul may
> be able to give you more details.
Cpusets share pre-assigned nodes, but not anonymous proportions of the
total system memory.
So sharing an apache/php/glibc page set across 5000 containers using
cpusets would be awkward. Unless I'm missing something, you'd have to
prepage in that page set, from some task allowed that many pages in
its own cpuset, then you'd run each of the 5000 web servers in smaller
cpusets that allowed space for the remainder of whatever that web
server was provisioned, not counting the shared pages. The shared pages
wouldn't count, because cpusets doesn't ding you for using a page that
is already in memory -- it just keeps you from allocating fresh pages
on certain nodes. When it came time to do rolling upgrades to new
versions of the software, and add a marketing driven list of 57
additional applications that the customers could use to build their
website, this could become an official nightmare.
Overbooking (selling say 10 Mbs of memory for each server, even though
there is less than 5000 * 10 Mb total RAM in the system) would also be
awkward. One could simulate with overlapping sets of fake numa nodes,
as I described in an earlier post today (the one that gave each task
some four of the five 20 MB fake cpusets.) But there would still be
false resource conflicts, and the (ab)use of the cpuset apparatus for
this seems unintuitive, in my opinion.
I imagine that a web site supporting 5000 web servers would be very
interested in overbooking working well. I'm sure the $7.99/month
cheap as dirt virtual web servers of which I am a customer overbook.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:34 ` Alan Cox
2006-09-20 17:15 ` Christoph Lameter
@ 2006-09-20 17:30 ` Paul Menage
2006-09-20 23:37 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-20 17:30 UTC (permalink / raw)
To: Alan Cox
Cc: Christoph Lameter, npiggin, CKRM-Tech, linux-kernel, pj,
Rohit Seth, devel
On 9/20/06, Alan Cox <alan@lxorguk.ukuu.org.uk> wrote:
>
> I'm also not clear how you handle shared pages correctly under the fake
> node system, can you perhaps explain that further how this works for say
> a single apache/php/glibc shared page set across 5000 containers each a
> web site.
If you can associate files with containers, you can have a "shared
libraries" container that the libraries/binaries for apache/php/glibc
are associated with - all pages from those files are then accounted to
the shared container. So you can see that there are 5000 apaches each
using say 10MB privately, and sharing a container with 100MB of file
data. This can also be O(1) in the number of apache containers.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 17:30 ` [ckrm-tech] " Paul Menage
@ 2006-09-20 23:37 ` Paul Jackson
2006-09-20 23:53 ` Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 23:37 UTC (permalink / raw)
To: Paul Menage
Cc: alan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
Paul M., responding to Alan:
> > I'm also not clear how you handle shared pages correctly under the fake
> > node system, can you perhaps explain that further how this works for say
> > a single apache/php/glibc shared page set across 5000 containers each a
> > web site.
>
> If you can associate files with containers, you can have a "shared
> libraries" container that the libraries/binaries for apache/php/glibc
> are associated with - all pages from those files are then accounted to
> the shared container.
The way you "associate" a file with a cpuset is to have some task in
that cpuset open that file and touch its pages -- where that task does
so before any other would be user of the file. Then so long as those
pages have any users or aren't reclaimed, they stay in memory or swap,
free for anyone to reference (free so far as cpusets cares, which is
not in the slightest.)
Such pre-touching of files is common occurrence on the HPC (High Perf
Comp.) apps that run on the big honkin NUMA iron where cpusets were
born. I'm guessing that someone hosting 5000 web servers would rather
not deal with that particular hastle.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:37 ` Paul Jackson
@ 2006-09-20 23:53 ` Paul Menage
2006-09-21 0:07 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-20 23:53 UTC (permalink / raw)
To: Paul Jackson
Cc: alan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
>
> The way you "associate" a file with a cpuset is to have some task in
> that cpuset open that file and touch its pages -- where that task does
> so before any other would be user of the file.
An alternative would be a way of binding files (or directory
hierarchies) to a particular set of memory nodes. Then you wouldn't
need to pre-fault the data. Extended attributes might be one way of
doing it.
>
> Such pre-touching of files is common occurrence on the HPC (High Perf
> Comp.) apps that run on the big honkin NUMA iron where cpusets were
> born. I'm guessing that someone hosting 5000 web servers would rather
> not deal with that particular hastle.
I'm looking at it from the perspective of job control systems that
need to have a good idea what big datasets the jobs running under them
are touching/sharing.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 23:53 ` Paul Menage
@ 2006-09-21 0:07 ` Paul Jackson
2006-09-21 0:10 ` Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:07 UTC (permalink / raw)
To: Paul Menage
Cc: alan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
Paul M wrote:
> An alternative would be a way of binding files (or directory
> hierarchies) to a particular set of memory nodes. Then you wouldn't
> need to pre-fault the data. Extended attributes might be one way of
> doing it.
Some of the file system folks have considered such use of extended
attributes, yes.
I remain unaware that any relation between that work and cpusets
exists or should exist.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:07 ` Paul Jackson
@ 2006-09-21 0:10 ` Paul Menage
2006-09-21 0:17 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-21 0:10 UTC (permalink / raw)
To: Paul Jackson
Cc: alan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
> Some of the file system folks have considered such use of extended
> attributes, yes.
>
> I remain unaware that any relation between that work and cpusets
> exists or should exist.
It doesn't have to be linked to cpusets - but userspace could use it
in conjunction with cpusets to control/account pagecache memory
sharing.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:10 ` Paul Menage
@ 2006-09-21 0:17 ` Paul Jackson
0 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:17 UTC (permalink / raw)
To: Paul Menage
Cc: alan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
Paul M:
> It doesn't have to be linked to cpusets - but userspace could use it
> in conjunction with cpusets to control/account pagecache memory
> sharing.
Could be. Right now I'm feeling too lazy to think about this hard
enough to be useful.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:25 ` Christoph Lameter
` (2 preceding siblings ...)
2006-09-20 17:34 ` Alan Cox
@ 2006-09-20 18:34 ` Chandra Seetharaman
2006-09-20 18:43 ` Paul Menage
2006-09-20 19:52 ` Christoph Lameter
2006-09-20 19:09 ` Chandra Seetharaman
4 siblings, 2 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-20 18:34 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, npiggin, pj, linux-kernel, devel, CKRM-Tech
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
Christoph,
There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.
cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.
cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".
chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2
Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:34 ` Chandra Seetharaman
@ 2006-09-20 18:43 ` Paul Menage
2006-09-20 18:54 ` Chandra Seetharaman
2006-09-20 20:11 ` Paul Jackson
2006-09-20 19:52 ` Christoph Lameter
1 sibling, 2 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-20 18:43 UTC (permalink / raw)
To: sekharan
Cc: Christoph Lameter, npiggin, CKRM-Tech, linux-kernel, pj,
Rohit Seth, devel
On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > We already have such a functionality in the kernel its called a cpuset. A
>
> Christoph,
>
> There had been multiple discussions in the past (as recent as Aug 18,
> 2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
> resource management are orthogonal features.
>
> cpuset provides "resource isolation", and what we, the resource
> management guys want is work-conserving resource control.
CPUset provides two things:
- a generic process container abstraction
- "resource controllers" for CPU masks and memory nodes.
Rather than adding a new process container abstraction, wouldn't it
make more sense to change cpuset to make it more extensible (more
separation between resource controllers), possibly rename it to
"containers", and let the various resource controllers fight it out
(e.g. zone/node-based memory controller vs multiple LRU controller,
CPU masks vs a properly QoS-based CPU scheduler, etc)
Or more specifically, what would need to be added to cpusets to make
it possible to bolt the CKRM/RG resource controllers on to it?
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:43 ` Paul Menage
@ 2006-09-20 18:54 ` Chandra Seetharaman
2006-09-20 19:25 ` Paul Menage
` (2 more replies)
2006-09-20 20:11 ` Paul Jackson
1 sibling, 3 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-20 18:54 UTC (permalink / raw)
To: Paul Menage
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On Wed, 2006-09-20 at 11:43 -0700, Paul Menage wrote:
> On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > > We already have such a functionality in the kernel its called a cpuset. A
> >
> > Christoph,
> >
> > There had been multiple discussions in the past (as recent as Aug 18,
> > 2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
> > resource management are orthogonal features.
> >
> > cpuset provides "resource isolation", and what we, the resource
> > management guys want is work-conserving resource control.
>
> CPUset provides two things:
>
> - a generic process container abstraction
>
> - "resource controllers" for CPU masks and memory nodes.
>
> Rather than adding a new process container abstraction, wouldn't it
> make more sense to change cpuset to make it more extensible (more
> separation between resource controllers), possibly rename it to
> "containers", and let the various resource controllers fight it out
> (e.g. zone/node-based memory controller vs multiple LRU controller,
> CPU masks vs a properly QoS-based CPU scheduler, etc)
>
> Or more specifically, what would need to be added to cpusets to make
> it possible to bolt the CKRM/RG resource controllers on to it?
Paul,
We had this discussion more than 18 months back and concluded that it is
not the right thing to do. Here is the link to the thread:
http://marc.theaimsgroup.com/?t=109173653100001&r=1&w=2
chandra
>
> Paul
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:54 ` Chandra Seetharaman
@ 2006-09-20 19:25 ` Paul Menage
2006-09-20 19:35 ` Chandra Seetharaman
2006-09-20 20:49 ` Paul Jackson
2006-09-20 19:55 ` Christoph Lameter
2006-09-20 20:27 ` Paul Jackson
2 siblings, 2 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-20 19:25 UTC (permalink / raw)
To: sekharan
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
>
> We had this discussion more than 18 months back and concluded that it is
> not the right thing to do. Here is the link to the thread:
Even if the resource control portions aren't totally compatible,
having two separate process container abstractions in the kernel is
sub-optimal, both in terms of efficiency and userspace management. How
about splitting out the container portions of cpuset from the actual
resource control, so that CKRM/RG can hang off of it too? Creation of
a cpuset or a resource group would be driven by creation of a
container; at fork time, a task inherits its parent's container, and
hence its cpuset and/or resource groups.
At its most crude, this could be something like:
struct container {
#ifdef CONFIG_CPUSETS
struct cpuset cs;
#endif
#ifdef CONFIG_RES_GROUPS
struct resource_group rg;
#endif
};
but at least it would be sharing some of the abstractions.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 19:25 ` Paul Menage
@ 2006-09-20 19:35 ` Chandra Seetharaman
2006-09-20 19:57 ` Paul Menage
2006-09-20 20:49 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-20 19:35 UTC (permalink / raw)
To: Paul Menage
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On Wed, 2006-09-20 at 12:25 -0700, Paul Menage wrote:
> On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> >
> > We had this discussion more than 18 months back and concluded that it is
> > not the right thing to do. Here is the link to the thread:
>
> Even if the resource control portions aren't totally compatible,
> having two separate process container abstractions in the kernel is
> sub-optimal, both in terms of efficiency and userspace management. How
> about splitting out the container portions of cpuset from the actual
> resource control, so that CKRM/RG can hang off of it too? Creation of
> a cpuset or a resource group would be driven by creation of a
> container; at fork time, a task inherits its parent's container, and
> hence its cpuset and/or resource groups.
>
> At its most crude, this could be something like:
>
> struct container {
> #ifdef CONFIG_CPUSETS
> struct cpuset cs;
> #endif
> #ifdef CONFIG_RES_GROUPS
> struct resource_group rg;
> #endif
> };
Won't it restrict the user to choose one of these, and not both.
It will also prevent the possibility of having resource groups within a
cpuset.
>
> but at least it would be sharing some of the abstractions.
>
> Paul
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 19:35 ` Chandra Seetharaman
@ 2006-09-20 19:57 ` Paul Menage
2006-09-21 0:30 ` Chandra Seetharaman
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-20 19:57 UTC (permalink / raw)
To: sekharan
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > At its most crude, this could be something like:
> >
> > struct container {
> > #ifdef CONFIG_CPUSETS
> > struct cpuset cs;
> > #endif
> > #ifdef CONFIG_RES_GROUPS
> > struct resource_group rg;
> > #endif
> > };
>
> Won't it restrict the user to choose one of these, and not both.
Not necessarily - you could have both compiled in, and each would only
worry about the resource management that they cared about - e.g. you
could use the memory node isolation portion of cpusets (in conjunction
with fake numa nodes/zones) for memory containment, but give every
cpuset access to all CPUs and control CPU usage via the resource
groups CPU controller.
The generic code would take care of details like container
creation/destruction (with appropriate callbacks into cpuset and/or
res_group code, tracking task membership of containers, etc.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 19:57 ` Paul Menage
@ 2006-09-21 0:30 ` Chandra Seetharaman
2006-09-21 0:33 ` Paul Jackson
2006-09-21 0:34 ` Paul Menage
0 siblings, 2 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 0:30 UTC (permalink / raw)
To: Paul Menage
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On Wed, 2006-09-20 at 12:57 -0700, Paul Menage wrote:
> On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > > At its most crude, this could be something like:
> > >
> > > struct container {
> > > #ifdef CONFIG_CPUSETS
> > > struct cpuset cs;
> > > #endif
> > > #ifdef CONFIG_RES_GROUPS
> > > struct resource_group rg;
> > > #endif
> > > };
> >
> > Won't it restrict the user to choose one of these, and not both.
>
> Not necessarily - you could have both compiled in, and each would only
> worry about the resource management that they cared about - e.g. you
> could use the memory node isolation portion of cpusets (in conjunction
> with fake numa nodes/zones) for memory containment, but give every
> cpuset access to all CPUs and control CPU usage via the resource
> groups CPU controller.
>
> The generic code would take care of details like container
> creation/destruction (with appropriate callbacks into cpuset and/or
> res_group code, tracking task membership of containers, etc.
What I am wondering is that whether the tight coupling of rg and cpuset
(into a container data structure) is ok.
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:30 ` Chandra Seetharaman
@ 2006-09-21 0:33 ` Paul Jackson
2006-09-21 0:50 ` Chandra Seetharaman
2006-09-21 0:34 ` Paul Menage
1 sibling, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:33 UTC (permalink / raw)
To: sekharan
Cc: menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Chandra wrote:
> What I am wondering is that whether the tight coupling of rg and cpuset
> (into a container data structure) is ok.
Just guessing wildly here, but I'd anticipate that at best we
(resource groups and cpusets) would share container mechanisms,
but not share the same container instances.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:33 ` Paul Jackson
@ 2006-09-21 0:50 ` Chandra Seetharaman
0 siblings, 0 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 0:50 UTC (permalink / raw)
To: Paul Jackson
Cc: menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On Wed, 2006-09-20 at 17:33 -0700, Paul Jackson wrote:
> Chandra wrote:
> > What I am wondering is that whether the tight coupling of rg and cpuset
> > (into a container data structure) is ok.
>
> Just guessing wildly here, but I'd anticipate that at best we
> (resource groups and cpusets) would share container mechanisms,
> but not share the same container instances.
That is what my thinking too.
>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:30 ` Chandra Seetharaman
2006-09-21 0:33 ` Paul Jackson
@ 2006-09-21 0:34 ` Paul Menage
1 sibling, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-21 0:34 UTC (permalink / raw)
To: sekharan
Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel,
Christoph Lameter
On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
>
> What I am wondering is that whether the tight coupling of rg and cpuset
> (into a container data structure) is ok.
Can you suggest a realistic scenario in which it's not? Don't forget
that since the container abstraction is hierarchical, you don't have
to use both at the same level. So you could easily e.g. have a parent
container in which you bound to a set of memory/cpu nodes, but had no
rg limits, and several subcontainers where you configured nothing
special for cpuset parameters (so inherited the parent params) but
tweaked different rg parameters.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 19:25 ` Paul Menage
2006-09-20 19:35 ` Chandra Seetharaman
@ 2006-09-20 20:49 ` Paul Jackson
2006-09-20 20:51 ` Paul Menage
2006-09-21 0:45 ` Chandra Seetharaman
1 sibling, 2 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 20:49 UTC (permalink / raw)
To: Paul Menage
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Paul M wrote:
> Even if the resource control portions aren't totally compatible,
> having two separate process container abstractions in the kernel is
> sub-optimal
At heart, CKRM (ne Resource Groups) are (well, have been until now)
different than cpusets.
Cpusets answers the question 'where', and Resource Groups 'how much'.
The fundamental motivation behind cpusets was to be able to enforce
job isolation. A job can get dedicated use of specified resources,
-even- if it means those resources are severely underutilized by that
job.
The fundamental motivation (Chandra or others correct me if I'm wrong)
of Resource Groups is to improve capacity utilization while limiting
starvation due to greedy, competing users for the same resources.
Cpusets seeks maximum isolation. Resource Groups seeks maximum
capacity utilization while preserving guaranteed levels of quality
of service.
Cpusets are that wall between you and the neighbor you might not
trust. Resource groups are a large family of modest wealth sitting
down to share a meal.
It seems that cpusets can mimic memory resource groups. I don't
see how cpusets could mimic other resource groups. But maybe I'm
just being a dimm bulb.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 20:49 ` Paul Jackson
@ 2006-09-20 20:51 ` Paul Menage
2006-09-20 21:04 ` Paul Jackson
2006-09-21 0:45 ` Chandra Seetharaman
1 sibling, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-20 20:51 UTC (permalink / raw)
To: Paul Jackson
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
>
> It seems that cpusets can mimic memory resource groups. I don't
> see how cpusets could mimic other resource groups. But maybe I'm
> just being a dimm bulb.
>
I'm not saying that they can - but they could be parallel types of
resource controller for a generic container abstraction, so that
userspace can create a container, and use e.g. memory node isolation
from the cpusets code in conjunction with the resource groups %-based
CPU scheduler.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 20:51 ` Paul Menage
@ 2006-09-20 21:04 ` Paul Jackson
[not found] ` <6599ad830609201605s2fc1ccbdse31e3e60a50d56bc@mail.google.com>
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 21:04 UTC (permalink / raw)
To: Paul Menage
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Paul M. wrote:
> I'm not saying that they can - but they could be parallel types of
> resource controller for a generic container abstraction,
When there are a sufficiently large number of sufficiently
similar types of objects, such as for example file systems,
then a 'generic container abstraction' such as vfs in the
file system case becomes well worth it, even essential.
I'll be surprised if we have enough such similarity between
cpusets and resource groups to be able to find a useful abstract
generalization that is common to them both.
But if someone finds a way to rewrite resource groups using
cpusets and can convince the resource group folks of this,
I'm game to consider it.
Just because some abstract generalizations are good doesn't
mean they all are.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 20:49 ` Paul Jackson
2006-09-20 20:51 ` Paul Menage
@ 2006-09-21 0:45 ` Chandra Seetharaman
2006-09-21 0:51 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 0:45 UTC (permalink / raw)
To: Paul Jackson
Cc: Paul Menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On Wed, 2006-09-20 at 13:49 -0700, Paul Jackson wrote:
I concur with most of the comments (except as noted below)
> Paul M wrote:
> > Even if the resource control portions aren't totally compatible,
> > having two separate process container abstractions in the kernel is
> > sub-optimal
>
> At heart, CKRM (ne Resource Groups) are (well, have been until now)
> different than cpusets.
>
> Cpusets answers the question 'where', and Resource Groups 'how much'.
>
> The fundamental motivation behind cpusets was to be able to enforce
> job isolation. A job can get dedicated use of specified resources,
> -even- if it means those resources are severely underutilized by that
> job.
>
> The fundamental motivation (Chandra or others correct me if I'm wrong)
> of Resource Groups is to improve capacity utilization while limiting
> starvation due to greedy, competing users for the same resources.
>
> Cpusets seeks maximum isolation. Resource Groups seeks maximum
> capacity utilization while preserving guaranteed levels of quality
> of service.
>
> Cpusets are that wall between you and the neighbor you might not
> trust. Resource groups are a large family of modest wealth sitting
> down to share a meal.
I am thinking hard about how to bring guarantee into this picture :).
>
> It seems that cpusets can mimic memory resource groups. I don't
I am little confused w.r.t how cpuset can mimic memory resource groups.
How can cpuset provide support for over commit.
> see how cpusets could mimic other resource groups. But maybe I'm
> just being a dimm bulb.
>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:45 ` Chandra Seetharaman
@ 2006-09-21 0:51 ` Paul Jackson
0 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:51 UTC (permalink / raw)
To: sekharan
Cc: menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Chandra wrote:
> > It seems that cpusets can mimic memory resource groups. I don't
>
> I am little confused w.r.t how cpuset can mimic memory resource groups.
> How can cpuset provide support for over commit.
I didn't say "mimic well" ;).
I had no clue cpusets could do overcommit at all, though Paul Menage just
posted a notion of how to mimic overcommit, with his post beginning:
> I have some patches locally that basically let you give out a small
> set of nodes initially to a cpuset, and if memory pressure in
> try_to_free_pages() passes a specified threshold, automatically
> allocate one of the parent cpuset's unused memory nodes to the child
> cpuset, up to specified limit.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:54 ` Chandra Seetharaman
2006-09-20 19:25 ` Paul Menage
@ 2006-09-20 19:55 ` Christoph Lameter
2006-09-20 20:27 ` Paul Jackson
2 siblings, 0 replies; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 19:55 UTC (permalink / raw)
To: Chandra Seetharaman
Cc: Paul Menage, npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth,
devel
On Wed, 20 Sep 2006, Chandra Seetharaman wrote:
> We had this discussion more than 18 months back and concluded that it is
> not the right thing to do. Here is the link to the thread:
Recent discussions on linux-mm sounded very different. I also brought this
up at the VM summit. Could you have a look at cpusets and the discussion
on linux-mm and then think how this could be done in a less VM invasive
way?
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:54 ` Chandra Seetharaman
2006-09-20 19:25 ` Paul Menage
2006-09-20 19:55 ` Christoph Lameter
@ 2006-09-20 20:27 ` Paul Jackson
2006-09-21 17:02 ` Srivatsa Vaddagiri
2 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 20:27 UTC (permalink / raw)
To: sekharan
Cc: menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Chandra wrote:
> We had this discussion more than 18 months back and concluded that it is
> not the right thing to do. Here is the link to the thread:
Because it is easy enough to carve memory up into nice little nameable
chunks, it might be the case that we can manage the percentage of
memory used by the expedient of something like cpusets and fake nodes.
Indeed, that seems to be doable, based on this latest work of Andrew
and others (David, some_bright_spark@jp, Magnus, ...). There are
still a bunch of wrinkles that remain to be ironed out.
For other resources, such as CPU cycles and network bandwidth, unless
another bright spark comes up with an insight, I don't see how to
express the "percentage used" semantics provided by something such
as CKRM, using anything resembling cpusets.
... Can one imagine having the scheduler subdivide each second of
time available on a CPU into several fake-CPUs, each one of which
speaks for one of those sub-second fake-CPU slices? Sounds too
weird to me, and a bit too rigid to be a servicable CKRM substitute.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 20:27 ` Paul Jackson
@ 2006-09-21 17:02 ` Srivatsa Vaddagiri
2006-09-21 19:29 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Srivatsa Vaddagiri @ 2006-09-21 17:02 UTC (permalink / raw)
To: Paul Jackson
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, menage,
devel, clameter
On Wed, Sep 20, 2006 at 01:27:34PM -0700, Paul Jackson wrote:
> For other resources, such as CPU cycles and network bandwidth, unless
> another bright spark comes up with an insight, I don't see how to
> express the "percentage used" semantics provided by something such
> as CKRM, using anything resembling cpusets.
How abt metered cpusets? Each child cpuset of a metered cpuset
represents a fraction of CPU time alloted to the tasks of the child
cpuset.
> ... Can one imagine having the scheduler subdivide each second of
> time available on a CPU into several fake-CPUs, each one of which
> speaks for one of those sub-second fake-CPU slices? Sounds too
> weird to me, and a bit too rigid to be a servicable CKRM substitute.
--
Regards,
vatsa
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 17:02 ` Srivatsa Vaddagiri
@ 2006-09-21 19:29 ` Paul Jackson
0 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 19:29 UTC (permalink / raw)
To: vatsa
Cc: npiggin, sekharan, ckrm-tech, linux-kernel, rohitseth, menage,
devel, clameter
Vatsa wrote:
> How abt metered cpusets? Each child cpuset of a metered cpuset
> represents a fraction of CPU time alloted to the tasks of the child
> cpuset.
Ah yes - they might work. Sorry I didn't think of your
meter_cpu controller patch with its cpuset interface
when I wrote the above.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:43 ` Paul Menage
2006-09-20 18:54 ` Chandra Seetharaman
@ 2006-09-20 20:11 ` Paul Jackson
2006-09-20 20:17 ` Paul Menage
1 sibling, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-20 20:11 UTC (permalink / raw)
To: Paul Menage
Cc: sekharan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
Paul M. wrote:
> Rather than adding a new process container abstraction, wouldn't it
> make more sense to change cpuset to make it more extensible (more
> separation between resource controllers), possibly rename it to
> "containers",
Without commenting one way or the other on the overall advisability
of this (for lack of sufficient clues), if we did this and renamed
"cpusets" to "containers", we would still want to export the /dev/cpuset
interface to just the CPU/Memory controllers. Perhaps the "container"
pseudo-filesystem could optionally be mounted with a "cpuset" option,
that just exposed the cpuset relevant interface, or some such thing.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 20:11 ` Paul Jackson
@ 2006-09-20 20:17 ` Paul Menage
0 siblings, 0 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-20 20:17 UTC (permalink / raw)
To: Paul Jackson
Cc: sekharan, clameter, npiggin, ckrm-tech, linux-kernel, rohitseth,
devel
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
> Paul M. wrote:
> > Rather than adding a new process container abstraction, wouldn't it
> > make more sense to change cpuset to make it more extensible (more
> > separation between resource controllers), possibly rename it to
> > "containers",
>
> Without commenting one way or the other on the overall advisability
> of this (for lack of sufficient clues), if we did this and renamed
> "cpusets" to "containers", we would still want to export the /dev/cpuset
> interface to just the CPU/Memory controllers. Perhaps the "container"
> pseudo-filesystem could optionally be mounted with a "cpuset" option,
> that just exposed the cpuset relevant interface, or some such thing.
Absolutely - I was thinking that as a first cut, any subsystem (e.g.
cpusets, res_groups, etc) that wanted to use per-task containers could
declare what files it wanted a container dir populated with, so you
could have it looking just like cpusets if you wanted to, and mount it
on /dev/cpuset and use it exactly as before. If you then added the
res_group patch to your kernel, you would also get the appropriate
resource group files appearing in each directory, but the cpuset
support would work as before.
Longer term we'd probably want to figure out a better naming
partitioning scheme, or maybe just a convention that each directory
entry was prefixed with the subsystem name. Also, maybe have a
convention that control files and subcontainer names be in different
namespaces (e.g. all control files start with ".", all subcontainer
names start with something else).
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 18:34 ` Chandra Seetharaman
2006-09-20 18:43 ` Paul Menage
@ 2006-09-20 19:52 ` Christoph Lameter
2006-09-21 0:31 ` Chandra Seetharaman
1 sibling, 1 reply; 125+ messages in thread
From: Christoph Lameter @ 2006-09-20 19:52 UTC (permalink / raw)
To: Chandra Seetharaman
Cc: Rohit Seth, npiggin, pj, linux-kernel, devel, CKRM-Tech
On Wed, 20 Sep 2006, Chandra Seetharaman wrote:
> cpuset partitions resource and hence the resource that are assigned to a
> node is not available for other cpuset, which is not good for "resource
> management".
cpusets can have one node in multiple cpusets.
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 19:52 ` Christoph Lameter
@ 2006-09-21 0:31 ` Chandra Seetharaman
2006-09-21 0:36 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 0:31 UTC (permalink / raw)
To: Christoph Lameter; +Cc: npiggin, CKRM-Tech, linux-kernel, pj, Rohit Seth, devel
On Wed, 2006-09-20 at 12:52 -0700, Christoph Lameter wrote:
> On Wed, 20 Sep 2006, Chandra Seetharaman wrote:
>
> > cpuset partitions resource and hence the resource that are assigned to a
> > node is not available for other cpuset, which is not good for "resource
> > management".
>
> cpusets can have one node in multiple cpusets.
AFAICS, That doesn't help me in over committing resources.
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:31 ` Chandra Seetharaman
@ 2006-09-21 0:36 ` Paul Jackson
2006-09-21 0:42 ` Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 0:36 UTC (permalink / raw)
To: sekharan; +Cc: clameter, npiggin, ckrm-tech, linux-kernel, rohitseth, devel
Chandra wrote:
> AFAICS, That doesn't help me in over committing resources.
I agree - I don't think cpusets plus fake numa ... handles over commit.
You might could hack up a cheap substitute, but it wouldn't do the job.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:36 ` Paul Jackson
@ 2006-09-21 0:42 ` Paul Menage
2006-09-21 1:45 ` Chandra Seetharaman
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-21 0:42 UTC (permalink / raw)
To: Paul Jackson
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
> Chandra wrote:
> > AFAICS, That doesn't help me in over committing resources.
>
> I agree - I don't think cpusets plus fake numa ... handles over commit.
> You might could hack up a cheap substitute, but it wouldn't do the job.
I have some patches locally that basically let you give out a small
set of nodes initially to a cpuset, and if memory pressure in
try_to_free_pages() passes a specified threshold, automatically
allocate one of the parent cpuset's unused memory nodes to the child
cpuset, up to specified limit. It's a bit ugly, but lets you trade of
performance vs memory footprint on a per-job basis (when combined with
fake numa to give lots of small nodes).
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 0:42 ` Paul Menage
@ 2006-09-21 1:45 ` Chandra Seetharaman
2006-09-21 1:52 ` Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 1:45 UTC (permalink / raw)
To: Paul Menage
Cc: Paul Jackson, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On Wed, 2006-09-20 at 17:42 -0700, Paul Menage wrote:
> On 9/20/06, Paul Jackson <pj@sgi.com> wrote:
> > Chandra wrote:
> > > AFAICS, That doesn't help me in over committing resources.
> >
> > I agree - I don't think cpusets plus fake numa ... handles over commit.
> > You might could hack up a cheap substitute, but it wouldn't do the job.
>
> I have some patches locally that basically let you give out a small
> set of nodes initially to a cpuset, and if memory pressure in
> try_to_free_pages() passes a specified threshold, automatically
> allocate one of the parent cpuset's unused memory nodes to the child
> cpuset, up to specified limit. It's a bit ugly, but lets you trade of
> performance vs memory footprint on a per-job basis (when combined with
> fake numa to give lots of small nodes).
Interesting. So you could set up the fake node with "guarantee" and let
it grow till "limit" ?
BTW, can you do these with fake nodes:
- dynamic creation
- dynamic removal
- dynamic change of size
Also, How could we account when a process moves from one node to
another ?
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 1:45 ` Chandra Seetharaman
@ 2006-09-21 1:52 ` Paul Menage
2006-09-21 20:06 ` Chandra Seetharaman
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-21 1:52 UTC (permalink / raw)
To: sekharan
Cc: Paul Jackson, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
>
> Interesting. So you could set up the fake node with "guarantee" and let
> it grow till "limit" ?
Sure - that works great. (Theoretically you could do this all in
userspace - start by assigning "guarantee" nodes to a
container/cpuset and when it gets close to its memory limit assign
more nodes to it. But in practice userspace can't keep up with rapid
memory allocators.
>
> BTW, can you do these with fake nodes:
> - dynamic creation
> - dynamic removal
> - dynamic change of size
The current fake numa support requires you to choose your node layout
at boot time - I've been working with 64 fake nodes of 128M each,
which gives a reasonable granularity for dividing a machine between
multiple different sized jobs.
>
> Also, How could we account when a process moves from one node to
> another ?
If you want to do that (the systems I'm working on don't really) you
could probably do it with the migrate_pages() syscall. It might not be
that efficient though.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 1:52 ` Paul Menage
@ 2006-09-21 20:06 ` Chandra Seetharaman
2006-09-21 20:10 ` Paul Menage
0 siblings, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 20:06 UTC (permalink / raw)
To: Paul Menage
Cc: Paul Jackson, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On Wed, 2006-09-20 at 18:52 -0700, Paul Menage wrote:
> On 9/20/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> >
> > Interesting. So you could set up the fake node with "guarantee" and let
> > it grow till "limit" ?
>
> Sure - that works great. (Theoretically you could do this all in
> userspace - start by assigning "guarantee" nodes to a
> container/cpuset and when it gets close to its memory limit assign
> more nodes to it. But in practice userspace can't keep up with rapid
> memory allocators.
>
I agree, especially when one of your main object is resource
utilization. Think about the magnitude of this when you have to deal
with 100s of containers.
> >
> > BTW, can you do these with fake nodes:
> > - dynamic creation
> > - dynamic removal
> > - dynamic change of size
>
> The current fake numa support requires you to choose your node layout
> at boot time - I've been working with 64 fake nodes of 128M each,
> which gives a reasonable granularity for dividing a machine between
> multiple different sized jobs.
It still will not satisfy what OpenVZ/Container folks are looking for:
100s of containers.
>
> >
> > Also, How could we account when a process moves from one node to
> > another ?
>
> If you want to do that (the systems I'm working on don't really) you
> could probably do it with the migrate_pages() syscall. It might not be
> that efficient though.
Totally agree, that will be very costly.
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 20:06 ` Chandra Seetharaman
@ 2006-09-21 20:10 ` Paul Menage
2006-09-21 21:44 ` Chandra Seetharaman
2006-09-21 21:59 ` Paul Jackson
0 siblings, 2 replies; 125+ messages in thread
From: Paul Menage @ 2006-09-21 20:10 UTC (permalink / raw)
To: sekharan
Cc: Paul Jackson, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > The current fake numa support requires you to choose your node layout
> > at boot time - I've been working with 64 fake nodes of 128M each,
> > which gives a reasonable granularity for dividing a machine between
> > multiple different sized jobs.
>
> It still will not satisfy what OpenVZ/Container folks are looking for:
> 100s of containers.
Right - so fake-numa is not the right solution for everyone, and I
never suggested that it is. (Having said that, there are discussions
underway to make the zone-based approach more practical - if you could
have dynamically-resizable nodes, this would be more applicable to
openvz).
But, there's no reason that the OpenVZ resource control mechanisms
couldn't be hooked into a generic process container mechanism along
with cpusets and RG.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 20:10 ` Paul Menage
@ 2006-09-21 21:44 ` Chandra Seetharaman
2006-09-21 22:09 ` Paul Menage
2006-09-21 21:59 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-21 21:44 UTC (permalink / raw)
To: Paul Menage
Cc: Paul Jackson, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On Thu, 2006-09-21 at 13:10 -0700, Paul Menage wrote:
> On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > > The current fake numa support requires you to choose your node layout
> > > at boot time - I've been working with 64 fake nodes of 128M each,
> > > which gives a reasonable granularity for dividing a machine between
> > > multiple different sized jobs.
> >
> > It still will not satisfy what OpenVZ/Container folks are looking for:
> > 100s of containers.
>
> Right - so fake-numa is not the right solution for everyone, and I
> never suggested that it is. (Having said that, there are discussions
> underway to make the zone-based approach more practical - if you could
> have dynamically-resizable nodes, this would be more applicable to
> openvz).
It would still have the other issue you pointed, i.e the userspace being
able to cope up with memory allocators dynamics.
>
> But, there's no reason that the OpenVZ resource control mechanisms
> couldn't be hooked into a generic process container mechanism along
> with cpusets and RG.
Isn't that one of the things we are trying to avoid (each one having
their own solution, especially when we _can_ have a common solution).
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 21:44 ` Chandra Seetharaman
@ 2006-09-21 22:09 ` Paul Menage
2006-09-22 0:06 ` Chandra Seetharaman
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-21 22:09 UTC (permalink / raw)
To: sekharan
Cc: npiggin, ckrm-tech, linux-kernel, Paul Jackson, rohitseth, devel,
clameter
On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
>
> >
> > But, there's no reason that the OpenVZ resource control mechanisms
> > couldn't be hooked into a generic process container mechanism along
> > with cpusets and RG.
>
> Isn't that one of the things we are trying to avoid (each one having
> their own solution, especially when we _can_ have a common solution).
Can we actually have a single common solution that works for everyone,
no matter what their needs? It's already apparent that there are
multiple different and subtly incompatible definitions of what "memory
controller" means and needs to do. Maybe these can be resolved - but
maybe it's better to have, say, two simple but very different memory
controllers that the user can pick between, rather than one big and
complicated one that tries to please everyone.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 22:09 ` Paul Menage
@ 2006-09-22 0:06 ` Chandra Seetharaman
2006-09-22 0:13 ` Paul Menage
2006-09-22 0:24 ` Paul Jackson
0 siblings, 2 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-22 0:06 UTC (permalink / raw)
To: Paul Menage
Cc: npiggin, ckrm-tech, linux-kernel, Paul Jackson, rohitseth, devel,
clameter
On Thu, 2006-09-21 at 15:09 -0700, Paul Menage wrote:
> On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> >
> > >
> > > But, there's no reason that the OpenVZ resource control mechanisms
> > > couldn't be hooked into a generic process container mechanism along
> > > with cpusets and RG.
> >
> > Isn't that one of the things we are trying to avoid (each one having
> > their own solution, especially when we _can_ have a common solution).
>
> Can we actually have a single common solution that works for everyone,
> no matter what their needs? It's already apparent that there are
> multiple different and subtly incompatible definitions of what "memory
> controller" means and needs to do. Maybe these can be resolved - but
> maybe it's better to have, say, two simple but very different memory
> controllers that the user can pick between, rather than one big and
> complicated one that tries to please everyone.
Paul,
Think about what will be available to customer through a distro.
There are two (competing) memory controllers in the kernel. But, distro
can turn only one ON. Which in turn mean
- there will be a debate from the two controller users/advocates with
the distro (headache to distro) about which one to turn ON
- one party will _not_ get what they want and hence no point in them
getting the feature into the mainline in the first place
(dissatisfaction of the users/original implementors of one solution).
So, IMHO, it is better to sort out the differences before we get things
in mainline kernel.
>
> Paul
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-22 0:06 ` Chandra Seetharaman
@ 2006-09-22 0:13 ` Paul Menage
2006-09-22 0:55 ` Chandra Seetharaman
2006-09-22 0:24 ` Paul Jackson
1 sibling, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-22 0:13 UTC (permalink / raw)
To: sekharan
Cc: npiggin, ckrm-tech, linux-kernel, Paul Jackson, rohitseth, devel,
clameter
On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> Think about what will be available to customer through a distro.
>
> There are two (competing) memory controllers in the kernel. But, distro
> can turn only one ON. Which in turn mean
Why's that? I don't see why cpuset memory nodemasks can't coexist
with, say, the RG memory controller. They're attempting to solve
different problems, and I can see situations where you might want to
use both at once.
>
> So, IMHO, it is better to sort out the differences before we get things
> in mainline kernel.
Agreed, if we can come up with a definition of e.g. memory controller
that everyone agrees is suitable for their needs. You're assuming
that's so a priori, I'm not yet convinced.
And I'm not trying to get another memory controller into the kernel,
I'm just trying to get a standard process aggregation into the kernel
(or rather, take the one that's already in the kernel and make it
possible to hook other controller frameworks into it), so that the
various memory controllers can become less intrusive patches in their
own right.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-22 0:13 ` Paul Menage
@ 2006-09-22 0:55 ` Chandra Seetharaman
0 siblings, 0 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-22 0:55 UTC (permalink / raw)
To: Paul Menage
Cc: npiggin, ckrm-tech, linux-kernel, Paul Jackson, rohitseth, devel,
clameter
On Thu, 2006-09-21 at 17:13 -0700, Paul Menage wrote:
> On 9/21/06, Chandra Seetharaman <sekharan@us.ibm.com> wrote:
> > Think about what will be available to customer through a distro.
> >
> > There are two (competing) memory controllers in the kernel. But, distro
> > can turn only one ON. Which in turn mean
>
> Why's that? I don't see why cpuset memory nodemasks can't coexist
> with, say, the RG memory controller. They're attempting to solve
> different problems, and I can see situations where you might want to
> use both at once.
Yes, they are two different solutions and I agree that there is no
competition.
Where I see the competition is w.r.t memory controllers from different
resource management solutions (RG, UBC, Rohit's container etc.,). That
is what I was referring to. Sorry for the confusion.
>
> >
> > So, IMHO, it is better to sort out the differences before we get things
> > in mainline kernel.
>
> Agreed, if we can come up with a definition of e.g. memory controller
> that everyone agrees is suitable for their needs. You're assuming
> that's so a priori, I'm not yet convinced.
>
> And I'm not trying to get another memory controller into the kernel,
> I'm just trying to get a standard process aggregation into the kernel
> (or rather, take the one that's already in the kernel and make it
> possible to hook other controller frameworks into it), so that the
> various memory controllers can become less intrusive patches in their
> own right.
I wasn't talking about the competition issue in this context.
Let me clearly state it for the record: I support your effort in
providing an independent process aggregation :)
>
> Paul
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-22 0:06 ` Chandra Seetharaman
2006-09-22 0:13 ` Paul Menage
@ 2006-09-22 0:24 ` Paul Jackson
2006-09-22 0:57 ` Chandra Seetharaman
1 sibling, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-22 0:24 UTC (permalink / raw)
To: sekharan
Cc: menage, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Chandra wrote:
> There are two (competing) memory controllers in the kernel. But, distro
> can turn only one ON.
Huh - time for me to play the dummy again ...
My (fog shrouded) vision of the future has:
1) mempolicy - provides fine grained memory placement for task on self
2) cpuset - provides system wide cpu and memory placement for unrelated tasks
3) some form of resource groups - measures and limits proportion of various
resources used, including cpu cycles, memory pages and network bandwidth,
by collections of tasks.k
Both (2) and (3) need to group tasks in flexible ways distinct from the
existing task groupings supported by the kernel.
I thought that Paul M suggested (2) and (3) use common underlying
grouping or 'bucket' technology - the infrastructure that separates
tasks into buckets and can be used to associate various resource
metrics and limits with each bucket.
I can't quite figure out whether you have in mind above:
* a conflict between two competing memory controllers for (3),
* or a conflict between cpusets and one memory controller for (3).
And either way, I don't see what that has to do with the underling
bucket technology - how we group tasks generically.
Guess I am missing something ...
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-22 0:24 ` Paul Jackson
@ 2006-09-22 0:57 ` Chandra Seetharaman
2006-09-22 1:11 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-22 0:57 UTC (permalink / raw)
To: Paul Jackson
Cc: npiggin, ckrm-tech, linux-kernel, rohitseth, menage, devel,
clameter
On Thu, 2006-09-21 at 17:24 -0700, Paul Jackson wrote:
> Chandra wrote:
> > There are two (competing) memory controllers in the kernel. But, distro
> > can turn only one ON.
>
> Huh - time for me to play the dummy again ...
>
> My (fog shrouded) vision of the future has:
> 1) mempolicy - provides fine grained memory placement for task on self
> 2) cpuset - provides system wide cpu and memory placement for unrelated tasks
> 3) some form of resource groups - measures and limits proportion of various
> resources used, including cpu cycles, memory pages and network bandwidth,
> by collections of tasks.k
>
> Both (2) and (3) need to group tasks in flexible ways distinct from the
> existing task groupings supported by the kernel.
>
> I thought that Paul M suggested (2) and (3) use common underlying
> grouping or 'bucket' technology - the infrastructure that separates
> tasks into buckets and can be used to associate various resource
> metrics and limits with each bucket.
>
> I can't quite figure out whether you have in mind above:
> * a conflict between two competing memory controllers for (3),
Yes.
> * or a conflict between cpusets and one memory controller for (3).
No.
>
> And either way, I don't see what that has to do with the underling
> bucket technology - how we group tasks generically.
True. I clarified it in the reply to Paul M.
>
> Guess I am missing something ...
>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 20:10 ` Paul Menage
2006-09-21 21:44 ` Chandra Seetharaman
@ 2006-09-21 21:59 ` Paul Jackson
2006-09-21 22:07 ` Paul Menage
1 sibling, 1 reply; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 21:59 UTC (permalink / raw)
To: Paul Menage
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Paul wrote:
> But, there's no reason that the OpenVZ resource control mechanisms
> couldn't be hooked into a generic process container mechanism along
> with cpusets and RG.
Can the generic container avoid performance bottlenecks due to locks
or other hot cache lines on the main code paths for fork, exit, page
allocation and task scheduling?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 21:59 ` Paul Jackson
@ 2006-09-21 22:07 ` Paul Menage
2006-09-21 22:48 ` Paul Jackson
0 siblings, 1 reply; 125+ messages in thread
From: Paul Menage @ 2006-09-21 22:07 UTC (permalink / raw)
To: Paul Jackson
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
On 9/21/06, Paul Jackson <pj@sgi.com> wrote:
>
> Can the generic container avoid performance bottlenecks due to locks
> or other hot cache lines on the main code paths for fork, exit, page
> allocation and task scheduling?
Page allocation and task scheduling are resource controller issues,
not generic process container issues. The generic process containers
would have essentially the same overheads for fork/exit that cpusets
have currently.
Paul
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-21 22:07 ` Paul Menage
@ 2006-09-21 22:48 ` Paul Jackson
0 siblings, 0 replies; 125+ messages in thread
From: Paul Jackson @ 2006-09-21 22:48 UTC (permalink / raw)
To: Paul Menage
Cc: sekharan, npiggin, ckrm-tech, linux-kernel, rohitseth, devel,
clameter
Paul M wrote:
> Page allocation and task scheduling are resource controller issues,
> not generic process container issues.
But when a process is moved to a different container, its page
allocation and task scheduling constraints and metrics move too.
One of the essential differences, for example, between the two memory
constraint mechanisms we have now, mempolicy.c and cpuset.c, is that
mempolicy only affects the current task, so has an easier time of
the locking and its hooks in the page allocation code path, whereas
cpusets allows any task to change any other tasks memory constraints.
This made the cpuset hooks in the page allocation code path more
difficult -- and as you have recently shown, we aren't done working
that code path yet ;).
This is likely true in general for resource controllers. One of
their more challenging design aspects are the hooks they require in
the code paths that handle the various controlled resources.
One has to use these hooks to access the container on these fairly
hot code paths. And since the container can be changing in parallel
at the same time, it can be challenging to handling the necessary
locking without forcing a system-wide lock there.
Doable, I presume. But challenging.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.925.600.0401
^ permalink raw reply [flat|nested] 125+ messages in thread
* Re: [ckrm-tech] [patch00/05]: Containers(V2)- Introduction
2006-09-20 16:25 ` Christoph Lameter
` (3 preceding siblings ...)
2006-09-20 18:34 ` Chandra Seetharaman
@ 2006-09-20 19:09 ` Chandra Seetharaman
4 siblings, 0 replies; 125+ messages in thread
From: Chandra Seetharaman @ 2006-09-20 19:09 UTC (permalink / raw)
To: Christoph Lameter; +Cc: Rohit Seth, npiggin, pj, linux-kernel, devel, CKRM-Tech
On Wed, 2006-09-20 at 09:25 -0700, Christoph Lameter wrote:
For some reason the email i sent about 30 mins back didn't make it...
her is a resend.
> On Tue, 19 Sep 2006, Rohit Seth wrote:
>
> > For example, a user can run a batch job like backup inside containers.
> > This job if run unconstrained could step over most of the memory present
> > in system thus impacting other workloads running on the system at that
> > time. But when the same job is run inside containers then the backup
> > job is run within container limits.
>
> I just saw this for the first time since linux-mm was not cced. We have
> discussed a similar mechanism on linux-mm.
>
> We already have such a functionality in the kernel its called a cpuset. A
Christoph,
There had been multiple discussions in the past (as recent as Aug 18,
2006), where we (Paul and CKRM/RG folks) have concluded that cpuset and
resource management are orthogonal features.
cpuset provides "resource isolation", and what we, the resource
management guys want is work-conserving resource control.
cpuset partitions resource and hence the resource that are assigned to a
node is not available for other cpuset, which is not good for "resource
management".
chandra
PS:
Aug 18 link: http://marc.theaimsgroup.com/?l=linux-
kernel&m=115593114408336&w=2
Feb 2005 thread: http://marc.theaimsgroup.com/?l=ckrm-
tech&m=110790400330617&w=2
> container could be created simply by creating a fake node that then
> allows constraining applications to this node. We already track the
> types of pages per node. The statistics you want are already existing.
> See /proc/zoneinfo and /sys/devices/system/node/node*/*.
>
> > We use the term container to indicate a structure against which we track
> > and charge utilization of system resources like memory, tasks etc for a
> > workload. Containers will allow system admins to customize the
> > underlying platform for different applications based on their
> > performance and HW resource utilization needs. Containers contain
> > enough infrastructure to allow optimal resource utilization without
> > bogging down rest of the kernel. A system admin should be able to
> > create, manage and free containers easily.
>
> Right thats what cpusets do and it has been working fine for years. Maybe
> Paul can help you if you find anything missing in the existing means to
> control resources.
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys -- and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> ckrm-tech mailing list
> https://lists.sourceforge.net/lists/listinfo/ckrm-tech
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- sekharan@us.ibm.com | .......you may get it.
----------------------------------------------------------------------
^ permalink raw reply [flat|nested] 125+ messages in thread