[RFC] Resource Management - Infrastructure choices

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Resource Management - Infrastructure choices
@ 2006-10-30 10:33 Srivatsa Vaddagiri
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-30 10:33 UTC (permalink / raw)
  To: dev, sekharan, menage
  Cc: pj, akpm, ckrm-tech, rohitseth, balbir, dipankar, matthltc,
	haveblue, linux-kernel

Over the last couple of months, we have seen a number of proposals for
resource management infrastructure/controllers and also good discussions
surrounding those proposals. These discussions has resulted in few
consensus points and few other points that are still being debated.

This RFC is an attempt to:

	o summarize various proposals to date for infrastructure

	o summarize consensus/debated points for infrastructure

	o (more importantly) get various stakeholders agree on what is a good 
	  compromise for infrastructure in going forward

Couple of questions that I am trying to address in this RFC:

	- Do we wait till controllers are worked out before merging
	  any infrastructure?

		IMHO, its good if we can merge some basic infrastructure now
		and incrementally enhance it and add controllers based on it. 
		This perspective leads to the second question below ..

	- Paul Menage's patches present a rework of existing code, which makes 
	  it simpler to get it in. Does it meet container (Openvz/Linux
	  VServer) and resource management requirements?

		Paul has ported over the CKRM code on top of his patches. So I 
		am optimistic that it meets resource management requirements in 
		general.

	  	One shortcoming I have seen in it is that it lacks an 
		efficient method to retrieve tasks associated with a group. 
		This may be needed by few controllers implementations if they 
		have to support, say, change of resource limits. This however 
		I think could be addressed quite easily (by a linked list
		hanging off each container structure).

Resource Management - Goals
---------------------------

Develop mechanisms for isolating use of shared resources like cpu, memory 
between various applications. This includes:

	- mechanism to group tasks by some attribute (ex: containers, 
	  CKRM/RG class, cpuset etc)

	- mechanism to monitor and control usage of a variety of resources by 
	  such groups of tasks

Resources to be managed:

	- Memory, CPU and disk I/O bandwidth (of high interest perhaps)
	- network bandwidth, number of tasks/file-descriptors/sockets etc.


Proposals to date for infrastructure
------------------------------------

	- CKRM/RG
	- UBC
	- Container implementation (by Paul Menage) based on generalization of 	
	  cpusets.


A. Class-based Kernel Resource Management/Resource Groups

	Framework to monitor/control use of various resources by a group of 
	tasks as per specified guarantee/limits.

	Provides a config-fs based interface to:

		- create/delete task-groups
		- allow a task to change its (or some other task's) association 
		  from one group to other (provided it has the right 
		  privileges). New children of the affected task inherit the 
		  same group association.
		- list tasks present in a group (A group can exist without any 
		  tasks associated with it)
		- specify group's min/max use of various resources. A special 
		  value "DONT_CARE" specifies that the group doesn't care for 
		  how much resource it gets.
		- obtain resource usage statistics
		- Supports heirarchy depth of 1 (??)

	In addition to this user-interface, it provides a framework for 
	controllers to:

		- register/deregister themselves
		- be intimated about changes in resource allocation for a group
		- be intimated about task movements between groups
		- be intimated about creation/deletion of groups
		- know which group a task belongs to

B. UBC

	Framework to account and limit usage of various resources by a 
	container (group of tasks).

	Provides a system call based interface to:

		- set a task's beancounter id. If the id does not exist, a new 
		  beancounter object is created
		- change a task's association from one beancounter to other
		- return beancounter id to which the calling task belongs
		- set limits of consumption of a particular resource by a 
		  beancounter
		- return statistics information for a given beancounter and 
		  resource.


	Provides a framework for controllers to:

		- register various resources
		- lookup beancounter object given a particular id
		- charge/uncharge usage of some resource to a beancounter by 
	 	  some amount
			- also know if the resulting usage is above the allowed 
			  soft/hard limit.
		- change a task's accounting beancounter (usefull in, say, 
		  interrupt handling)
		- know when the resource limits change for a beancounter

C. Paul Menage's container patches

	Provides a generic heirarchial process grouping mechanism based on 
	cpusets, which can be used for resource management purposes.

	Provides a filesystem-based interface to:

		- create/destroy containers
		- change a task's association from one container to other
		- retrieve all the tasks associated with a container
		- know which container a task belongs to (from /proc)
		- know when the last task belonging to a container has exited


Consensus/Debated Points
------------------------

Consensus:

	- Provide resource control over a group of tasks 
	- Support movement of task from one resource group to another
	- Dont support heirarchy for now
	- Support limit (soft and/or hard depending on the resource
	  type) in controllers. Guarantee feature could be indirectly
	  met thr limits.

Debated:
	- syscall vs configfs interface
	- Interaction of resource controllers, containers and cpusets
		- Should we support, for instance, creation of resource
		  groups/containers under a cpuset?
	- Should we have different groupings for different resources?
	- Support movement of all threads of a process from one group
	  to another atomically?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* RFC: Memory Controller
  2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
@ 2006-10-30 10:34 ` Balbir Singh
  2006-10-30 11:04   ` Paul Menage
  2006-10-30 15:58   ` Pavel Emelianov
  2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 135+ messages in thread
From: Balbir Singh @ 2006-10-30 10:34 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, menage, pj, akpm, ckrm-tech, rohitseth, dipankar,
	matthltc, haveblue, linux-kernel

We've seen a lot of discussion lately on the memory controller. The RFC below
provides a summary of the discussions so far. The goal of this RFC is to bring
together the thoughts so far, build consensus and agree on a path forward.

NOTE: I have tried to keep the information as accurate and current as possible.
Please bring out any omissions/corrections if you notice them. I would like to
keep this summary document accurate, current and live.

Summary of Memory Controller Discussions and Patches

1. Accounting

The patches submitted so far agree that the following memory
should be accounted for

Reclaimable memory

(i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
      they are mapped into the user page tables, but not backed by a file.
(ii)  File mapped pages - File mapped pages map a portion of a file
(iii) Page Cache Pages - Consists of the following

    (a) Pages used during IPC using shmfs
    (c) Pages of a user mode process that are swapped out
    (c) Pages from block read/write operations
    (d) Pages from file read/write operations

Non Reclaimable memory

This memory is not reclaimable until it is explicitly released by the
allocator. Examples of such memory include slab allocated memory and
memory allocated by the kernel components in process context. mlock()'ed
memory is also considered as non-reclaimable, but it is usually handled
as a separate resource.

(i)  Slabs
(ii) Kernel pages and page_tables allocated on behalf of a task.

2. Control considerations for the memory controller

Control can be implemented using either

(i)  Limits
     Limits, limit the usage of the resource to the specified value. If the
     resource usage crosses the limit, then the group might be penalized
     or restricted. Soft limits can be exceeded by the group as long as
     the resource is still available. Hard limits are usually the cut-of-point.
     No additional resources might be allocated beyond the hard limit.

(ii) Guarantees
     Guarantees, come in two forms

     (a) Soft guarantees is a best effort service to provide the group
      with the specified guarantee of resource availability. In this form
      resources can be shared (the unutilized resources of one
      group can be used by other groups) among groups and groups are allowed to
      exceed their guarantee when the resource is available (there is
      no other group unable to meet its guarantee). When a group is unable
      to meet its guarantee, the system tries to provide it with it's
      guaranteed resources by trying to reclaim from other groups, which
      have exceeded their guarantee. In spite of its best effort, if the
      system is unable to meet the specified guarantee, the guarantee
      failed statistic of the group is incremented. This form of guarantees
      is best suited for non-reclaimable resources.

     (b) Hard guarantees is a more deterministic method of providing QoS.
     Resources need to be allocated in advance, to ensure that the group
     is always able to meet its guarantee. This form is undesirable as
     it leads to resource under utilization. Another approach is to
     allow sharing of resources, but when a group is unable to meet its
     guarantee, the system will OOM kill a group that exceeds its
     guarantee.  Hard guarantees are more difficult to provide for
     non-reclaimable resources, but might be easier to provide for
     reclaimable resources.

NOTE: It has been argued that guarantees can be implemented using
limits. See http://wiki.openvz.org/Guarantees_for_resources

3. Memory Controller Alternatives

(i)   Beancouners
(ii)  Containers
(iii) Resource groups (aka CKRM)
(iv)  Fake Nodes

+----+---------+------+---------+------------+----------------+-----------+
| No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
+----+---------+------+---------+------------+----------------+-----------+
| i  |  No     | Yes  | syscall | Memory     | No framework   |   Yes     |
|    |         |      |         |            | to write new   |           |
|    |         |      |         |            | controllers    |           |
+----+---------+------+---------+------------+----------------+-----------+
|ii  |  No     | Yes  | configfs| Memory,    | Plans to       |   Yes     |
|    |         |      |         | task limit.| provide a      |           |
|    |         |      |         | Plans to   | framework      |           |
|    |         |      |         | allow      | to write new   |           |
|    |         |      |         | CPU and I/O| controllers    |           |
+----+---------+------+---------+------------+----------------+-----------+
|iii |  Yes    | Yes  | configfs| CPU, task  | Provides a     |   Yes     |
|    |         |      |         | limit &    | framework to   |           |
|    |         |      |         | Memory     | add new        |           |
|    |         |      |         | controller.| controllers    |           |
|    |         |      |         | I/O contr  |                |           |
|    |         |      |         | oller for  |                |           |
|    |         |      |         | older      |                |           |
|    |         |      |         | revisions  |                |           |
+----+---------+------+---------+------------+----------------+-----------+

4. Existing accounting

a. Beancounters currently account for the following resources

(i)   kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
(ii)  physpages - Resident set size of the tasks in the group.
      Reclaim support is provided for this resource.
(iii) lockedpages - User pages locked in memory
(iv)  slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
      controlled.

Beancounters provides some support for event notification (limit/barrier hit).

b. Containers account for the following resources

(i)   mapped pages
(ii)  anonymous pages
(iii) file pages (from the page cache)
(iv)  active pages

There is some support for reclaiming pages, the code is in the early stages of
development.

c. CKRM/RG Memory Controller

(i)   Tracks active pages
(ii)  Supports reclaim of LRU pages
(iii) Shared pages are not tracked

This controller provides its own res_zone, to aid reclaim and tracking of pages.

d. Fake NUMA Nodes

This approach was suggested while discussing the memory controller

Advantages

(i)   Accounting for zones is already present
(ii)  Reclaim code can directly deal with zones

Disadvantages

(i)   The approach leads to hard partitioning of memory.
(ii)  It's complex to
      resize the node. Resizing is required to allow change of limits for
      resource management.
(ii)  Addition/Deletion of a resource group would require memory hotplug
      support for add/delete a node. On deletion of node, its memory is
      not utilized until a new node of a same or lesser size is created.
      Addition of node, requires reserving memory for it upfront.

5. Open issues

(i)    Can we allow threads belonging to the same process belong
       to two different resource groups? Does this mean we need to do per-thread
       VM accounting now?
(ii)   There is an overhead associated with adding a pointer in struct page.
       Can this be reduced/avoided? One solution suggested is to use a
       mirror mem_map.
(iii)  How do we distribute the remaining resources among resource hungry
       groups? The Resource Group implementation used the ratio of the limits
       to decide on the ratio according to which they are distributed.
(iv)   How do we account for shared pages? Should it be charged to the first
       container which touches the page or should it be charged equally among
       all containers sharing the page?
(v)    Definition of RSS (see http://lkml.org/lkml/2006/10/10/130)

6. Going forward

(i)    Agree on requirements (there has been some agreement already, please
       see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
(ii)   Agree on minimum accounting and hooks in the kernel. It might be
       a good idea to take this up in phases
       phase 1 - account for user space memory
       phase 2 - account for kernel memory allocated on behalf of the user/task
(iii)  Infrastructure - There is a separate RFC on that.

7. References

1. http://www.openvz.org
2. http://lkml.org/lkml/2006/9/19/283 (Containers patches)
3. http://lwn.net/Articles/200073/ (Another Container Implementation)
4. http://ckrm.sf.net (Resource Groups)
5. http://lwn.net/Articles/197433/ (Resource Beancounters)
6. http://lwn.net/Articles/182369/ (CKRM Rebranded)
7. http://lkml.org/lkml/2006/7/26/237 (OLS BoF on Resource Management (NOTES))

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
@ 2006-10-30 10:43 ` Paul Jackson
  2006-10-30 14:19   ` [ckrm-tech] " Pavel Emelianov
  2006-10-30 17:09   ` Srivatsa Vaddagiri
  2006-10-30 10:51 ` Paul Menage
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 10:43 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, menage, akpm, ckrm-tech, rohitseth, balbir,
	dipankar, matthltc, haveblue, linux-kernel

vatsa wrote:
> C. Paul Menage's container patches
> 
> 	Provides a generic heirarchial ...
> 
> Consensus/Debated Points
> ------------------------
> 
> Consensus:
> 	...
> 	- Dont support heirarchy for now

Looks like this item can be dropped from the concensus ... ;).

I for one would recommend getting the hierarchy right from the
beginning.

Though I can appreciate that others were trying to "keep it simple"
and postpone dealing with such complications.  I don't agree.

Such stuff as this deeply affects all that sits on it.  Get the
basic data shape presented by the kernel-user API right up front.
The rest will follow, much easier.

Good review of the choices - thanks.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
  2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
@ 2006-10-30 10:51 ` Paul Menage
  2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
                     ` (2 more replies)
  2006-10-30 14:08 ` Pavel Emelianov
  2006-11-01  9:30 ` Pavel Emelianov
  4 siblings, 3 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-30 10:51 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, pj, akpm, ckrm-tech, rohitseth, balbir, dipankar,
	matthltc, haveblue, linux-kernel

On 10/30/06, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
>
>         - Paul Menage's patches present a rework of existing code, which makes
>           it simpler to get it in. Does it meet container (Openvz/Linux
>           VServer) and resource management requirements?
>
>                 Paul has ported over the CKRM code on top of his patches. So I
>                 am optimistic that it meets resource management requirements in
>                 general.
>
>                 One shortcoming I have seen in it is that it lacks an
>                 efficient method to retrieve tasks associated with a group.
>                 This may be needed by few controllers implementations if they
>                 have to support, say, change of resource limits. This however
>                 I think could be addressed quite easily (by a linked list
>                 hanging off each container structure).

The cpusets code which this was based on simply locked the task list,
and traversed it to find threads in the cpuset of interest; you could
do the same thing in any other resource controller.

Not keeping a list of tasks in the container makes fork/exit more
efficient, and I assume is the reason that cpusets made that design
decision. If we really wanted to keep a list of tasks in a container
it wouldn't be hard, but should probably be conditional on at least
one of the registered resource controllers to avoid unnecessary
overhead when none of the controllers actually care (in a similar
manner to the fork/exit callbacks, which only take the container
callback mutex if some container subsystem is interested in fork/exit
events).

>
>                 - register/deregister themselves
>                 - be intimated about changes in resource allocation for a group
>                 - be intimated about task movements between groups
>                 - be intimated about creation/deletion of groups
>                 - know which group a task belongs to

Apart from the deregister, my generic containers patch provides all of
these as well.

How important is it for controllers/subsystems to be able to
deregister themselves, do you think? I could add it relatively easily,
but it seemed unnecessary in general.

>
> B. UBC
>
>         Framework to account and limit usage of various resources by a
>         container (group of tasks).
>
>         Provides a system call based interface to:
>
>                 - set a task's beancounter id. If the id does not exist, a new
>                   beancounter object is created
>                 - change a task's association from one beancounter to other
>                 - return beancounter id to which the calling task belongs
>                 - set limits of consumption of a particular resource by a
>                   beancounter
>                 - return statistics information for a given beancounter and
>                   resource.

I've not really played with it yet, but I don't see any reason why the
beancounter resource control concept couldn't also be built over
generic containers. The user interface would be different, of course
(filesysmem vs syscall), but maybe even that could be emulated if
there was a need for backwards compatibility.

>
> Consensus:
>
>         - Provide resource control over a group of tasks
>         - Support movement of task from one resource group to another
>         - Dont support heirarchy for now

Both CKRM/RG and generic containers support a hierarchy.

>         - Support limit (soft and/or hard depending on the resource
>           type) in controllers. Guarantee feature could be indirectly
>           met thr limits.

That's an issue for resource controllers, rather than the underlying
infrastructure, I think.

>
> Debated:
>         - syscall vs configfs interface
>         - Interaction of resource controllers, containers and cpusets
>                 - Should we support, for instance, creation of resource
>                   groups/containers under a cpuset?
>         - Should we have different groupings for different resources?

I've played around with the idea where the hierarchies of resource
controller entities was distinct from the hierarchy of process
containers.

The simplest form of this would be that at each level in the hierarchy
the user could indicate, for each resource controller, whether child
containers would inherit the same resource entity for that controller,
or would have a new one created. E.g. you could determine if, when you
create a child container, whether tasks in that container would be in
the same cpuset as the parent, or in a fresh cpuset; this would be
independent of whether they were in the same disk I/O scheduling
domain, or in a fresh child domain, etc. This would be an extension of
the "X_enabled" files that appear in the top-level container directory
for each container subsystem in my current patch.

At a more complex level, the resource controller entity tree for each
resource controller could be independent, and the mapping from
containers to resource controller nodes could be arbitrary and
different for each controller - so every process would belong to
exactly one container, but the user could pick e.g. any cpuset and any
disk I/O scheduling domain for each container.

Both of these seem a little complex for a first cut of the code, though.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: RFC: Memory Controller
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
@ 2006-10-30 11:04   ` Paul Menage
  2006-10-30 13:27     ` [ckrm-tech] " Balbir Singh
  2006-10-30 15:58   ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 11:04 UTC (permalink / raw)
  To: balbir
  Cc: vatsa, dev, sekharan, pj, akpm, ckrm-tech, rohitseth, dipankar,
	matthltc, haveblue, linux-kernel

On 10/30/06, Balbir Singh <balbir@in.ibm.com> wrote:
> +----+---------+------+---------+------------+----------------+-----------+
> |ii  |  No     | Yes  | configfs| Memory,    | Plans to       |   Yes     |
> |    |         |      |         | task limit.| provide a      |           |
> |    |         |      |         | Plans to   | framework      |           |
> |    |         |      |         | allow      | to write new   |           |
> |    |         |      |         | CPU and I/O| controllers    |           |

I have a port of Rohit's memory controller to run over my generic containers.

>
> d. Fake NUMA Nodes
>
> This approach was suggested while discussing the memory controller
>
> Advantages
>
> (i)   Accounting for zones is already present
> (ii)  Reclaim code can directly deal with zones
>
> Disadvantages
>
> (i)   The approach leads to hard partitioning of memory.
> (ii)  It's complex to
>       resize the node. Resizing is required to allow change of limits for
>       resource management.
> (ii)  Addition/Deletion of a resource group would require memory hotplug
>       support for add/delete a node. On deletion of node, its memory is
>       not utilized until a new node of a same or lesser size is created.
>       Addition of node, requires reserving memory for it upfront.

A much simpler way of adding/deleting/resizing resource groups is to
partition the system at boot time into a large number of fake numa
nodes (say one node per 64MB in the system) and then use cpusets to
assign the appropriate number of nodes each group. We're finding a few
ineffiencies in the current code when using such a large number of
small nodes (e.g. slab alien node caches), but we're confident that we
can iron those out.

> (iv)   How do we account for shared pages? Should it be charged to the first
>        container which touches the page or should it be charged equally among
>        all containers sharing the page?

A third option is to allow inodes to be associated with containers in
their own right, and charge all pages for those inodes to the
associated container. So if several different containers are sharing a
large data file, you can put that file in its own container, and you
then have an exact count of how many pages are in use in that shared
file.

This is cheaper than having to keep track of multiple users of a page,
and is also useful when you're trying to do scheduling, to decide who
to evict. Suppose you have two jobs each allocating 100M of anonymous
memory and each accessing all of a 1G shared file, and you need to
free up 500M of memory in order to run a higher-priority job.

If you charge the first user, then it will appear that the first job
is using 1.1G of memory and the second is using 100M of memory. So you
might evict the first job, thinking it would free up 1.1G of memory -
but it would actually only free up 100M of memory, since the shared
pages would still be in use by the second job.

If you share the charge between both users, then it would appear that
each job is using 600M of memory - but it's still the case that
evicting either one would only free up 100M of memory.

If you can see that the shared file that they're both using is
accounting for 1G of the memory total, and that they're each using
100M of anon memory, then it's easier to see that you'd need to evict
*both* jobs in order to free up 500M of memory.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:51 ` Paul Menage
@ 2006-10-30 11:06   ` Paul Jackson
  2006-10-30 12:07     ` Paul Menage
  2006-10-30 11:15   ` Paul Jackson
  2006-11-01 17:33   ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 11:06 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> The cpusets code which this was based on simply locked the task list,
> and traversed it to find threads in the cpuset of interest; you could
> do the same thing in any other resource controller.

I get away with this in the cpuset code because:
 1) I have the cpuset pointer directly in 'task_struct', so don't
    have to chase down anything, for each task, while scanning the
    task list.  I just have to ask, for each task, if its cpuset
    pointer points to the cpuset of interest.
 2) I don't care if I get an inconsistent answer, so I don't have
    to lock each task, nor do I even lockout the rest of the cpuset
    code.  All I know, at the end of the scan, is that each task that
    I claim is attached to the cpuset in question was attached to it at
    some point during my scan, not necessarilly all at the same time.
 3) It's not a flaming disaster if the kmalloc() of enough memory
    to hold all the pids I collect in a single array fails.  That
    just means that some hapless users open for read of a cpuset
    'tasks' file failed, -ENOMEM.  Oh well ...

If someone is actually trying to manage system resources accurately,
they probably can't get away being as fast and loose as this.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:51 ` Paul Menage
  2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
@ 2006-10-30 11:15   ` Paul Jackson
  2006-10-30 12:04     ` Paul Menage
  2006-11-01 17:33   ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 11:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> I've played around with the idea where the hierarchies of resource
> controller entities was distinct from the hierarchy of process
> containers.

It would be nice, me thinks, if the underlying container technology
didn't really care whether we had one hierarchy or seven.  Let the
users (such as CKRM/RG, cpusets, ...) of this container infrastructure
determine when and where they need separate hierarchies, and when and
where they are better off sharing the same hierarchy.

The question of one or more separate hierarchies is one of those long
term questions that should be driven by the basic semantics of what we
are trying to model, not by transient infrastructure expediencies.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 11:15   ` Paul Jackson
@ 2006-10-30 12:04     ` Paul Menage
  2006-10-30 12:27       ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 12:04 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 10/30/06, Paul Jackson <pj@sgi.com> wrote:
> It would be nice, me thinks, if the underlying container technology
> didn't really care whether we had one hierarchy or seven.  Let the
> users (such as CKRM/RG, cpusets, ...)

I was thinking that it would be even better if the actual (human)
users could determine this; have the container infrastructure make it
practical to have flexible hierarchy mappings, and have the resource
controller subsystems not have to care about how they were being used.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
@ 2006-10-30 12:07     ` Paul Menage
  2006-10-30 12:28       ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 12:07 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 10/30/06, Paul Jackson <pj@sgi.com> wrote:
> I get away with this in the cpuset code because:
>  1) I have the cpuset pointer directly in 'task_struct', so don't
>     have to chase down anything, for each task, while scanning the
>     task list.  I just have to ask, for each task, if its cpuset
>     pointer points to the cpuset of interest.

That's the same when it's transferred to containers - each task_struct
now has a container pointer, and you can just see whether the
container pointer matches the container that you're interested in.

>  2) I don't care if I get an inconsistent answer, so I don't have
>     to lock each task, nor do I even lockout the rest of the cpuset
>     code.  All I know, at the end of the scan, is that each task that
>     I claim is attached to the cpuset in question was attached to it at
>     some point during my scan, not necessarilly all at the same time.

Well, anything that can be accomplished from within the tasklist_lock
can get a consistent result without any additional lists or
synchronization - it seems that it would be good to come up with a
real-world example of something that *can't* make do with this before
adding extra book-keeping.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 12:04     ` Paul Menage
@ 2006-10-30 12:27       ` Paul Jackson
  2006-10-30 17:53         ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 12:27 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> I was thinking that it would be even better if the actual (human)
> users could determine this; have the container infrastructure make it

You mean let the system admin, say, of a system determine
whether or not CKRM/RG and cpusets have one shared, or two
separate, hierarchies?

Wow - I think my brain just exploded.

Oh well ... I'll have to leave it an open issue for the moment;
I'm focusing on something else right now.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 12:07     ` Paul Menage
@ 2006-10-30 12:28       ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 12:28 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> it seems that it would be good to come up with a
> real-world example of something that *can't* make do with this before
> adding extra book-keeping.

that seems reasonable enough ...

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 11:04   ` Paul Menage
@ 2006-10-30 13:27     ` Balbir Singh
  2006-10-30 18:14       ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-30 13:27 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

Paul Menage wrote:
> On 10/30/06, Balbir Singh <balbir@in.ibm.com> wrote:
>> +----+---------+------+---------+------------+----------------+-----------+
>> |ii  |  No     | Yes  | configfs| Memory,    | Plans to       |   Yes     |
>> |    |         |      |         | task limit.| provide a      |           |
>> |    |         |      |         | Plans to   | framework      |           |
>> |    |         |      |         | allow      | to write new   |           |
>> |    |         |      |         | CPU and I/O| controllers    |           |
> 
> I have a port of Rohit's memory controller to run over my generic containers.

Cool!

> 
>> d. Fake NUMA Nodes
>>
>> This approach was suggested while discussing the memory controller
>>
>> Advantages
>>
>> (i)   Accounting for zones is already present
>> (ii)  Reclaim code can directly deal with zones
>>
>> Disadvantages
>>
>> (i)   The approach leads to hard partitioning of memory.
>> (ii)  It's complex to
>>       resize the node. Resizing is required to allow change of limits for
>>       resource management.
>> (ii)  Addition/Deletion of a resource group would require memory hotplug
>>       support for add/delete a node. On deletion of node, its memory is
>>       not utilized until a new node of a same or lesser size is created.
>>       Addition of node, requires reserving memory for it upfront.
> 
> A much simpler way of adding/deleting/resizing resource groups is to
> partition the system at boot time into a large number of fake numa
> nodes (say one node per 64MB in the system) and then use cpusets to
> assign the appropriate number of nodes each group. We're finding a few
> ineffiencies in the current code when using such a large number of
> small nodes (e.g. slab alien node caches), but we're confident that we
> can iron those out.
> 

You'll also end up with per zone page cache pools for each zone. A list of
active/inactive pages per zone (which will split up the global LRU list).
What about the hard-partitioning. If a container/cpuset is not using its full
64MB of a fake node, can some other node use it? Also, won't you end up
with a big zonelist?

>> (iv)   How do we account for shared pages? Should it be charged to the first
>>        container which touches the page or should it be charged equally among
>>        all containers sharing the page?
> 
> A third option is to allow inodes to be associated with containers in
> their own right, and charge all pages for those inodes to the
> associated container. So if several different containers are sharing a
> large data file, you can put that file in its own container, and you
> then have an exact count of how many pages are in use in that shared
> file.
> 
> This is cheaper than having to keep track of multiple users of a page,
> and is also useful when you're trying to do scheduling, to decide who
> to evict. Suppose you have two jobs each allocating 100M of anonymous
> memory and each accessing all of a 1G shared file, and you need to
> free up 500M of memory in order to run a higher-priority job.
> 
> If you charge the first user, then it will appear that the first job
> is using 1.1G of memory and the second is using 100M of memory. So you
> might evict the first job, thinking it would free up 1.1G of memory -
> but it would actually only free up 100M of memory, since the shared
> pages would still be in use by the second job.
> 
> If you share the charge between both users, then it would appear that
> each job is using 600M of memory - but it's still the case that
> evicting either one would only free up 100M of memory.
> 
> If you can see that the shared file that they're both using is
> accounting for 1G of the memory total, and that they're each using
> 100M of anon memory, then it's easier to see that you'd need to evict
> *both* jobs in order to free up 500M of memory.

Consider the other side of the story. lets say we have a shared lib shared
among quite a few containers. We limit the usage of the inode containing
the shared library to 50M. Tasks A and B use some part of the library
and cause the container "C" to reach the limit. Container C is charged
for all usage of the shared library. Now no other task, irrespective of which
container it belongs to, can touch any new pages of the shared library.

We might also be interested in limiting the page cache usage of a container.
In such cases, this solution might not work out to be the best.

What you are suggesting is to virtually group the inodes by container rather
than task. It might make sense in some cases, but not all.

We could consider implementing the controllers in phases

1. RSS control (anon + mapped pages)
2. Page Cache control
3. Kernel accounting and control




-- 
	Cheers,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
                   ` (2 preceding siblings ...)
  2006-10-30 10:51 ` Paul Menage
@ 2006-10-30 14:08 ` Pavel Emelianov
  2006-10-30 14:23   ` Paul Jackson
                     ` (2 more replies)
  2006-11-01  9:30 ` Pavel Emelianov
  4 siblings, 3 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-30 14:08 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, menage, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

[snip]

> 
> Consensus/Debated Points
> ------------------------
> 
> Consensus:
> 
> 	- Provide resource control over a group of tasks 
> 	- Support movement of task from one resource group to another
> 	- Dont support heirarchy for now
> 	- Support limit (soft and/or hard depending on the resource
> 	  type) in controllers. Guarantee feature could be indirectly
> 	  met thr limits.
> 
> Debated:
> 	- syscall vs configfs interface

1. One of the major configfs ideas is that lifetime of
   the objects is completely driven by userspace.
   Resource controller shouldn't live as long as user
   want. It "may", but not "must"! As you have seen from
   our (beancounters) patches beancounters disapeared
   as soon as the last reference was dropped. Removing
   configfs entries on beancounter's automatic destruction
   is possible, but it breaks the logic of configfs.

2. Having configfs as the only interface doesn't alow
   people having resource controll facility w/o configfs.
   Resource controller must not depend on any "feature".

3. Configfs may be easily implemented later as an additional
   interface. I propose the following solution:
     - First we make an interface via any common kernel
       facility (syscall, ioctl, etc);
     - Later we may extend this with configfs. This will
       alow one to have configfs interface build as a module.

> 	- Interaction of resource controllers, containers and cpusets
> 		- Should we support, for instance, creation of resource
> 		  groups/containers under a cpuset?
> 	- Should we have different groupings for different resources?

This breaks the idea of groups isolation.

> 	- Support movement of all threads of a process from one group
> 	  to another atomically?

This is not a critical question. This is something that
has difference in

-	move_task_to_container(task);
+	do_each_thread_all(g, p) {
+		if (g->mm == task->mm)
+			move_task_to_container(g);
+	} while_each_thread_all(g, p);

or similar. If we have an infrastructure for accounting and
moving one task_struct into group then solution of how many
task to move in one syscall may be taken, but not the other
way round.


I also add devel@openvz.org to Cc. Please keep it on your replies.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
@ 2006-10-30 14:19   ` Pavel Emelianov
  2006-10-30 14:29     ` Paul Jackson
  2006-10-30 17:09   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-30 14:19 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth, menage

Paul Jackson wrote:
> vatsa wrote:
>> C. Paul Menage's container patches
>>
>> 	Provides a generic heirarchial ...
>>
>> Consensus/Debated Points
>> ------------------------
>>
>> Consensus:
>> 	...
>> 	- Dont support heirarchy for now
> 
> Looks like this item can be dropped from the concensus ... ;).

Agree.

> 
> I for one would recommend getting the hierarchy right from the
> beginning.
> 
> Though I can appreciate that others were trying to "keep it simple"
> and postpone dealing with such complications.  I don't agree.
> 
> Such stuff as this deeply affects all that sits on it.  Get the

I can share our experience with it.
Hierarchy support over beancounters was done in one patch.
This patch altered only three places - charge/uncharge routines,
beancounter creation/destruction code and BC's /proc entry.
All the rest code was not modified.

My point is that a good infrastrucure doesn't care wether
or not beancounter (group controller) has a parent.

> basic data shape presented by the kernel-user API right up front.
> The rest will follow, much easier.
> 
> Good review of the choices - thanks.
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:08 ` Pavel Emelianov
@ 2006-10-30 14:23   ` Paul Jackson
  2006-10-30 14:38     ` Pavel Emelianov
  2006-10-30 18:01   ` Paul Menage
  2006-10-31 16:34   ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 14:23 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, menage, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth, devel

Pavel wrote:
> 1. One of the major configfs ideas is that lifetime of
>    the objects is completely driven by userspace.
>    Resource controller shouldn't live as long as user
>    want. It "may", but not "must"!

I had trouble understanding what you are saying here.

What does the phrase "live as long as user want" mean?

> 2. Having configfs as the only interface doesn't alow
>    people having resource controll facility w/o configfs.
>    Resource controller must not depend on any "feature".
>
> 3. Configfs may be easily implemented later as an additional
>    interface. I propose the following solution:
>      - First we make an interface via any common kernel
>        facility (syscall, ioctl, etc);
>      - Later we may extend this with configfs. This will
>        alow one to have configfs interface build as a module.

So you would add bloat to the kernel, with two interfaces
to the same facility, because you don't want the resource
controller to depend on configfs.

I am familiar with what is wrong with kernel bloat.

Can you explain to me what is wrong with having resource
groups depend on configfs?  Is there something wrong with
configfs that would be a significant problem for some systems
needing resource groups?

It is better where possible, I would think, to reuse common
infrastructure and minimize redundancy.  If there is something
wrong with configfs that makes this a problem, perhaps we
should fix that.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:19   ` [ckrm-tech] " Pavel Emelianov
@ 2006-10-30 14:29     ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 14:29 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth, menage

Pavel wrote:
> My point is that a good infrastrucure doesn't care wether
> or not beancounter (group controller) has a parent.

I am far more interested in the API, including the shape
of the data model, that we present to the user across the
kernel-user boundary.

Getting one, good, stable API for the long haul is worth alot.

Whether or not some substantial semantic change in this, such
as going from a flat to a tree shape, can be done in a single
line of kernel code, or a thousand lines, is less important.

What is the right long term kernel-user API and data model?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:23   ` Paul Jackson
@ 2006-10-30 14:38     ` Pavel Emelianov
  2006-10-30 15:18       ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-30 14:38 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Pavel Emelianov, vatsa, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth, devel

Paul Jackson wrote:
> Pavel wrote:
>> 1. One of the major configfs ideas is that lifetime of
>>    the objects is completely driven by userspace.
>>    Resource controller shouldn't live as long as user
>>    want. It "may", but not "must"!
> 
> I had trouble understanding what you are saying here.
> 
> What does the phrase "live as long as user want" mean?

What if if user creates a controller (configfs directory)
and doesn't remove it at all. Should controller stay in memory
even if nobody uses it?

> 
> 
>> 2. Having configfs as the only interface doesn't alow
>>    people having resource controll facility w/o configfs.
>>    Resource controller must not depend on any "feature".
>>
>> 3. Configfs may be easily implemented later as an additional
>>    interface. I propose the following solution:
>>      - First we make an interface via any common kernel
>>        facility (syscall, ioctl, etc);
>>      - Later we may extend this with configfs. This will
>>        alow one to have configfs interface build as a module.
> 
> So you would add bloat to the kernel, with two interfaces
> to the same facility, because you don't want the resource
> controller to depend on configfs.
> 
> I am familiar with what is wrong with kernel bloat.
> 
> Can you explain to me what is wrong with having resource
> groups depend on configfs?  Is there something wrong with

Resource controller has nothing common with confgifs.
That's the same as if we make netfilter depend on procfs.

> configfs that would be a significant problem for some systems
> needing resource groups?

Why do we need to make some dependency if we can avoid it?

> It is better where possible, I would think, to reuse common
> infrastructure and minimize redundancy.  If there is something
> wrong with configfs that makes this a problem, perhaps we
> should fix that.

The same can be said about system calls interface, isn't it?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:38     ` Pavel Emelianov
@ 2006-10-30 15:18       ` Paul Jackson
  2006-10-30 15:26         ` Pavel Emelianov
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 15:18 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: xemul, vatsa, dev, sekharan, menage, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth, devel

Pavel wrote:
> >> 3. Configfs may be easily implemented later as an additional
> >>    interface. I propose the following solution:
> >>      ...
> >
> Resource controller has nothing common with confgifs.
> That's the same as if we make netfilter depend on procfs.

Well ... if you used configfs as an interface to resource
controllers, as you said was easily done, then they would
have something to do with each other, right ;)?

Choose the right data structure for the job, and then reuse
what fits for that choice.

Neither avoid nor encouraging code reuse is the key question.

What's the best fit, long term, for the style of kernel-user
API, for this use?  That's the key question.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 15:18       ` Paul Jackson
@ 2006-10-30 15:26         ` Pavel Emelianov
  2006-10-31  0:26           ` Matt Helsley
  2006-10-31 16:33           ` Chris Friesen
  0 siblings, 2 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-30 15:26 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Pavel Emelianov, vatsa, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth, devel

Paul Jackson wrote:
> Pavel wrote:
>>>> 3. Configfs may be easily implemented later as an additional
>>>>    interface. I propose the following solution:
>>>>      ...
>> Resource controller has nothing common with confgifs.
>> That's the same as if we make netfilter depend on procfs.
> 
> Well ... if you used configfs as an interface to resource
> controllers, as you said was easily done, then they would
> have something to do with each other, right ;)?

Right. We'll create a dependency that is not needed.

> Choose the right data structure for the job, and then reuse
> what fits for that choice.
> 
> Neither avoid nor encouraging code reuse is the key question.
> 
> What's the best fit, long term, for the style of kernel-user
> API, for this use?  That's the key question.

I agree, but you've cut some importaint questions away,
so I ask them again:

 > What if if user creates a controller (configfs directory)
 > and doesn't remove it at all. Should controller stay in
 > memory even if nobody uses it?

This is importaint to solve now - wether we want or not to
keep "empty" beancounters in memory. If we do not then configfs
usage is not acceptible.

 > The same can be said about system calls interface, isn't it?

I haven't seen any objections against system calls yet.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
  2006-10-30 11:04   ` Paul Menage
@ 2006-10-30 15:58   ` Pavel Emelianov
  2006-10-30 17:39     ` Balbir Singh
  2006-10-30 18:20     ` Paul Menage
  1 sibling, 2 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-30 15:58 UTC (permalink / raw)
  To: balbir
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage

[snip]

> Reclaimable memory
> 
> (i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
>       they are mapped into the user page tables, but not backed by a file.

I do not agree with such classification.
When one maps file then kernel can remove page from address
space as there is already space on disk for it. When one
maps an anonymous page then kernel won't remove this page
for sure as system may simply be configured to be swapless.

I also remind you that beancounter code keeps all the logic
of memory classification in one place, so changing this
would require minimal changes.

[snip]

> 
> (i)  Slabs
> (ii) Kernel pages and page_tables allocated on behalf of a task.

I'd pay more attention to kernel memory accounting and less
to user one as having kmemsize resource actually protects
the system from DoS attacks. Accounting and limiting user
pages doesn't protect system from anything.

[snip]

>      (b) Hard guarantees is a more deterministic method of providing QoS.
>      Resources need to be allocated in advance, to ensure that the group
>      is always able to meet its guarantee. This form is undesirable as

How would you allocate memory on NUMA in advance?

[snip]

> +----+---------+------+---------+------------+----------------+-----------+
> | No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
> +----+---------+------+---------+------------+----------------+-----------+
> | i  |  No     | Yes  | syscall | Memory     | No framework   |   Yes     |
> |    |         |      |         |            | to write new   |           |
> |    |         |      |         |            | controllers    |           |

The lattest Beancounter patches do provide framework for
new controllers.

[snip]

> a. Beancounters currently account for the following resources
> 
> (i)   kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
> (ii)  physpages - Resident set size of the tasks in the group.
>       Reclaim support is provided for this resource.
> (iii) lockedpages - User pages locked in memory
> (iv)  slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
>       controlled.

This is also not true now. The latest beancounter code accounts for
1. kmemsie - this includes slab and vmalloc objects and "raw" pages
   allocated directly from buddy allocator.
2. unreclaimable memory - this accounts for the total length of
   mappings of a certain type. These are _mappings_ that are
   accounted since limiting mapping limits memory usage and alows
   a grace rejects (-ENOMEM returned from sys_mmap), but with
   unlimited mappings you may limit memory usage with SIGKILL only.
3. physical pages - these includes pages mapped in page faults and
   hitting the pyspages limit starts pages reclamation.

[snip]

> 5. Open issues
> 
> (i)    Can we allow threads belonging to the same process belong
>        to two different resource groups? Does this mean we need to do per-thread
>        VM accounting now?

Solving this question is the same as solving "how would we account for
pages that are shared between several processes?".

> (ii)   There is an overhead associated with adding a pointer in struct page.
>        Can this be reduced/avoided? One solution suggested is to use a
>        mirror mem_map.

This does not affect infrastructure, right? E.g. current beancounter
code uses page_bc() macro to get BC pointer from page. Changing it
from
#define page_bc(page) ((page)->page_bc)
to
#define page_bc(page) ((bc_mmap[page_to_pfn(page)])
or similar may be done at any moment.

We may deside that "each page has an associated BC pointer" and
go on further discussion (e.g. which interface to use). The solution
where to store this pointer may be taken after we agree on all the
rest.

Since we're not going to discuss right now what kind of locking
we are going to have, let's delay the discussion of anything that
is code-dependent.

> (iii)  How do we distribute the remaining resources among resource hungry
>        groups? The Resource Group implementation used the ratio of the limits
>        to decide on the ratio according to which they are distributed.
> (iv)   How do we account for shared pages? Should it be charged to the first
>        container which touches the page or should it be charged equally among
>        all containers sharing the page?
> (v)    Definition of RSS (see http://lkml.org/lkml/2006/10/10/130)
> 
> 6. Going forward
> 
> (i)    Agree on requirements (there has been some agreement already, please
>        see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
> (ii)   Agree on minimum accounting and hooks in the kernel. It might be
>        a good idea to take this up in phases
>        phase 1 - account for user space memory
>        phase 2 - account for kernel memory allocated on behalf of the user/task

I'd raised the priority of kernel memory accounting.

I see that everyone agree that we want to see three resources:
  1. kernel memory
  2. unreclaimable memory
  3. reclaimable memory
if this is right then let's save it somewhere
(e.g. http://wiki.openvz.org/Containers/UBC_discussion)
and go on discussing the next question - interface.

Right now this is the most diffucult one and there are two
candidates - syscalls and configfs. I've pointed my objections
agains configfs and haven't seen any against system calls...

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
  2006-10-30 14:19   ` [ckrm-tech] " Pavel Emelianov
@ 2006-10-30 17:09   ` Srivatsa Vaddagiri
  2006-10-30 17:16     ` Dave McCracken
  1 sibling, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-30 17:09 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth, menage, devel

On Mon, Oct 30, 2006 at 02:43:20AM -0800, Paul Jackson wrote:
> > Consensus:
> > 	...
> > 	- Dont support heirarchy for now
> 
> Looks like this item can be dropped from the concensus ... ;).
> 
> I for one would recommend getting the hierarchy right from the
> beginning.
> 
> Though I can appreciate that others were trying to "keep it simple"
> and postpone dealing with such complications.  I don't agree.
> 
> Such stuff as this deeply affects all that sits on it.  Get the
> basic data shape presented by the kernel-user API right up front.
> The rest will follow, much easier.

Hierarchy has implications in not just the kernel-user API, but also on
the controller design. I would prefer to progressively enhance the
controller, not supporting hierarchy in the begining.

However you do have a valid concern that, if we dont design the user-kernel 
API keeping hierarchy in mind, then we may break this interface when we 
latter add hierarchy support, which will be bad. 

One possibility is to design the user-kernel interface that supports hierarchy
but not support creating hierarchical depths more than 1 in the initial 
versions. Would that work?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 17:09   ` Srivatsa Vaddagiri
@ 2006-10-30 17:16     ` Dave McCracken
  2006-10-30 18:07       ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Dave McCracken @ 2006-10-30 17:16 UTC (permalink / raw)
  To: ckrm-tech; +Cc: dev, linux-kernel, devel

On Monday 30 October 2006 11:09 am, Srivatsa Vaddagiri wrote:
> Hierarchy has implications in not just the kernel-user API, but also on
> the controller design. I would prefer to progressively enhance the
> controller, not supporting hierarchy in the begining.
> 
> However you do have a valid concern that, if we dont design the user-kernel
> API keeping hierarchy in mind, then we may break this interface when we
> latter add hierarchy support, which will be bad.
>
> One possibility is to design the user-kernel interface that supports
> hierarchy but not support creating hierarchical depths more than 1 in the
> initial versions. Would that work?

Is there any user demand for heirarchy right now?  I agree that we should 
design the API to allow heirarchy, but unless there is a current need for it 
I think we should not support actually creating heirarchies.  In addition to 
the reduction in code complexity, it will simplify the paradigm presented to 
the users.  I'm a firm believer in not giving users options they will never 
use.

Dave McCracken

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 15:58   ` Pavel Emelianov
@ 2006-10-30 17:39     ` Balbir Singh
  2006-10-30 18:07       ` Balbir Singh
  2006-10-31  8:48       ` Pavel Emelianov
  2006-10-30 18:20     ` Paul Menage
  1 sibling, 2 replies; 135+ messages in thread
From: Balbir Singh @ 2006-10-30 17:39 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage

Pavel Emelianov wrote:
> [snip]
> 
>> Reclaimable memory
>>
>> (i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
>>       they are mapped into the user page tables, but not backed by a file.
> 
> I do not agree with such classification.
> When one maps file then kernel can remove page from address
> space as there is already space on disk for it. When one
> maps an anonymous page then kernel won't remove this page
> for sure as system may simply be configured to be swapless.

Yes, I agree if there is no swap space, then anonymous memory is pinned.
Assuming that we'll end up using a an abstraction on top of the
existing reclaim mechanism, the mechanism would know if a particular
type of memory is reclaimable or not.

But your point is well taken.

[snip]

>> (i)  Slabs
>> (ii) Kernel pages and page_tables allocated on behalf of a task.
> 
> I'd pay more attention to kernel memory accounting and less
> to user one as having kmemsize resource actually protects
> the system from DoS attacks. Accounting and limiting user
> pages doesn't protect system from anything.
> 

Please see my comments at the end

[snip]

> 
>> +----+---------+------+---------+------------+----------------+-----------+
>> | No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
>> +----+---------+------+---------+------------+----------------+-----------+
>> | i  |  No     | Yes  | syscall | Memory     | No framework   |   Yes     |
>> |    |         |      |         |            | to write new   |           |
>> |    |         |      |         |            | controllers    |           |
> 
> The lattest Beancounter patches do provide framework for
> new controllers.
> 

I'll update the RFC.

> [snip]
> 
>> a. Beancounters currently account for the following resources
>>
>> (i)   kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
>> (ii)  physpages - Resident set size of the tasks in the group.
>>       Reclaim support is provided for this resource.
>> (iii) lockedpages - User pages locked in memory
>> (iv)  slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
>>       controlled.
> 
> This is also not true now. The latest beancounter code accounts for
> 1. kmemsie - this includes slab and vmalloc objects and "raw" pages
>    allocated directly from buddy allocator.

This is what I said, pages marked with __GFP_BC, so far on i386 I see
slab, vmalloc, PTE & LDT entries marked with the flag.


> 2. unreclaimable memory - this accounts for the total length of
>    mappings of a certain type. These are _mappings_ that are
>    accounted since limiting mapping limits memory usage and alows
>    a grace rejects (-ENOMEM returned from sys_mmap), but with
>    unlimited mappings you may limit memory usage with SIGKILL only.

ok, I'll add this too.

> 3. physical pages - these includes pages mapped in page faults and
>    hitting the pyspages limit starts pages reclamation.
> 

Yep, thats what I said.

> [snip]
> 
>> 5. Open issues
>>
>> (i)    Can we allow threads belonging to the same process belong
>>        to two different resource groups? Does this mean we need to do per-thread
>>        VM accounting now?
> 
> Solving this question is the same as solving "how would we account for
> pages that are shared between several processes?".
> 

Yes and that's an open issue too :)

>> (ii)   There is an overhead associated with adding a pointer in struct page.
>>        Can this be reduced/avoided? One solution suggested is to use a
>>        mirror mem_map.
> 
> This does not affect infrastructure, right? E.g. current beancounter
> code uses page_bc() macro to get BC pointer from page. Changing it
> from
> #define page_bc(page) ((page)->page_bc)
> to
> #define page_bc(page) ((bc_mmap[page_to_pfn(page)])
> or similar may be done at any moment.

The goal of the RFC is to discuss the controller. In the OLS BOF
on resource management, it was agreed that the controllers  should be
discussed and designed first, so that the proper infrastructure
could be put in place. Please see  http://lkml.org/lkml/2006/7/26/237.

> 
> We may deside that "each page has an associated BC pointer" and
> go on further discussion (e.g. which interface to use). The solution
> where to store this pointer may be taken after we agree on all the
> rest.


Yes, what you point out is an abstraction mechanism, that abstracts
out the implementation detail right now. I think its a good starting
point for further discussion.

[snip]

>>
>> 6. Going forward
>>
>> (i)    Agree on requirements (there has been some agreement already, please
>>        see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
>> (ii)   Agree on minimum accounting and hooks in the kernel. It might be
>>        a good idea to take this up in phases
>>        phase 1 - account for user space memory
>>        phase 2 - account for kernel memory allocated on behalf of the user/task
> 
> I'd raised the priority of kernel memory accounting.
> 
> I see that everyone agree that we want to see three resources:
>   1. kernel memory
>   2. unreclaimable memory
>   3. reclaimable memory
> if this is right then let's save it somewhere
> (e.g. http://wiki.openvz.org/Containers/UBC_discussion)
> and go on discussing the next question - interface.

I understand that kernel memory accounting is the first priority for
containers, but accounting kernel memory requires too many changes
to the VM core, hence I was hesitant to put it up as first priority.

But in general I agree, these are the three important resources for
accounting and control

[snip]

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 12:27       ` Paul Jackson
@ 2006-10-30 17:53         ` Paul Menage
  2006-10-30 20:36           ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 17:53 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 10/30/06, Paul Jackson <pj@sgi.com> wrote:
>
> You mean let the system admin, say, of a system determine
> whether or not CKRM/RG and cpusets have one shared, or two
> separate, hierarchies?

Yes - let the sysadmin define the process groupings, and how those
groupings get associated with resource control entities. The default
should be that all the hierarchies are the same, since I think that's
likely to be the common case.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:08 ` Pavel Emelianov
  2006-10-30 14:23   ` Paul Jackson
@ 2006-10-30 18:01   ` Paul Menage
  2006-10-31  8:31     ` Pavel Emelianov
  2006-10-31 16:34   ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 18:01 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On 10/30/06, Pavel Emelianov <xemul@openvz.org> wrote:
> > Debated:
> >       - syscall vs configfs interface
>
> 1. One of the major configfs ideas is that lifetime of
>    the objects is completely driven by userspace.
>    Resource controller shouldn't live as long as user
>    want. It "may", but not "must"! As you have seen from
>    our (beancounters) patches beancounters disapeared
>    as soon as the last reference was dropped.

Why is this an important feature for beancounters? All the other
resource control approaches seem to prefer having userspace handle
removing empty/dead groups/containers.

> 2. Having configfs as the only interface doesn't alow
>    people having resource controll facility w/o configfs.
>    Resource controller must not depend on any "feature".

Why is depending on a feature like configfs worse than depending on a
feature of being able to extend the system call interface?

> >       - Interaction of resource controllers, containers and cpusets
> >               - Should we support, for instance, creation of resource
> >                 groups/containers under a cpuset?
> >       - Should we have different groupings for different resources?
>
> This breaks the idea of groups isolation.

That's fine - some people don't want total isolation. If we're looking
for a solution that fits all the different requirements, then we need
that flexibility. I agree that the default would probably want to be
that the groupings be the same for all resource controllers /
subsystems.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 17:16     ` Dave McCracken
@ 2006-10-30 18:07       ` Paul Menage
  2006-10-30 20:41         ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 18:07 UTC (permalink / raw)
  To: Dave McCracken; +Cc: ckrm-tech, dev, linux-kernel, devel

On 10/30/06, Dave McCracken <dmccr@us.ibm.com> wrote:
>
> Is there any user demand for heirarchy right now?  I agree that we should
> design the API to allow heirarchy, but unless there is a current need for it
> I think we should not support actually creating heirarchies.  In addition to
> the reduction in code complexity, it will simplify the paradigm presented to
> the users.  I'm a firm believer in not giving users options they will never
> use.

The current CPUsets code supports hierarchies, and I believe that
there are people out there who depend on them (right, PaulJ?) Since
CPUsets are at heart a form of resource controller, it would be nice
to have them use the same resource control infrastructure as other
resource controllers (see the generic container patches that I sent
out as an example of this). So that would be at least one user that
requires a hierarchy.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 17:39     ` Balbir Singh
@ 2006-10-30 18:07       ` Balbir Singh
  2006-10-31  8:57         ` Pavel Emelianov
  2006-10-31  8:48       ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-30 18:07 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm

Balbir Singh wrote:
[snip]

>
>> I see that everyone agree that we want to see three resources:
>>   1. kernel memory
>>   2. unreclaimable memory
>>   3. reclaimable memory
>> if this is right then let's save it somewhere
>> (e.g. http://wiki.openvz.org/Containers/UBC_discussion)
>> and go on discussing the next question - interface.
> 
> I understand that kernel memory accounting is the first priority for
> containers, but accounting kernel memory requires too many changes
> to the VM core, hence I was hesitant to put it up as first priority.
> 
> But in general I agree, these are the three important resources for
> accounting and control

I missed out to mention, I hope you were including the page cache in
your definition of reclaimable memory.

> 
> [snip]
> 


-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 13:27     ` [ckrm-tech] " Balbir Singh
@ 2006-10-30 18:14       ` Paul Menage
  2006-10-31 17:07         ` Balbir Singh
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 18:14 UTC (permalink / raw)
  To: balbir
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

On 10/30/06, Balbir Singh <balbir@in.ibm.com> wrote:
>
> You'll also end up with per zone page cache pools for each zone. A list of
> active/inactive pages per zone (which will split up the global LRU list).

Yes, these are some of the inefficiencies that we're ironing out.

> What about the hard-partitioning. If a container/cpuset is not using its full
> 64MB of a fake node, can some other node use it?

No. So the granularity at which you can divide up the system depends
on how big your fake nodes are. For our purposes, we figure that 64MB
granularity should be fine.

> Also, won't you end up
> with a big zonelist?

Yes - but PaulJ's recent patch to speed up the zone selection helped
reduce the overhead of this a lot.

>
> Consider the other side of the story. lets say we have a shared lib shared
> among quite a few containers. We limit the usage of the inode containing
> the shared library to 50M. Tasks A and B use some part of the library
> and cause the container "C" to reach the limit. Container C is charged
> for all usage of the shared library. Now no other task, irrespective of which
> container it belongs to, can touch any new pages of the shared library.

Well, if the pages aren't mlocked then presumably some of the existing
pages can be flushed out to disk and replaced with other pages.

>
> What you are suggesting is to virtually group the inodes by container rather
> than task. It might make sense in some cases, but not all.

Right - I think it's an important feature to be able to support, but I
agree that it's not suitable for all situations.

>
> We could consider implementing the controllers in phases
>
> 1. RSS control (anon + mapped pages)
> 2. Page Cache control

Page cache control is actually more essential that RSS control, in our
experience - it's pretty easy to track RSS values from userspace, and
react reasonably quickly to kill things that go over their limit, but
determining page cache usage (i.e. determining which job on the system
is flooding the page cache with dirty buffers) is pretty much
impossible currently.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 15:58   ` Pavel Emelianov
  2006-10-30 17:39     ` Balbir Singh
@ 2006-10-30 18:20     ` Paul Menage
  2006-10-30 21:38       ` Paul Jackson
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-30 18:20 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: balbir, vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth

On 10/30/06, Pavel Emelianov <xemul@openvz.org> wrote:
> and go on discussing the next question - interface.
>
> Right now this is the most diffucult one and there are two
> candidates - syscalls and configfs. I've pointed my objections
> agains configfs and haven't seen any against system calls...
>

Some objections:

- they require touching every architecture to add the new system calls
- they're harder to debug from userspace, since you can't using useful
tools such as echo and cat
- changing the interface is harder since it's (presumably) a binary API

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 17:53         ` Paul Menage
@ 2006-10-30 20:36           ` Paul Jackson
  2006-10-30 20:47             ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 20:36 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

> Yes - let the sysadmin define the process groupings, and how those
> groupings get associated with resource control entities. The default
> should be that all the hierarchies are the same, since I think that's
> likely to be the common case.

Ah - I had thought earlier you were saying let the user define whether
or not (speaking metaphorically) their car had multiple gears in its
transmission, or just one gear.  That would have been kind of insane.

You meant we deliver a car with multiple gears, and its up to the user
when and if to ever shift.  That makes more sense.

In other words you are recommending delivering a system that internally
tracks separate hierarchies for each resource control entity, but where
the user can conveniently overlap some of these hierarchies and deal
with them as a single hierarchy.

What you are suggesting goes beyond the question of whether the kernel
has just and exactly and nevermore than one hierarchy, to suggest that
not only should the kernel support multiple hierarchies for different
resource control entities, but furthermore the kernel should make it
convenient for users to "bind" two or more of these hierarchies and
treat them as one.

Ok.  Sounds useful.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 18:07       ` Paul Menage
@ 2006-10-30 20:41         ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 20:41 UTC (permalink / raw)
  To: Paul Menage; +Cc: dmccr, dev, linux-kernel, devel, ckrm-tech

>  I believe that
> there are people out there who depend on them (right, PaulJ?)

Yes.  For example a common usage pattern has the system admin carve
off a big chunk of CPUs and Memory Nodes into a cpuset for the batch
scheduler to manage, within which the batch scheduler creates child
cpusets, roughly one for each job under its control.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 20:36           ` Paul Jackson
@ 2006-10-30 20:47             ` Paul Menage
  2006-10-30 20:56               ` Paul Jackson
                                 ` (3 more replies)
  0 siblings, 4 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-30 20:47 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 10/30/06, Paul Jackson <pj@sgi.com> wrote:
>
> In other words you are recommending delivering a system that internally
> tracks separate hierarchies for each resource control entity, but where
> the user can conveniently overlap some of these hierarchies and deal
> with them as a single hierarchy.

More or less. More concretely:

- there is a single hierarchy of process containers
- each process is a member of exactly one process container

- for each resource controller, there's a hierarchy of resource "nodes"
- each process container is associated with exactly one resource node
of each type

- by default, the process container hierarchy and the resource node
hierarchies are isomorphic, but that can be controlled by userspace.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 20:47             ` Paul Menage
@ 2006-10-30 20:56               ` Paul Jackson
  2006-10-30 21:03               ` Paul Menage
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 20:56 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

> More concretely:
> 
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
> 
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
>   of each type
> 
> - by default, the process container hierarchy and the resource node
>   hierarchies are isomorphic, but that can be controlled by userspace.

nice

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 20:47             ` Paul Menage
  2006-10-30 20:56               ` Paul Jackson
@ 2006-10-30 21:03               ` Paul Menage
  2006-10-31 11:53               ` Srivatsa Vaddagiri
  2006-11-01  4:39               ` David Rientjes
  3 siblings, 0 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-30 21:03 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 10/30/06, Paul Menage <menage@google.com> wrote:
>
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
>
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
>
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.

A simpler alternative that I thought about would be to restrict the
resource contoller hierarchies to be strict subtrees of the process
container hierarchy - so at each stage in the hierarchy, a container
could either inherit its parent's node for a given resource or have a
new child node (i.e. be in the same cpuset or be in a fresh child
cpuset).

This is a much simpler abstraction to present to userspace (simply one
flag for each resource controller in each process container) and might
be sufficient for all reasonable scenarios.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 18:20     ` Paul Menage
@ 2006-10-30 21:38       ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-10-30 21:38 UTC (permalink / raw)
  To: Paul Menage
  Cc: xemul, dev, vatsa, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

> - they require touching every architecture to add the new system calls
> - they're harder to debug from userspace, since you can't using useful
>   tools such as echo and cat
> - changing the interface is harder since it's (presumably) a binary API

To my mind these are rather secondary selection criteria.

If say we were adding a single, per-thread scalar value that each
thread could query of and perhaps modify for itself, then a system call
would be an alternative worthy of further consideration.

Representing complicated, nested, structured information via custom
system calls is a pain.  We have more luck using classic file system
structures, and abstracting the representation a layer up.  Of course
there are still system calls, but they are the classic Unix calls such
as mkdir, chdir, rmdir, creat, unlink, open, read, write and close.

The same thing happens in designing network and web services.  There
are always low level protocols, such as physical and link and IP.
And sometimes these have to be extended, such as IPv4 versus IPv6.
But we don't code AJAX down at that level - AJAX sits on top of things
like Javascript and XML, higher up in the protocol stack.

And we didn't start coding AJAX as a series of IP hacks, saying we can
add a higher level protocol alternative later on.  That would have been
useless.

Figuring out where in the protocol stack one is targeting ones new
feature work is a foundation decision.  Get it right, up front,
forevermore, or risk ending up in Documentation/ABI/obsolete or
Documentation/ABI/removed in a few years, if your work doesn't
just die sooner without a whimper.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 15:26         ` Pavel Emelianov
@ 2006-10-31  0:26           ` Matt Helsley
  2006-10-31  8:34             ` Pavel Emelianov
  2006-10-31 16:33           ` Chris Friesen
  1 sibling, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-10-31  0:26 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Paul Jackson, vatsa, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, dipankar, rohitseth, devel

On Mon, 2006-10-30 at 18:26 +0300, Pavel Emelianov wrote:
> Paul Jackson wrote:
> > Pavel wrote:
> >>>> 3. Configfs may be easily implemented later as an additional
> >>>>    interface. I propose the following solution:
> >>>>      ...
> >> Resource controller has nothing common with confgifs.
> >> That's the same as if we make netfilter depend on procfs.
> > 
> > Well ... if you used configfs as an interface to resource
> > controllers, as you said was easily done, then they would
> > have something to do with each other, right ;)?
> 
> Right. We'll create a dependency that is not needed.
> 
> > Choose the right data structure for the job, and then reuse
> > what fits for that choice.
> > 
> > Neither avoid nor encouraging code reuse is the key question.
> > 
> > What's the best fit, long term, for the style of kernel-user
> > API, for this use?  That's the key question.
> 
> I agree, but you've cut some importaint questions away,
> so I ask them again:
> 
>  > What if if user creates a controller (configfs directory)
>  > and doesn't remove it at all. Should controller stay in
>  > memory even if nobody uses it?

	Yes. The controller should stay in memory until userspace decides that
control of the resource is no longer desired. Though not all controllers
should be removable since that may impose unreasonable restrictions on
what useful/performant controllers can be implemented.

	That doesn't mean that the controller couldn't reclaim memory it uses
when it's no longer needed.

<snip>

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 18:01   ` Paul Menage
@ 2006-10-31  8:31     ` Pavel Emelianov
  2006-10-31 16:34       ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31  8:31 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth, devel

Paul Menage wrote:
> On 10/30/06, Pavel Emelianov <xemul@openvz.org> wrote:
>> > Debated:
>> >       - syscall vs configfs interface
>>
>> 1. One of the major configfs ideas is that lifetime of
>>    the objects is completely driven by userspace.
>>    Resource controller shouldn't live as long as user
>>    want. It "may", but not "must"! As you have seen from
>>    our (beancounters) patches beancounters disapeared
>>    as soon as the last reference was dropped.
> 
> Why is this an important feature for beancounters? All the other
> resource control approaches seem to prefer having userspace handle
> removing empty/dead groups/containers.

That's functionality user may want. I agree that some users
may want to create some kind of "persistent" beancounters, but
this must not be the only way to control them. I like the way
TUN devices are done. Each has TUN_PERSIST flag controlling
whether or not to destroy device right on closing. I think that
we may have something similar - a flag BC_PERSISTENT to keep
beancounters with zero refcounter in memory to reuse them.

Objections?

>> 2. Having configfs as the only interface doesn't alow
>>    people having resource controll facility w/o configfs.
>>    Resource controller must not depend on any "feature".
> 
> Why is depending on a feature like configfs worse than depending on a
> feature of being able to extend the system call interface?

Because configfs is a _feature_, while system calls interface is
a mandatory part of a kernel. Since "resource beancounters" is a
core thing it shouldn't depend on "optional" kernel stuff. E.g.
procfs is the way userspace gets information about running tasks,
but disabling procfs doesn't disable such core functionality
as fork-ing and execve-ing.

Moreover, I hope you agree that beancounters can't be made as
module. If so user will have to built-in configfs, and thus
CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".

I have nothing against using configfs as additional, optional
interface, but I do object using it as the only window inside
BC world.

>> >       - Interaction of resource controllers, containers and cpusets
>> >               - Should we support, for instance, creation of resource
>> >                 groups/containers under a cpuset?
>> >       - Should we have different groupings for different resources?
>>
>> This breaks the idea of groups isolation.
> 
> That's fine - some people don't want total isolation. If we're looking
> for a solution that fits all the different requirements, then we need
> that flexibility. I agree that the default would probably want to be
> that the groupings be the same for all resource controllers /
> subsystems.

Hm... OK, I don't mind although don't see any reasonable use of it.
Thus we add one more point to our "agreement" list
http://wiki.openvz.org/Containers/UBC_discussion

 - all resource groups are independent

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31  0:26           ` Matt Helsley
@ 2006-10-31  8:34             ` Pavel Emelianov
  0 siblings, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31  8:34 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelianov, Paul Jackson, vatsa, dev, sekharan, menage,
	ckrm-tech, balbir, haveblue, linux-kernel, dipankar, rohitseth,
	devel

[snip]

> 	Yes. The controller should stay in memory until userspace decides that
> control of the resource is no longer desired. Though not all controllers
> should be removable since that may impose unreasonable restrictions on
> what useful/performant controllers can be implemented.
> 
> 	That doesn't mean that the controller couldn't reclaim memory it uses
> when it's no longer needed.
> 

I've already answered Paul Menage about this. Shortly:

... I agree that some users may want to create some
kind of "persistent" beancounters, but this must not be
the only way to control them...
... I think that we may have something [like this] - a flag
BC_PERSISTENT to keep beancounters with zero refcounter in
memory to reuse them...
... I have nothing against using configfs as additional,
optional interface, but I do object using it as the only
window inside BC world...

Please, refer to my full reply for comments.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 17:39     ` Balbir Singh
  2006-10-30 18:07       ` Balbir Singh
@ 2006-10-31  8:48       ` Pavel Emelianov
  2006-10-31 10:54         ` Balbir Singh
  2006-10-31 17:04         ` Dave Hansen
  1 sibling, 2 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31  8:48 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage

Balbir Singh wrote:
> Pavel Emelianov wrote:
>> [snip]
>>
>>> Reclaimable memory
>>>
>>> (i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
>>>       they are mapped into the user page tables, but not backed by a file.
>> I do not agree with such classification.
>> When one maps file then kernel can remove page from address
>> space as there is already space on disk for it. When one
>> maps an anonymous page then kernel won't remove this page
>> for sure as system may simply be configured to be swapless.
> 
> Yes, I agree if there is no swap space, then anonymous memory is pinned.
> Assuming that we'll end up using a an abstraction on top of the
> existing reclaim mechanism, the mechanism would know if a particular
> type of memory is reclaimable or not.

If memory is considered to be unreclaimable then actions should be
taken at mmap() time, not later! Rejecting mmap() is the only way to
limit user in unreclaimable memory consumption.

> But your point is well taken.

Thank you.

[snip]

>> This is also not true now. The latest beancounter code accounts for
>> 1. kmemsie - this includes slab and vmalloc objects and "raw" pages
>>    allocated directly from buddy allocator.
> 
> This is what I said, pages marked with __GFP_BC, so far on i386 I see
> slab, vmalloc, PTE & LDT entries marked with the flag.

Yes. I just wanted to keep all the things together.

[snip]

> I understand that kernel memory accounting is the first priority for
> containers, but accounting kernel memory requires too many changes
> to the VM core, hence I was hesitant to put it up as first priority.

Among all the kernel-code-intrusive patches in BC patch set
kmemsize hooks are the most "conservative" - only one place
is heavily patched - this is slab allocator. Buddy is patched,
but _significantly_ smaller. The rest of the patch adds __GFP_BC
flags to some allocations and SLAB_BC to some kmem_caches.

User memory controlling patch is much heavier...

I'd set priorities of development that way:

1. core infrastructure (mainly headers)
2. interface
3. kernel memory hooks and accounting
4. mappings hooks and accounting
5. physical pages hooks and accounting
6. user pages reclamation
7. moving threads between beancounters
8. make beancounter persistent

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 18:07       ` Balbir Singh
@ 2006-10-31  8:57         ` Pavel Emelianov
  2006-10-31  9:19           ` Balbir Singh
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31  8:57 UTC (permalink / raw)
  To: balbir
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage, linux-mm

[snip]

>> But in general I agree, these are the three important resources for
>> accounting and control
> 
> I missed out to mention, I hope you were including the page cache in
> your definition of reclaimable memory.

As far as page cache is concerned my opinion is the following.
(If I misunderstood you, please correct me.)

Page cache is designed to keep in memory as much pages as
possible to optimize performance. If we start limiting the page
cache usage we cut the performance. What is to be controlled is
_used_ resources (touched pages, opened file descriptors, mapped
areas, etc), but not the cached ones. I see nothing bad if the
page that belongs to a file, but is not used by ANY task in BC,
stays in memory. I think this is normal. If kernel wants it may
push this page out easily it won't event need to try_to_unmap()
it. So cached pages must not be accounted.

I've also noticed that you've [snip]-ed on one of my questions.

 > How would you allocate memory on NUMA in advance?

Please, clarify this.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  8:57         ` Pavel Emelianov
@ 2006-10-31  9:19           ` Balbir Singh
  2006-10-31  9:25             ` Pavel Emelianov
  2006-10-31  9:42             ` Andrew Morton
  0 siblings, 2 replies; 135+ messages in thread
From: Balbir Singh @ 2006-10-31  9:19 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage, linux-mm, Vaidyanathan S

Pavel Emelianov wrote:
> [snip]
> 
>>> But in general I agree, these are the three important resources for
>>> accounting and control
>> I missed out to mention, I hope you were including the page cache in
>> your definition of reclaimable memory.
> 
> As far as page cache is concerned my opinion is the following.
> (If I misunderstood you, please correct me.)
> 
> Page cache is designed to keep in memory as much pages as
> possible to optimize performance. If we start limiting the page
> cache usage we cut the performance. What is to be controlled is
> _used_ resources (touched pages, opened file descriptors, mapped
> areas, etc), but not the cached ones. I see nothing bad if the
> page that belongs to a file, but is not used by ANY task in BC,
> stays in memory. I think this is normal. If kernel wants it may
> push this page out easily it won't event need to try_to_unmap()
> it. So cached pages must not be accounted.
> 

The idea behind limiting the page cache is this

1. Lets say one container fills up the page cache.
2. The other containers will not be able to allocate memory (even
though they are within their limits) without the overhead of having
to flush the page cache and freeing up occupied cache. The kernel
will have to pageout() the dirty pages in the page cache.

Since it is easy to push the page out (as you said), it should be
easy to impose a limit on the page cache usage of a container.

> 
> I've also noticed that you've [snip]-ed on one of my questions.
> 
>  > How would you allocate memory on NUMA in advance?
> 
> Please, clarify this.

I am not quite sure I understand the question. Could you please rephrase
it and highlight some of the difficulty?

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  9:19           ` Balbir Singh
@ 2006-10-31  9:25             ` Pavel Emelianov
  2006-10-31 10:10               ` Balbir Singh
  2006-10-31  9:42             ` Andrew Morton
  1 sibling, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31  9:25 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm,
	Vaidyanathan S

Balbir Singh wrote:
> Pavel Emelianov wrote:
>> [snip]
>>
>>>> But in general I agree, these are the three important resources for
>>>> accounting and control
>>> I missed out to mention, I hope you were including the page cache in
>>> your definition of reclaimable memory.
>> As far as page cache is concerned my opinion is the following.
>> (If I misunderstood you, please correct me.)
>>
>> Page cache is designed to keep in memory as much pages as
>> possible to optimize performance. If we start limiting the page
>> cache usage we cut the performance. What is to be controlled is
>> _used_ resources (touched pages, opened file descriptors, mapped
>> areas, etc), but not the cached ones. I see nothing bad if the
>> page that belongs to a file, but is not used by ANY task in BC,
>> stays in memory. I think this is normal. If kernel wants it may
>> push this page out easily it won't event need to try_to_unmap()
>> it. So cached pages must not be accounted.
>>
> 
> The idea behind limiting the page cache is this
> 
> 1. Lets say one container fills up the page cache.
> 2. The other containers will not be able to allocate memory (even
> though they are within their limits) without the overhead of having
> to flush the page cache and freeing up occupied cache. The kernel
> will have to pageout() the dirty pages in the page cache.
> 
> Since it is easy to push the page out (as you said), it should be
> easy to impose a limit on the page cache usage of a container.

If a group is limited with memory _consumption_ it won't fill
the page cache...

>> I've also noticed that you've [snip]-ed on one of my questions.
>>
>>  > How would you allocate memory on NUMA in advance?
>>
>> Please, clarify this.
> 
> I am not quite sure I understand the question. Could you please rephrase
> it and highlight some of the difficulty?

I'd like to provide a guarantee for a newly created group. According
to your idea I have to preallocate some pages in advance. OK. How to
select a NUMA node to allocate them from?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  9:19           ` Balbir Singh
  2006-10-31  9:25             ` Pavel Emelianov
@ 2006-10-31  9:42             ` Andrew Morton
  2006-10-31 10:36               ` Balbir Singh
  1 sibling, 1 reply; 135+ messages in thread
From: Andrew Morton @ 2006-10-31  9:42 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm,
	Vaidyanathan S

On Tue, 31 Oct 2006 14:49:12 +0530
Balbir Singh <balbir@in.ibm.com> wrote:

> The idea behind limiting the page cache is this
> 
> 1. Lets say one container fills up the page cache.
> 2. The other containers will not be able to allocate memory (even
> though they are within their limits) without the overhead of having
> to flush the page cache and freeing up occupied cache. The kernel
> will have to pageout() the dirty pages in the page cache.

There's a vast difference between clean pagecache and dirty pagecache in this
context.  It is terribly imprecise to use the term "pagecache".  And it would be
a poor implementation which failed to distinguish between clean pagecache and
dirty pagecache.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  9:25             ` Pavel Emelianov
@ 2006-10-31 10:10               ` Balbir Singh
  2006-10-31 10:19                 ` Pavel Emelianov
  0 siblings, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-31 10:10 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage, linux-mm, Vaidyanathan S

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Pavel Emelianov wrote:
>>> [snip]
>>>
>>>>> But in general I agree, these are the three important resources for
>>>>> accounting and control
>>>> I missed out to mention, I hope you were including the page cache in
>>>> your definition of reclaimable memory.
>>> As far as page cache is concerned my opinion is the following.
>>> (If I misunderstood you, please correct me.)
>>>
>>> Page cache is designed to keep in memory as much pages as
>>> possible to optimize performance. If we start limiting the page
>>> cache usage we cut the performance. What is to be controlled is
>>> _used_ resources (touched pages, opened file descriptors, mapped
>>> areas, etc), but not the cached ones. I see nothing bad if the
>>> page that belongs to a file, but is not used by ANY task in BC,
>>> stays in memory. I think this is normal. If kernel wants it may
>>> push this page out easily it won't event need to try_to_unmap()
>>> it. So cached pages must not be accounted.
>>>
>> The idea behind limiting the page cache is this
>>
>> 1. Lets say one container fills up the page cache.
>> 2. The other containers will not be able to allocate memory (even
>> though they are within their limits) without the overhead of having
>> to flush the page cache and freeing up occupied cache. The kernel
>> will have to pageout() the dirty pages in the page cache.
>>
>> Since it is easy to push the page out (as you said), it should be
>> easy to impose a limit on the page cache usage of a container.
> 
> If a group is limited with memory _consumption_ it won't fill
> the page cache...
> 

So you mean the memory _consumption_ limit is already controlling
the page cache? That's what we need the ability for a container
not to fill up the page cache :)

I don't remember correctly, but do you account for dirty page cache usage in
the latest patches of BC?

>>> I've also noticed that you've [snip]-ed on one of my questions.
>>>
>>>  > How would you allocate memory on NUMA in advance?
>>>
>>> Please, clarify this.
>> I am not quite sure I understand the question. Could you please rephrase
>> it and highlight some of the difficulty?
> 
> I'd like to provide a guarantee for a newly created group. According
> to your idea I have to preallocate some pages in advance. OK. How to
> select a NUMA node to allocate them from?

The idea of pre-allocation was discussed as a possibility in the case
that somebody needed hard guarantees, but most of us don't need it.
I was in the RFC for the sake of completeness.

Coming back to your question

Why do you need to select a NUMA node? For performance?

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 10:10               ` Balbir Singh
@ 2006-10-31 10:19                 ` Pavel Emelianov
  0 siblings, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31 10:19 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm,
	Vaidyanathan S

[snip]

>>> Since it is easy to push the page out (as you said), it should be
>>> easy to impose a limit on the page cache usage of a container.
>> If a group is limited with memory _consumption_ it won't fill
>> the page cache...
>>
> 
> So you mean the memory _consumption_ limit is already controlling
> the page cache? That's what we need the ability for a container
> not to fill up the page cache :)

I mean page cache limiting is not needed. We need to make
sure group eats less that N physical pages. That can be
achieved by controlling page faults, setup_arg_pages(), etc.
Page cache is not to be touched.

> I don't remember correctly, but do you account for dirty page cache usage in
> the latest patches of BC?

We do not account for page cache itself. We track only
physical pages regardless of where they are.

[snip]

> The idea of pre-allocation was discussed as a possibility in the case
> that somebody needed hard guarantees, but most of us don't need it.
> I was in the RFC for the sake of completeness.
> 
> Coming back to your question
> 
> Why do you need to select a NUMA node? For performance?

Of course! Otherwise what do we need kmem_cache_alloc_node() etc
calls in kernel?

The second question is - what if two processes from different
beancounters try to share one page. I remember that the current
solution is to take the page from the first user's reserve. OK.
Consider then that this first user stops using the page. When
this happens one page must be put back to it's reserve, right?
But where to get this page from?

Note that making guarantee through limiting doesn't care about
where the page is get from.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  9:42             ` Andrew Morton
@ 2006-10-31 10:36               ` Balbir Singh
  0 siblings, 0 replies; 135+ messages in thread
From: Balbir Singh @ 2006-10-31 10:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm,
	Vaidyanathan S

Andrew Morton wrote:
> On Tue, 31 Oct 2006 14:49:12 +0530
> Balbir Singh <balbir@in.ibm.com> wrote:
> 
>> The idea behind limiting the page cache is this
>>
>> 1. Lets say one container fills up the page cache.
>> 2. The other containers will not be able to allocate memory (even
>> though they are within their limits) without the overhead of having
>> to flush the page cache and freeing up occupied cache. The kernel
>> will have to pageout() the dirty pages in the page cache.
> 
> There's a vast difference between clean pagecache and dirty pagecache in this
> context.  It is terribly imprecise to use the term "pagecache".  And it would be
> a poor implementation which failed to distinguish between clean pagecache and
> dirty pagecache.
> 

Yes, I agree, it will be a good idea to distinguish between the two.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  8:48       ` Pavel Emelianov
@ 2006-10-31 10:54         ` Balbir Singh
  2006-10-31 11:15           ` Pavel Emelianov
  2006-10-31 17:04         ` Dave Hansen
  1 sibling, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-31 10:54 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage, linux-mm

Pavel Emelianov wrote:
> Balbir Singh wrote:
>> Pavel Emelianov wrote:
>>> [snip]
>>>
>>>> Reclaimable memory
>>>>
>>>> (i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
>>>>       they are mapped into the user page tables, but not backed by a file.
>>> I do not agree with such classification.
>>> When one maps file then kernel can remove page from address
>>> space as there is already space on disk for it. When one
>>> maps an anonymous page then kernel won't remove this page
>>> for sure as system may simply be configured to be swapless.
>> Yes, I agree if there is no swap space, then anonymous memory is pinned.
>> Assuming that we'll end up using a an abstraction on top of the
>> existing reclaim mechanism, the mechanism would know if a particular
>> type of memory is reclaimable or not.
> 
> If memory is considered to be unreclaimable then actions should be
> taken at mmap() time, not later! Rejecting mmap() is the only way to
> limit user in unreclaimable memory consumption.

That's like disabling memory over-commit in the regular kernel.
Don't you think this should again be based on the systems configuration
of over-commit?

[snip]

> 
>> I understand that kernel memory accounting is the first priority for
>> containers, but accounting kernel memory requires too many changes
>> to the VM core, hence I was hesitant to put it up as first priority.
> 
> Among all the kernel-code-intrusive patches in BC patch set
> kmemsize hooks are the most "conservative" - only one place
> is heavily patched - this is slab allocator. Buddy is patched,
> but _significantly_ smaller. The rest of the patch adds __GFP_BC
> flags to some allocations and SLAB_BC to some kmem_caches.
> 
> User memory controlling patch is much heavier...
> 

Please see the patching of Rohit's memory controller for user
level patching. It seems much simpler.

> I'd set priorities of development that way:
> 
> 1. core infrastructure (mainly headers)
> 2. interface
> 3. kernel memory hooks and accounting
> 4. mappings hooks and accounting
> 5. physical pages hooks and accounting
> 6. user pages reclamation
> 7. moving threads between beancounters
> 8. make beancounter persistent

I would prefer a different set

1 & 2, for now we could use any interface and then start developing the
controller. As we develop the new controller, we are likely to find the
need to add/enhance the interface, so freezing in on 1 & 2 might not be
a good idea.

I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
memory controller.

Then take up the rest.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 10:54         ` Balbir Singh
@ 2006-10-31 11:15           ` Pavel Emelianov
  2006-10-31 12:39             ` Balbir Singh
                               ` (2 more replies)
  0 siblings, 3 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31 11:15 UTC (permalink / raw)
  To: balbir
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage, linux-mm

[snip]

> That's like disabling memory over-commit in the regular kernel.

Nope. We limit only unreclaimable mappings. Allowing user
to break limits breaks the sense of limit.

Or you do not agree that allowing unlimited unreclaimable
mappings doesn't alow you the way to cut groups gracefully?

[snip]

> Please see the patching of Rohit's memory controller for user
> level patching. It seems much simpler.

Could you send me an URL where to get the patch from, please.
Or the patch itself directly to me. Thank you.

[snip]

> I would prefer a different set
> 
> 1 & 2, for now we could use any interface and then start developing the
> controller. As we develop the new controller, we are likely to find the
> need to add/enhance the interface, so freezing in on 1 & 2 might not be
> a good idea.

Paul Menage won't agree. He believes that interface must come first.
I also remind you that the latest beancounter patch provides all the
stuff we're discussing. It may move tasks, limit all three resources
discussed, reclaim memory and so on. And configfs interface could be
attached easily.

> I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
> memory controller.
> 
> Then take up the rest.

I'll review Rohit's patches and comment.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 20:47             ` Paul Menage
  2006-10-30 20:56               ` Paul Jackson
  2006-10-30 21:03               ` Paul Menage
@ 2006-10-31 11:53               ` Srivatsa Vaddagiri
  2006-10-31 13:31                 ` Srivatsa Vaddagiri
  2006-10-31 16:46                 ` Paul Menage
  2006-11-01  4:39               ` David Rientjes
  3 siblings, 2 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-31 11:53 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Mon, Oct 30, 2006 at 12:47:59PM -0800, Paul Menage wrote:
> On 10/30/06, Paul Jackson <pj@sgi.com> wrote:
> >
> >In other words you are recommending delivering a system that internally
> >tracks separate hierarchies for each resource control entity, but where
> >the user can conveniently overlap some of these hierarchies and deal
> >with them as a single hierarchy.
> 
> More or less. More concretely:
> 
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
> 
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
> 
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.

For the case where resource node hierarchy is different from process
container hierarchy, I am trying to make sense of "why do we need to
maintain two hierarchies" - one the actual hierarchy used for resource
control purpose, another the process container hierarchy. What purpose 
does maintaining the process container hierarchy (in addition to the
resource controller hierarchy) solve?

I am thinking we can avoid maintaining these two hierarchies, by 
something on these lines:

	mkdir /dev/cpu
	mount -t container -ocpu container /dev/cpu

		-> Represents a hierarchy for cpu control purpose.

		   tsk->cpurc	= represent the node in the cpu
				  controller hierarchy. Also maintains 
				  resource allocation information for
				  this node.

		   tsk->cpurc->parent = parent node.

	mkdir /dev/mem
	mount -t container -omem container /dev/mem

		-> Represents a hierarchy for mem control purpose.

		   tsk->memrc	= represent the node in the mem
				  controller hierarchy. Also maintains 
				  resource allocation information for
				  this node.

		   tsk->memrc->parent = parent node.


	mkdir /dev/containers
	mount -t container -ocontainer container /dev/container

		-> Represents a (mostly flat?) hierarchy for the real 
	  	   container (virtualization) purpose.

		   tsk->container = represent the node in the container
				    hierarchy. Also maintains relavant 
				    container information for this node.

		   tsk->container->parent = parent node.


I suspect this may simplify the "container" filesystem, since it doesnt
have to track multiple hierarchies at the same time, and improve lock
contention too (modifying the cpu controller hierarchy can take a different 
lock than the mem controller hierarchy).

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 11:15           ` Pavel Emelianov
@ 2006-10-31 12:39             ` Balbir Singh
  2006-10-31 14:19               ` Pavel Emelianov
  2006-10-31 16:54             ` Paul Menage
  2006-11-01  6:00             ` David Rientjes
  2 siblings, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-31 12:39 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage, linux-mm

Pavel Emelianov wrote:
>> That's like disabling memory over-commit in the regular kernel.
> 
> Nope. We limit only unreclaimable mappings. Allowing user
> to break limits breaks the sense of limit.
> 
> Or you do not agree that allowing unlimited unreclaimable
> mappings doesn't alow you the way to cut groups gracefully?
> 


A quick code review showed that most of the accounting is the
same.

I see that most of the mmap accounting code, it seems to do
the equivalent of security_vm_enough_memory() when VM_ACCOUNT
is set. May be we could merge the accounting code to handle
even containers.

I looked at

do_mmap_pgoff
acct_stack_growth
__do_brk (
do_mremap


> [snip]
> 
>> Please see the patching of Rohit's memory controller for user
>> level patching. It seems much simpler.
> 
> Could you send me an URL where to get the patch from, please.
> Or the patch itself directly to me. Thank you.

Please see http://lkml.org/lkml/2006/9/19/283

> 
> [snip]
> 
>> I would prefer a different set
>>
>> 1 & 2, for now we could use any interface and then start developing the
>> controller. As we develop the new controller, we are likely to find the
>> need to add/enhance the interface, so freezing in on 1 & 2 might not be
>> a good idea.
> 
> Paul Menage won't agree. He believes that interface must come first.
> I also remind you that the latest beancounter patch provides all the
> stuff we're discussing. It may move tasks, limit all three resources
> discussed, reclaim memory and so on. And configfs interface could be
> attached easily.
> 

I think the interface should depend on the controllers and not
the other way around. I fear that the infrastructure discussion might
hold us back and no fruitful work will happen on the controllers.
Once we add and agree on the controller, we can then look at the
interface requirements (like persistence if kernel memory is being
tracked, etc). What do you think?

>> I would put 4, 5 and 6 ahead of 3, based on the changes I see in Rohit's
>> memory controller.
>>
>> Then take up the rest.
> 
> I'll review Rohit's patches and comment.

ok



-- 
	Thanks,
	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 11:53               ` Srivatsa Vaddagiri
@ 2006-10-31 13:31                 ` Srivatsa Vaddagiri
  2006-10-31 16:46                 ` Paul Menage
  1 sibling, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-31 13:31 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	Paul Jackson, matthltc, dipankar, rohitseth

On Tue, Oct 31, 2006 at 05:23:43PM +0530, Srivatsa Vaddagiri wrote:
> 	mount -t container -ocpu container /dev/cpu
> 
> 		-> Represents a hierarchy for cpu control purpose.
> 
> 		   tsk->cpurc	= represent the node in the cpu
> 				  controller hierarchy. Also maintains 
> 				  resource allocation information for
> 				  this node.

I suspect this will lead to code like:

	if (something->..->options == cpu)
		tsk->cpurc = ..
	else if (something->..->options == mem)
		tsk->memrc = ..

Dont know enough of filesystems atm to say if such code is avoidable.



-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 12:39             ` Balbir Singh
@ 2006-10-31 14:19               ` Pavel Emelianov
  0 siblings, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-10-31 14:19 UTC (permalink / raw)
  To: balbir, menage
  Cc: Pavel Emelianov, vatsa, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, linux-mm

[snip]

> A quick code review showed that most of the accounting is the
> same.
> 
> I see that most of the mmap accounting code, it seems to do
> the equivalent of security_vm_enough_memory() when VM_ACCOUNT
> is set. May be we could merge the accounting code to handle
> even containers.
> 
> I looked at
> 
> do_mmap_pgoff
> acct_stack_growth
> __do_brk (
> do_mremap

I'm sure this is possible. I'll take this into account
in the next patch series. Thank you.

>> [snip]
>>
>>> Please see the patching of Rohit's memory controller for user
>>> level patching. It seems much simpler.
>> Could you send me an URL where to get the patch from, please.
>> Or the patch itself directly to me. Thank you.
> 
> Please see http://lkml.org/lkml/2006/9/19/283

Thanks. I'll review it in a couple of days and comment.

[snip]

> I think the interface should depend on the controllers and not
> the other way around. I fear that the infrastructure discussion might
> hold us back and no fruitful work will happen on the controllers.
> Once we add and agree on the controller, we can then look at the
> interface requirements (like persistence if kernel memory is being
> tracked, etc). What do you think?

I do agree with you. But we have to make an agreement with
Paul in this also...

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 15:26         ` Pavel Emelianov
  2006-10-31  0:26           ` Matt Helsley
@ 2006-10-31 16:33           ` Chris Friesen
  1 sibling, 0 replies; 135+ messages in thread
From: Chris Friesen @ 2006-10-31 16:33 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Paul Jackson, vatsa, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth, devel

Pavel Emelianov wrote:
> Paul Jackson wrote:

> I agree, but you've cut some importaint questions away,
> so I ask them again:
> 
>  > What if if user creates a controller (configfs directory)
>  > and doesn't remove it at all. Should controller stay in
>  > memory even if nobody uses it?
> 
> This is importaint to solve now - wether we want or not to
> keep "empty" beancounters in memory. If we do not then configfs
> usage is not acceptible.

I can certainly see scenarios where we would want to keep "empty" 
beancounters around.

For instance, I move all the tasks out of a group but still want to be 
able to obtain stats on how much cpu time the group has used.

Maybe we can do that without persisting the actual beancounters...I'm 
not familiar enough with the code to say.

Chris

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 14:08 ` Pavel Emelianov
  2006-10-30 14:23   ` Paul Jackson
  2006-10-30 18:01   ` Paul Menage
@ 2006-10-31 16:34   ` Srivatsa Vaddagiri
  2006-11-01  8:01     ` Pavel Emelianov
  2 siblings, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-31 16:34 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, sekharan, menage, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On Mon, Oct 30, 2006 at 05:08:03PM +0300, Pavel Emelianov wrote:
> 1. One of the major configfs ideas is that lifetime of
>    the objects is completely driven by userspace.
>    Resource controller shouldn't live as long as user
>    want. It "may", but not "must"! As you have seen from
>    our (beancounters) patches beancounters disapeared
>    as soon as the last reference was dropped. Removing
>    configfs entries on beancounter's automatic destruction
>    is possible, but it breaks the logic of configfs.

cpusets has a neat flag called notify_on_release. If set, some userspace
agent is invoked when the last task exists from a cpuset.

Can't we use a similar flag as a configfs file and (if set) invoke a
userspace agent (to cleanup) upon last reference drop? How would this
violate logic of configfs?

> 2. Having configfs as the only interface doesn't alow
>    people having resource controll facility w/o configfs.
>    Resource controller must not depend on any "feature".

One flexibility configfs (and any fs-based interface) offers is, as Matt
had pointed out sometime back, the ability to delage management of a
sub-tree to a particular user (without requiring root permission).

For ex:

			/
			|
		 -----------------
		|		  |
	       vatsa (70%)	linux (20%)
		|
	 ----------------------------------
	|	         | 	          |
      browser (10%)   compile (50%)    editor (10%)

In this, group 'vatsa' has been alloted 70% share of cpu. Also user
'vatsa' has been given permissions to manage this share as he wants. If
the cpu controller supports hierarchy, user 'vatsa' can create further
sub-groups (browser, compile ..etc) -without- requiring root access.

Also it is convenient to manipulate resource hierarchy/parameters thr a
shell-script if it is fs-based.

> 3. Configfs may be easily implemented later as an additional
>    interface. I propose the following solution:

Ideally we should have one interface - either syscall or configfs - and
not both.

Assuming your requirement of auto-deleting objects in configfs can be
met thr' something similar to cpuset's notify_on_release, what other
killer problem do you think configfs will pose?

> > 	- Should we have different groupings for different resources?
> 
> This breaks the idea of groups isolation.

Sorry dont get you here. Are you saying we should support different
grouping for different controllers?

> > 	- Support movement of all threads of a process from one group
> > 	  to another atomically?
> 
> This is not a critical question. This is something that
> has difference in

It can be a significant pain for some workloads. I have heard that
workload management products often encounter processes with anywhere
between 200-700 threads in a process. Moving all those threads one by
one from user-space can suck.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31  8:31     ` Pavel Emelianov
@ 2006-10-31 16:34       ` Paul Menage
  2006-10-31 16:57         ` Srivatsa Vaddagiri
  2006-11-01  7:58         ` Pavel Emelianov
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-31 16:34 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On 10/31/06, Pavel Emelianov <xemul@openvz.org> wrote:
>
> That's functionality user may want. I agree that some users
> may want to create some kind of "persistent" beancounters, but
> this must not be the only way to control them. I like the way
> TUN devices are done. Each has TUN_PERSIST flag controlling
> whether or not to destroy device right on closing. I think that
> we may have something similar - a flag BC_PERSISTENT to keep
> beancounters with zero refcounter in memory to reuse them.

How about the cpusets approach, where once a cpuset has no children
and no processes, a usermode helper can be executed - this could
immediately remove the container/bean-counter if that's what the user
wants. My generic containers patch copies this from cpusets.

>
> Moreover, I hope you agree that beancounters can't be made as
> module. If so user will have to built-in configfs, and thus
> CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".

How about a small custom filesystem as part of the containers support,
then? I'm not wedded to using configfs itself, but I do think that a
filesystem interface is much more debuggable and extensible than a
system call interface, and the simple filesystem is only a couple of
hundred lines.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 11:53               ` Srivatsa Vaddagiri
  2006-10-31 13:31                 ` Srivatsa Vaddagiri
@ 2006-10-31 16:46                 ` Paul Menage
  2006-11-01 17:25                   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-10-31 16:46 UTC (permalink / raw)
  To: vatsa
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On 10/31/06, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> For the case where resource node hierarchy is different from process
> container hierarchy, I am trying to make sense of "why do we need to
> maintain two hierarchies" - one the actual hierarchy used for resource
> control purpose, another the process container hierarchy. What purpose
> does maintaining the process container hierarchy (in addition to the
> resource controller hierarchy) solve?

The idea is that in general, people aren't going to want to have
separate hierarchies for different resources - they're going to have
the hierarchies be the same for all resources. So in general when they
move a process from one container to another, they're going to want to
move that task to use all the new resources limits/guarantees
simultaneously.

Having completely independent hierarchies makes this more difficult -
you have to manually maintain multiple different hierarchies from
userspace. Suppose a task forks while you're moving it from one
container to another? With the approach that each process is in one
container, and each container is in a set of resource nodes, at least
the child task is either entirely in the new resource limits or
entirely in the old limits - if userspace has to update several
hierarchies at once non-atomically then a freshly forked child could
end up with a mixture of resource nodes.

>
> I am thinking we can avoid maintaining these two hierarchies, by
> something on these lines:
>
>         mkdir /dev/cpu
>         mount -t container -ocpu container /dev/cpu
>
>                 -> Represents a hierarchy for cpu control purpose.
>
>                    tsk->cpurc   = represent the node in the cpu
>                                   controller hierarchy. Also maintains
>                                   resource allocation information for
>                                   this node.
>

If we were going to do something like this, hopefully it would look
more like an array of generic container subsystems, rather than a
separate named pointer for each subsystem.

>
>         mkdir /dev/mem
>         mount -t container -omem container /dev/mem
>
>                 -> Represents a hierarchy for mem control purpose.
>
>                    tsk->memrc   = represent the node in the mem
>                                   controller hierarchy. Also maintains
>                                   resource allocation information for
>                                   this node.
>
>                    tsk->memrc->parent = parent node.
>
>
>         mkdir /dev/containers
>         mount -t container -ocontainer container /dev/container
>
>                 -> Represents a (mostly flat?) hierarchy for the real
>                    container (virtualization) purpose.

I think we have an overloading of terminology here. By "container" I
just mean "group of processes tracked for resource control and other
purposes". Can we use a term like "virtual server" if you're doing
virtualization? I.e. a virtual server would be a specialization of a
container (effectively analagous to a resource controller)

>
> I suspect this may simplify the "container" filesystem, since it doesnt
> have to track multiple hierarchies at the same time, and improve lock
> contention too (modifying the cpu controller hierarchy can take a different
> lock than the mem controller hierarchy).

Do you think that lock contention when modifying hierarchies is
generally going to be an issue - how often do tasks get moved around
in the hierarchy, compared to the other operations going on on the
system?

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 11:15           ` Pavel Emelianov
  2006-10-31 12:39             ` Balbir Singh
@ 2006-10-31 16:54             ` Paul Menage
  2006-11-01  6:00             ` David Rientjes
  2 siblings, 0 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-31 16:54 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: balbir, vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, linux-mm

On 10/31/06, Pavel Emelianov <xemul@openvz.org> wrote:
>
> Paul Menage won't agree. He believes that interface must come first.

No, I'm just trying to get agreement on the generic infrastructure for
process containers and extensibility - the actual API to the memory
controller (i.e. what limits, what to track, etc) can presumably be
fitted into  the generic mechanism fairly easily (or else the
infrastructure probably isn't generic enough).

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 16:34       ` Paul Menage
@ 2006-10-31 16:57         ` Srivatsa Vaddagiri
  2006-11-01  7:58         ` Pavel Emelianov
  1 sibling, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-31 16:57 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, devel

On Tue, Oct 31, 2006 at 08:34:52AM -0800, Paul Menage wrote:
> How about the cpusets approach, where once a cpuset has no children
> and no processes, a usermode helper can be executed - this could
> immediately remove the container/bean-counter if that's what the user
> wants. My generic containers patch copies this from cpusets.

Bingo. We crossed mails!

Kirill/Pavel,
	As I mentioned in the begining of this thread, one of the
objective of this RFC is to seek consensus on what could be a good
compromise for the infrastructure in going forward. Paul Menage's
patches, being rework of existing code, is attactive to maintainers like
Andew. 

>From that perspective, how well do you think the container
infrastructure patches meet your needs?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31  8:48       ` Pavel Emelianov
  2006-10-31 10:54         ` Balbir Singh
@ 2006-10-31 17:04         ` Dave Hansen
  2006-11-01  7:57           ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Dave Hansen @ 2006-10-31 17:04 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: balbir, vatsa, dev, sekharan, ckrm-tech, linux-kernel, pj,
	matthltc, dipankar, rohitseth, menage

On Tue, 2006-10-31 at 11:48 +0300, Pavel Emelianov wrote:
> If memory is considered to be unreclaimable then actions should be
> taken at mmap() time, not later! Rejecting mmap() is the only way to
> limit user in unreclaimable memory consumption.

I don't think this is necessarily true.  Today, if a kernel exceeds its
allocation limits (runs out of memory) it gets killed.  Doing the
limiting at mmap() time instead of fault time will keep a sparse memory
applications from even being able to run.

Now, failing an mmap() is a wee bit more graceful than a SIGBUS, but it
certainly introduces its own set of problems.

-- Dave

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-30 18:14       ` Paul Menage
@ 2006-10-31 17:07         ` Balbir Singh
  2006-10-31 17:22           ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-10-31 17:07 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

Paul Menage wrote:
> On 10/30/06, Balbir Singh <balbir@in.ibm.com> wrote:
>> You'll also end up with per zone page cache pools for each zone. A list of
>> active/inactive pages per zone (which will split up the global LRU list).
> 
> Yes, these are some of the inefficiencies that we're ironing out.
> 
>> What about the hard-partitioning. If a container/cpuset is not using its full
>> 64MB of a fake node, can some other node use it?
> 
> No. So the granularity at which you can divide up the system depends
> on how big your fake nodes are. For our purposes, we figure that 64MB
> granularity should be fine.
> 

I am still a little concerned about how limit size changes will be implemented.
Will the cpuset "mems" field change to reflect the changed limits?

>> Also, won't you end up
>> with a big zonelist?
> 
> Yes - but PaulJ's recent patch to speed up the zone selection helped
> reduce the overhead of this a lot.

Great! let me find those patches

> 
>> Consider the other side of the story. lets say we have a shared lib shared
>> among quite a few containers. We limit the usage of the inode containing
>> the shared library to 50M. Tasks A and B use some part of the library
>> and cause the container "C" to reach the limit. Container C is charged
>> for all usage of the shared library. Now no other task, irrespective of which
>> container it belongs to, can touch any new pages of the shared library.
> 
> Well, if the pages aren't mlocked then presumably some of the existing
> pages can be flushed out to disk and replaced with other pages.
> 
>> What you are suggesting is to virtually group the inodes by container rather
>> than task. It might make sense in some cases, but not all.
> 
> Right - I think it's an important feature to be able to support, but I
> agree that it's not suitable for all situations.

>> We could consider implementing the controllers in phases
>>
>> 1. RSS control (anon + mapped pages)
>> 2. Page Cache control
> 
> Page cache control is actually more essential that RSS control, in our
> experience - it's pretty easy to track RSS values from userspace, and
> react reasonably quickly to kill things that go over their limit, but
> determining page cache usage (i.e. determining which job on the system
> is flooding the page cache with dirty buffers) is pretty much
> impossible currently.
> 

Hmm... interesting. Why do you think its impossible, what are the kinds of
issues you've run into?


> Paul
> 

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 17:07         ` Balbir Singh
@ 2006-10-31 17:22           ` Paul Menage
  2006-10-31 18:16             ` Badari Pulavarty
  2006-11-01  7:05             ` Balbir Singh
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Menage @ 2006-10-31 17:22 UTC (permalink / raw)
  To: balbir
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

On 10/31/06, Balbir Singh <balbir@in.ibm.com> wrote:
>
> I am still a little concerned about how limit size changes will be implemented.
> Will the cpuset "mems" field change to reflect the changed limits?

That's how we've been doing it - increasing limits is easy, shrinking
them is harder ...

> > Page cache control is actually more essential that RSS control, in our
> > experience - it's pretty easy to track RSS values from userspace, and
> > react reasonably quickly to kill things that go over their limit, but
> > determining page cache usage (i.e. determining which job on the system
> > is flooding the page cache with dirty buffers) is pretty much
> > impossible currently.
> >
>
> Hmm... interesting. Why do you think its impossible, what are the kinds of
> issues you've run into?
>

Issues such as:

- determining from userspace how much of the page cache is really
"free" memory that can be given out to new jobs without impacting the
performance of existing jobs

- determining which job on the system is flooding the page cache with
dirty buffers

- accounting the active pagecache usage of a job as part of its memory
footprint (if a process is only 1MB large but is seeking randomly
through a 1GB file, treating it as only using/needing 1MB isn't
practical).

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 17:22           ` Paul Menage
@ 2006-10-31 18:16             ` Badari Pulavarty
  2006-11-01  7:05             ` Balbir Singh
  1 sibling, 0 replies; 135+ messages in thread
From: Badari Pulavarty @ 2006-10-31 18:16 UTC (permalink / raw)
  To: Paul Menage
  Cc: balbir, dev, vatsa, sekharan, ckrm-tech, haveblue, lkml, pj,
	matthltc, dipankar, rohitseth

On Tue, 2006-10-31 at 09:22 -0800, Paul Menage wrote:

> >
> > Hmm... interesting. Why do you think its impossible, what are the kinds of
> > issues you've run into?
> >
> 
> Issues such as:
> 
> - determining from userspace how much of the page cache is really
> "free" memory that can be given out to new jobs without impacting the
> performance of existing jobs
> 
> - determining which job on the system is flooding the page cache with
> dirty buffers
> 

Interesting .. these are exactly questions our database people
have been us asking for few years :)

Thanks,
Badari


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 20:47             ` Paul Menage
                                 ` (2 preceding siblings ...)
  2006-10-31 11:53               ` Srivatsa Vaddagiri
@ 2006-11-01  4:39               ` David Rientjes
  2006-11-01  9:50                 ` Paul Jackson
                                   ` (2 more replies)
  3 siblings, 3 replies; 135+ messages in thread
From: David Rientjes @ 2006-11-01  4:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, dev, vatsa, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Mon, 30 Oct 2006, Paul Menage wrote:

> More or less. More concretely:
> 
> - there is a single hierarchy of process containers
> - each process is a member of exactly one process container
> 
> - for each resource controller, there's a hierarchy of resource "nodes"
> - each process container is associated with exactly one resource node
> of each type
> 
> - by default, the process container hierarchy and the resource node
> hierarchies are isomorphic, but that can be controlled by userspace.
> 

This approach appears to be the most complete and extensible 
implementation of containers for all practical uses.  Not only can you use 
these process containers in conjunction with your choice of memory 
controllers, network controllers, disk I/O controllers, etc, but you can 
also pick and choose your own modular controller of choice to meet your 
needs.

So here's our three process containers, A, B, and C, with our tasks m-t:

	-----A-----	-----B-----	-----C-----
	|    |    |     |    |    |     |    |
	m    n    o	p    q    r	s    t

Here's our memory controller groups D and E and our containers set within 
them:

	-----D-----	-----E-----
	|         |	|
	A         B	C

 [ My memory controller E is for my real-time processes so I set its
   attributes appropriately so that it never breaks. ]

And our network controller groups F, G, and H:

	-----F-----	-----G-----
			|         |
		   -----H-----    C
		   |         |
		   A	     B

 [ I'm going to leave my network controller F open for my customer's
   WWW browsing, but nobody is using it right now. ]

I choose not to control disk I/O so there is change from current behavior 
for any of the processes listed above.

There's two things I notice about this approach (my use of the word 
"container" refers to the process containers A, B, and C; my use of the 
word "controller" refers to memory, disk I/O, network, etc controllers):

 - While the process containers are only single-level, the controllers are
   _inherently_ hierarchial just like a filesystem.  So it appears that
   the manipulation of these controllers would most effectively be done
   from userspace with a filesystem approach.  While it may not be served
   by forcing CONFIG_CONFIGFS_FS to be enabled, I observe no objection to
   giving it its own filesystem capability, apart from configfs, through 
   the kernel.  The filesystem manipulation tools that everybody is
   familiar with makes the implementation of controllers simple and, more
   importantly, easier to _use_.

 - The process containers will need to be setup as desired following
   boot.  So if the current approach of cpusets is used, where the
   functionality is enabled on mount, all processes will originally belong
   to the default container that encompasses the entire system.  Since
   each process must belong to only one process container as per Paul
   Menage's proposal, a new container will need to be created and
   processes _moved_ to it for later use by controllers.  So it appears
   that the manipulation of containers would most effectively be done from
   userspace by a syscall approach.

In this scenario, it is not necessary for network controller groups F and 
G above to be limited (or guaranteed) to 100% of our network load.  It is 
quite possible that we do not assign every container to a network 
controller so that they receive the remainder of the bandwidth that is not 
already attributed to F and G.  The same is true with any controller.  Our 
controllers should only seek the limit or guarantee certain amount of 
resources, not force each system process to be a member of one group or 
another to receive the resources.

Two questions also arise:

 - Why do I need to create (i.e. mount the filesystem) the container in
   the first place?  Since the use of these containers are entirely on the 
   shoulders of the optional controllers, there should be no interference 
   with current behavior if I choose not to use any controller.  So why 
   not take the approach that NUMA did whereas if we're on an UMA machine, 
   all of memory belongs to a node 0?  In our case, all processes will 
   inherently belong to a system-wide container similar to procfs.  In
   fact, procfs is how this can be implemented apart from configfs
   following the criticism from UBC.

 - How is forking handled with the various controllers?  Do child 
   processes automatically inherit all the controller groups of its
   parent?  If not (or if its dependant on a user-configured attribute
   of the controller), what happens when I want forked processes to
   belong to a new network controller group from container A in the
   illustration above?  Certaintly that new controller cannot be
   created as a sibling of F and G; and determining the limit on
   network for a third child of H would be non-trivial because then
   the network resources allocated to A and B would be scaled back
   prehaps in an undesired manner.

So the container abstraction looks appropriate for a syscall interface 
whereas a controller abstraction looks appropriate for a filesystem 
interface.  If Paul Menage's proposal of above is adopted, it seems like 
the design and implementation of containers is the first milestone; how 
far does the current patchset get us to what is described above?  Does it 
still support a hierarchy just like cpusets?

And following that, it seems like the next milestone would be to design 
the different characteristics that the various modular controllers could 
support such as notify_on_release, limits/guarantees, behavior on fork, 
and privileges.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 11:15           ` Pavel Emelianov
  2006-10-31 12:39             ` Balbir Singh
  2006-10-31 16:54             ` Paul Menage
@ 2006-11-01  6:00             ` David Rientjes
  2006-11-01  8:05               ` Pavel Emelianov
  2 siblings, 1 reply; 135+ messages in thread
From: David Rientjes @ 2006-11-01  6:00 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: balbir, vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, menage, linux-mm

On Tue, 31 Oct 2006, Pavel Emelianov wrote:

> Paul Menage won't agree. He believes that interface must come first.
> I also remind you that the latest beancounter patch provides all the
> stuff we're discussing. It may move tasks, limit all three resources
> discussed, reclaim memory and so on. And configfs interface could be
> attached easily.
> 

There's really two different interfaces: those to the controller and those 
to the container.  While the configfs (or simpler fs implementation solely 
for our purposes) is the most logical because of its inherent hierarchial 
nature, it seems like the only criticism on that has come from UBC.  From 
my understanding of beancounter, it could be implemented on top of any 
such container abstraction anyway.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 17:22           ` Paul Menage
  2006-10-31 18:16             ` Badari Pulavarty
@ 2006-11-01  7:05             ` Balbir Singh
  2006-11-01  7:07               ` Paul Menage
  1 sibling, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-11-01  7:05 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

Paul Menage wrote:
> On 10/31/06, Balbir Singh <balbir@in.ibm.com> wrote:
>> I am still a little concerned about how limit size changes will be implemented.
>> Will the cpuset "mems" field change to reflect the changed limits?
> 
> That's how we've been doing it - increasing limits is easy, shrinking
> them is harder ...
> 
>>> Page cache control is actually more essential that RSS control, in our
>>> experience - it's pretty easy to track RSS values from userspace, and
>>> react reasonably quickly to kill things that go over their limit, but
>>> determining page cache usage (i.e. determining which job on the system
>>> is flooding the page cache with dirty buffers) is pretty much
>>> impossible currently.
>>>
>> Hmm... interesting. Why do you think its impossible, what are the kinds of
>> issues you've run into?
>>
> 
> Issues such as:
> 
> - determining from userspace how much of the page cache is really
> "free" memory that can be given out to new jobs without impacting the
> performance of existing jobs
> 
> - determining which job on the system is flooding the page cache with
> dirty buffers
> 
> - accounting the active pagecache usage of a job as part of its memory
> footprint (if a process is only 1MB large but is seeking randomly
> through a 1GB file, treating it as only using/needing 1MB isn't
> practical).
> 
> Paul
> 

Thanks for the info!

I thought this would be hard to do in general, but with a page -->
container mapping that will come as a result of the memory controller,
will it still be that hard?

I'll dig deeper.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01  7:05             ` Balbir Singh
@ 2006-11-01  7:07               ` Paul Menage
  2006-11-01  7:44                 ` Balbir Singh
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-01  7:07 UTC (permalink / raw)
  To: balbir
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

On 10/31/06, Balbir Singh <balbir@in.ibm.com> wrote:
>
> I thought this would be hard to do in general, but with a page -->
> container mapping that will come as a result of the memory controller,
> will it still be that hard?

I meant that it's pretty much impossible with the current APIs
provided by the kernel. That's why one of the most useful things that
a memory controller can provide is accounting and limiting of page
cache usage.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01  7:07               ` Paul Menage
@ 2006-11-01  7:44                 ` Balbir Singh
  2006-11-01 12:23                   ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Balbir Singh @ 2006-11-01  7:44 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth

Paul Menage wrote:
> On 10/31/06, Balbir Singh <balbir@in.ibm.com> wrote:
>> I thought this would be hard to do in general, but with a page -->
>> container mapping that will come as a result of the memory controller,
>> will it still be that hard?
> 
> I meant that it's pretty much impossible with the current APIs
> provided by the kernel. That's why one of the most useful things that
> a memory controller can provide is accounting and limiting of page
> cache usage.
> 
> Paul

Thanks for clarifying that! I completely agree, page cache control is
very important!

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-10-31 17:04         ` Dave Hansen
@ 2006-11-01  7:57           ` Pavel Emelianov
  0 siblings, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-01  7:57 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Pavel Emelianov, balbir, vatsa, dev, sekharan, ckrm-tech,
	linux-kernel, pj, matthltc, dipankar, rohitseth, menage

Dave Hansen wrote:
> On Tue, 2006-10-31 at 11:48 +0300, Pavel Emelianov wrote:
>> If memory is considered to be unreclaimable then actions should be
>> taken at mmap() time, not later! Rejecting mmap() is the only way to
>> limit user in unreclaimable memory consumption.
> 
> I don't think this is necessarily true.  Today, if a kernel exceeds its
> allocation limits (runs out of memory) it gets killed.  Doing the
> limiting at mmap() time instead of fault time will keep a sparse memory
> applications from even being able to run.

If limiting _every_ mapping it will, but when limiting only
"private" mappings - no problems at all. BC code lives for
more than 3 years already and no claims from users on this
question yet.

> Now, failing an mmap() is a wee bit more graceful than a SIGBUS, but it
> certainly introduces its own set of problems.
> 
> -- Dave
> 
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 16:34       ` Paul Menage
  2006-10-31 16:57         ` Srivatsa Vaddagiri
@ 2006-11-01  7:58         ` Pavel Emelianov
  1 sibling, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-01  7:58 UTC (permalink / raw)
  To: Paul Menage
  Cc: Pavel Emelianov, dev, vatsa, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth, devel

Paul Menage wrote:
> On 10/31/06, Pavel Emelianov <xemul@openvz.org> wrote:
>>
>> That's functionality user may want. I agree that some users
>> may want to create some kind of "persistent" beancounters, but
>> this must not be the only way to control them. I like the way
>> TUN devices are done. Each has TUN_PERSIST flag controlling
>> whether or not to destroy device right on closing. I think that
>> we may have something similar - a flag BC_PERSISTENT to keep
>> beancounters with zero refcounter in memory to reuse them.
> 
> How about the cpusets approach, where once a cpuset has no children
> and no processes, a usermode helper can be executed - this could

Hmm... Sounds good. I'll think over this.

> immediately remove the container/bean-counter if that's what the user
> wants. My generic containers patch copies this from cpusets.
> 
>>
>> Moreover, I hope you agree that beancounters can't be made as
>> module. If so user will have to built-in configfs, and thus
>> CONFIG_CONFIGFS_FS essentially becomes "bool", not a "tristate".
> 
> How about a small custom filesystem as part of the containers support,
> then? I'm not wedded to using configfs itself, but I do think that a
> filesystem interface is much more debuggable and extensible than a
> system call interface, and the simple filesystem is only a couple of
> hundred lines.

This sounds more reasonable than using configfs for me.

> Paul
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 16:34   ` Srivatsa Vaddagiri
@ 2006-11-01  8:01     ` Pavel Emelianov
  2006-11-01 16:04       ` Matt Helsley
  2006-11-01 17:50       ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-01  8:01 UTC (permalink / raw)
  To: vatsa
  Cc: Pavel Emelianov, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth, devel

[snip]

>> 2. Having configfs as the only interface doesn't alow
>>    people having resource controll facility w/o configfs.
>>    Resource controller must not depend on any "feature".
> 
> One flexibility configfs (and any fs-based interface) offers is, as Matt
> had pointed out sometime back, the ability to delage management of a
> sub-tree to a particular user (without requiring root permission).
> 
> For ex:
> 
> 			/
> 			|
> 		 -----------------
> 		|		  |
> 	       vatsa (70%)	linux (20%)
> 		|
> 	 ----------------------------------
> 	|	         | 	          |
>       browser (10%)   compile (50%)    editor (10%)
> 
> In this, group 'vatsa' has been alloted 70% share of cpu. Also user
> 'vatsa' has been given permissions to manage this share as he wants. If
> the cpu controller supports hierarchy, user 'vatsa' can create further
> sub-groups (browser, compile ..etc) -without- requiring root access.

I can do the same using bcctl tool and sudo :)

> Also it is convenient to manipulate resource hierarchy/parameters thr a
> shell-script if it is fs-based.
> 
>> 3. Configfs may be easily implemented later as an additional
>>    interface. I propose the following solution:
> 
> Ideally we should have one interface - either syscall or configfs - and
> not both.

Agree.

> Assuming your requirement of auto-deleting objects in configfs can be
> met thr' something similar to cpuset's notify_on_release, what other
> killer problem do you think configfs will pose?
> 
> 
>>> 	- Should we have different groupings for different resources?
>> This breaks the idea of groups isolation.
> 
> Sorry dont get you here. Are you saying we should support different
> grouping for different controllers?

Not me, but other people in this thread.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01  6:00             ` David Rientjes
@ 2006-11-01  8:05               ` Pavel Emelianov
  2006-11-01  8:35                 ` David Rientjes
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-01  8:05 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pavel Emelianov, balbir, vatsa, dev, sekharan, ckrm-tech,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth, menage,
	linux-mm

David Rientjes wrote:
> On Tue, 31 Oct 2006, Pavel Emelianov wrote:
> 
>> Paul Menage won't agree. He believes that interface must come first.
>> I also remind you that the latest beancounter patch provides all the
>> stuff we're discussing. It may move tasks, limit all three resources
>> discussed, reclaim memory and so on. And configfs interface could be
>> attached easily.
>>
> 
> There's really two different interfaces: those to the controller and those 
> to the container.  While the configfs (or simpler fs implementation solely 
> for our purposes) is the most logical because of its inherent hierarchial 
> nature, it seems like the only criticism on that has come from UBC.  From 
> my understanding of beancounter, it could be implemented on top of any 
> such container abstraction anyway.

beancounters may be implemented above any (or nearly any) userspace
interface, no questions. But we're trying to come to agreement here,
so I just say my point of view.

I don't mind having file system based interface, I just believe that
configfs is not so good for it. I've already answered that having
our own filesystem for it sounds better than having configfs.

Maybe we can summarize what we have come to?

> 		David
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01  8:05               ` Pavel Emelianov
@ 2006-11-01  8:35                 ` David Rientjes
  0 siblings, 0 replies; 135+ messages in thread
From: David Rientjes @ 2006-11-01  8:35 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: balbir, vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, menage, linux-mm

On Wed, 1 Nov 2006, Pavel Emelianov wrote:

> beancounters may be implemented above any (or nearly any) userspace
> interface, no questions. But we're trying to come to agreement here,
> so I just say my point of view.
> 
> I don't mind having file system based interface, I just believe that
> configfs is not so good for it. I've already answered that having
> our own filesystem for it sounds better than having configfs.
> 
> Maybe we can summarize what we have come to?
> 

I've seen nothing but praise for Paul Menage's suggestion of implementing 
a single-level containers abstraction for processes and attaching 
these to various resource controller (disk, network, memory, cpu) nodes.  
The question of whether to use configfs or not is really at the fore-front 
of that discussion because making any progress in implementation is 
difficult without first deciding upon it, and the containers abstraction 
patchset uses configfs as its interface.

The original objection against configfs was against the lifetime of the 
resource controller.  But this is actually a two part question since there 
are two interfaces: one for the containers, one for the controllers.  At 
present it seems like the only discussion taking place is that of the 
container so this objection can wait.  After boot, there are one of two 
options:

 - require the user to mount the configfs filesystem with a single
   system-wide container as default

    i. include all processes in that container by default

   ii. include no processes in that container, force the user to add them

 - create the entire container abstraction upon boot and attach all
   processes to it in a manner similar to procfs

 [ In both scenarios, kernel behavior is unchanged if no resource
   controller node is attached to any container as if the container(s)
   didn't exist. ]

Another objection against configfs was the fact that you must define 
CONFIG_CONFIGFS_FS to use CONFIG_CONTAINERS.  This objection does not make 
much sense since it seems like we are falling the direction of abandoning 
the syscall approach here and looking toward an fs approach in the first 
place.  So CONFIG_CONTAINERS will need to include its own lightweight 
filesystem if we cannot use CONFIG_CONFIGFS_FS, but it seems redundant 
since this is what configfs is for: a configurable filesystem to interface 
to the kernel.  We definitely do not want two or more interfaces to 
_containers_ so we are reimplementing an already existing infrastructure.

The criticism that users can create containers and then not use them 
shouldn't be an issue if it is carefully implemented.  In fact, I proposed 
that all processes are initially attached to a single system-wide 
container at boot regardless if you've loaded any controllers or not just 
like how UMA machines work with node 0 for system-wide memory.  We should 
incur no overhead for having empty or _full_ containers if we haven't 
loaded controllers or have configured them properly to include the right 
containers.

So if we re-read Paul Menage's containers abstraction away from cpusets 
patchset that uses configfs, we can see that we are almost there with the 
exception of making it a single-layer "hierarchy" as he has already 
proposed.  Resource controller "nodes" that these containers can be 
attached to are a separate issue at this point and shouldn't be confused.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
                   ` (3 preceding siblings ...)
  2006-10-30 14:08 ` Pavel Emelianov
@ 2006-11-01  9:30 ` Pavel Emelianov
  2006-11-01  9:53   ` David Rientjes
  2006-11-01 18:12   ` Srivatsa Vaddagiri
  4 siblings, 2 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-01  9:30 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, menage, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth

> Consensus/Debated Points
> ------------------------
> 
> Consensus:
> 
> 	- Provide resource control over a group of tasks 
> 	- Support movement of task from one resource group to another
> 	- Dont support heirarchy for now
> 	- Support limit (soft and/or hard depending on the resource
> 	  type) in controllers. Guarantee feature could be indirectly
> 	  met thr limits.
> 
> Debated:
> 	- syscall vs configfs interface

OK. Let's stop at configfs interface to move...

> 	- Interaction of resource controllers, containers and cpusets
> 		- Should we support, for instance, creation of resource
> 		  groups/containers under a cpuset?
> 	- Should we have different groupings for different resources?

I propose to discuss this question as this is the most important
now from my point of view.

I believe this can be done, but can't imagine how to use this...

> 	- Support movement of all threads of a process from one group
> 	  to another atomically?

I propose such a solution: if a user asks to move /proc/<pid>
then move the whole task with threads.
If user asks to move /proc/<pid>/task/<tid> then move just
a single thread.

What do you think?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  4:39               ` David Rientjes
@ 2006-11-01  9:50                 ` Paul Jackson
  2006-11-01  9:58                   ` David Rientjes
  2006-11-01 15:59                 ` Srivatsa Vaddagiri
  2006-11-01 18:19                 ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-01  9:50 UTC (permalink / raw)
  To: David Rientjes
  Cc: menage, dev, vatsa, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

David wrote:
>  - While the process containers are only single-level, the controllers are
>    _inherently_ hierarchial just like a filesystem.  So it appears that

Cpusets certainly enjoys what I would call hierarchical process
containers.  I can't tell if your flat container space is just
a "for instance", or you're recommending we only have a flat
container space.

If the later, I disagree.

> So it appears
>   that the manipulation of containers would most effectively be done from
>   userspace by a syscall approach.

Yup - sure sounds like you're advocating a flat container space
accessed by system calls.

Sure doesn't sound right to me.  I like hierarchical containers,
accessed via like a file system.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  9:30 ` Pavel Emelianov
@ 2006-11-01  9:53   ` David Rientjes
  2006-11-01 22:23     ` Matt Helsley
  2006-11-01 18:12   ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 135+ messages in thread
From: David Rientjes @ 2006-11-01  9:53 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, menage, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth

On Wed, 1 Nov 2006, Pavel Emelianov wrote:

> > 	- Interaction of resource controllers, containers and cpusets
> > 		- Should we support, for instance, creation of resource
> > 		  groups/containers under a cpuset?
> > 	- Should we have different groupings for different resources?
> 
> I propose to discuss this question as this is the most important
> now from my point of view.
> 
> I believe this can be done, but can't imagine how to use this...
> 

I think cpusets, as abstracted away from containers by Paul Menage, simply 
become a client of the container configfs.  Cpusets would become more of a 
NUMA-type controller by default.

Different groupings for different resources was already discussed.  If we 
use the approach of a single-level "hierarchy" for process containers and 
then attach them each to a "node" of a controller, then the groupings have 
been achieved.  It's possible to change the network controller of a 
container or move processes from container to container easily through the 
filesystem.

> > 	- Support movement of all threads of a process from one group
> > 	  to another atomically?
> 
> I propose such a solution: if a user asks to move /proc/<pid>
> then move the whole task with threads.
> If user asks to move /proc/<pid>/task/<tid> then move just
> a single thread.
> 
> What do you think?

This seems to use my proposal of using procfs as an abstraction of process 
containers.  I haven't looked at the implementation details, but it seems 
like the most appropriate place given what it currently supports.  
Naturally it should be an atomic move but I don't think it's the most 
important detail in terms of efficiency because moving threads should not 
be such a frequent occurrence anyway.  This begs the question about how 
forks are handled for processes with regard to the various controllers 
that could be implemented and whether they should all be decendants of the 
parent container by default or have the option of spawning a new 
controller all together.  This would be an attribute of controllers and 
not containers, however.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  9:50                 ` Paul Jackson
@ 2006-11-01  9:58                   ` David Rientjes
  0 siblings, 0 replies; 135+ messages in thread
From: David Rientjes @ 2006-11-01  9:58 UTC (permalink / raw)
  To: Paul Jackson
  Cc: menage, dev, vatsa, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Wed, 1 Nov 2006, Paul Jackson wrote:

> David wrote:
> >  - While the process containers are only single-level, the controllers are
> >    _inherently_ hierarchial just like a filesystem.  So it appears that
> 
> Cpusets certainly enjoys what I would call hierarchical process
> containers.  I can't tell if your flat container space is just
> a "for instance", or you're recommending we only have a flat
> container space.
> 

This was using the recommendation of "each process belongs to a single 
container that can be attached to controller nodes later."  So while it is 
indeed possible for the controllers, whatever they are, to be hierarchical 
(and most assuredly should be), what is the objection against grouping 
processes in single-level containers?  The only difference is that now 
when we assign processes to specific controllers with their attributes set 
as we desire, we are assigning a container (or group) processes instead of 
individual ones.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01  7:44                 ` Balbir Singh
@ 2006-11-01 12:23                   ` Paul Jackson
  2006-11-02  0:09                     ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-01 12:23 UTC (permalink / raw)
  To: balbir
  Cc: menage, dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Balbir wrote:
> Paul Menage wrote:
> > On 10/31/06, Balbir Singh <balbir@in.ibm.com> wrote:
> >> I thought this would be hard to do in general, but with a page -->
> >> container mapping that will come as a result of the memory controller,
> >> will it still be that hard?
> > 
> > I meant that it's pretty much impossible with the current APIs
> > provided by the kernel. That's why one of the most useful things that
> > a memory controller can provide is accounting and limiting of page
> > cache usage.
> > 
> > Paul
> 
> Thanks for clarifying that! I completely agree, page cache control is
> very important!

Doesn't "zone_reclaim" (added by Christoph Lameter over the last
several months) go a long way toward resolving this page cache control
problem?

Essentially, if my understanding is correct, zone reclaim has tasks
that are asking for memory first do some work towards keeping enough
memory free, such as doing some work reclaiming slab memory and pushing
swap and pushing dirty buffers to disk.

Tasks must help out as is needed to keep the per-node free memory above
watermarks.

This way, you don't actually have to account for who owns what, with
all the problems arbitrating between claims on shared resources.

Rather, you just charge the next customer who comes in the front
door (aka, mm/page_alloc.c:__alloc_pages()) a modest overhead if
they happen to show up when free memory supplies are running short.

On average, it has the same affect as a strict accounting system,
of charging the heavy users more (more CPU cycles in kernel vmscan
code and clock cycles waiting on disk heads).  But it does so without
any need of accurate per-user accounting.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  4:39               ` David Rientjes
  2006-11-01  9:50                 ` Paul Jackson
@ 2006-11-01 15:59                 ` Srivatsa Vaddagiri
  2006-11-01 16:31                   ` Srivatsa Vaddagiri
  2006-11-01 21:05                   ` David Rientjes
  2006-11-01 18:19                 ` Srivatsa Vaddagiri
  2 siblings, 2 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 15:59 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Menage, Paul Jackson, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth

On Tue, Oct 31, 2006 at 08:39:27PM -0800, David Rientjes wrote:
> So here's our three process containers, A, B, and C, with our tasks m-t:
> 
> 	-----A-----	-----B-----	-----C-----
> 	|    |    |     |    |    |     |    |
> 	m    n    o	p    q    r	s    t
> 
> Here's our memory controller groups D and E and our containers set within 
> them:
> 
> 	-----D-----	-----E-----
> 	|         |	|
> 	A         B	C

This would forces all tasks in container A to belong to the same mem/io ctlr 
groups. What if that is not desired? How would we achieve something like
this:

	tasks (m) should belong to mem ctlr group D,
	tasks (n, o) should belong to mem ctlr group E
  	tasks (m, n, o) should belong to i/o ctlr group G

(this example breaks the required condition/assumption that a task belong to 
exactly only one process container).

Is this a unrealistic requirement? I suspect not and should give this
flexibilty, if we ever have to support task-grouping that is
unique to each resource. Fundamentally process grouping exists because
of various resource and not otherwise.

At this point, what purpose does having/exposing-to-user the generic process 
container abstraction A, B and C achieve?

IMHO what is more practical is to let res ctlr groups (like D, E, F, G)
be comprised of individual tasks (rather than containers). 

Note that all this is not saying that Paul Menages's patches are
pointless. In fact his generalization of cpusets to achieve process
grouping is indeed a good idea. I am only saying that his mechanism
should be used to define groups-of-tasks under each resource, rather
than to have groups-of-containers under each resource.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  8:01     ` Pavel Emelianov
@ 2006-11-01 16:04       ` Matt Helsley
  2006-11-01 17:51         ` Srivatsa Vaddagiri
  2006-11-01 17:50       ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-11-01 16:04 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, dipankar, rohitseth, menage, devel

On Wed, 2006-11-01 at 11:01 +0300, Pavel Emelianov wrote:
> [snip]
> 
> >> 2. Having configfs as the only interface doesn't alow
> >>    people having resource controll facility w/o configfs.
> >>    Resource controller must not depend on any "feature".

	That's not true. It's possible for a resource control system that uses
a filesystem interface to operate without it's filesystem interface. In
fact, for performance reasons I think it's necessary.

	Even assuming your point is true, since you agree there should be only
one interface does it matter that choosing one prevents implementing
another?

	Why must a resource controller never depend on another "feature"?

> > One flexibility configfs (and any fs-based interface) offers is, as Matt
> > had pointed out sometime back, the ability to delage management of a
> > sub-tree to a particular user (without requiring root permission).
> > 
> > For ex:
> > 
> > 			/
> > 			|
> > 		 -----------------
> > 		|		  |
> > 	       vatsa (70%)	linux (20%)
> > 		|
> > 	 ----------------------------------
> > 	|	         | 	          |
> >       browser (10%)   compile (50%)    editor (10%)
> > 
> > In this, group 'vatsa' has been alloted 70% share of cpu. Also user
> > 'vatsa' has been given permissions to manage this share as he wants. If
> > the cpu controller supports hierarchy, user 'vatsa' can create further
> > sub-groups (browser, compile ..etc) -without- requiring root access.
> 
> I can do the same using bcctl tool and sudo :)

bcctl and, to a lesser extent, sudo are more esoteric.

Open, read, write, mkdir, unlink, etc. are all system calls so it seems
we all agree that system calls are the way to go. ;) Now if only we
could all agree on which system calls...

> > Also it is convenient to manipulate resource hierarchy/parameters thr a
> > shell-script if it is fs-based.
> > 
> >> 3. Configfs may be easily implemented later as an additional
> >>    interface. I propose the following solution:
> > 
> > Ideally we should have one interface - either syscall or configfs - and
> > not both.

To incorporate all feedback perhaps we should replace "configfs" with
"filesystem".

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 15:59                 ` Srivatsa Vaddagiri
@ 2006-11-01 16:31                   ` Srivatsa Vaddagiri
  2006-11-01 21:05                   ` David Rientjes
  1 sibling, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 16:31 UTC (permalink / raw)
  To: David Rientjes
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	Paul Jackson, matthltc, dipankar, rohitseth, Paul Menage

On Wed, Nov 01, 2006 at 09:29:37PM +0530, Srivatsa Vaddagiri wrote:
> This would forces all tasks in container A to belong to the same mem/io ctlr 
> groups. What if that is not desired? How would we achieve something like
> this:
> 
> 	tasks (m) should belong to mem ctlr group D,
> 	tasks (n, o) should belong to mem ctlr group E
>   	tasks (m, n, o) should belong to i/o ctlr group G
> 
> (this example breaks the required condition/assumption that a task belong to 
> exactly only one process container).
> 
> Is this a unrealistic requirement? I suspect not and should give this
> flexibilty, if we ever have to support task-grouping that is
> unique to each resource. Fundamentally process grouping exists because
> of various resource and not otherwise.

In this article, http://lwn.net/Articles/94573/, Linus is quoted to want
something close to the above example, I think.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-31 16:46                 ` Paul Menage
@ 2006-11-01 17:25                   ` Srivatsa Vaddagiri
  2006-11-01 23:37                     ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 17:25 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Tue, Oct 31, 2006 at 08:46:00AM -0800, Paul Menage wrote:
> The idea is that in general, people aren't going to want to have
> separate hierarchies for different resources - they're going to have
> the hierarchies be the same for all resources. So in general when they
> move a process from one container to another, they're going to want to
> move that task to use all the new resources limits/guarantees
> simultaneously.

Sure, a reasonable enough requirement.

> Having completely independent hierarchies makes this more difficult -
> you have to manually maintain multiple different hierarchies from
> userspace.

I suspect we can avoid maintaining separate hierarchies if not required.

        mkdir /dev/res_groups
        mount -t container -o cpu,mem,io none /dev/res_groups
        mkdir /dev/res_groups/A
        mkdir /dev/res_groups/B

Directories A and B would now contain res ctl files associated with all
resources (viz cpu, mem, io) and also a 'members' file listing the tasks
belonging to those groups.

Do you think the above mechanism is implementable? Even if it is, I dont know 
how the implementation will get complicated because of this requirement.

> Suppose a task forks while you're moving it from one
> container to another? With the approach that each process is in one
> container, and each container is in a set of resource nodes, at least

This requirement that each process should be exactly in one process container
is perhaps not good, since it will not give the fleixibility to define groups
unique to each resource (see my reply earlier to David Rientjes).

> the child task is either entirely in the new resource limits or
> entirely in the old limits - if userspace has to update several
> hierarchies at once non-atomically then a freshly forked child could
> end up with a mixture of resource nodes.

If the user intended to have a common grouping hierarchy for all
resources, then this movement of tasks can be "atomic" as far as user is
concerned, as per the above example:

        echo task_pid > /dev/res_groups/B/members

should cause the task transition to the new group in one shot?

> >I am thinking we can avoid maintaining these two hierarchies, by
> >something on these lines:
> >
> >        mkdir /dev/cpu
> >        mount -t container -ocpu container /dev/cpu
> >
> >                -> Represents a hierarchy for cpu control purpose.
> >
> >                   tsk->cpurc   = represent the node in the cpu
> >                                  controller hierarchy. Also maintains
> >                                  resource allocation information for
> >                                  this node.
> >
> 
> If we were going to do something like this, hopefully it would look
> more like an array of generic container subsystems, rather than a
> separate named pointer for each subsystem.

Sounds good.

> I think we have an overloading of terminology here. By "container" I
> just mean "group of processes tracked for resource control and other
> purposes". Can we use a term like "virtual server" if you're doing
> virtualization? I.e. a virtual server would be a specialization of a
> container (effectively analagous to a resource controller)

Ok, sure.

> >I suspect this may simplify the "container" filesystem, since it doesnt
> >have to track multiple hierarchies at the same time, and improve lock
> >contention too (modifying the cpu controller hierarchy can take a different
> >lock than the mem controller hierarchy).
> 
> Do you think that lock contention when modifying hierarchies is
> generally going to be an issue - how often do tasks get moved around
> in the hierarchy, compared to the other operations going on on the
> system?

I suspect the manipulation to resource group hierarchy (and the
resulting lock contention) will be more frequent than to the cpuset 
hierarchy, if we have to support scenarios like here:

	http://lkml.org/lkml/2006/9/5/178

I will try and get a better picture of how frequent would such task
migration be in practice from few people I know who are interested in this
feature within IBM.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-10-30 10:51 ` Paul Menage
  2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
  2006-10-30 11:15   ` Paul Jackson
@ 2006-11-01 17:33   ` Srivatsa Vaddagiri
  2006-11-01 21:18     ` Chris Friesen
  2 siblings, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 17:33 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel, pj,
	matthltc, dipankar, rohitseth, devel

On Mon, Oct 30, 2006 at 02:51:24AM -0800, Paul Menage wrote:
> The cpusets code which this was based on simply locked the task list,
> and traversed it to find threads in the cpuset of interest; you could
> do the same thing in any other resource controller.

Sure ..the point was about efficiency (whether you plough through
thousands of tasks to find those 10 tasks which belong to a group or
you have a list which gets to the 10 tasks immediately). But then the
cost of maintaining such a list is noted.

> Not keeping a list of tasks in the container makes fork/exit more
> efficient, and I assume is the reason that cpusets made that design
> decision. If we really wanted to keep a list of tasks in a container
> it wouldn't be hard, but should probably be conditional on at least
> one of the registered resource controllers to avoid unnecessary
> overhead when none of the controllers actually care (in a similar
> manner to the fork/exit callbacks, which only take the container
> callback mutex if some container subsystem is interested in fork/exit
> events).

Makes sense.

> How important is it for controllers/subsystems to be able to
> deregister themselves, do you think? I could add it relatively easily,
> but it seemed unnecessary in general.

Not very important perhaps.

> I've not really played with it yet, but I don't see any reason why the
> beancounter resource control concept couldn't also be built over
> generic containers. The user interface would be different, of course
> (filesysmem vs syscall), but maybe even that could be emulated if
> there was a need for backwards compatibility.

Hmm ..cpusets is in mainline already and hence we do need to worry abt
backward compatibility. If we were to go ahead with your patches, do we have 
the same backward compatibility concern for beancounter as well? :)

> > Consensus:
> >
> >         - Provide resource control over a group of tasks
> >         - Support movement of task from one resource group to another
> >         - Dont support heirarchy for now
> 
> Both CKRM/RG and generic containers support a hierarchy.

I guess the consensus (as was made at OLS BoF :
 http://lkml.org/lkml/2006/7/26/237) was more wrt controllers than 
the infrastructure.

> 
> >         - Support limit (soft and/or hard depending on the resource
> >           type) in controllers. Guarantee feature could be indirectly
> >           met thr limits.
> 
> That's an issue for resource controllers, rather than the underlying
> infrastructure, I think.

Hmm ..I dont think so. If we were to support both guarantee and limit,
then the infrastructure has to provide interfaces to set both values
for a group.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  8:01     ` Pavel Emelianov
  2006-11-01 16:04       ` Matt Helsley
@ 2006-11-01 17:50       ` Srivatsa Vaddagiri
  2006-11-02  8:42         ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 17:50 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, sekharan, menage, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On Wed, Nov 01, 2006 at 11:01:31AM +0300, Pavel Emelianov wrote:
> > Sorry dont get you here. Are you saying we should support different
> > grouping for different controllers?
> 
> Not me, but other people in this thread.

Hmm ..I thought OpenVz folks were interested in having different
groupings for different resources i.e grouping for CPU should be
independent of the grouping for memory.

	http://lkml.org/lkml/2006/8/18/98

Isnt that true?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 16:04       ` Matt Helsley
@ 2006-11-01 17:51         ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 17:51 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Pavel Emelianov, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, dipankar, rohitseth, menage, devel

On Wed, Nov 01, 2006 at 08:04:01AM -0800, Matt Helsley wrote:
> > >> 3. Configfs may be easily implemented later as an additional
> > >>    interface. I propose the following solution:
> > > 
> > > Ideally we should have one interface - either syscall or configfs - and
> > > not both.
> 
> To incorporate all feedback perhaps we should replace "configfs" with
> "filesystem".

Yes, you are right.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  9:30 ` Pavel Emelianov
  2006-11-01  9:53   ` David Rientjes
@ 2006-11-01 18:12   ` Srivatsa Vaddagiri
  2006-11-01 22:19     ` Matt Helsley
  2006-11-02  8:52     ` Pavel Emelianov
  1 sibling, 2 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 18:12 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: dev, sekharan, menage, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth

On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
> > Debated:
> > 	- syscall vs configfs interface
> 
> OK. Let's stop at configfs interface to move...

Excellent!

> > 	- Should we have different groupings for different resources?
> 
> I propose to discuss this question as this is the most important
> now from my point of view.
> 
> I believe this can be done, but can't imagine how to use this...

As I mentioned in my earlier mail, I thought openvz folks did want this
flexibility:

	http://lkml.org/lkml/2006/8/18/98

Also:

	http://lwn.net/Articles/94573/

But I am ok if we dont support this feature in the initial round of
development.

Having grouping for different resources could be a hairy to deal
with and could easily mess up applications (for ex: a process in a 80%
CPU class but in a 10% memory class could lead to underutilization of
its cpu share, because it cannot allocated memory as fast as it wants to run), 
it is assumed that administrator will carefully manage these settings.

> > 	- Support movement of all threads of a process from one group
> > 	  to another atomically?
> 
> I propose such a solution: if a user asks to move /proc/<pid>
> then move the whole task with threads.
> If user asks to move /proc/<pid>/task/<tid> then move just
> a single thread.
> 
> What do you think?

Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?

For ex:

	# ls /proc/2906/task
	2906  2907  2908  2909

2906 is the main thread which created the remaining threads.

This would lead to an ambiguity when user does something like below:

	echo 2906 > /some_res_file_system/some_new_group

Is he intending to move just the main thread, 2906, to the new group or
all the threads? It could be either.

This needs some more thought ...

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  4:39               ` David Rientjes
  2006-11-01  9:50                 ` Paul Jackson
  2006-11-01 15:59                 ` Srivatsa Vaddagiri
@ 2006-11-01 18:19                 ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-01 18:19 UTC (permalink / raw)
  To: David Rientjes
  Cc: Paul Menage, Paul Jackson, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth

On Tue, Oct 31, 2006 at 08:39:27PM -0800, David Rientjes wrote:
>  - How is forking handled with the various controllers?  Do child 
>    processes automatically inherit all the controller groups of its
>    parent?  If not (or if its dependant on a user-configured attribute

I think it would be simpler to go with the assumption that child process should 
automatically inherit the same resource controller groups as its parent.

Although I think, CKRM did attempt to provide the flexibility of
changing this behavior using rule-based classification engine (Matt/Chandra, 
correct me if I am wrong here).

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 15:59                 ` Srivatsa Vaddagiri
  2006-11-01 16:31                   ` Srivatsa Vaddagiri
@ 2006-11-01 21:05                   ` David Rientjes
  2006-11-01 23:43                     ` Paul Menage
  1 sibling, 1 reply; 135+ messages in thread
From: David Rientjes @ 2006-11-01 21:05 UTC (permalink / raw)
  To: Srivatsa Vaddagiri
  Cc: Paul Menage, Paul Jackson, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth

On Wed, 1 Nov 2006, Srivatsa Vaddagiri wrote:

> This would forces all tasks in container A to belong to the same mem/io ctlr 
> groups. What if that is not desired? How would we achieve something like
> this:
> 
> 	tasks (m) should belong to mem ctlr group D,
> 	tasks (n, o) should belong to mem ctlr group E
>   	tasks (m, n, o) should belong to i/o ctlr group G
> 

With the example you would need to place task m in one container called 
A_m and tasks n and o in another container called A_n,o.  Then join A_m to 
D, A_n,o to E, and both to G.

I agree that this doesn't appear to be very easy to setup by the sysadmin 
or any automated means.  But in terms of the kernel, each of these tasks 
would have a pointer back to its container and that container would point 
to its assigned resource controller.  So it's still a double dereference 
to access the controller from any task_struct.

So if we proposed a hierarchy of containers, we could have the following:

			----------A----------
			|         |         |
		   -----B-----	  m    -----C------
		   |         |         |
		   n    -----D-----    o
			|	  |
			p         q

So instead we make the requirement that only one container can be attached 
to any given controller.  So if container A is attached to a disk I/O 
controller, for example, then it includes all processes.  If D is attached 
to it instead, only p and q are affected by its constraints.

This would be possible by adding a field to the struct container that 
would point to its parent cpu, net, mem, etc. container or NULL if it is 
itself.

The difference:

	Single-level container hierarchy

		struct task_struct {
			...
			struct container *my_container;
		}
		struct container {
			...
			struct controller *my_cpu_controller;
			struct controller *my_mem_controller;
		}

	Multi-level container hierarchy

		struct task_struct {
			...
			struct container *my_container;
		}
		struct container {
			...
			/* Root containers, NULL if itself */
			struct container *my_cpu_root_container;
			struct container *my_mem_root_container;
			/* Controllers, NULL if has parent */
			struct controller *my_cpu_controller;
			struct controller *my_mem_controller;
		}

This eliminates the need to put a pointer to each resource controller 
within each task_struct.

> (this example breaks the required condition/assumption that a task belong to 
> exactly only one process container).
> 

Yes, and that was the requirement that the above example was based upon.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 17:33   ` Srivatsa Vaddagiri
@ 2006-11-01 21:18     ` Chris Friesen
  2006-11-01 23:01       ` [Devel] " Kir Kolyshkin
  2006-11-01 23:48       ` Paul Menage
  0 siblings, 2 replies; 135+ messages in thread
From: Chris Friesen @ 2006-11-01 21:18 UTC (permalink / raw)
  To: vatsa
  Cc: Paul Menage, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, devel

Srivatsa Vaddagiri wrote:

>>>        - Support limit (soft and/or hard depending on the resource
>>>          type) in controllers. Guarantee feature could be indirectly
>>>          met thr limits.

I just thought I'd weigh in on this.  As far as our usage pattern is 
concerned, guarantees cannot be met via limits.

I want to give "x" cpu to container X, "y" cpu to container Y, and "z" 
cpu to container Z.

If these are percentages, x+y+z must be less than 100.

However, if Y does not use its share of the cpu, I would like the 
leftover cpu time to be made available to X and Z, in a ratio based on 
their allocated weights.

With limits, I don't see how I can get the ability for containers to 
make opportunistic use of cpu that becomes available.

I can see that with things like memory this could become tricky (How do 
you free up memory that was allocated to X when Y decides that it really 
wants it after all?) but for CPU I think it's a valid scenario.

Chris

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 18:12   ` Srivatsa Vaddagiri
@ 2006-11-01 22:19     ` Matt Helsley
  2006-11-01 23:50       ` Paul Menage
  2006-11-02  8:52     ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-11-01 22:19 UTC (permalink / raw)
  To: vatsa
  Cc: Pavel Emelianov, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, dipankar, rohitseth, menage

On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:

<snip>

> > > 	- Support movement of all threads of a process from one group
> > > 	  to another atomically?
> > 
> > I propose such a solution: if a user asks to move /proc/<pid>
> > then move the whole task with threads.
> > If user asks to move /proc/<pid>/task/<tid> then move just
> > a single thread.
> > 
> > What do you think?
> 
> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> 
> For ex:
> 
> 	# ls /proc/2906/task
> 	2906  2907  2908  2909
> 
> 2906 is the main thread which created the remaining threads.
> 
> This would lead to an ambiguity when user does something like below:
> 
> 	echo 2906 > /some_res_file_system/some_new_group
> 
> Is he intending to move just the main thread, 2906, to the new group or
> all the threads? It could be either.
> 
> This needs some more thought ...

	I thought the idea was to take in a proc path instead of a single
number. You could then distinguish between the whole thread group and
individual threads by parsing the string. You'd move a single thread if
you find both the tgid and the tid. If you only get a tgid you'd move
the whole thread group. So:

<pid>                   -> if it's a thread group leader move the whole
			   thread group, otherwise just move the thread
/proc/<tgid>            -> move the whole thread group
/proc/<tgid>/task/<tid> -> move the thread


	Alternatives that come to mind are:

1. Read a flag with the pid
2. Use a special file which expects only thread groups as input 

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01  9:53   ` David Rientjes
@ 2006-11-01 22:23     ` Matt Helsley
  0 siblings, 0 replies; 135+ messages in thread
From: Matt Helsley @ 2006-11-01 22:23 UTC (permalink / raw)
  To: David Rientjes
  Cc: Pavel Emelianov, dev, vatsa, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, dipankar, rohitseth, menage

On Wed, 2006-11-01 at 01:53 -0800, David Rientjes wrote:
> On Wed, 1 Nov 2006, Pavel Emelianov wrote:
> 
> > > 	- Interaction of resource controllers, containers and cpusets
> > > 		- Should we support, for instance, creation of resource
> > > 		  groups/containers under a cpuset?
> > > 	- Should we have different groupings for different resources?
> > 
> > I propose to discuss this question as this is the most important
> > now from my point of view.
> > 
> > I believe this can be done, but can't imagine how to use this...
> > 
> 
> I think cpusets, as abstracted away from containers by Paul Menage, simply 
> become a client of the container configfs.  Cpusets would become more of a 
> NUMA-type controller by default.
> 
> Different groupings for different resources was already discussed.  If we 
> use the approach of a single-level "hierarchy" for process containers and 

At least in my mental model the depth of the hierarchy has nothing to do
with different groupings for different resources. They are just separate
hierarchies and where they are mounted does not affect their behavior.

> then attach them each to a "node" of a controller, then the groupings have 
> been achieved.  It's possible to change the network controller of a 
> container or move processes from container to container easily through the 
> filesystem.
> 
> > > 	- Support movement of all threads of a process from one group
> > > 	  to another atomically?
> > 
> > I propose such a solution: if a user asks to move /proc/<pid>
> > then move the whole task with threads.
> > If user asks to move /proc/<pid>/task/<tid> then move just
> > a single thread.
> > 
> > What do you think?
> 
> This seems to use my proposal of using procfs as an abstraction of process 
> containers.  I haven't looked at the implementation details, but it seems 
> like the most appropriate place given what it currently supports.  

I'm not so sure procfs is the right mechanism.

> Naturally it should be an atomic move but I don't think it's the most 
> important detail in terms of efficiency because moving threads should not 
> be such a frequent occurrence anyway.  This begs the question about how 
> forks are handled for processes with regard to the various controllers 
> that could be implemented and whether they should all be decendants of the 
> parent container by default or have the option of spawning a new 
> controller all together.  This would be an attribute of controllers and 

"spawning a new controller"?? Did you mean a new container?

> not containers, however.
> 
> 		David

	I don't follow. You seem to be mixing and separating the terms
"controller" and "container" and it doesn't fit with the uses of those
terms that I'm familiar with.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [Devel] Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 21:18     ` Chris Friesen
@ 2006-11-01 23:01       ` Kir Kolyshkin
  2006-11-02  0:31         ` Matt Helsley
  2006-11-01 23:48       ` Paul Menage
  1 sibling, 1 reply; 135+ messages in thread
From: Kir Kolyshkin @ 2006-11-01 23:01 UTC (permalink / raw)
  To: devel
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, linux-kernel, pj,
	matthltc, dipankar, rohitseth, Paul Menage, Chris Friesen

Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
>
>>>>        - Support limit (soft and/or hard depending on the resource
>>>>          type) in controllers. Guarantee feature could be indirectly
>>>>          met thr limits.
>
> I just thought I'd weigh in on this.  As far as our usage pattern is
> concerned, guarantees cannot be met via limits.
>
> I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> cpu to container Z.
>
> If these are percentages, x+y+z must be less than 100.
>
> However, if Y does not use its share of the cpu, I would like the
> leftover cpu time to be made available to X and Z, in a ratio based on
> their allocated weights.
>
> With limits, I don't see how I can get the ability for containers to
> make opportunistic use of cpu that becomes available.
This is basically how "cpuunits" in OpenVZ works. It is not limiting a
container in any way, just assigns some relative "units" to it, with sum
of all units across all containers equal to 100% CPU. Thus, if we have
cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
CPU-intensive tasks in all the containers, X will be given
10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets
60%. Now, if Z is not using CPU, X will be given 33% and Y -- 66%. The
scheduler used is based on a per-VE runqueues, is quite fair, and works
fine and fair for, say, uneven case of 3 containers on a 4 CPU box.

OpenVZ also has a "cpulimit" resource, which is, naturally, a hard limit
of CPU usage for a VE. Still, given the fact that cpunits works just
fine, cpulimit is rarely needed -- makes sense only in special scenarios
where you want to see how app is run on a slow box, or in case of some
proprietary software licensed per CPU MHZ, or smth like that.

Looks like this is what you need, right?
> I can see that with things like memory this could become tricky (How
> do you free up memory that was allocated to X when Y decides that it
> really wants it after all?) but for CPU I think it's a valid scenario.
Yes, CPU controller is quite different of other resource controllers.

Kir.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 17:25                   ` Srivatsa Vaddagiri
@ 2006-11-01 23:37                     ` Paul Menage
  2006-11-06 12:49                       ` Srivatsa Vaddagiri
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-01 23:37 UTC (permalink / raw)
  To: vatsa
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On 11/1/06, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
>
> I suspect we can avoid maintaining separate hierarchies if not required.
>
>         mkdir /dev/res_groups
>         mount -t container -o cpu,mem,io none /dev/res_groups
>         mkdir /dev/res_groups/A
>         mkdir /dev/res_groups/B
>
> Directories A and B would now contain res ctl files associated with all
> resources (viz cpu, mem, io) and also a 'members' file listing the tasks
> belonging to those groups.
>
> Do you think the above mechanism is implementable? Even if it is, I dont know
> how the implementation will get complicated because of this requirement.

Yes, certainly implementable, and I don't think it would complicate
the code too much. I alluded to it as a possibility when I first sent
out my patches  - I think my main issue with it was the fact that it
results in multiple container pointers per process at compile time,
which could be wasteful.

>
> This requirement that each process should be exactly in one process container
> is perhaps not good, since it will not give the fleixibility to define groups
> unique to each resource (see my reply earlier to David Rientjes).

I saw your example, but can you give a concrete example of a situation
when you might want to do that?

For simplicity combined with flexibility, I think I still favour the
following model:

- all processes are a member of one container
- for each resource type, each container is either in the same
resource node as its parent or a freshly child node of the parent
resource node (determined at container creation time)

This is a subset of my more complex model, but it's pretty easy to
understand from userspace and to implement in the kernel.

>
> > the child task is either entirely in the new resource limits or
> > entirely in the old limits - if userspace has to update several
> > hierarchies at once non-atomically then a freshly forked child could
> > end up with a mixture of resource nodes.
>
> If the user intended to have a common grouping hierarchy for all
> resources, then this movement of tasks can be "atomic" as far as user is
> concerned, as per the above example:
>
>         echo task_pid > /dev/res_groups/B/members
>
> should cause the task transition to the new group in one shot?
>

Yes, if we took that model. But if someone does want to have
non-identical hierarchies, then in that model they're still forced
into a non-atomic update situation.

What objections do you have to David's suggestion hat if you want some
processes in a container to be in one resource node and others in
another resource node, then you should just subdivide into two
containers, such that all processes in a container are in the same set
of resource nodes?

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 21:05                   ` David Rientjes
@ 2006-11-01 23:43                     ` Paul Menage
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-01 23:43 UTC (permalink / raw)
  To: David Rientjes
  Cc: Srivatsa Vaddagiri, Paul Jackson, dev, sekharan, ckrm-tech,
	balbir, haveblue, linux-kernel, matthltc, dipankar, rohitseth

>
> So instead we make the requirement that only one container can be attached
> to any given controller.  So if container A is attached to a disk I/O
> controller, for example, then it includes all processes.  If D is attached
> to it instead, only p and q are affected by its constraints.

If by "controller" you mean "resource node" this looks on second
glance very similar in concept to the simplified approach I outlined
in my last email. Except that I'd still include a pointer from e.g. D
to the resource node for fast lookup.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 21:18     ` Chris Friesen
  2006-11-01 23:01       ` [Devel] " Kir Kolyshkin
@ 2006-11-01 23:48       ` Paul Menage
  2006-11-02  3:28         ` Chris Friesen
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-01 23:48 UTC (permalink / raw)
  To: Chris Friesen
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On 11/1/06, Chris Friesen <cfriesen@nortel.com> wrote:
>
> I just thought I'd weigh in on this.  As far as our usage pattern is
> concerned, guarantees cannot be met via limits.
>
> I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> cpu to container Z.

I agree that these are issues - but they don't really affect the
container framework directly.

The framework should be flexible enough to let controllers register
any control parameters (via the filesystem?) that they need, but it
shouldn't contain explicit concepts like guarantees and limits. Some
controllers won't even have this concept (cpusets doesn't really, for
instance, and  containers don't have to be just to do with
quantitative resource control).

I sent out a patch a while ago that showed how ResGroups could be
turned into effectively a library on top of a generic container system
- so ResGroups controllers could write to the ResGroups interface, and
let the library handle setting up control parameters and parsing
limits and guarantees. I expect the same thing could be done for UBC.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 22:19     ` Matt Helsley
@ 2006-11-01 23:50       ` Paul Menage
  2006-11-02  0:30         ` Matt Helsley
  2006-11-02  9:08         ` Pavel Emelianov
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-01 23:50 UTC (permalink / raw)
  To: Matt Helsley
  Cc: vatsa, Pavel Emelianov, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, dipankar, rohitseth

On 11/1/06, Matt Helsley <matthltc@us.ibm.com> wrote:
> On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> > On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
>
> <snip>
>
> > > >   - Support movement of all threads of a process from one group
> > > >     to another atomically?
> > >
> > > I propose such a solution: if a user asks to move /proc/<pid>
> > > then move the whole task with threads.
> > > If user asks to move /proc/<pid>/task/<tid> then move just
> > > a single thread.
> > >
> > > What do you think?
> >
> > Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> >
> > For ex:
> >
> >       # ls /proc/2906/task
> >       2906  2907  2908  2909
> >
> > 2906 is the main thread which created the remaining threads.
> >
> > This would lead to an ambiguity when user does something like below:
> >
> >       echo 2906 > /some_res_file_system/some_new_group
> >
> > Is he intending to move just the main thread, 2906, to the new group or
> > all the threads? It could be either.
> >
> > This needs some more thought ...
>
>         I thought the idea was to take in a proc path instead of a single
> number. You could then distinguish between the whole thread group and
> individual threads by parsing the string. You'd move a single thread if
> you find both the tgid and the tid. If you only get a tgid you'd move
> the whole thread group. So:
>
> <pid>                   -> if it's a thread group leader move the whole
>                            thread group, otherwise just move the thread
> /proc/<tgid>            -> move the whole thread group
> /proc/<tgid>/task/<tid> -> move the thread
>
>
>         Alternatives that come to mind are:
>
> 1. Read a flag with the pid
> 2. Use a special file which expects only thread groups as input

I think that having a "tasks" file and a "threads" file in each
container directory would be a clean way to handle it:

"tasks" : read/write complete process members
"threads" : read/write individual thread members

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-01 12:23                   ` Paul Jackson
@ 2006-11-02  0:09                     ` Paul Menage
  2006-11-02  0:39                       ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-02  0:09 UTC (permalink / raw)
  To: Paul Jackson
  Cc: balbir, dev, vatsa, sekharan, ckrm-tech, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/1/06, Paul Jackson <pj@sgi.com> wrote:
>
> Essentially, if my understanding is correct, zone reclaim has tasks
> that are asking for memory first do some work towards keeping enough
> memory free, such as doing some work reclaiming slab memory and pushing
> swap and pushing dirty buffers to disk.

True, it would help with keeping the machine in an alive state.

But when one task is allocating memory, it's still going to be pushing
out pages with random owners, rather than pushing out its own pages
when it hits it memory limit. That can negatively affect the
performance of other tasks, which is what we're trying to prevent.

You can't just say that the biggest user should get penalised. You
might want to use 75% of a machine for an important production server,
and have the remaining 25% available for random batch jobs - they
shouldn't be able to impact the production server.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 23:50       ` Paul Menage
@ 2006-11-02  0:30         ` Matt Helsley
  2006-11-02  5:33           ` Balbir Singh
  2006-11-02  9:08         ` Pavel Emelianov
  1 sibling, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-11-02  0:30 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, Pavel Emelianov, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, dipankar, rohitseth

On Wed, 2006-11-01 at 15:50 -0800, Paul Menage wrote:
> On 11/1/06, Matt Helsley <matthltc@us.ibm.com> wrote:
> > On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
> > > On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
> >
> > <snip>
> >
> > > > >   - Support movement of all threads of a process from one group
> > > > >     to another atomically?
> > > >
> > > > I propose such a solution: if a user asks to move /proc/<pid>
> > > > then move the whole task with threads.
> > > > If user asks to move /proc/<pid>/task/<tid> then move just
> > > > a single thread.
> > > >
> > > > What do you think?
> > >
> > > Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> > >
> > > For ex:
> > >
> > >       # ls /proc/2906/task
> > >       2906  2907  2908  2909
> > >
> > > 2906 is the main thread which created the remaining threads.
> > >
> > > This would lead to an ambiguity when user does something like below:
> > >
> > >       echo 2906 > /some_res_file_system/some_new_group
> > >
> > > Is he intending to move just the main thread, 2906, to the new group or
> > > all the threads? It could be either.
> > >
> > > This needs some more thought ...
> >
> >         I thought the idea was to take in a proc path instead of a single
> > number. You could then distinguish between the whole thread group and
> > individual threads by parsing the string. You'd move a single thread if
> > you find both the tgid and the tid. If you only get a tgid you'd move
> > the whole thread group. So:
> >
> > <pid>                   -> if it's a thread group leader move the whole
> >                            thread group, otherwise just move the thread
> > /proc/<tgid>            -> move the whole thread group
> > /proc/<tgid>/task/<tid> -> move the thread
> >
> >
> >         Alternatives that come to mind are:
> >
> > 1. Read a flag with the pid
> > 2. Use a special file which expects only thread groups as input
> 
> I think that having a "tasks" file and a "threads" file in each
> container directory would be a clean way to handle it:
> 
> "tasks" : read/write complete process members
> "threads" : read/write individual thread members
> 
> Paul

Seems like a good idea to me -- that certainly avoids complex parsing.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [Devel] Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 23:01       ` [Devel] " Kir Kolyshkin
@ 2006-11-02  0:31         ` Matt Helsley
  2006-11-02  8:34           ` Kir Kolyshkin
  0 siblings, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-11-02  0:31 UTC (permalink / raw)
  To: Kir Kolyshkin
  Cc: devel, vatsa, dev, sekharan, ckrm-tech, balbir, linux-kernel, pj,
	dipankar, rohitseth, Paul Menage, Chris Friesen

On Thu, 2006-11-02 at 02:01 +0300, Kir Kolyshkin wrote:
> Chris Friesen wrote:
> > Srivatsa Vaddagiri wrote:
> >
> >>>>        - Support limit (soft and/or hard depending on the resource
> >>>>          type) in controllers. Guarantee feature could be indirectly
> >>>>          met thr limits.
> >
> > I just thought I'd weigh in on this.  As far as our usage pattern is
> > concerned, guarantees cannot be met via limits.
> >
> > I want to give "x" cpu to container X, "y" cpu to container Y, and "z"
> > cpu to container Z.
> >
> > If these are percentages, x+y+z must be less than 100.
> >
> > However, if Y does not use its share of the cpu, I would like the
> > leftover cpu time to be made available to X and Z, in a ratio based on
> > their allocated weights.
> >
> > With limits, I don't see how I can get the ability for containers to
> > make opportunistic use of cpu that becomes available.
> This is basically how "cpuunits" in OpenVZ works. It is not limiting a
> container in any way, just assigns some relative "units" to it, with sum
> of all units across all containers equal to 100% CPU. Thus, if we have

	So the user doesn't really specify percentage but values that feed into
ratios used by the underlying controller? If so then it's not terribly
different from the "shares" of single level of Resource Groups. 

	Resource groups goes one step further and defines a denominator for
child groups to use. This allows the shares to be connected vertically
so that changes don't need to propagate beyond the parent and child
groups.

> cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
> CPU-intensive tasks in all the containers, X will be given
> 10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets

nit: I don't think this math is correct.

Shouldn't they all have the same denominator (60), or am I
misunderstanding something?

If so then it should be:
X = 10/60      16.666...%
Y = 20/60      33.333...%
Z = 30/60      50.0%
Total:        100.0%

> 60%. Now, if Z is not using CPU, X will be given 33% and Y -- 66%. The
> scheduler used is based on a per-VE runqueues, is quite fair, and works
> fine and fair for, say, uneven case of 3 containers on a 4 CPU box.

<snip>

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] RFC: Memory Controller
  2006-11-02  0:09                     ` Paul Menage
@ 2006-11-02  0:39                       ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-11-02  0:39 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> That can negatively affect the
> performance of other tasks, which is what we're trying to prevent.

That sounds like a worthwhile goal.  I agree that zone_reclaim
doesn't do that.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 23:48       ` Paul Menage
@ 2006-11-02  3:28         ` Chris Friesen
  2006-11-02  7:40           ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Chris Friesen @ 2006-11-02  3:28 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

Paul Menage wrote:

> The framework should be flexible enough to let controllers register
> any control parameters (via the filesystem?) that they need, but it
> shouldn't contain explicit concepts like guarantees and limits.

If the framework was able to handle arbitrary control parameters, that 
would certainly be interesting.

Presumably there would be some way for the controllers to be called from 
the framework to validate those parameters?

Chris

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02  0:30         ` Matt Helsley
@ 2006-11-02  5:33           ` Balbir Singh
  0 siblings, 0 replies; 135+ messages in thread
From: Balbir Singh @ 2006-11-02  5:33 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Paul Menage, vatsa, Pavel Emelianov, dev, sekharan, ckrm-tech,
	haveblue, linux-kernel, pj, dipankar, rohitseth

Matt Helsley wrote:
> On Wed, 2006-11-01 at 15:50 -0800, Paul Menage wrote:
>> On 11/1/06, Matt Helsley <matthltc@us.ibm.com> wrote:
>>> On Wed, 2006-11-01 at 23:42 +0530, Srivatsa Vaddagiri wrote:
>>>> On Wed, Nov 01, 2006 at 12:30:13PM +0300, Pavel Emelianov wrote:
>>> <snip>
>>>
>>>>>>   - Support movement of all threads of a process from one group
>>>>>>     to another atomically?
>>>>> I propose such a solution: if a user asks to move /proc/<pid>
>>>>> then move the whole task with threads.
>>>>> If user asks to move /proc/<pid>/task/<tid> then move just
>>>>> a single thread.
>>>>>
>>>>> What do you think?
>>>> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
>>>>
>>>> For ex:
>>>>
>>>>       # ls /proc/2906/task
>>>>       2906  2907  2908  2909
>>>>
>>>> 2906 is the main thread which created the remaining threads.
>>>>
>>>> This would lead to an ambiguity when user does something like below:
>>>>
>>>>       echo 2906 > /some_res_file_system/some_new_group
>>>>
>>>> Is he intending to move just the main thread, 2906, to the new group or
>>>> all the threads? It could be either.
>>>>
>>>> This needs some more thought ...
>>>         I thought the idea was to take in a proc path instead of a single
>>> number. You could then distinguish between the whole thread group and
>>> individual threads by parsing the string. You'd move a single thread if
>>> you find both the tgid and the tid. If you only get a tgid you'd move
>>> the whole thread group. So:
>>>
>>> <pid>                   -> if it's a thread group leader move the whole
>>>                            thread group, otherwise just move the thread
>>> /proc/<tgid>            -> move the whole thread group
>>> /proc/<tgid>/task/<tid> -> move the thread
>>>
>>>
>>>         Alternatives that come to mind are:
>>>
>>> 1. Read a flag with the pid
>>> 2. Use a special file which expects only thread groups as input
>> I think that having a "tasks" file and a "threads" file in each
>> container directory would be a clean way to handle it:
>>
>> "tasks" : read/write complete process members
>> "threads" : read/write individual thread members
>>
>> Paul
> 
> Seems like a good idea to me -- that certainly avoids complex parsing.
> 
> Cheers,
> 	-Matt Helsley
> 

Yeah, sounds like a good idea. We need to give the controllers some control
over whether they support task movement, thread movement or both.

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02  3:28         ` Chris Friesen
@ 2006-11-02  7:40           ` Paul Menage
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-02  7:40 UTC (permalink / raw)
  To: Chris Friesen
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	pj, matthltc, dipankar, rohitseth, devel

On 11/1/06, Chris Friesen <cfriesen@nortel.com> wrote:
> Paul Menage wrote:
>
> > The framework should be flexible enough to let controllers register
> > any control parameters (via the filesystem?) that they need, but it
> > shouldn't contain explicit concepts like guarantees and limits.
>
> If the framework was able to handle arbitrary control parameters, that
> would certainly be interesting.
>
> Presumably there would be some way for the controllers to be called from
> the framework to validate those parameters?

The approach that I had in mind was that each controller could
register what ever control files it wanted, which would appear in the
filesystem directories for each container; reads and writes on those
files would invoke handlers in the controller. The framework wouldn't
care about the semantics of those control files. See the containers
patch that I posted last month for some examples of this.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [Devel] Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02  0:31         ` Matt Helsley
@ 2006-11-02  8:34           ` Kir Kolyshkin
  0 siblings, 0 replies; 135+ messages in thread
From: Kir Kolyshkin @ 2006-11-02  8:34 UTC (permalink / raw)
  To: Matt Helsley
  Cc: Kir Kolyshkin, devel, vatsa, dev, sekharan, ckrm-tech, balbir,
	linux-kernel, pj, dipankar, rohitseth, Paul Menage, Chris Friesen

Matt Helsley wrote:
> On Thu, 2006-11-02 at 02:01 +0300, Kir Kolyshkin wrote:
>   
>> cpuunits 10, 20, and 30 assigned to containers X, Y, and Z, and run some
>> CPU-intensive tasks in all the containers, X will be given
>> 10/(10+20+30), or 20% of CPU time, Y -- 20/50, i.e. 40%, while Z gets
>>     
>
> nit: I don't think this math is correct.
>
> Shouldn't they all have the same denominator (60), or am I
> misunderstanding something?
>
> If so then it should be:
> X = 10/60      16.666...%
> Y = 20/60      33.333...%
> Z = 30/60      50.0%
> Total:        100.0%
>   
Ughm. You are totally correct of course, I must've been very tired
yesterday night :-\



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 17:50       ` Srivatsa Vaddagiri
@ 2006-11-02  8:42         ` Pavel Emelianov
  2006-11-03  1:29           ` David Rientjes
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-02  8:42 UTC (permalink / raw)
  To: vatsa
  Cc: Pavel Emelianov, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth, devel

Srivatsa Vaddagiri wrote:
> On Wed, Nov 01, 2006 at 11:01:31AM +0300, Pavel Emelianov wrote:
>>> Sorry dont get you here. Are you saying we should support different
>>> grouping for different controllers?
>> Not me, but other people in this thread.
> 
> Hmm ..I thought OpenVz folks were interested in having different
> groupings for different resources i.e grouping for CPU should be
> independent of the grouping for memory.
> 
> 	http://lkml.org/lkml/2006/8/18/98
> 
> Isnt that true?

That's true. We don't mind having different groupings for
different resources. But what I was sying in this thread is
"I didn't *propose* this thing, I just *agreed* that this
might be usefull for someone."

So if we're going to have different groupings for different
resources what's the use of "container" grouping all "controllers"
together? I see this situation like each task_struct carries
pointers to kmemsize controller, pivate pages controller,
physical pages controller, CPU time controller, disk bandwidth
controller, etc. Right? Or did I miss something?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 18:12   ` Srivatsa Vaddagiri
  2006-11-01 22:19     ` Matt Helsley
@ 2006-11-02  8:52     ` Pavel Emelianov
  1 sibling, 0 replies; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-02  8:52 UTC (permalink / raw)
  To: vatsa
  Cc: Pavel Emelianov, dev, sekharan, menage, ckrm-tech, balbir,
	haveblue, linux-kernel, pj, matthltc, dipankar, rohitseth

>> I believe this can be done, but can't imagine how to use this...
> 
> As I mentioned in my earlier mail, I thought openvz folks did want this
> flexibility:
> 
> 	http://lkml.org/lkml/2006/8/18/98
> 
> Also:
> 
> 	http://lwn.net/Articles/94573/
> 
> But I am ok if we dont support this feature in the initial round of
> development.

Yes. Lets start with it - no separate groupings for a while.

BTW I think that hierarchy is a good (and easier to make than)
replacement for separate grouping. Say if I want two groups to
have separate CPU shares and common kmemsize this is the same as
if I want one group for kmemsize with two kids - one for X% of
CPU share and the other for Y%. And this (hierarchy) provides
more flexibility than "plain" although separate grouping.
Moreover configfs can provide a clean interface for it. E.g.
 $ mkdir /configfs/beancounters/0
 $ mkdir /configfs/beancounters/0/1
 $ mkdir /confgifs/beancounters/0/2
and each task_struct will have a single pointer - current
container - but not 10 - for each controller.

What do you think?

> Having grouping for different resources could be a hairy to deal
> with and could easily mess up applications (for ex: a process in a 80%

That's it... One more thing against separate grouping.

[snip]

> Isnt /proc/<pid> listed also in /proc/<pid>/task/<tid>?
> 
> For ex:
> 
> 	# ls /proc/2906/task
> 	2906  2907  2908  2909
> 
> 2906 is the main thread which created the remaining threads.
> 
> This would lead to an ambiguity when user does something like below:
> 
> 	echo 2906 > /some_res_file_system/some_new_group
> 
> Is he intending to move just the main thread, 2906, to the new group or
> all the threads? It could be either.
> 
> This needs some more thought ...

I agree with Paul Menage that having
/configfs/beancounters/<id>/tasks and /.../threads is perfect.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 23:50       ` Paul Menage
  2006-11-02  0:30         ` Matt Helsley
@ 2006-11-02  9:08         ` Pavel Emelianov
  2006-11-02 11:26           ` Matt Helsley
  1 sibling, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-02  9:08 UTC (permalink / raw)
  To: Paul Menage
  Cc: Matt Helsley, vatsa, Pavel Emelianov, dev, sekharan, ckrm-tech,
	balbir, haveblue, linux-kernel, pj, dipankar, rohitseth

[snip]

> I think that having a "tasks" file and a "threads" file in each
> container directory would be a clean way to handle it:
> 
> "tasks" : read/write complete process members
> "threads" : read/write individual thread members

I've just thought of it.

Beancounter may have more than 409 tasks, while configfs
doesn't allow attributes to store more than PAGE_SIZE bytes
on read. So how would you fill so many tasks in one page?

I like the idea of writing pids/tids to these files, but
printing them back is not that easy.

> 
> Paul
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02  9:08         ` Pavel Emelianov
@ 2006-11-02 11:26           ` Matt Helsley
  2006-11-02 13:04             ` Pavel Emelianov
  0 siblings, 1 reply; 135+ messages in thread
From: Matt Helsley @ 2006-11-02 11:26 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Paul Menage, vatsa, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, dipankar, rohitseth

On Thu, 2006-11-02 at 12:08 +0300, Pavel Emelianov wrote:
> [snip]
> 
> > I think that having a "tasks" file and a "threads" file in each
> > container directory would be a clean way to handle it:
> > 
> > "tasks" : read/write complete process members
> > "threads" : read/write individual thread members
> 
> I've just thought of it.
> 
> Beancounter may have more than 409 tasks, while configfs
> doesn't allow attributes to store more than PAGE_SIZE bytes
> on read. So how would you fill so many tasks in one page?

	To be clear that's a limitation of configfs as an interface. In the
Resource Groups code, for example, there is no hard limitation on length
of the underlying list. This is why we're talking about a filesystem
interface and not necessarily a configfs interface.

> I like the idea of writing pids/tids to these files, but
> printing them back is not that easy.

	That depends on how you do it. For instance, if you don't have an
explicit list of tasks in the group (rough cost: 1 list head per task)
then yes, it could be difficult.

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02 11:26           ` Matt Helsley
@ 2006-11-02 13:04             ` Pavel Emelianov
  2006-11-03  1:29               ` David Rientjes
  0 siblings, 1 reply; 135+ messages in thread
From: Pavel Emelianov @ 2006-11-02 13:04 UTC (permalink / raw)
  To: Matt Helsley, vatsa, balbir
  Cc: Pavel Emelianov, Paul Menage, dev, sekharan, ckrm-tech, haveblue,
	linux-kernel, pj, dipankar, rohitseth

Matt Helsley wrote:
> On Thu, 2006-11-02 at 12:08 +0300, Pavel Emelianov wrote:
>> [snip]
>>
>>> I think that having a "tasks" file and a "threads" file in each
>>> container directory would be a clean way to handle it:
>>>
>>> "tasks" : read/write complete process members
>>> "threads" : read/write individual thread members
>> I've just thought of it.
>>
>> Beancounter may have more than 409 tasks, while configfs
>> doesn't allow attributes to store more than PAGE_SIZE bytes
>> on read. So how would you fill so many tasks in one page?
> 
> 	To be clear that's a limitation of configfs as an interface. In the
> Resource Groups code, for example, there is no hard limitation on length
> of the underlying list. This is why we're talking about a filesystem
> interface and not necessarily a configfs interface.

David Rientjes persuaded me that writing our own file system is
reimplementing the existing thing. If we've agreed with file system
interface then configfs may be used. But the limitations I've
pointed out must be discussed.

Let me remind:
1. limitation of size of data written out of configfs;
2. when configfs is a module user won't be able to
   use beancounters.

and one new
3. now in beancounters we have /proc/user_beancounters
   file that shows the complete statistics on BC. This
   includes all then beancounters in the system with all
   resources' held/maxheld/failcounters/etc. This is very
   handy and "vividly": a simple 'cat' shows you all you
   need. With configfs we lack this very handy feature.

>> I like the idea of writing pids/tids to these files, but
>> printing them back is not that easy.
> 
> 	That depends on how you do it. For instance, if you don't have an
> explicit list of tasks in the group (rough cost: 1 list head per task)
> then yes, it could be difficult.

I propose not to have the list of tasks associated with beancounter
(what for?) but to extend /proc/<pid>/status with 'bcid: <id>' field.
/configfs/beancounters/<id>/(tasks|threads) file should be write-only
then.

What do you think?

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02  8:42         ` Pavel Emelianov
@ 2006-11-03  1:29           ` David Rientjes
  0 siblings, 0 replies; 135+ messages in thread
From: David Rientjes @ 2006-11-03  1:29 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: vatsa, dev, sekharan, menage, ckrm-tech, balbir, haveblue,
	linux-kernel, pj, matthltc, dipankar, rohitseth, devel

On Thu, 2 Nov 2006, Pavel Emelianov wrote:

> So if we're going to have different groupings for different
> resources what's the use of "container" grouping all "controllers"
> together? I see this situation like each task_struct carries
> pointers to kmemsize controller, pivate pages controller,
> physical pages controller, CPU time controller, disk bandwidth
> controller, etc. Right? Or did I miss something?

My understanding is that the only addition to the task_struct is a pointer 
to the struct container it belongs to.  Then, the various controllers can 
register the control files through the fs-based container interface and 
all the manipulation can be done at that level.  Having each task_struct 
containing pointers to individual resource nodes was never proposed.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-02 13:04             ` Pavel Emelianov
@ 2006-11-03  1:29               ` David Rientjes
  0 siblings, 0 replies; 135+ messages in thread
From: David Rientjes @ 2006-11-03  1:29 UTC (permalink / raw)
  To: Pavel Emelianov
  Cc: Matt Helsley, vatsa, balbir, Paul Menage, dev, sekharan,
	ckrm-tech, haveblue, linux-kernel, pj, dipankar, rohitseth

On Thu, 2 Nov 2006, Pavel Emelianov wrote:

> >> Beancounter may have more than 409 tasks, while configfs
> >> doesn't allow attributes to store more than PAGE_SIZE bytes
> >> on read. So how would you fill so many tasks in one page?
> > 
> > 	To be clear that's a limitation of configfs as an interface. In the
> > Resource Groups code, for example, there is no hard limitation on length
> > of the underlying list. This is why we're talking about a filesystem
> > interface and not necessarily a configfs interface.
> 
> David Rientjes persuaded me that writing our own file system is
> reimplementing the existing thing. If we've agreed with file system
> interface then configfs may be used. But the limitations I've
> pointed out must be discussed.
> 

What are we really discussing here?  The original issue that you raised 
with the infrastructure was an fs vs. syscall interface and I simply 
argued in favor of an fs-based approach because containers are inherently 
hierarchial.  As Paul Jackson mentioned, this is one of the advantages 
that cpusets has had since its inclusion in the kernel and the abstraction 
of cpusets from containers makes a convincing case for how beneficial it 
has been and will continue to be.

Regardless of whether configfs is specifically used for this particular 
purpose is irrelevant in deciding fs vs syscall.  Certainly it could be 
used for lightweight purposes but it by no means is the only possibility 
for containers.  I have observed no further advocation for a syscall 
interface; it seems like a no-brainer that if there are certain 
limitations on configfs that you have pointed out that would be 
disadvantageous to containers that another fs implementation would 
suffice.

> Let me remind:
> 1. limitation of size of data written out of configfs;
> 2. when configfs is a module user won't be able to
>    use beancounters.
> 
> and one new
> 3. now in beancounters we have /proc/user_beancounters
>    file that shows the complete statistics on BC. This
>    includes all then beancounters in the system with all
>    resources' held/maxheld/failcounters/etc. This is very
>    handy and "vividly": a simple 'cat' shows you all you
>    need. With configfs we lack this very handy feature.
> 

Ok, so each of these issues includes a specific criticism against configfs 
for containers.  So a different fs-based interface similiar to the cpuset 
abstraction from containers is certainly appropriate.

		David

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-01 23:37                     ` Paul Menage
@ 2006-11-06 12:49                       ` Srivatsa Vaddagiri
  2006-11-06 20:23                         ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-06 12:49 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Wed, Nov 01, 2006 at 03:37:12PM -0800, Paul Menage wrote:
> I saw your example, but can you give a concrete example of a situation
> when you might want to do that?

Paul,
	Firstly, after some more thought on this, we can use your current
proposal, if it makes the implementation simpler.

Secondly, regarding how separate grouping per-resource *maybe* usefull,
consider this scenario.

A large university server has various users - students, professors,
system tasks etc. The resource planning for this server could be on these lines:

	CPU : 		Top cpuset 
			/	\   
		CPUSet1 	CPUSet2
		   |		  |
		(Profs)		(Students)

		In addition (system tasks) are attached to topcpuset (so
		that they can run anywhere) with a limit of 20%

	Memory : Professors (50%), students (30%), system (20%)

	Disk : Prof (50%), students (30%), system (20%)

	Network : WWW browsing (20%), Network File System (60%), others (20%)
				/ \
 			Prof (15%) students (5%)

Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go 
into NFS network class.

At the same time firefox/lynx will share an appropriate CPU/Memory class 
depending on who launched it (prof/student).

If we had the ability to write pids directly to these resource classes,
then admin can easily setup a script which receives exec notifications
and depending on who is launching the browser he can 

	# echo browser_pid > approp_resource_class

With your proposal, he now would have to create a separate container for
every browser launched and associate it with approp network and other
resource class. This may lead to proliferation of such containers.

Also lets say that the administrator would like to give enhanced network
access temporarily to a student's browser (since it is night and the user
wants to do online gaming :)  OR give one of the students simulation
apps enhanced CPU power, 

With ability to write pids directly to resource classes, its just a
matter of :

	# echo pid > new_cpu/network_class
	(after some time)
	# echo pid > old_cpu/network_class

Without this ability, he will have to split the container into a
separate one and then associate the container with the new resource
classes.

So yes, the end result is perhaps achievable either way, the big
different I see is the ease of use.

> For simplicity combined with flexibility, I think I still favour the
> following model:
> 
> - all processes are a member of one container
> - for each resource type, each container is either in the same
> resource node as its parent or a freshly child node of the parent
> resource node (determined at container creation time)
> 
> This is a subset of my more complex model, but it's pretty easy to
> understand from userspace and to implement in the kernel.

If this model makes the implementation simpler, then I am for it, until
we have gained better insight on its use.

> What objections do you have to David's suggestion hat if you want some
> processes in a container to be in one resource node and others in
> another resource node, then you should just subdivide into two
> containers, such that all processes in a container are in the same set
> of resource nodes?

One observation is the ease of use (as some of the examples above
point out). Other is that it could lead to more containers than
necessary.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-06 12:49                       ` Srivatsa Vaddagiri
@ 2006-11-06 20:23                         ` Paul Menage
  2006-11-07 13:20                           ` Srivatsa Vaddagiri
                                             ` (2 more replies)
  0 siblings, 3 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-06 20:23 UTC (permalink / raw)
  To: vatsa
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On 11/6/06, Srivatsa Vaddagiri <vatsa@in.ibm.com> wrote:
> On Wed, Nov 01, 2006 at 03:37:12PM -0800, Paul Menage wrote:
> > I saw your example, but can you give a concrete example of a situation
> > when you might want to do that?
>
> Paul,
>         Firstly, after some more thought on this, we can use your current
> proposal, if it makes the implementation simpler.

It does, but I'm more in favour of getting the abstractions right the
first time if we can, rather than implementation simplicity.

>
> Secondly, regarding how separate grouping per-resource *maybe* usefull,
> consider this scenario.
>
> A large university server has various users - students, professors,
> system tasks etc. The resource planning for this server could be on these lines:
>
>         CPU :           Top cpuset
>                         /       \
>                 CPUSet1         CPUSet2
>                    |              |
>                 (Profs)         (Students)
>
>                 In addition (system tasks) are attached to topcpuset (so
>                 that they can run anywhere) with a limit of 20%
>
>         Memory : Professors (50%), students (30%), system (20%)
>
>         Disk : Prof (50%), students (30%), system (20%)
>
>         Network : WWW browsing (20%), Network File System (60%), others (20%)
>                                 / \
>                         Prof (15%) students (5%)
>
> Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
> into NFS network class.
>
> At the same time firefox/lynx will share an appropriate CPU/Memory class
> depending on who launched it (prof/student).
>
> If we had the ability to write pids directly to these resource classes,
> then admin can easily setup a script which receives exec notifications
> and depending on who is launching the browser he can
>
>         # echo browser_pid > approp_resource_class
>
> With your proposal, he now would have to create a separate container for
> every browser launched and associate it with approp network and other
> resource class. This may lead to proliferation of such containers.

Or create one container per combination (so in this case four,
prof/www, prof/other, student/www, student/other) - then processes can
be moved between the containers to get the appropriate qos of each
type.

So the setup would look something like:

top-level: prof vs student vs system, with new child nodes for cpu,
memory and disk, and no  new node for network

second-level, within the prof and student classes: www vs other, with
new child nodes for network, and no new child nodes for cpu.

In terms of the commands to set it up, it might look like (from the top-level)

echo network > inherit
mkdir prof student system
echo disk,cpu,memory > prof/inherit
mkdir prof/www prof/other
echo disk,cpu,memory > student/inherit
mkdir student/www student/other

> Also lets say that the administrator would like to give enhanced network
> access temporarily to a student's browser (since it is night and the user
> wants to do online gaming :)  OR give one of the students simulation
> apps enhanced CPU power,
>
> With ability to write pids directly to resource classes, its just a
> matter of :
>
>         # echo pid > new_cpu/network_class
>         (after some time)
>         # echo pid > old_cpu/network_class
>
> Without this ability, he will have to split the container into a
> separate one and then associate the container with the new resource
> classes.

In practice though, do you think the admin would really want to be
have to move individual processes around by hand? Sure, it's possible,
but wouldn't it make more sense to just give the entire student/www
class more network bandwidth? Or more generically, how often are
people going to be needing to move individual processes from one QoS
class to another, rather than changing the QoS for the existing class?

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-06 20:23                         ` Paul Menage
@ 2006-11-07 13:20                           ` Srivatsa Vaddagiri
  2006-11-07 18:41                           ` Paul Jackson
  2006-11-10 14:57                           ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-07 13:20 UTC (permalink / raw)
  To: Paul Menage
  Cc: Paul Jackson, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Mon, Nov 06, 2006 at 12:23:44PM -0800, Paul Menage wrote:
> In practice though, do you think the admin would really want to be
> have to move individual processes around by hand? Sure, it's possible,
> but wouldn't it make more sense to just give the entire student/www
> class more network bandwidth? Or more generically, how often are

Wouldn't that cause -all- browsers to get enhanced network access? This
is when your intention was to give one particular student's browser
enhanced network access (to do online gaming) while retaining its
existing cpu/mem/io limits or another particular students simulation app 
enhanced CPU access while retaining existing mem/io limits.

> people going to be needing to move individual processes from one QoS
> class to another, rather than changing the QoS for the existing class?

If we are talking of tasks moving from one QoS class to another, then it
can be pretty frequent in case of threaded databases and webservers.
I have been told that, atleast in case of databases, depending on the
workload, tasks may migrate from one group to another on every request.
In general, duration of requests fall within the milliseconds to seconds
range. So, IMO, design should support frequent task-migration.

Also, the requirement to tune individual resource availability for
specific apps/processes (ex: boost its CPU usage but retain other existing
limits) may not be unrealistic.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-06 20:23                         ` Paul Menage
  2006-11-07 13:20                           ` Srivatsa Vaddagiri
@ 2006-11-07 18:41                           ` Paul Jackson
  2006-11-07 19:07                             ` Paul Menage
  2006-11-10 14:57                           ` Srivatsa Vaddagiri
  2 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 18:41 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M. wrote:
> It does, but I'm more in favour of getting the abstractions right the
> first time if we can, rather than implementation simplicity.

Yup.

The CONFIG_CPUSETS_LEGACY_API config option is still sticking in my
craw.  Binding things at mount time, as you did, seems more useful.

Srivatsa wrote:
> Secondly, regarding how separate grouping per-resource *maybe* usefull,
> consider this scenario.

Yeah - I tend to agree that we should allow for such possibilities.

I see the following usage patterns -- I wonder if we can see a way to
provide for all these.  I will speak in terms of just cpusets and
resource groups, as examplars of the variety of controllers that might
make good use of Paul M's containers:

Could we (Paul M?) find a way to build a single kernel that supports:

 1) Someone just using cpusets wants to do:
	mount -t cpuset cpuset /dev/cpuset
    and then see the existing cpuset API.  Perhaps other files show
    up in the cpuset directories, but at least all the existing
    ones provided by the current cpuset API, with their existing
    behaviours, are all there.

 2) Someone wanting a good CKRM/ResourceGroup interface, doing
    whatever those fine folks are wont to do, binding some other
    resource group controller to a container hierarchy.

 3) Someone, in the future, wanting to "bind" cpusets and resource
    groups together, with a single container based name hierarchy
    of sets of tasks, providing both the cpuset and resource group
    control mechanisms.  Code written for (1) or (2) should work,
    though there is a little wiggle room for API 'refinements' if
    need be.

 4) Someone doing (1) and (2) separately and independently on the
    same system at the same time, with separate and independent
    partitions (aka container hierarchies) of that systems tasks.

If we found usage pattern (4) to difficult to provide cleanly, I might
be willing to drop that one.  I'm not sure yet.

Intuitively, I find (3) very attractive, though I don't have any actual
customer requirements for it in hand (we are operating a little past
our customers awareness in this present discussion.)

The initial customer needs are for (1), which preserves an existing
kernel API, and on separate systems, for (2).  Providing for both on
the same system, as in (3) with a single container hierarchy or even
(4) with multiple independent hierarchies, is an enhancement.

I forsee a day when user level software, such as batch schedulers, are
written to take advantage of (3), once the kernel supports binding
multiple controllers to a common task container hierarchy.  Initially,
some systems will need cpusets, and some will need resource groups, and
the intersection of these requiring both, whether bound as in (3), or
independent as in (4), will be pretty much empty.

In general then, we will have several controllers (need a good way
for user space to list what controllers, such as cpusets and resource
groups, are available on a live system!) and user space code should be
able to create at least one, if not multiple as in (4) above, container
hierarchies, each bound to one or more of these controllers.

Likely some, if not all, controllers will be singular - at most one such
controller of a given time on a system. Though if someone has a really
big brain, and wants to generalize that constraint, that could be
amusing.  I guess I could have added a (5) above - allow for multiple
instances of a given controller, each bound to different container
hierarchies.  But I'm guessing that is too hard, and not worth the
effort, so I didn't list it.

The notify_on_release mechanism should be elaborated, so that when
multiple controllers (e.g. cpusets and resource groups) are bound to
a common container hierarchy, then user space code can (using API's that
don't exist currently) separately control these exits hooks for each of
these bound controllers.  Perhaps simply enabling 'notify_on_release'
for a container invokes the exit hooks (user space callbacks) for -all-
the controllers bound to that container, whereas some new API's enable
picking and choosing which controllers exit hooks are active.  For
example, there might be a per-cpuset boolean flag file called
'cpuset_notify_on_release', for controlling that exit hook, separately
from any other exit hooks, and a 'cpuset_notify_on_release_path' file
for setting the path of the executable to invoke on release.

I would expect one kernel build CONFIG option for each controller type.
If any one or more of these controller options was enabled, then you
get containers in your build too - no option about it.  I guess that
means that we have a CONFIG option for containers, to mark that code as
conditionally compiled, but that this container CONFIG option is
automatically set iff one or more controllers are included in the build.

Perhaps the interface to binding multiple controllers to a single container
hierarchy is via multiple mount commands, each of type 'container', with
different options specifying which controller(s) to bind.  Then the
command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
'mount -t container -o controller=cpuset /dev/cpuset'.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 18:41                           ` Paul Jackson
@ 2006-11-07 19:07                             ` Paul Menage
  2006-11-07 19:11                               ` Paul Jackson
                                                 ` (3 more replies)
  0 siblings, 4 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-07 19:07 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Jackson <pj@sgi.com> wrote:
>
> I see the following usage patterns -- I wonder if we can see a way to
> provide for all these.  I will speak in terms of just cpusets and
> resource groups, as examplars of the variety of controllers that might
> make good use of Paul M's containers:
>
> Could we (Paul M?) find a way to build a single kernel that supports:
>
>  1) Someone just using cpusets wants to do:
>         mount -t cpuset cpuset /dev/cpuset
>     and then see the existing cpuset API.  Perhaps other files show
>     up in the cpuset directories, but at least all the existing
>     ones provided by the current cpuset API, with their existing
>     behaviours, are all there.

This will happen if you configure CONFIG_CPUSETS_LEGACY_API

>
>  2) Someone wanting a good CKRM/ResourceGroup interface, doing
>     whatever those fine folks are wont to do, binding some other
>     resource group controller to a container hierarchy.

This works now.

>
>  3) Someone, in the future, wanting to "bind" cpusets and resource
>     groups together, with a single container based name hierarchy
>     of sets of tasks, providing both the cpuset and resource group
>     control mechanisms.  Code written for (1) or (2) should work,
>     though there is a little wiggle room for API 'refinements' if
>     need be.

That works now.

>
>  4) Someone doing (1) and (2) separately and independently on the
>     same system at the same time, with separate and independent
>     partitions (aka container hierarchies) of that systems tasks.

Right now you can't have multiple independent hierarchies - each
subsystem either has the same hierarchy as all the other subsystems,
or has just a single node and doesn't participate in the hierarchy.

>
> The initial customer needs are for (1), which preserves an existing
> kernel API, and on separate systems, for (2).  Providing for both on
> the same system, as in (3) with a single container hierarchy or even
> (4) with multiple independent hierarchies, is an enhancement.
>
> I forsee a day when user level software, such as batch schedulers, are
> written to take advantage of (3), once the kernel supports binding
> multiple controllers to a common task container hierarchy.  Initially,
> some systems will need cpusets, and some will need resource groups, and
> the intersection of these requiring both, whether bound as in (3), or
> independent as in (4), will be pretty much empty.

I don't know about group (4), but we certainly have a big need for (3).

>
> In general then, we will have several controllers (need a good way
> for user space to list what controllers, such as cpusets and resource
> groups,

I think it's better to treat resource groups as a common framework for
resource controllers, rather than a resource controller itself.
Otherwise we'll have the same issues of wanting to treat separate
resources in separate hierarchies - by treating each RG controller as
a separate entitiy sharing a common resource metaphor and user API,
you get the multiple hierarchy support for free.

>
> Perhaps the interface to binding multiple controllers to a single container
> hierarchy is via multiple mount commands, each of type 'container', with
> different options specifying which controller(s) to bind.  Then the
> command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
> 'mount -t container -o controller=cpuset /dev/cpuset'.

Yes, that's the aproach that I'm thinking of currently. It should
require pretty reasonably robotic changes to the existing code.

One of the issues that crops up with it is what do you put in
/proc/<pid>/container if there are multiple hierarchies?

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:07                             ` Paul Menage
@ 2006-11-07 19:11                               ` Paul Jackson
  2006-11-07 19:24                                 ` Paul Menage
  2006-11-07 20:02                               ` Paul Jackson
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 19:11 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

> This will happen if you configure CONFIG_CPUSETS_LEGACY_API

So why is this CONFIG_* option separate?  When would I ever not
want it?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:11                               ` Paul Jackson
@ 2006-11-07 19:24                                 ` Paul Menage
  2006-11-07 19:58                                   ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-07 19:24 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Jackson <pj@sgi.com> wrote:
> > This will happen if you configure CONFIG_CPUSETS_LEGACY_API
>
> So why is this CONFIG_* option separate?  When would I ever not
> want it?

If you weren't bothered about having the legacy semantics. The main
issue is that it adds an extra file to /proc/<pid>. I guess the other
stuff could be made nonconditional without breaking anyone who didn't
try to mount cpusetfs

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:24                                 ` Paul Menage
@ 2006-11-07 19:58                                   ` Paul Jackson
  2006-11-07 20:00                                     ` Paul Menage
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 19:58 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

> > So why is this CONFIG_* option separate?  When would I ever not
> > want it?
> 
> If you weren't bothered about having the legacy semantics.

You mean if I wasn't bothered about -not- having the legacy semantics?

Let me put this another way - could you drop the
CONFIG_CPUSETS_LEGACY_API option, and make whatever is needed to
preserve the current cpuset API always present (if CPUSETS themselves
are configured, of course)?

If you're reluctant to do so, why?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:58                                   ` Paul Jackson
@ 2006-11-07 20:00                                     ` Paul Menage
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-07 20:00 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Jackson <pj@sgi.com> wrote:
> > > So why is this CONFIG_* option separate?  When would I ever not
> > > want it?
> >
> > If you weren't bothered about having the legacy semantics.
>
> You mean if I wasn't bothered about -not- having the legacy semantics?
>
> Let me put this another way - could you drop the
> CONFIG_CPUSETS_LEGACY_API option, and make whatever is needed to
> preserve the current cpuset API always present (if CPUSETS themselves
> are configured, of course)?

Yes.

>
> If you're reluctant to do so, why?

As I said, mainly /proc pollution.

But it's not a big deal, so I can drop it unless there's a strong
argument from others in favour of keeping it.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:07                             ` Paul Menage
  2006-11-07 19:11                               ` Paul Jackson
@ 2006-11-07 20:02                               ` Paul Jackson
  2006-11-08  2:47                                 ` Matt Helsley
  2006-11-07 20:34                               ` Paul Jackson
  2006-11-07 22:21                               ` Paul Menage
  3 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 20:02 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> I think it's better to treat resource groups as a common framework for
> resource controllers, rather than a resource controller itself.

You could well be right here - I was just using resource groups
as another good example of a controller.  I'll let others decide
if that's one or several controllers.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:07                             ` Paul Menage
  2006-11-07 19:11                               ` Paul Jackson
  2006-11-07 20:02                               ` Paul Jackson
@ 2006-11-07 20:34                               ` Paul Jackson
  2006-11-07 20:41                                 ` Paul Menage
  2006-11-07 22:21                               ` Paul Menage
  3 siblings, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 20:34 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wroteL
> One of the issues that crops up with it is what do you put in
> /proc/<pid>/container if there are multiple hierarchies?

Thanks for your rapid responses - good.

How about /proc/<pid>/containers being a directory, with each
controller having one regular file entry (so long as we haven't done
the multiple controller instances in my item (5)) containing the path,
relative to some container file system mount point (which container
mount is up to user space code to track) of the container that contains
that task?

Or how about each controller type, such as cpusets, having its own
/proc/<pid>/<controller-type> file, with no generic file
/proc</pid>/container at all.  Just extend the current model
seen in /proc/<pid>/cpuset ?

Actually, I rather like that last alternative - forcing the word
'container' into these /proc/<pid>/??? pathnames strikes me as
an exercise in branding, not in technical necessity.  But that
could just mean I am still missing a big fat clue somewhere ...

Feel free to keep hitting me with clue sticks, as need be.

It will take a while (as in a year or two) for me and others to train
all the user level code that 'knows' that cpusets are always mounted at
"/dev/cpuset" to find the mount point for the container handling
cpusets anywhere else.

I knew when I hardcoded the "/dev/cpuset" path in various places
in user space that I might need to revisit that, but my crystal
ball wasn't good enough to predict what form this generalization
would take.  So I followed one of my favorite maxims - if you can't
get it right, at least keep it stupid, simple, so that whomever does
have to fix it up has the least amount of legacy mechanism to rip out.

However this fits in nicely with my expectation that we will have
only limited need, if any, in the short term, to run systems with
both cpusets and resource groups at the same time.  Systems just
needing cpusets can jolly well continue to mount at /dev/cpuset,
in perpetuity.  Systems needing other or fancier combinations of
controllers will need to handle alternative mount points, and keep
track somehow in user space of what's mounted where.

And while we're here, how about each controller naming itself with a
well known string compiled into its kernel code, and a file such
as /proc/containers listing what controllers are known to it?  Not
surprisingly, I claim the word "cpuset" to name the cpuset controller ;)

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 20:34                               ` Paul Jackson
@ 2006-11-07 20:41                                 ` Paul Menage
  2006-11-07 21:50                                   ` Paul Jackson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-07 20:41 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Jackson <pj@sgi.com> wrote:
> How about /proc/<pid>/containers being a directory, with each
> controller having one regular file entry (so long as we haven't done
> the multiple controller instances in my item (5)) containing the path,
> relative to some container file system mount point (which container
> mount is up to user space code to track) of the container that contains
> that task?

Hmm. Seems a bit fancier than necessary, but maybe reasonable. I'll
probably start with a single file listing all the different container
associations and we can turn it into a directory later as a finishing
touch.

>
> Or how about each controller type, such as cpusets, having its own
> /proc/<pid>/<controller-type> file, with no generic file
> /proc</pid>/container at all.  Just extend the current model
> seen in /proc/<pid>/cpuset ?

Is it possible to dynamically extend the /proc/<pid>/ directory? If
not, then every container subsystem would involve a patch in
fs/proc/base.c, which seems a bit nasty.

> However this fits in nicely with my expectation that we will have
> only limited need, if any, in the short term, to run systems with
> both cpusets and resource groups at the same time.

We're currently planning on using cpusets for the memory node
isolation properties, but we have a whole bunch of other resource
controllers that we'd like to be able to hang off the same
infrastructure, so I don't think the need is that limited.

>
> And while we're here, how about each controller naming itself with a
> well known string compiled into its kernel code, and a file such
> as /proc/containers listing what controllers are known to it?  Not

The naming is already in my patch. You can tell from the top-level
directory which containers are registered, since each one has an
xxx_enabled file to control whether it's in use; there's not a
separate /proc/containers file yet.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 20:41                                 ` Paul Menage
@ 2006-11-07 21:50                                   ` Paul Jackson
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-11-07 21:50 UTC (permalink / raw)
  To: Paul Menage
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> Is it possible to dynamically extend the /proc/<pid>/ directory?

Not that I know of -- sounds like a nice idea for a patch.

> We're currently planning on using cpusets for the memory node
> isolation properties, but we have a whole bunch of other resource
> controllers that we'd like to be able to hang off the same
> infrastructure, so I don't think the need is that limited.

So long as you can update the code in your user space stack that
knows about this, then you should have nothing stopping you.

I've got a major (albeit not well publicized) open source user space
C library for working with cpusets which I will have to fix up.

> The naming is already in my patch. You can tell from the top-level
> directory which containers are registered, since each one has an
> xxx_enabled file to control whether it's in use;

But there are other *_enabled per-cpuset flags, not naming controllers,
so that is not a robust way to list container types.

Right now, I'm rather fond of the /proc/containers (or should it
be /proc/controllers?) idea.  Though since I don't time to code
the patch today, I'll have to shut up.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 19:07                             ` Paul Menage
                                                 ` (2 preceding siblings ...)
  2006-11-07 20:34                               ` Paul Jackson
@ 2006-11-07 22:21                               ` Paul Menage
  2006-11-08  3:15                                 ` Paul Jackson
  3 siblings, 1 reply; 135+ messages in thread
From: Paul Menage @ 2006-11-07 22:21 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Menage <menage@google.com> wrote:
> > Perhaps the interface to binding multiple controllers to a single container
> > hierarchy is via multiple mount commands, each of type 'container', with
> > different options specifying which controller(s) to bind.  Then the
> > command 'mount -t cpuset cpuset /dev/cpuset' gets remapped to the command
> > 'mount -t container -o controller=cpuset /dev/cpuset'.
>
> Yes, that's the aproach that I'm thinking of currently. It should
> require pretty reasonably robotic changes to the existing code.

One drawback to this that I can see is the following:

- suppose you mount a containerfs with controllers cpuset and cpu, and
create some nodes, and then unmount it, what happens? do the resource
nodes stick around still?

- suppose you then try to mount a containers with controllers cpuset
and diskio, but the resource nodes are still around, what happens
then?

Is there any way to prevent unmounting (at the dentry level) a busy filesystem?

If we enforced a completely separate hierarchy for each resource
controller (i.e. one resource controller per mount), then it wouldn't
be too hard to hang the node structure off the controller itself, and
there would never be a problem with mounting two controllers with
existing inconsistent hierarchies on the same mount. But that would
rule out being able to hook several resource controllers together in
the same container node.

One alternative to this would be to have a fixed number of container
hierarchies; at mount time you'd mount hierarchy N, and optionally
bind a set of resource controllers to it, or else use the existing
set. Then the hierarchy can be hung off the appropriate entry in the
hierarchy array even when the fs isn't mounted.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 20:02                               ` Paul Jackson
@ 2006-11-08  2:47                                 ` Matt Helsley
  0 siblings, 0 replies; 135+ messages in thread
From: Matt Helsley @ 2006-11-08  2:47 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Paul Menage, vatsa, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, dipankar, rohitseth

On Tue, 2006-11-07 at 12:02 -0800, Paul Jackson wrote:
> Paul M wrote:
> > I think it's better to treat resource groups as a common framework for
> > resource controllers, rather than a resource controller itself.
> 
> You could well be right here - I was just using resource groups
> as another good example of a controller.  I'll let others decide
> if that's one or several controllers.

	At various stages different controllers were available with the core
patches or separately. The numtasks, cpu, io, socket accept queue, and
memory controllers were available for early CKRM patches. More recently
(April 2006) numtasks, cpu, and memory controllers were available for
Resource Groups.

	So I'd say "several".

Cheers,
	-Matt Helsley


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-07 22:21                               ` Paul Menage
@ 2006-11-08  3:15                                 ` Paul Jackson
  2006-11-08  4:15                                   ` Paul Menage
  2006-11-08  5:12                                   ` Srivatsa Vaddagiri
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Jackson @ 2006-11-08  3:15 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> One drawback to this that I can see is the following:
> 
> - suppose you mount a containerfs with controllers cpuset and cpu, and
> create some nodes, and then unmount it, what happens? do the resource
> nodes stick around still?

Sorry - I let interfaces get confused with objects and operations.

Let me back up a step.  I think I have beat on your proposal enough
to translate it into the more abstract terms that I prefer using when
detemining objects, operations and semantics.

It goes like this ... grab a cup of coffee.

We have a container mechanism (the code you extracted from cpusets)
and the potential for one or more instantiations of that mechanism
as container file systems.

This container mechanism provides a pseudo file system structure
that names not disk files, but partitions of the set of tasks
in the system.  As always, by partition I mean a set of subsets,
covering and non-intersecting.  Each element of a partition of the
set of tasks in a system is a subset of that systems tasks.

The container mechanism gives names and permissions to the elements
(subsets of tasks) of the partition, and provides a convenient place
to attach attributes to those partition elements. The directories in
such a container file system always map one-to-one with the elements of
the partition, and the regular files in each such directory represent
the per-element (per-cpuset, for example) attributes.

Each directory in a container file system has a file called 'tasks'
listing the pids of the tasks (newline separated decimal ASCII format)
in that partition element.

Each container file system needs a name.  This corresponds to the
/dev/sda1 style raw device used to name disk based file systems
independent of where or if they are mounted.

Each task should list in its /proc/<pid> directory, for each such
named container file system in the system, the container file system
relative path of whichever directory in that container (element in
the partition it defines) that task belongs to.  (An earlier proposal
I made to have an entry for each -controller- in each /proc/<pid>
directory was bogus.)

Because containers define a partition of the tasks in a system, each
task will always be in exactly one of the partition elements of a
container file system.  Tasks are moved from one partition element
to another by writing their pid (decimal ASCII) into the 'tasks'
file of the receiving directory.

For some set of events, to include at least the 'release' of a
container element, the user can request that a callout be made to
a user executable.  This carries forth a feature previously known
as 'notify_on_release.'

We have several controllers, each of which can be instantiated and
bound to a container file system.  One of these controllers provides
for NUMA processor and memory placement control, and is called cpusets.

Perhaps in the future some controllers will support multiple instances,
bound to different container file systems, at the same time.
By different here, I meant not just different mounts of the same
container file system, but different partitions that divide up the
tasks of the system in different ways.

Each controller specifies a set of attributes to be associated with
each partition element of a container.  The act of associating a
controllers attributes with partition elements I will call "binding".

We need to be able to create, destroy and list container file systems,
and for each such container file system, we need to be able to bind
and unbind controller instances thereto.

We need to be able to list what controller types exist in the system
capable of being bound to containers.  We need to be able to list
for each container file system what controllers are bound to it.

And we need to be able to mount and unmount container file systems
from specific mount point paths in the file system.

We definitely need to be able to bind more than one controller to a
given container file system at the same time.  This was my item (3)
in an earlier post today.

We might like to support multiple container file systems at one time.
This seems like a good idea to at least anticipate doing, even if it
turns out to be more work than we can deliver immediately.  This was
my item (4) in that earlier post.

We will probably have some controllers in the future that are able
to be bound to more than one container file system at the same time,
and we have now, and likely will always have, some controllers, such
as cpusets, that must be singular - at most one bound instance at a
time in the system  This relates to my (buried) item (5) from that
earlier post.  The container code may or may not be able to support
controllers that bind to more than one file system at a time; I don't
know yet either how valuable or difficult this would be.

Overloading all these operations on the mount/umount commands seems
cumbersome, obscure and confusing.  The essential thing a mount does
is bind a kernel object (such as one of our container instances) to
a mount point (path) in the filesystem.  By the way, we should allow
for the possibility that one container instance might be mounted on
multiple points at the same time.

So it seems we need additional API's to support the creation and
destruction of containers, and binding controllers to them.

All controllers define an initial default state, and all tasks
can reference, while in that tasks context in the kernel, for any
controller type built into the system (or loadable module ?!), the
per-task state of that controller, getting at least this default state
even if the controller is not bound.

If a controller is not bound to any container file system, and
immediately after such a binding, before any of its per-container
attribute files have been modified via the container file system API,
the state of a controller as seen by a task will be this default state.
When a controller is unbound, then the state it presented to each
task in the system reverts to this default state.

Container file systems can be unmounted and remounted all the
while retaining their partitioning and  any binding to controllers.
Unmounting a container file system just retracts the API mechanism
required to query and manipulate the partitioning and the state per
partition element of bound controllers.

A basic scenario exemplifying these operations might go like this
(notice I've still given no hint of the some of the API's involved):

  1) Given a system with controllers Con1, Con2 and Con3, list them.
  2) List the currently defined container file systems, finding none.
  3) Define a container file system CFS1.
  4) Bind controller Con2 to CFS1.
  5) Mount CFS1 on /dev/container.
  6) Bind controller Con3 to CFS1.
  7) List the currently defined container file systems, finding CFS1.
  8) List the controllers bound to CFS1, finding Con2 and Con3.
  9) Mount CFS1 on a second mount point, say /foo/bar/container.
     This gives us two pathnames to refer to the same thing.
 10) Refine and modify the partition defined by CFS1, by making
     subdirectories and moving tasks about.
 11) Define a second container file system - this might fail if our
     implementation doesn't support multiple container file systems
     at the same time yet.  Call this CFS2.
 12) Bind controller Con1 to CFS2.  This should work.
 13) Mount CFS2 on /baz.
 14) Bind controller Con2 to CFS2.  This may well fail if that
     controller must be singular.
 15) Unbind controller Con2 from CFS2.  After this, any task referencing
     its Con2 controller will find the minimal default state.
 16) If (14) failed, try it again.  We should be able to bind Con2 to
     CFS2 now, if not earlier.
 17) List the mount points in the system (cat /proc/mounts).  Observe
     two entries of type container. for CFS1 and CFS2.
 18) List the controllers bound to CFS2, finding Con1 and Con2.
 19) Unmount CFS2.  Its structure remains, however one lacks any API to
     observe this.
 20) List the controllers bound to CFS2 - still Con1 and Con2.
 20) Remount CFS2 on /bornagain.
 21) Observe its structure and the binding of Con1 and Con2 to it remain.
 22) Unmount CFS2 again.
 23) Ask to delete CFS2 - this fails due because it has controllers
     bound to it.
 24) Unbind controllers Con1 and Con2 from CFS2.
 25) Ask to delete CFS2 - this succeeds this time.
 26) List the currently defined container file systems, once again
     finding just CFS1.
 27) List the controllers bound to CFS1, finding just Con3.
 28) Examine the regular files in the directory /dev/container, where
     CFS1 is currently mounted.  Find the files representing the
     attributes of controller Con3.

If you indulged in enough coffee to stay awake through all that,
you noticed that I invented some rules on what would or would not
work in certain situations.  For example, I decreed in (23) that one
could not delete a container file system if it had any controllers
bound to it.  I just made these rules up ...

I find it usually works best if I turn the objects and operations
around in my head a bit, before inventing API's to realize them.

So I don't yet have any firmly jelled views on what the additional
API's to manipulate container file systems and controller binding
should look like.

Perhaps someone else will beat me to it.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  3:15                                 ` Paul Jackson
@ 2006-11-08  4:15                                   ` Paul Menage
  2006-11-08  4:16                                     ` Paul Jackson
  2006-11-08 11:22                                     ` Paul Jackson
  2006-11-08  5:12                                   ` Srivatsa Vaddagiri
  1 sibling, 2 replies; 135+ messages in thread
From: Paul Menage @ 2006-11-08  4:15 UTC (permalink / raw)
  To: Paul Jackson
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

On 11/7/06, Paul Jackson <pj@sgi.com> wrote:
> Paul M wrote:
> > One drawback to this that I can see is the following:
> >
> > - suppose you mount a containerfs with controllers cpuset and cpu, and
> > create some nodes, and then unmount it, what happens? do the resource
> > nodes stick around still?
>
> Sorry - I let interfaces get confused with objects and operations.
>
> Let me back up a step.  I think I have beat on your proposal enough
> to translate it into the more abstract terms that I prefer using when
> detemining objects, operations and semantics.
>
> It goes like this ... grab a cup of coffee.
>

That's pretty much what I was envisioning, except for the fact that I
was trying to fit the controller/container bindings into the same
mount/umount interface. I still think that might be possible with
judicious use of mount options, but if not we should probably just use
configfs or something like that as a binding API.

Paul

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  4:15                                   ` Paul Menage
@ 2006-11-08  4:16                                     ` Paul Jackson
  2006-11-08 11:22                                     ` Paul Jackson
  1 sibling, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-11-08  4:16 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

> That's pretty much what I was envisioning, 

Good.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  3:15                                 ` Paul Jackson
  2006-11-08  4:15                                   ` Paul Menage
@ 2006-11-08  5:12                                   ` Srivatsa Vaddagiri
  2006-11-08  5:36                                     ` Paul Jackson
  2006-11-08 19:25                                     ` Chris Friesen
  1 sibling, 2 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-08  5:12 UTC (permalink / raw)
  To: Paul Jackson
  Cc: Paul Menage, dev, sekharan, ckrm-tech, balbir, haveblue,
	linux-kernel, matthltc, dipankar, rohitseth

On Tue, Nov 07, 2006 at 07:15:18PM -0800, Paul Jackson wrote:
> It goes like this ... grab a cup of coffee.

Thanks for the nice and big writeup!

> Each directory in a container file system has a file called 'tasks'
> listing the pids of the tasks (newline separated decimal ASCII format)
> in that partition element.

As was discussed in a previous thread, having a 'threads' file also will
be good.

	http://lkml.org/lkml/2006/11/1/386

> Because containers define a partition of the tasks in a system, each
> task will always be in exactly one of the partition elements of a
> container file system.  Tasks are moved from one partition element
> to another by writing their pid (decimal ASCII) into the 'tasks'
> file of the receiving directory.

Writing to 'tasks' file will move that single thread to the new
container. Writing to 'threads' file will move all the threads of the
process into the new container.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  5:12                                   ` Srivatsa Vaddagiri
@ 2006-11-08  5:36                                     ` Paul Jackson
  2006-11-09  5:39                                       ` Balbir Singh
  2006-11-08 19:25                                     ` Chris Friesen
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Jackson @ 2006-11-08  5:36 UTC (permalink / raw)
  To: vatsa
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth, menage

Srivatsa wrote:
> As was discussed in a previous thread, having a 'threads' file also will
> be good.
> 
> 	http://lkml.org/lkml/2006/11/1/386
> 
> Writing to 'tasks' file will move that single thread to the new
> container. Writing to 'threads' file will move all the threads of the
> process into the new container.

Yup - agreed.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  4:15                                   ` Paul Menage
  2006-11-08  4:16                                     ` Paul Jackson
@ 2006-11-08 11:22                                     ` Paul Jackson
  1 sibling, 0 replies; 135+ messages in thread
From: Paul Jackson @ 2006-11-08 11:22 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, vatsa, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	matthltc, dipankar, rohitseth

Paul M wrote:
> except for the fact that I
> was trying to fit the controller/container bindings into the same
> mount/umount interface.

Of course, if you come up with an API using mount for this stuff
that looks nice and intuitive, don't hesitate to propose it.

I don't have any fundamental opposition to just using mount options
here; just a pretty strong guess that it won't be very intuitive
by the time all the necessary operations are encoded.

And this sort of abstractified pseudo meta containerized code is
just the sort of thing that drives normal humans up a wall, or
should I say, into a fog of confusion.

Not only is it worth a bit of work getting the abstractions right,
as you have noted, it's also worth a bit of work to try to get the
API as transparent as we are able.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.925.600.0401

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  5:12                                   ` Srivatsa Vaddagiri
  2006-11-08  5:36                                     ` Paul Jackson
@ 2006-11-08 19:25                                     ` Chris Friesen
  2006-11-09  3:54                                       ` Srivatsa Vaddagiri
  1 sibling, 1 reply; 135+ messages in thread
From: Chris Friesen @ 2006-11-08 19:25 UTC (permalink / raw)
  To: vatsa
  Cc: Paul Jackson, Paul Menage, dev, sekharan, ckrm-tech, balbir,
	haveblue, linux-kernel, matthltc, dipankar, rohitseth

Srivatsa Vaddagiri wrote:

> As was discussed in a previous thread, having a 'threads' file also will
> be good.
> 
> 	http://lkml.org/lkml/2006/11/1/386

> Writing to 'tasks' file will move that single thread to the new
> container. Writing to 'threads' file will move all the threads of the
> process into the new container.

That's exactly backwards to the proposal that you linked to.

Chris

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08 19:25                                     ` Chris Friesen
@ 2006-11-09  3:54                                       ` Srivatsa Vaddagiri
  0 siblings, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-09  3:54 UTC (permalink / raw)
  To: Chris Friesen
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	Paul Jackson, matthltc, dipankar, rohitseth, Paul Menage

On Wed, Nov 08, 2006 at 01:25:18PM -0600, Chris Friesen wrote:
> Srivatsa Vaddagiri wrote:
> 
> > As was discussed in a previous thread, having a 'threads' file also will
> > be good.
> > 
> > 	http://lkml.org/lkml/2006/11/1/386
> 
> > Writing to 'tasks' file will move that single thread to the new
> > container. Writing to 'threads' file will move all the threads of the
> > process into the new container.
> 
> That's exactly backwards to the proposal that you linked to.

Oops ..yes. Thanks for correcting me!

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-08  5:36                                     ` Paul Jackson
@ 2006-11-09  5:39                                       ` Balbir Singh
  0 siblings, 0 replies; 135+ messages in thread
From: Balbir Singh @ 2006-11-09  5:39 UTC (permalink / raw)
  To: Paul Jackson
  Cc: vatsa, dev, sekharan, ckrm-tech, haveblue, linux-kernel, matthltc,
	dipankar, rohitseth, menage

Paul Jackson wrote:
> Srivatsa wrote:
>> As was discussed in a previous thread, having a 'threads' file also will
>> be good.
>>
>> 	http://lkml.org/lkml/2006/11/1/386
>>
>> Writing to 'tasks' file will move that single thread to the new
>> container. Writing to 'threads' file will move all the threads of the
>> process into the new container.
> 
> Yup - agreed.
> 


Referring to the discussion at

	http://lkml.org/lkml/2006/10/31/210

Which lead to

	http://lkml.org/lkml/2006/11/1/101

If OpenVZ is ok with the notify_on_release approach, we can close in on
any further objections to the containers approach of implementing resource
grouping and be open to ideas for extending and enhancing it :)

-- 

	Balbir Singh,
	Linux Technology Center,
	IBM Software Labs

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [ckrm-tech] [RFC] Resource Management - Infrastructure choices
  2006-11-06 20:23                         ` Paul Menage
  2006-11-07 13:20                           ` Srivatsa Vaddagiri
  2006-11-07 18:41                           ` Paul Jackson
@ 2006-11-10 14:57                           ` Srivatsa Vaddagiri
  2 siblings, 0 replies; 135+ messages in thread
From: Srivatsa Vaddagiri @ 2006-11-10 14:57 UTC (permalink / raw)
  To: Paul Menage
  Cc: dev, sekharan, ckrm-tech, balbir, haveblue, linux-kernel,
	Paul Jackson, matthltc, dipankar, rohitseth

On Mon, Nov 06, 2006 at 12:23:44PM -0800, Paul Menage wrote:
> > Secondly, regarding how separate grouping per-resource *maybe* usefull,
> > consider this scenario.
> >
> > A large university server has various users - students, professors,
> > system tasks etc. The resource planning for this server could be on these lines:
> >
> >         CPU :           Top cpuset
> >                         /       \
> >                 CPUSet1         CPUSet2
> >                    |              |
> >                 (Profs)         (Students)
> >
> >                 In addition (system tasks) are attached to topcpuset (so
> >                 that they can run anywhere) with a limit of 20%
> >
> >         Memory : Professors (50%), students (30%), system (20%)
> >
> >         Disk : Prof (50%), students (30%), system (20%)
> >
> >         Network : WWW browsing (20%), Network File System (60%), others (20%)
> >                                 / \
> >                         Prof (15%) students (5%)

Lets say that network resource controller supports only one level
hierarchy, and hence you can only split it as:

        Network : WWW browsing (20%), Network File System (60%), others (20%)

> > Browsers like firefox/lynx go into the WWW network class, while (k)nfsd go
> > into NFS network class.
> >
> > At the same time firefox/lynx will share an appropriate CPU/Memory class
> > depending on who launched it (prof/student).
> >
> > If we had the ability to write pids directly to these resource classes,
> > then admin can easily setup a script which receives exec notifications
> > and depending on who is launching the browser he can
> >
> >         # echo browser_pid > approp_resource_class
> >
> > With your proposal, he now would have to create a separate container for
> > every browser launched and associate it with approp network and other
> > resource class. This may lead to proliferation of such containers.
> 
> Or create one container per combination (so in this case four,
> prof/www, prof/other, student/www, student/other) - then processes can
> be moved between the containers to get the appropriate qos of each
> type.
> 
> So the setup would look something like:
> 
> top-level: prof vs student vs system, with new child nodes for cpu,
> memory and disk, and no  new node for network
> 
> second-level, within the prof and student classes: www vs other, with
> new child nodes for network, and no new child nodes for cpu.
> 
> In terms of the commands to set it up, it might look like (from the top-level)
> 
> echo network > inherit
> mkdir prof student system
> echo disk,cpu,memory > prof/inherit
> mkdir prof/www prof/other
> echo disk,cpu,memory > student/inherit
> mkdir student/www student/other

By these commands, we would forcibly split the WWW bandwidth of 20%
between prof/www and student/www, when it was actually not needed (as
per the new requirement above). This forced split may be fine for a renewable 
resource like network bandwidth, but would be inconvenient for something like 
RSS, disk quota etc.

(I thought of a scheme where you can avoid this forced split by
maintaining soft/hard links to resource nodes from the container nodes.
Essentially each resource can have its own hierarchy of resource nodes.
Each resource node provides allocation information like min/max shares.
Container nodes point to one or more such resource nodes, implemented
as soft/hard links. This will avoid the forced split I mentioned above.
But I suspect we will run into atomicity issues again when modifying the 
container hierarchy).

Essentially by restrictly ourselves to a single hierarchy, we loose the 
flexibility of "viewing" each resource usage differently (network by traffic, 
cpu by users etc).

Coming to reality, I believe most work load management tools would be
fine to live with this restriction. AFAIK containers can also use this
model without much loss of flexibility. But if you are considering long term
user-interface stability, then this is something I would definitely
think hard about.

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, other threads:[~2006-11-10 14:57 UTC | newest]

Thread overview: 135+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
2006-10-30 11:04   ` Paul Menage
2006-10-30 13:27     ` [ckrm-tech] " Balbir Singh
2006-10-30 18:14       ` Paul Menage
2006-10-31 17:07         ` Balbir Singh
2006-10-31 17:22           ` Paul Menage
2006-10-31 18:16             ` Badari Pulavarty
2006-11-01  7:05             ` Balbir Singh
2006-11-01  7:07               ` Paul Menage
2006-11-01  7:44                 ` Balbir Singh
2006-11-01 12:23                   ` Paul Jackson
2006-11-02  0:09                     ` Paul Menage
2006-11-02  0:39                       ` Paul Jackson
2006-10-30 15:58   ` Pavel Emelianov
2006-10-30 17:39     ` Balbir Singh
2006-10-30 18:07       ` Balbir Singh
2006-10-31  8:57         ` Pavel Emelianov
2006-10-31  9:19           ` Balbir Singh
2006-10-31  9:25             ` Pavel Emelianov
2006-10-31 10:10               ` Balbir Singh
2006-10-31 10:19                 ` Pavel Emelianov
2006-10-31  9:42             ` Andrew Morton
2006-10-31 10:36               ` Balbir Singh
2006-10-31  8:48       ` Pavel Emelianov
2006-10-31 10:54         ` Balbir Singh
2006-10-31 11:15           ` Pavel Emelianov
2006-10-31 12:39             ` Balbir Singh
2006-10-31 14:19               ` Pavel Emelianov
2006-10-31 16:54             ` Paul Menage
2006-11-01  6:00             ` David Rientjes
2006-11-01  8:05               ` Pavel Emelianov
2006-11-01  8:35                 ` David Rientjes
2006-10-31 17:04         ` Dave Hansen
2006-11-01  7:57           ` Pavel Emelianov
2006-10-30 18:20     ` Paul Menage
2006-10-30 21:38       ` Paul Jackson
2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
2006-10-30 14:19   ` [ckrm-tech] " Pavel Emelianov
2006-10-30 14:29     ` Paul Jackson
2006-10-30 17:09   ` Srivatsa Vaddagiri
2006-10-30 17:16     ` Dave McCracken
2006-10-30 18:07       ` Paul Menage
2006-10-30 20:41         ` Paul Jackson
2006-10-30 10:51 ` Paul Menage
2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
2006-10-30 12:07     ` Paul Menage
2006-10-30 12:28       ` Paul Jackson
2006-10-30 11:15   ` Paul Jackson
2006-10-30 12:04     ` Paul Menage
2006-10-30 12:27       ` Paul Jackson
2006-10-30 17:53         ` Paul Menage
2006-10-30 20:36           ` Paul Jackson
2006-10-30 20:47             ` Paul Menage
2006-10-30 20:56               ` Paul Jackson
2006-10-30 21:03               ` Paul Menage
2006-10-31 11:53               ` Srivatsa Vaddagiri
2006-10-31 13:31                 ` Srivatsa Vaddagiri
2006-10-31 16:46                 ` Paul Menage
2006-11-01 17:25                   ` Srivatsa Vaddagiri
2006-11-01 23:37                     ` Paul Menage
2006-11-06 12:49                       ` Srivatsa Vaddagiri
2006-11-06 20:23                         ` Paul Menage
2006-11-07 13:20                           ` Srivatsa Vaddagiri
2006-11-07 18:41                           ` Paul Jackson
2006-11-07 19:07                             ` Paul Menage
2006-11-07 19:11                               ` Paul Jackson
2006-11-07 19:24                                 ` Paul Menage
2006-11-07 19:58                                   ` Paul Jackson
2006-11-07 20:00                                     ` Paul Menage
2006-11-07 20:02                               ` Paul Jackson
2006-11-08  2:47                                 ` Matt Helsley
2006-11-07 20:34                               ` Paul Jackson
2006-11-07 20:41                                 ` Paul Menage
2006-11-07 21:50                                   ` Paul Jackson
2006-11-07 22:21                               ` Paul Menage
2006-11-08  3:15                                 ` Paul Jackson
2006-11-08  4:15                                   ` Paul Menage
2006-11-08  4:16                                     ` Paul Jackson
2006-11-08 11:22                                     ` Paul Jackson
2006-11-08  5:12                                   ` Srivatsa Vaddagiri
2006-11-08  5:36                                     ` Paul Jackson
2006-11-09  5:39                                       ` Balbir Singh
2006-11-08 19:25                                     ` Chris Friesen
2006-11-09  3:54                                       ` Srivatsa Vaddagiri
2006-11-10 14:57                           ` Srivatsa Vaddagiri
2006-11-01  4:39               ` David Rientjes
2006-11-01  9:50                 ` Paul Jackson
2006-11-01  9:58                   ` David Rientjes
2006-11-01 15:59                 ` Srivatsa Vaddagiri
2006-11-01 16:31                   ` Srivatsa Vaddagiri
2006-11-01 21:05                   ` David Rientjes
2006-11-01 23:43                     ` Paul Menage
2006-11-01 18:19                 ` Srivatsa Vaddagiri
2006-11-01 17:33   ` Srivatsa Vaddagiri
2006-11-01 21:18     ` Chris Friesen
2006-11-01 23:01       ` [Devel] " Kir Kolyshkin
2006-11-02  0:31         ` Matt Helsley
2006-11-02  8:34           ` Kir Kolyshkin
2006-11-01 23:48       ` Paul Menage
2006-11-02  3:28         ` Chris Friesen
2006-11-02  7:40           ` Paul Menage
2006-10-30 14:08 ` Pavel Emelianov
2006-10-30 14:23   ` Paul Jackson
2006-10-30 14:38     ` Pavel Emelianov
2006-10-30 15:18       ` Paul Jackson
2006-10-30 15:26         ` Pavel Emelianov
2006-10-31  0:26           ` Matt Helsley
2006-10-31  8:34             ` Pavel Emelianov
2006-10-31 16:33           ` Chris Friesen
2006-10-30 18:01   ` Paul Menage
2006-10-31  8:31     ` Pavel Emelianov
2006-10-31 16:34       ` Paul Menage
2006-10-31 16:57         ` Srivatsa Vaddagiri
2006-11-01  7:58         ` Pavel Emelianov
2006-10-31 16:34   ` Srivatsa Vaddagiri
2006-11-01  8:01     ` Pavel Emelianov
2006-11-01 16:04       ` Matt Helsley
2006-11-01 17:51         ` Srivatsa Vaddagiri
2006-11-01 17:50       ` Srivatsa Vaddagiri
2006-11-02  8:42         ` Pavel Emelianov
2006-11-03  1:29           ` David Rientjes
2006-11-01  9:30 ` Pavel Emelianov
2006-11-01  9:53   ` David Rientjes
2006-11-01 22:23     ` Matt Helsley
2006-11-01 18:12   ` Srivatsa Vaddagiri
2006-11-01 22:19     ` Matt Helsley
2006-11-01 23:50       ` Paul Menage
2006-11-02  0:30         ` Matt Helsley
2006-11-02  5:33           ` Balbir Singh
2006-11-02  9:08         ` Pavel Emelianov
2006-11-02 11:26           ` Matt Helsley
2006-11-02 13:04             ` Pavel Emelianov
2006-11-03  1:29               ` David Rientjes
2006-11-02  8:52     ` Pavel Emelianov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox