All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] Resource Management - Infrastructure choices
@ 2006-10-30 10:33 Srivatsa Vaddagiri
  2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
                   ` (4 more replies)
  0 siblings, 5 replies; 152+ messages in thread
From: Srivatsa Vaddagiri @ 2006-10-30 10:33 UTC (permalink / raw)
  To: dev, sekharan, menage
  Cc: pj, akpm, ckrm-tech, rohitseth, balbir, dipankar, matthltc,
	haveblue, linux-kernel

Over the last couple of months, we have seen a number of proposals for
resource management infrastructure/controllers and also good discussions
surrounding those proposals. These discussions has resulted in few
consensus points and few other points that are still being debated.

This RFC is an attempt to:

	o summarize various proposals to date for infrastructure

	o summarize consensus/debated points for infrastructure

	o (more importantly) get various stakeholders agree on what is a good 
	  compromise for infrastructure in going forward

Couple of questions that I am trying to address in this RFC:

	- Do we wait till controllers are worked out before merging
	  any infrastructure?

		IMHO, its good if we can merge some basic infrastructure now
		and incrementally enhance it and add controllers based on it. 
		This perspective leads to the second question below ..

	- Paul Menage's patches present a rework of existing code, which makes 
	  it simpler to get it in. Does it meet container (Openvz/Linux
	  VServer) and resource management requirements?

		Paul has ported over the CKRM code on top of his patches. So I 
		am optimistic that it meets resource management requirements in 
		general.

	  	One shortcoming I have seen in it is that it lacks an 
		efficient method to retrieve tasks associated with a group. 
		This may be needed by few controllers implementations if they 
		have to support, say, change of resource limits. This however 
		I think could be addressed quite easily (by a linked list
		hanging off each container structure).

Resource Management - Goals
---------------------------

Develop mechanisms for isolating use of shared resources like cpu, memory 
between various applications. This includes:

	- mechanism to group tasks by some attribute (ex: containers, 
	  CKRM/RG class, cpuset etc)

	- mechanism to monitor and control usage of a variety of resources by 
	  such groups of tasks

Resources to be managed:

	- Memory, CPU and disk I/O bandwidth (of high interest perhaps)
	- network bandwidth, number of tasks/file-descriptors/sockets etc.


Proposals to date for infrastructure
------------------------------------

	- CKRM/RG
	- UBC
	- Container implementation (by Paul Menage) based on generalization of 	
	  cpusets.


A. Class-based Kernel Resource Management/Resource Groups

	Framework to monitor/control use of various resources by a group of 
	tasks as per specified guarantee/limits.

	Provides a config-fs based interface to:

		- create/delete task-groups
		- allow a task to change its (or some other task's) association 
		  from one group to other (provided it has the right 
		  privileges). New children of the affected task inherit the 
		  same group association.
		- list tasks present in a group (A group can exist without any 
		  tasks associated with it)
		- specify group's min/max use of various resources. A special 
		  value "DONT_CARE" specifies that the group doesn't care for 
		  how much resource it gets.
		- obtain resource usage statistics
		- Supports heirarchy depth of 1 (??)

	In addition to this user-interface, it provides a framework for 
	controllers to:

		- register/deregister themselves
		- be intimated about changes in resource allocation for a group
		- be intimated about task movements between groups
		- be intimated about creation/deletion of groups
		- know which group a task belongs to

B. UBC

	Framework to account and limit usage of various resources by a 
	container (group of tasks).

	Provides a system call based interface to:

		- set a task's beancounter id. If the id does not exist, a new 
		  beancounter object is created
		- change a task's association from one beancounter to other
		- return beancounter id to which the calling task belongs
		- set limits of consumption of a particular resource by a 
		  beancounter
		- return statistics information for a given beancounter and 
		  resource.


	Provides a framework for controllers to:

		- register various resources
		- lookup beancounter object given a particular id
		- charge/uncharge usage of some resource to a beancounter by 
	 	  some amount
			- also know if the resulting usage is above the allowed 
			  soft/hard limit.
		- change a task's accounting beancounter (usefull in, say, 
		  interrupt handling)
		- know when the resource limits change for a beancounter

C. Paul Menage's container patches

	Provides a generic heirarchial process grouping mechanism based on 
	cpusets, which can be used for resource management purposes.

	Provides a filesystem-based interface to:

		- create/destroy containers
		- change a task's association from one container to other
		- retrieve all the tasks associated with a container
		- know which container a task belongs to (from /proc)
		- know when the last task belonging to a container has exited


Consensus/Debated Points
------------------------

Consensus:

	- Provide resource control over a group of tasks 
	- Support movement of task from one resource group to another
	- Dont support heirarchy for now
	- Support limit (soft and/or hard depending on the resource
	  type) in controllers. Guarantee feature could be indirectly
	  met thr limits.

Debated:
	- syscall vs configfs interface
	- Interaction of resource controllers, containers and cpusets
		- Should we support, for instance, creation of resource
		  groups/containers under a cpuset?
	- Should we have different groupings for different resources?
	- Support movement of all threads of a process from one group
	  to another atomically?

-- 
Regards,
vatsa

^ permalink raw reply	[flat|nested] 152+ messages in thread
* RFC: Memory Controller
@ 2006-10-30 15:57 Balbir Singh
  0 siblings, 0 replies; 152+ messages in thread
From: Balbir Singh @ 2006-10-30 15:57 UTC (permalink / raw)
  To: linux-mm

I missed linux-mm in the cc of the first post.

Balbir

----- Forwarded message from Balbir Singh <balbir@in.ibm.com> -----

We've seen a lot of discussion lately on the memory controller. The RFC below
provides a summary of the discussions so far. The goal of this RFC is to bring
together the thoughts so far, build consensus and agree on a path forward.

NOTE: I have tried to keep the information as accurate and current as possible.
Please bring out any omissions/corrections if you notice them. I would like to
keep this summary document accurate, current and live.

Summary of Memory Controller Discussions and Patches

1. Accounting

The patches submitted so far agree that the following memory
should be accounted for

Reclaimable memory

(i)   Anonymous pages - Anonymous pages are pages allocated by the user space,
      they are mapped into the user page tables, but not backed by a file.
(ii)  File mapped pages - File mapped pages map a portion of a file
(iii) Page Cache Pages - Consists of the following

    (a) Pages used during IPC using shmfs
    (c) Pages of a user mode process that are swapped out
    (c) Pages from block read/write operations
    (d) Pages from file read/write operations

Non Reclaimable memory

This memory is not reclaimable until it is explicitly released by the
allocator. Examples of such memory include slab allocated memory and
memory allocated by the kernel components in process context. mlock()'ed
memory is also considered as non-reclaimable, but it is usually handled
as a separate resource.

(i)  Slabs
(ii) Kernel pages and page_tables allocated on behalf of a task.

2. Control considerations for the memory controller

Control can be implemented using either

(i)  Limits
     Limits, limit the usage of the resource to the specified value. If the
     resource usage crosses the limit, then the group might be penalized
     or restricted. Soft limits can be exceeded by the group as long as
     the resource is still available. Hard limits are usually the cut-of-point.
     No additional resources might be allocated beyond the hard limit.

(ii) Guarantees
     Guarantees, come in two forms

     (a) Soft guarantees is a best effort service to provide the group
      with the specified guarantee of resource availability. In this form
      resources can be shared (the unutilized resources of one
      group can be used by other groups) among groups and groups are allowed to
      exceed their guarantee when the resource is available (there is
      no other group unable to meet its guarantee). When a group is unable
      to meet its guarantee, the system tries to provide it with it's
      guaranteed resources by trying to reclaim from other groups, which
      have exceeded their guarantee. In spite of its best effort, if the
      system is unable to meet the specified guarantee, the guarantee
      failed statistic of the group is incremented. This form of guarantees
      is best suited for non-reclaimable resources.

     (b) Hard guarantees is a more deterministic method of providing QoS.
     Resources need to be allocated in advance, to ensure that the group
     is always able to meet its guarantee. This form is undesirable as
     it leads to resource under utilization. Another approach is to
     allow sharing of resources, but when a group is unable to meet its
     guarantee, the system will OOM kill a group that exceeds its
     guarantee.  Hard guarantees are more difficult to provide for
     non-reclaimable resources, but might be easier to provide for
     reclaimable resources.

NOTE: It has been argued that guarantees can be implemented using
limits. See http://wiki.openvz.org/Guarantees_for_resources

3. Memory Controller Alternatives

(i)   Beancouners
(ii)  Containers
(iii) Resource groups (aka CKRM)
(iv)  Fake Nodes

+----+---------+------+---------+------------+----------------+-----------+
| No |Guarantee| Limit| User I/F| Controllers| New Controllers|Statistics |
+----+---------+------+---------+------------+----------------+-----------+
| i  |  No     | Yes  | syscall | Memory     | No framework   |   Yes     |
|    |         |      |         |            | to write new   |           |
|    |         |      |         |            | controllers    |           |
+----+---------+------+---------+------------+----------------+-----------+
|ii  |  No     | Yes  | configfs| Memory,    | Plans to       |   Yes     |
|    |         |      |         | task limit.| provide a      |           |
|    |         |      |         | Plans to   | framework      |           |
|    |         |      |         | allow      | to write new   |           |
|    |         |      |         | CPU and I/O| controllers    |           |
+----+---------+------+---------+------------+----------------+-----------+
|iii |  Yes    | Yes  | configfs| CPU, task  | Provides a     |   Yes     |
|    |         |      |         | limit &    | framework to   |           |
|    |         |      |         | Memory     | add new        |           |
|    |         |      |         | controller.| controllers    |           |
|    |         |      |         | I/O contr  |                |           |
|    |         |      |         | oller for  |                |           |
|    |         |      |         | older      |                |           |
|    |         |      |         | revisions  |                |           |
+----+---------+------+---------+------------+----------------+-----------+

4. Existing accounting

a. Beancounters currently account for the following resources

(i)   kmemsize - memory obtained through alloc_pages() with __GFP_BC flag set.
(ii)  physpages - Resident set size of the tasks in the group.
      Reclaim support is provided for this resource.
(iii) lockedpages - User pages locked in memory
(iv)  slabs - slabs allocated with kmem_cache_alloc_bc are accounted and
      controlled.

Beancounters provides some support for event notification (limit/barrier hit).

b. Containers account for the following resources

(i)   mapped pages
(ii)  anonymous pages
(iii) file pages (from the page cache)
(iv)  active pages

There is some support for reclaiming pages, the code is in the early stages of
development.

c. CKRM/RG Memory Controller

(i)   Tracks active pages
(ii)  Supports reclaim of LRU pages
(iii) Shared pages are not tracked

This controller provides its own res_zone, to aid reclaim and tracking of pages.

d. Fake NUMA Nodes

This approach was suggested while discussing the memory controller

Advantages

(i)   Accounting for zones is already present
(ii)  Reclaim code can directly deal with zones

Disadvantages

(i)   The approach leads to hard partitioning of memory.
(ii)  It's complex to
      resize the node. Resizing is required to allow change of limits for
      resource management.
(ii)  Addition/Deletion of a resource group would require memory hotplug
      support for add/delete a node. On deletion of node, its memory is
      not utilized until a new node of a same or lesser size is created.
      Addition of node, requires reserving memory for it upfront.

5. Open issues

(i)    Can we allow threads belonging to the same process belong
       to two different resource groups? Does this mean we need to do per-thread
       VM accounting now?
(ii)   There is an overhead associated with adding a pointer in struct page.
       Can this be reduced/avoided? One solution suggested is to use a
       mirror mem_map.
(iii)  How do we distribute the remaining resources among resource hungry
       groups? The Resource Group implementation used the ratio of the limits
       to decide on the ratio according to which they are distributed.
(iv)   How do we account for shared pages? Should it be charged to the first
       container which touches the page or should it be charged equally among
       all containers sharing the page?
(v)    Definition of RSS (see http://lkml.org/lkml/2006/10/10/130)

6. Going forward

(i)    Agree on requirements (there has been some agreement already, please
       see http://lkml.org/lkml/2006/9/6/102 and the BOF summary [7])
(ii)   Agree on minimum accounting and hooks in the kernel. It might be
       a good idea to take this up in phases
       phase 1 - account for user space memory
       phase 2 - account for kernel memory allocated on behalf of the user/task
(iii)  Infrastructure - There is a separate RFC on that.

7. References

1. http://www.openvz.org
2. http://lkml.org/lkml/2006/9/19/283 (Containers patches)
3. http://lwn.net/Articles/200073/ (Another Container Implementation)
4. http://ckrm.sf.net (Resource Groups)
5. http://lwn.net/Articles/197433/ (Resource Beancounters)
6. http://lwn.net/Articles/182369/ (CKRM Rebranded)
7. http://lkml.org/lkml/2006/7/26/237 (OLS BoF on Resource Management (NOTES))

----- End forwarded message -----

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 152+ messages in thread

end of thread, other threads:[~2006-11-10 14:57 UTC | newest]

Thread overview: 152+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-30 10:33 [RFC] Resource Management - Infrastructure choices Srivatsa Vaddagiri
2006-10-30 10:34 ` RFC: Memory Controller Balbir Singh
2006-10-30 11:04   ` Paul Menage
2006-10-30 13:27     ` [ckrm-tech] " Balbir Singh
2006-10-30 18:14       ` Paul Menage
2006-10-31 17:07         ` Balbir Singh
2006-10-31 17:22           ` Paul Menage
2006-10-31 18:16             ` Badari Pulavarty
2006-11-01  7:05             ` Balbir Singh
2006-11-01  7:07               ` Paul Menage
2006-11-01  7:44                 ` Balbir Singh
2006-11-01 12:23                   ` Paul Jackson
2006-11-02  0:09                     ` Paul Menage
2006-11-02  0:39                       ` Paul Jackson
2006-10-30 15:58   ` Pavel Emelianov
2006-10-30 17:39     ` Balbir Singh
2006-10-30 18:07       ` Balbir Singh
2006-10-30 18:07         ` Balbir Singh
2006-10-31  8:57         ` Pavel Emelianov
2006-10-31  8:57           ` Pavel Emelianov
2006-10-31  9:19           ` Balbir Singh
2006-10-31  9:19             ` Balbir Singh
2006-10-31  9:25             ` Pavel Emelianov
2006-10-31  9:25               ` Pavel Emelianov
2006-10-31 10:10               ` Balbir Singh
2006-10-31 10:10                 ` Balbir Singh
2006-10-31 10:19                 ` Pavel Emelianov
2006-10-31 10:19                   ` Pavel Emelianov
2006-10-31  9:42             ` Andrew Morton
2006-10-31  9:42               ` Andrew Morton
2006-10-31 10:36               ` Balbir Singh
2006-10-31 10:36                 ` Balbir Singh
2006-10-31  8:48       ` Pavel Emelianov
2006-10-31 10:54         ` Balbir Singh
2006-10-31 10:54           ` Balbir Singh
2006-10-31 11:15           ` Pavel Emelianov
2006-10-31 11:15             ` Pavel Emelianov
2006-10-31 12:39             ` Balbir Singh
2006-10-31 12:39               ` Balbir Singh
2006-10-31 14:19               ` Pavel Emelianov
2006-10-31 14:19                 ` Pavel Emelianov
2006-10-31 16:54             ` Paul Menage
2006-10-31 16:54               ` Paul Menage
2006-11-01  6:00             ` David Rientjes
2006-11-01  6:00               ` David Rientjes
2006-11-01  8:05               ` Pavel Emelianov
2006-11-01  8:05                 ` Pavel Emelianov
2006-11-01  8:35                 ` David Rientjes
2006-11-01  8:35                   ` David Rientjes
2006-10-31 17:04         ` Dave Hansen
2006-11-01  7:57           ` Pavel Emelianov
2006-10-30 18:20     ` Paul Menage
2006-10-30 21:38       ` Paul Jackson
2006-10-30 10:43 ` [RFC] Resource Management - Infrastructure choices Paul Jackson
2006-10-30 14:19   ` [ckrm-tech] " Pavel Emelianov
2006-10-30 14:29     ` Paul Jackson
2006-10-30 17:09   ` Srivatsa Vaddagiri
2006-10-30 17:16     ` Dave McCracken
2006-10-30 18:07       ` Paul Menage
2006-10-30 20:41         ` Paul Jackson
2006-10-30 10:51 ` Paul Menage
2006-10-30 11:06   ` [ckrm-tech] " Paul Jackson
2006-10-30 12:07     ` Paul Menage
2006-10-30 12:28       ` Paul Jackson
2006-10-30 11:15   ` Paul Jackson
2006-10-30 12:04     ` Paul Menage
2006-10-30 12:27       ` Paul Jackson
2006-10-30 17:53         ` Paul Menage
2006-10-30 20:36           ` Paul Jackson
2006-10-30 20:47             ` Paul Menage
2006-10-30 20:56               ` Paul Jackson
2006-10-30 21:03               ` Paul Menage
2006-10-31 11:53               ` Srivatsa Vaddagiri
2006-10-31 13:31                 ` Srivatsa Vaddagiri
2006-10-31 16:46                 ` Paul Menage
2006-11-01 17:25                   ` Srivatsa Vaddagiri
2006-11-01 23:37                     ` Paul Menage
2006-11-06 12:49                       ` Srivatsa Vaddagiri
2006-11-06 20:23                         ` Paul Menage
2006-11-07 13:20                           ` Srivatsa Vaddagiri
2006-11-07 18:41                           ` Paul Jackson
2006-11-07 19:07                             ` Paul Menage
2006-11-07 19:11                               ` Paul Jackson
2006-11-07 19:24                                 ` Paul Menage
2006-11-07 19:58                                   ` Paul Jackson
2006-11-07 20:00                                     ` Paul Menage
2006-11-07 20:02                               ` Paul Jackson
2006-11-08  2:47                                 ` Matt Helsley
2006-11-07 20:34                               ` Paul Jackson
2006-11-07 20:41                                 ` Paul Menage
2006-11-07 21:50                                   ` Paul Jackson
2006-11-07 22:21                               ` Paul Menage
2006-11-08  3:15                                 ` Paul Jackson
2006-11-08  4:15                                   ` Paul Menage
2006-11-08  4:16                                     ` Paul Jackson
2006-11-08 11:22                                     ` Paul Jackson
2006-11-08  5:12                                   ` Srivatsa Vaddagiri
2006-11-08  5:36                                     ` Paul Jackson
2006-11-09  5:39                                       ` Balbir Singh
2006-11-08 19:25                                     ` Chris Friesen
2006-11-09  3:54                                       ` Srivatsa Vaddagiri
2006-11-10 14:57                           ` Srivatsa Vaddagiri
2006-11-01  4:39               ` David Rientjes
2006-11-01  9:50                 ` Paul Jackson
2006-11-01  9:58                   ` David Rientjes
2006-11-01 15:59                 ` Srivatsa Vaddagiri
2006-11-01 16:31                   ` Srivatsa Vaddagiri
2006-11-01 21:05                   ` David Rientjes
2006-11-01 23:43                     ` Paul Menage
2006-11-01 18:19                 ` Srivatsa Vaddagiri
2006-11-01 17:33   ` Srivatsa Vaddagiri
2006-11-01 21:18     ` Chris Friesen
2006-11-01 23:01       ` [Devel] " Kir Kolyshkin
2006-11-02  0:31         ` Matt Helsley
2006-11-02  8:34           ` Kir Kolyshkin
2006-11-01 23:48       ` Paul Menage
2006-11-02  3:28         ` Chris Friesen
2006-11-02  7:40           ` Paul Menage
2006-10-30 14:08 ` Pavel Emelianov
2006-10-30 14:23   ` Paul Jackson
2006-10-30 14:38     ` Pavel Emelianov
2006-10-30 15:18       ` Paul Jackson
2006-10-30 15:26         ` Pavel Emelianov
2006-10-31  0:26           ` Matt Helsley
2006-10-31  8:34             ` Pavel Emelianov
2006-10-31 16:33           ` Chris Friesen
2006-10-30 18:01   ` Paul Menage
2006-10-31  8:31     ` Pavel Emelianov
2006-10-31 16:34       ` Paul Menage
2006-10-31 16:57         ` Srivatsa Vaddagiri
2006-11-01  7:58         ` Pavel Emelianov
2006-10-31 16:34   ` Srivatsa Vaddagiri
2006-11-01  8:01     ` Pavel Emelianov
2006-11-01 16:04       ` Matt Helsley
2006-11-01 17:51         ` Srivatsa Vaddagiri
2006-11-01 17:50       ` Srivatsa Vaddagiri
2006-11-02  8:42         ` Pavel Emelianov
2006-11-03  1:29           ` David Rientjes
2006-11-01  9:30 ` Pavel Emelianov
2006-11-01  9:53   ` David Rientjes
2006-11-01 22:23     ` Matt Helsley
2006-11-01 18:12   ` Srivatsa Vaddagiri
2006-11-01 22:19     ` Matt Helsley
2006-11-01 23:50       ` Paul Menage
2006-11-02  0:30         ` Matt Helsley
2006-11-02  5:33           ` Balbir Singh
2006-11-02  9:08         ` Pavel Emelianov
2006-11-02 11:26           ` Matt Helsley
2006-11-02 13:04             ` Pavel Emelianov
2006-11-03  1:29               ` David Rientjes
2006-11-02  8:52     ` Pavel Emelianov
  -- strict thread matches above, loose matches on Subject: below --
2006-10-30 15:57 RFC: Memory Controller Balbir Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.