2009 kernel summit preparation for 'containers end-game' discussion

All of lore.kernel.org
 help / color / mirror / Atom feed

* 2009 kernel summit preparation for 'containers end-game' discussion
@ 2009-10-06 15:56 Serge E. Hallyn
       [not found] ` <20091006155637.GA14761-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2009-10-06 15:56 UTC (permalink / raw)
  To: Linux Containers, cgroup-r/Jw6+rmf7HQT0dZR+AlfA,
	Eric W. Biederman, Balbir Singh, Sukadev

Hi,

the kernel summit is rapidly approaching. One of the agenda
items is 'the containers end-game and how do we get there.'
As of now I don't yet know who will be there to represent the
containers community in that discussion.  I hope there is
someone planning on that?  In the hopes that there is, here is
a summary of the info I gathered in June, in case that is
helpful.  If it doesn't look like anyone will be attending
ksummit representing containers, then I'll send the final
version of this info to the ksummit mailing list so that someone
can stand in.

1. There will be an IO controller minisummit before KS.  I
trust someone (Balbir?) will be sending meeting notes to
the cgroup list, so that highlights can be mentioned at KS?

2. There was a checkpoint/restart BOF plus talk at plumber's.
Notes on the BOF are here:
https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html

3. There was an OOM notification talk or BOF at plumber's.
Dave or Balbir, are there any notes about that meeting?

4. The actual title of the KS discussion is 'containers end-game'.
The containers-specific info I gathered in June was mainly about
additional resources which we might containerize.  I expect that
will be useful in helping the KS community decide how far down
the containerization path they are willing to go - i.e. whether
we want to call what we have good enough and say you must use kvm
for anything more, whether we want to be able to provide all the
features of a full VM with containers, or something in between,
say targetting specific uses (perhaps only expand on cooperative
resource management containers).  With that in mind, here are
some items that were mentioned in June as candidates for
more containerization work

	1. Cpu hard limits, memory soft limits (Balbir)
	2. Large pages, mlock, shared page accounting (Balbir)
	3. Oom notification (Balbir - was anything decided on this
		at plumber's?)
	4. There is agreement on getting rid of the ns cgroup,
		provided that:
		a. user namespaces can provide container confinement
		guarantees
		b. a compatibility flag is created to clone parent
		cgroup when creating a new cgroup (Paul and Daniel)
	5. Poweroff/reboot handling in containers (Daniel)
	6. Full user namespaces to segragate uids in different
		containers and confine root users in containers, i.e.
		with respect to file systems like cgroupfs.
	7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
	8. C/r will want inode virtualization (Daniel)
	9. Sunrpc containerization (required to allow multiple
		containers separate NFS client access to the same server)
	10. Sysfs tagging, support for physical netifs to migrate
		network namespaces, and /sys/class/net virtualization

Again the point of this list isn't to ask for discussion about
whether or how to implement each at this KS, but rather to give
an idea of how much work is left to do.  Though let the discussion
lead where it may of course.

I don't have it here, but maybe it would also be useful to
have a list ready of things we can do today with containerization?
Both with upstream, and with under-development patchsets.

I also hope that someone will take notes on the ksummit
discussion to send to the containers and cgroup lists.
I expect there will be a good LWN writeup, but a more
containers-focused set of notes will probably be useful
too.

thanks,
-serge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found] ` <20091006155637.GA14761-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-06 16:53   ` Ying Han
       [not found]     ` <604427e00910060953l2d14fa8ci3923320dfaeb6490-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2009-10-12 18:49   ` Oren Laadan
  1 sibling, 1 reply; 8+ messages in thread
From: Ying Han @ 2009-10-06 16:53 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, cgroup-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Linux Containers, Pavel Emelyanov

On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> Hi,
>
> the kernel summit is rapidly approaching. One of the agenda
> items is 'the containers end-game and how do we get there.'
> As of now I don't yet know who will be there to represent the
> containers community in that discussion.  I hope there is
> someone planning on that?  In the hopes that there is, here is
> a summary of the info I gathered in June, in case that is
> helpful.  If it doesn't look like anyone will be attending
> ksummit representing containers, then I'll send the final
> version of this info to the ksummit mailing list so that someone
> can stand in.
>
> 1. There will be an IO controller minisummit before KS.  I
> trust someone (Balbir?) will be sending meeting notes to
> the cgroup list, so that highlights can be mentioned at KS?
>
> 2. There was a checkpoint/restart BOF plus talk at plumber's.
> Notes on the BOF are here:
>
https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
>
> 3. There was an OOM notification talk or BOF at plumber's.
> Dave or Balbir, are there any notes about that meeting?
Serge:
Here are some notes I took from Dave's OOM talk:

Change the OOM killer's policy.

The current goal of OOM killer is to kill a rogue memory hogging task which
will lead to future memory freeing, and allow the system or container to
resume normal operation. Under OOM condition, kernel scans the tasklist of
the system or container and scores each task based on heuristic mechanism.
The task with highest score is picked to kill. Also kernel provides
/proc/pid/oom_adj API for adding user policy on top of the score, it allows
admin to tune the "badness" on task basis.

Linux Theory: A free page is a wasted page of RAM and Linux will always fill
up memory with disk caches. When we time stamp the running time of an
application, we normally follow the sequence "flush cache - time - run app -
time - flush cache". So being OOM is normal and it is not a bug.

Linux-mm has a list descripting the possible OOM conditions.
http://linux-mm.org/OOM

User Perspectives:
High Performance Computing: I will take as much memory can be given, Please
tell me how much memory that is. In these systems, swapping is the devil.

Enterprise: Applications do their own memory management.If the system gets
lowmem, I want the the kernel to tell me, and I will give some of mine back.
Memory notification system brings up lots of attention. Couple of proposals
have been posted in linux-mm and none of them seems fulfill all the
requirements.

Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up,
please just kill it quickly, i will reopen it in a minute. Besides, Please
don't kill sshd.

Memory Reclaim
If no free memory, we scan the LRU and try to free pages. Recent issues on
page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have 1024
pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The
increasing of the memory size makes reclaim job harder and harder.

Beat the LRU into shape
* Never run out of memory, never reclaim and never look at the LRU.
* Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM uses
64K page, more on the kernel issue change than userpace change if they use
libc"
* Keep troublesome pages off the LRU lists including unreclaimable pages
(anon, mlock, shm, slab, dirty pages)
and Hugetlbfs which are not counted on RSS.
* Split up the LRU lists. It includes the NUMA implementation as well as the
unevictable patch from Rik (~2.6.28)
What is next:

Having the OOM killer always pick the "right" application to kill is a tough
problem and it has been the hot topic in upstream with several patches
posted. Notification system has lots of attention during the talk, here are
the summary of current posted patches:

Linux killed Kenny, bastard!
Evgeniy Polyakov posted the patch early this year. What the patch does is to
provide an API that admin can specify the oom victim by the process name.
No one likes the patch in linux-mm. The argument is on the current mechanism
of caculating "badness score" which is way complex for admin to determin
which task to kill. Alan Cox simply answered the question: "its
always heuristic", and he also pointed out "What you actually need is
notifiers to work on /proc. In fact containers are probably the right way to
do it".

Cgroup based OOM killer controller
Nikanth Karthikesan re-posted the patch which adding the cgroup support. The
patch added an adjustable value "oom.victim" for each oom cgroup. The OOM
killer would kill all the processes in a cgruop with a higher oom.victim
value before killing a process in a cgroup with lower oom.victim value.
Among those tasks with the same oom.victim value, the usual "badness"
heuristics would be applied.
It is one step further which takes use of the cgroup hireachy for the OOM
killer subsystem. However, the same question had been raised "What is the
difference between oom_adj and this oom.victim to user?". Nikanth answered
to that question "Using this oom.victim users can specify the exact order to
kill processes.". Another word, oom_adj works as a hint to the kernel while
oom_victim gives strict order.

Per-cgroup OOM handler
Ying Han posted the google in-house patch into linux-mm which defers the OOM
kill decisions to userspace. It allows userspace to respond the OOM by
adding nodes, dropping caches, elevating memcg limit or sending signal. An
alternative is to use /dev/mem_notify which David Rientjes proposed in
linux-mm. The idea is similar, instead of waiting on oom_await, userspace
can poll the information during lowmem condition and respond
correspondingly.

Vladislav Buzov posted the patch which extends the memcg by adding the
notification system on system lowmem condition. The feedbacks looks
promising this time, Although there still lots of changes needs to be done.
Discussions focused on the implementation of the notification mechanism.
Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for
event delivery and request/respondse applications. Paul Menage proposed
couple of options including new ioctl on cgroup files, new syscall and new
per-cgroup file.

--Ying Han

>
> 4. The actual title of the KS discussion is 'containers end-game'.
> The containers-specific info I gathered in June was mainly about
> additional resources which we might containerize.  I expect that
> will be useful in helping the KS community decide how far down
> the containerization path they are willing to go - i.e. whether
> we want to call what we have good enough and say you must use kvm
> for anything more, whether we want to be able to provide all the
> features of a full VM with containers, or something in between,
> say targetting specific uses (perhaps only expand on cooperative
> resource management containers).  With that in mind, here are
> some items that were mentioned in June as candidates for
> more containerization work
>
>        1. Cpu hard limits, memory soft limits (Balbir)
>        2. Large pages, mlock, shared page accounting (Balbir)
>        3. Oom notification (Balbir - was anything decided on this
>                at plumber's?)
>        4. There is agreement on getting rid of the ns cgroup,
>                provided that:
>                a. user namespaces can provide container confinement
>                guarantees
>                b. a compatibility flag is created to clone parent
>                cgroup when creating a new cgroup (Paul and Daniel)
>        5. Poweroff/reboot handling in containers (Daniel)
>        6. Full user namespaces to segragate uids in different
>                containers and confine root users in containers, i.e.
>                with respect to file systems like cgroupfs.
>        7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
>        8. C/r will want inode virtualization (Daniel)
>        9. Sunrpc containerization (required to allow multiple
>                containers separate NFS client access to the same server)
>        10. Sysfs tagging, support for physical netifs to migrate
>                network namespaces, and /sys/class/net virtualization
>
> Again the point of this list isn't to ask for discussion about
> whether or how to implement each at this KS, but rather to give
> an idea of how much work is left to do.  Though let the discussion
> lead where it may of course.
>
> I don't have it here, but maybe it would also be useful to
> have a list ready of things we can do today with containerization?
> Both with upstream, and with under-development patchsets.
>
> I also hope that someone will take notes on the ksummit
> discussion to send to the containers and cgroup lists.
> I expect there will be a good LWN writeup, but a more
> containers-focused set of notes will probably be useful
> too.
>
> thanks,
> -serge
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found]     ` <604427e00910060953l2d14fa8ci3923320dfaeb6490-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2009-10-06 18:21       ` Serge E. Hallyn
       [not found]         ` <20091006182154.GB18694-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2009-10-06 18:21 UTC (permalink / raw)
  To: Ying Han
  Cc: Dave Hansen, Eric W. Biederman, Linux Containers, Pavel Emelyanov

Wow, detailed notes - thanks, I'm still looking through them.  If you don't
mind, I'll use a link to the archive of this email
(https://lists.linux-foundation.org/pipermail/containers/2009-October/021227.html)
in the final summary.

thanks,
-serge

Quoting Ying Han (yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> > Hi,
> >
> > the kernel summit is rapidly approaching. One of the agenda
> > items is 'the containers end-game and how do we get there.'
> > As of now I don't yet know who will be there to represent the
> > containers community in that discussion.  I hope there is
> > someone planning on that?  In the hopes that there is, here is
> > a summary of the info I gathered in June, in case that is
> > helpful.  If it doesn't look like anyone will be attending
> > ksummit representing containers, then I'll send the final
> > version of this info to the ksummit mailing list so that someone
> > can stand in.
> >
> > 1. There will be an IO controller minisummit before KS.  I
> > trust someone (Balbir?) will be sending meeting notes to
> > the cgroup list, so that highlights can be mentioned at KS?
> >
> > 2. There was a checkpoint/restart BOF plus talk at plumber's.
> > Notes on the BOF are here:
> >
> https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
> >
> > 3. There was an OOM notification talk or BOF at plumber's.
> > Dave or Balbir, are there any notes about that meeting?
> Serge:
> Here are some notes I took from Dave's OOM talk:
> 
> Change the OOM killer's policy.
> 
> The current goal of OOM killer is to kill a rogue memory hogging task which
> will lead to future memory freeing, and allow the system or container to
> resume normal operation. Under OOM condition, kernel scans the tasklist of
> the system or container and scores each task based on heuristic mechanism.
> The task with highest score is picked to kill. Also kernel provides
> /proc/pid/oom_adj API for adding user policy on top of the score, it allows
> admin to tune the "badness" on task basis.
> 
> Linux Theory: A free page is a wasted page of RAM and Linux will always fill
> up memory with disk caches. When we time stamp the running time of an
> application, we normally follow the sequence "flush cache - time - run app -
> time - flush cache". So being OOM is normal and it is not a bug.
> 
> Linux-mm has a list descripting the possible OOM conditions.
> http://linux-mm.org/OOM
> 
> User Perspectives:
> High Performance Computing: I will take as much memory can be given, Please
> tell me how much memory that is. In these systems, swapping is the devil.
> 
> Enterprise: Applications do their own memory management.If the system gets
> lowmem, I want the the kernel to tell me, and I will give some of mine back.
> Memory notification system brings up lots of attention. Couple of proposals
> have been posted in linux-mm and none of them seems fulfill all the
> requirements.
> 
> Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up,
> please just kill it quickly, i will reopen it in a minute. Besides, Please
> don't kill sshd.
> 
> Memory Reclaim
> If no free memory, we scan the LRU and try to free pages. Recent issues on
> page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have 1024
> pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The
> increasing of the memory size makes reclaim job harder and harder.
> 
> Beat the LRU into shape
> * Never run out of memory, never reclaim and never look at the LRU.
> * Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM uses
> 64K page, more on the kernel issue change than userpace change if they use
> libc"
> * Keep troublesome pages off the LRU lists including unreclaimable pages
> (anon, mlock, shm, slab, dirty pages)
> and Hugetlbfs which are not counted on RSS.
> * Split up the LRU lists. It includes the NUMA implementation as well as the
> unevictable patch from Rik (~2.6.28)
> What is next:
> 
> Having the OOM killer always pick the "right" application to kill is a tough
> problem and it has been the hot topic in upstream with several patches
> posted. Notification system has lots of attention during the talk, here are
> the summary of current posted patches:
> 
> Linux killed Kenny, bastard!
> Evgeniy Polyakov posted the patch early this year. What the patch does is to
> provide an API that admin can specify the oom victim by the process name.
> No one likes the patch in linux-mm. The argument is on the current mechanism
> of caculating "badness score" which is way complex for admin to determin
> which task to kill. Alan Cox simply answered the question: "its
> always heuristic", and he also pointed out "What you actually need is
> notifiers to work on /proc. In fact containers are probably the right way to
> do it".
> 
> Cgroup based OOM killer controller
> Nikanth Karthikesan re-posted the patch which adding the cgroup support. The
> patch added an adjustable value "oom.victim" for each oom cgroup. The OOM
> killer would kill all the processes in a cgruop with a higher oom.victim
> value before killing a process in a cgroup with lower oom.victim value.
> Among those tasks with the same oom.victim value, the usual "badness"
> heuristics would be applied.
> It is one step further which takes use of the cgroup hireachy for the OOM
> killer subsystem. However, the same question had been raised "What is the
> difference between oom_adj and this oom.victim to user?". Nikanth answered
> to that question "Using this oom.victim users can specify the exact order to
> kill processes.". Another word, oom_adj works as a hint to the kernel while
> oom_victim gives strict order.
> 
> Per-cgroup OOM handler
> Ying Han posted the google in-house patch into linux-mm which defers the OOM
> kill decisions to userspace. It allows userspace to respond the OOM by
> adding nodes, dropping caches, elevating memcg limit or sending signal. An
> alternative is to use /dev/mem_notify which David Rientjes proposed in
> linux-mm. The idea is similar, instead of waiting on oom_await, userspace
> can poll the information during lowmem condition and respond
> correspondingly.
> 
> Vladislav Buzov posted the patch which extends the memcg by adding the
> notification system on system lowmem condition. The feedbacks looks
> promising this time, Although there still lots of changes needs to be done.
> Discussions focused on the implementation of the notification mechanism.
> Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for
> event delivery and request/respondse applications. Paul Menage proposed
> couple of options including new ioctl on cgroup files, new syscall and new
> per-cgroup file.
> 
> --Ying Han
> 
> >
> > 4. The actual title of the KS discussion is 'containers end-game'.
> > The containers-specific info I gathered in June was mainly about
> > additional resources which we might containerize.  I expect that
> > will be useful in helping the KS community decide how far down
> > the containerization path they are willing to go - i.e. whether
> > we want to call what we have good enough and say you must use kvm
> > for anything more, whether we want to be able to provide all the
> > features of a full VM with containers, or something in between,
> > say targetting specific uses (perhaps only expand on cooperative
> > resource management containers).  With that in mind, here are
> > some items that were mentioned in June as candidates for
> > more containerization work
> >
> >        1. Cpu hard limits, memory soft limits (Balbir)
> >        2. Large pages, mlock, shared page accounting (Balbir)
> >        3. Oom notification (Balbir - was anything decided on this
> >                at plumber's?)
> >        4. There is agreement on getting rid of the ns cgroup,
> >                provided that:
> >                a. user namespaces can provide container confinement
> >                guarantees
> >                b. a compatibility flag is created to clone parent
> >                cgroup when creating a new cgroup (Paul and Daniel)
> >        5. Poweroff/reboot handling in containers (Daniel)
> >        6. Full user namespaces to segragate uids in different
> >                containers and confine root users in containers, i.e.
> >                with respect to file systems like cgroupfs.
> >        7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
> >        8. C/r will want inode virtualization (Daniel)
> >        9. Sunrpc containerization (required to allow multiple
> >                containers separate NFS client access to the same server)
> >        10. Sysfs tagging, support for physical netifs to migrate
> >                network namespaces, and /sys/class/net virtualization
> >
> > Again the point of this list isn't to ask for discussion about
> > whether or how to implement each at this KS, but rather to give
> > an idea of how much work is left to do.  Though let the discussion
> > lead where it may of course.
> >
> > I don't have it here, but maybe it would also be useful to
> > have a list ready of things we can do today with containerization?
> > Both with upstream, and with under-development patchsets.
> >
> > I also hope that someone will take notes on the ksummit
> > discussion to send to the containers and cgroup lists.
> > I expect there will be a good LWN writeup, but a more
> > containers-focused set of notes will probably be useful
> > too.
> >
> > thanks,
> > -serge
> > _______________________________________________
> > Containers mailing list
> > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > https://lists.linux-foundation.org/mailman/listinfo/containers
> >

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found]         ` <20091006182154.GB18694-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-06 18:54           ` Ying Han
  0 siblings, 0 replies; 8+ messages in thread
From: Ying Han @ 2009-10-06 18:54 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, Eric W. Biederman, Linux Containers, Pavel Emelyanov

On Tue, Oct 6, 2009 at 11:21 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:

> Wow, detailed notes - thanks, I'm still looking through them.  If you don't
> mind, I'll use a link to the archive of this email
> (
> https://lists.linux-foundation.org/pipermail/containers/2009-October/021227.html
> )
> in the final summary.
>
> Sure. The archive works with me.   :)

--Ying

thanks,
> -serge
>
> Quoting Ying Han (yinghan-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org):
> > On Tue, Oct 6, 2009 at 8:56 AM, Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> wrote:
> > > Hi,
> > >
> > > the kernel summit is rapidly approaching. One of the agenda
> > > items is 'the containers end-game and how do we get there.'
> > > As of now I don't yet know who will be there to represent the
> > > containers community in that discussion.  I hope there is
> > > someone planning on that?  In the hopes that there is, here is
> > > a summary of the info I gathered in June, in case that is
> > > helpful.  If it doesn't look like anyone will be attending
> > > ksummit representing containers, then I'll send the final
> > > version of this info to the ksummit mailing list so that someone
> > > can stand in.
> > >
> > > 1. There will be an IO controller minisummit before KS.  I
> > > trust someone (Balbir?) will be sending meeting notes to
> > > the cgroup list, so that highlights can be mentioned at KS?
> > >
> > > 2. There was a checkpoint/restart BOF plus talk at plumber's.
> > > Notes on the BOF are here:
> > >
> >
> https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
> > >
> > > 3. There was an OOM notification talk or BOF at plumber's.
> > > Dave or Balbir, are there any notes about that meeting?
> > Serge:
> > Here are some notes I took from Dave's OOM talk:
> >
> > Change the OOM killer's policy.
> >
> > The current goal of OOM killer is to kill a rogue memory hogging task
> which
> > will lead to future memory freeing, and allow the system or container to
> > resume normal operation. Under OOM condition, kernel scans the tasklist
> of
> > the system or container and scores each task based on heuristic
> mechanism.
> > The task with highest score is picked to kill. Also kernel provides
> > /proc/pid/oom_adj API for adding user policy on top of the score, it
> allows
> > admin to tune the "badness" on task basis.
> >
> > Linux Theory: A free page is a wasted page of RAM and Linux will always
> fill
> > up memory with disk caches. When we time stamp the running time of an
> > application, we normally follow the sequence "flush cache - time - run
> app -
> > time - flush cache". So being OOM is normal and it is not a bug.
> >
> > Linux-mm has a list descripting the possible OOM conditions.
> > http://linux-mm.org/OOM
> >
> > User Perspectives:
> > High Performance Computing: I will take as much memory can be given,
> Please
> > tell me how much memory that is. In these systems, swapping is the devil.
> >
> > Enterprise: Applications do their own memory management.If the system
> gets
> > lowmem, I want the the kernel to tell me, and I will give some of mine
> back.
> > Memory notification system brings up lots of attention. Couple of
> proposals
> > have been posted in linux-mm and none of them seems fulfill all the
> > requirements.
> >
> > Desktop: This is the OOM designed for. When OpenOffice/Firefox flows up,
> > please just kill it quickly, i will reopen it in a minute. Besides,
> Please
> > don't kill sshd.
> >
> > Memory Reclaim
> > If no free memory, we scan the LRU and try to free pages. Recent issues
> on
> > page reclaim focuses on scalability. In 1991 with 4M of DRAM, We have
> 1024
> > pages to scan. In 2009 with 4G of DRAM, we have1048576 pages to scan. The
> > increasing of the memory size makes reclaim job harder and harder.
> >
> > Beat the LRU into shape
> > * Never run out of memory, never reclaim and never look at the LRU.
> > * Use large size pagesize. IBM uses 64k page instead of 4k page. "IBM
> uses
> > 64K page, more on the kernel issue change than userpace change if they
> use
> > libc"
> > * Keep troublesome pages off the LRU lists including unreclaimable pages
> > (anon, mlock, shm, slab, dirty pages)
> > and Hugetlbfs which are not counted on RSS.
> > * Split up the LRU lists. It includes the NUMA implementation as well as
> the
> > unevictable patch from Rik (~2.6.28)
> > What is next:
> >
> > Having the OOM killer always pick the "right" application to kill is a
> tough
> > problem and it has been the hot topic in upstream with several patches
> > posted. Notification system has lots of attention during the talk, here
> are
> > the summary of current posted patches:
> >
> > Linux killed Kenny, bastard!
> > Evgeniy Polyakov posted the patch early this year. What the patch does is
> to
> > provide an API that admin can specify the oom victim by the process name.
> > No one likes the patch in linux-mm. The argument is on the current
> mechanism
> > of caculating "badness score" which is way complex for admin to determin
> > which task to kill. Alan Cox simply answered the question: "its
> > always heuristic", and he also pointed out "What you actually need is
> > notifiers to work on /proc. In fact containers are probably the right way
> to
> > do it".
> >
> > Cgroup based OOM killer controller
> > Nikanth Karthikesan re-posted the patch which adding the cgroup support.
> The
> > patch added an adjustable value "oom.victim" for each oom cgroup. The OOM
> > killer would kill all the processes in a cgruop with a higher oom.victim
> > value before killing a process in a cgroup with lower oom.victim value.
> > Among those tasks with the same oom.victim value, the usual "badness"
> > heuristics would be applied.
> > It is one step further which takes use of the cgroup hireachy for the OOM
> > killer subsystem. However, the same question had been raised "What is the
> > difference between oom_adj and this oom.victim to user?". Nikanth
> answered
> > to that question "Using this oom.victim users can specify the exact order
> to
> > kill processes.". Another word, oom_adj works as a hint to the kernel
> while
> > oom_victim gives strict order.
> >
> > Per-cgroup OOM handler
> > Ying Han posted the google in-house patch into linux-mm which defers the
> OOM
> > kill decisions to userspace. It allows userspace to respond the OOM by
> > adding nodes, dropping caches, elevating memcg limit or sending signal.
> An
> > alternative is to use /dev/mem_notify which David Rientjes proposed in
> > linux-mm. The idea is similar, instead of waiting on oom_await, userspace
> > can poll the information during lowmem condition and respond
> > correspondingly.
> >
> > Vladislav Buzov posted the patch which extends the memcg by adding the
> > notification system on system lowmem condition. The feedbacks looks
> > promising this time, Although there still lots of changes needs to be
> done.
> > Discussions focused on the implementation of the notification mechanism.
> > Balbir Singh mentioned the cgroupstats - a genetlink based mechanism for
> > event delivery and request/respondse applications. Paul Menage proposed
> > couple of options including new ioctl on cgroup files, new syscall and
> new
> > per-cgroup file.
> >
> > --Ying Han
> >
> > >
> > > 4. The actual title of the KS discussion is 'containers end-game'.
> > > The containers-specific info I gathered in June was mainly about
> > > additional resources which we might containerize.  I expect that
> > > will be useful in helping the KS community decide how far down
> > > the containerization path they are willing to go - i.e. whether
> > > we want to call what we have good enough and say you must use kvm
> > > for anything more, whether we want to be able to provide all the
> > > features of a full VM with containers, or something in between,
> > > say targetting specific uses (perhaps only expand on cooperative
> > > resource management containers).  With that in mind, here are
> > > some items that were mentioned in June as candidates for
> > > more containerization work
> > >
> > >        1. Cpu hard limits, memory soft limits (Balbir)
> > >        2. Large pages, mlock, shared page accounting (Balbir)
> > >        3. Oom notification (Balbir - was anything decided on this
> > >                at plumber's?)
> > >        4. There is agreement on getting rid of the ns cgroup,
> > >                provided that:
> > >                a. user namespaces can provide container confinement
> > >                guarantees
> > >                b. a compatibility flag is created to clone parent
> > >                cgroup when creating a new cgroup (Paul and Daniel)
> > >        5. Poweroff/reboot handling in containers (Daniel)
> > >        6. Full user namespaces to segragate uids in different
> > >                containers and confine root users in containers, i.e.
> > >                with respect to file systems like cgroupfs.
> > >        7. Checkpoint/restart (c/r) will want time virtualization
> (Daniel)
> > >        8. C/r will want inode virtualization (Daniel)
> > >        9. Sunrpc containerization (required to allow multiple
> > >                containers separate NFS client access to the same
> server)
> > >        10. Sysfs tagging, support for physical netifs to migrate
> > >                network namespaces, and /sys/class/net virtualization
> > >
> > > Again the point of this list isn't to ask for discussion about
> > > whether or how to implement each at this KS, but rather to give
> > > an idea of how much work is left to do.  Though let the discussion
> > > lead where it may of course.
> > >
> > > I don't have it here, but maybe it would also be useful to
> > > have a list ready of things we can do today with containerization?
> > > Both with upstream, and with under-development patchsets.
> > >
> > > I also hope that someone will take notes on the ksummit
> > > discussion to send to the containers and cgroup lists.
> > > I expect there will be a good LWN writeup, but a more
> > > containers-focused set of notes will probably be useful
> > > too.
> > >
> > > thanks,
> > > -serge
> > > _______________________________________________
> > > Containers mailing list
> > > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> > > https://lists.linux-foundation.org/mailman/listinfo/containers
> > >
>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found] ` <20091006155637.GA14761-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-10-06 16:53   ` Ying Han
@ 2009-10-12 18:49   ` Oren Laadan
       [not found]     ` <4AD37A3C.8020408-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 8+ messages in thread
From: Oren Laadan @ 2009-10-12 18:49 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, cgroup-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Linux Containers, Pavel Emelyanov

Hi,

Serge E. Hallyn wrote:
> Hi,
> 
> the kernel summit is rapidly approaching. One of the agenda
> items is 'the containers end-game and how do we get there.'
> As of now I don't yet know who will be there to represent the
> containers community in that discussion.  I hope there is
> someone planning on that?  In the hopes that there is, here is
> a summary of the info I gathered in June, in case that is
> helpful.  If it doesn't look like anyone will be attending
> ksummit representing containers, then I'll send the final
> version of this info to the ksummit mailing list so that someone
> can stand in.
> 
> 1. There will be an IO controller minisummit before KS.  I
> trust someone (Balbir?) will be sending meeting notes to
> the cgroup list, so that highlights can be mentioned at KS?
> 
> 2. There was a checkpoint/restart BOF plus talk at plumber's.
> Notes on the BOF are here:
> https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html

Based on Suka's post, I updated the linux-cr wiki page with the
notes from the BOF here:

	http://ckpt.wiki.kernel.org/index.php/LPC2009

> 
> 3. There was an OOM notification talk or BOF at plumber's.
> Dave or Balbir, are there any notes about that meeting?
> 
> 4. The actual title of the KS discussion is 'containers end-game'.
> The containers-specific info I gathered in June was mainly about
> additional resources which we might containerize.  I expect that
> will be useful in helping the KS community decide how far down
> the containerization path they are willing to go - i.e. whether
> we want to call what we have good enough and say you must use kvm
> for anything more, whether we want to be able to provide all the
> features of a full VM with containers, or something in between,
> say targetting specific uses (perhaps only expand on cooperative
> resource management containers).  With that in mind, here are
> some items that were mentioned in June as candidates for
> more containerization work
> 
> 	1. Cpu hard limits, memory soft limits (Balbir)
> 	2. Large pages, mlock, shared page accounting (Balbir)
> 	3. Oom notification (Balbir - was anything decided on this
> 		at plumber's?)
> 	4. There is agreement on getting rid of the ns cgroup,
> 		provided that:
> 		a. user namespaces can provide container confinement
> 		guarantees
> 		b. a compatibility flag is created to clone parent
> 		cgroup when creating a new cgroup (Paul and Daniel)
> 	5. Poweroff/reboot handling in containers (Daniel)
> 	6. Full user namespaces to segragate uids in different
> 		containers and confine root users in containers, i.e.
> 		with respect to file systems like cgroupfs.
> 	7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
> 	8. C/r will want inode virtualization (Daniel)

What is the status on device namespace/virtualization ?  the first few
I have in mind are per-container: /dev/rtc, /dev/ttyX, and even
dev/urandom (isolated entropy pools?).

The first two are important for containers that hold user sessions
(e.g. linux terminal server) - is anyone pushing this use-case in the
context of containers-end-game ?

Oren.

> 	9. Sunrpc containerization (required to allow multiple
> 		containers separate NFS client access to the same server)
> 	10. Sysfs tagging, support for physical netifs to migrate
> 		network namespaces, and /sys/class/net virtualization
> 
> Again the point of this list isn't to ask for discussion about
> whether or how to implement each at this KS, but rather to give
> an idea of how much work is left to do.  Though let the discussion
> lead where it may of course.
> 
> I don't have it here, but maybe it would also be useful to
> have a list ready of things we can do today with containerization?
> Both with upstream, and with under-development patchsets.
> 
> I also hope that someone will take notes on the ksummit
> discussion to send to the containers and cgroup lists.
> I expect there will be a good LWN writeup, but a more
> containers-focused set of notes will probably be useful
> too.
> 
> thanks,
> -serge
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found]     ` <4AD37A3C.8020408-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
@ 2009-10-12 19:04       ` Serge E. Hallyn
       [not found]         ` <20091012190416.GA15143-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Serge E. Hallyn @ 2009-10-12 19:04 UTC (permalink / raw)
  To: Oren Laadan
  Cc: Dave Hansen, cgroup-r/Jw6+rmf7HQT0dZR+AlfA, Eric W. Biederman,
	Linux Containers, Pavel Emelyanov

Quoting Oren Laadan (orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org):
> Hi,
> 
> Serge E. Hallyn wrote:
> > Hi,
> > 
> > the kernel summit is rapidly approaching. One of the agenda
> > items is 'the containers end-game and how do we get there.'
> > As of now I don't yet know who will be there to represent the
> > containers community in that discussion.  I hope there is
> > someone planning on that?  In the hopes that there is, here is
> > a summary of the info I gathered in June, in case that is
> > helpful.  If it doesn't look like anyone will be attending
> > ksummit representing containers, then I'll send the final
> > version of this info to the ksummit mailing list so that someone
> > can stand in.
> > 
> > 1. There will be an IO controller minisummit before KS.  I
> > trust someone (Balbir?) will be sending meeting notes to
> > the cgroup list, so that highlights can be mentioned at KS?
> > 
> > 2. There was a checkpoint/restart BOF plus talk at plumber's.
> > Notes on the BOF are here:
> > https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
> 
> Based on Suka's post, I updated the linux-cr wiki page with the
> notes from the BOF here:
> 
> 	http://ckpt.wiki.kernel.org/index.php/LPC2009

Thanks.

> > 3. There was an OOM notification talk or BOF at plumber's.
> > Dave or Balbir, are there any notes about that meeting?
> > 
> > 4. The actual title of the KS discussion is 'containers end-game'.
> > The containers-specific info I gathered in June was mainly about
> > additional resources which we might containerize.  I expect that
> > will be useful in helping the KS community decide how far down
> > the containerization path they are willing to go - i.e. whether
> > we want to call what we have good enough and say you must use kvm
> > for anything more, whether we want to be able to provide all the
> > features of a full VM with containers, or something in between,
> > say targetting specific uses (perhaps only expand on cooperative
> > resource management containers).  With that in mind, here are
> > some items that were mentioned in June as candidates for
> > more containerization work
> > 
> > 	1. Cpu hard limits, memory soft limits (Balbir)
> > 	2. Large pages, mlock, shared page accounting (Balbir)
> > 	3. Oom notification (Balbir - was anything decided on this
> > 		at plumber's?)
> > 	4. There is agreement on getting rid of the ns cgroup,
> > 		provided that:
> > 		a. user namespaces can provide container confinement
> > 		guarantees
> > 		b. a compatibility flag is created to clone parent
> > 		cgroup when creating a new cgroup (Paul and Daniel)
> > 	5. Poweroff/reboot handling in containers (Daniel)
> > 	6. Full user namespaces to segragate uids in different
> > 		containers and confine root users in containers, i.e.
> > 		with respect to file systems like cgroupfs.
> > 	7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
> > 	8. C/r will want inode virtualization (Daniel)
> 
> What is the status on device namespace/virtualization ?  the first few
> I have in mind are per-container: /dev/rtc, /dev/ttyX, and even
> dev/urandom (isolated entropy pools?).

They sound like good ideas.  I think the status is unstarted :)

> The first two are important for containers that hold user sessions
> (e.g. linux terminal server) - is anyone pushing this use-case in the
> context of containers-end-game ?

/me hopes someone chimes in and says "I am".

BTW, containers end-game is off the ksummit agenda now.

-serge

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found]         ` <20091012190416.GA15143-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-10-12 19:39           ` Eric W. Biederman
       [not found]             ` <m18wfgjtaq.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
  0 siblings, 1 reply; 8+ messages in thread
From: Eric W. Biederman @ 2009-10-12 19:39 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Dave Hansen, cgroup-r/Jw6+rmf7HQT0dZR+AlfA, Linux Containers,
	Pavel Emelyanov

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Oren Laadan (orenl-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org):
>> Hi,
>> 
>> Serge E. Hallyn wrote:
>> > Hi,
>> > 
>> > the kernel summit is rapidly approaching. One of the agenda
>> > items is 'the containers end-game and how do we get there.'
>> > As of now I don't yet know who will be there to represent the
>> > containers community in that discussion.  I hope there is
>> > someone planning on that?  In the hopes that there is, here is
>> > a summary of the info I gathered in June, in case that is
>> > helpful.  If it doesn't look like anyone will be attending
>> > ksummit representing containers, then I'll send the final
>> > version of this info to the ksummit mailing list so that someone
>> > can stand in.
>> > 
>> > 1. There will be an IO controller minisummit before KS.  I
>> > trust someone (Balbir?) will be sending meeting notes to
>> > the cgroup list, so that highlights can be mentioned at KS?
>> > 
>> > 2. There was a checkpoint/restart BOF plus talk at plumber's.
>> > Notes on the BOF are here:
>> > https://lists.linux-foundation.org/pipermail/containers/2009-September/020915.html
>> 
>> Based on Suka's post, I updated the linux-cr wiki page with the
>> notes from the BOF here:
>> 
>> 	http://ckpt.wiki.kernel.org/index.php/LPC2009
>
> Thanks.
>
>> > 3. There was an OOM notification talk or BOF at plumber's.
>> > Dave or Balbir, are there any notes about that meeting?
>> > 
>> > 4. The actual title of the KS discussion is 'containers end-game'.
>> > The containers-specific info I gathered in June was mainly about
>> > additional resources which we might containerize.  I expect that
>> > will be useful in helping the KS community decide how far down
>> > the containerization path they are willing to go - i.e. whether
>> > we want to call what we have good enough and say you must use kvm
>> > for anything more, whether we want to be able to provide all the
>> > features of a full VM with containers, or something in between,
>> > say targetting specific uses (perhaps only expand on cooperative
>> > resource management containers).  With that in mind, here are
>> > some items that were mentioned in June as candidates for
>> > more containerization work
>> > 
>> > 	1. Cpu hard limits, memory soft limits (Balbir)
>> > 	2. Large pages, mlock, shared page accounting (Balbir)
>> > 	3. Oom notification (Balbir - was anything decided on this
>> > 		at plumber's?)
>> > 	4. There is agreement on getting rid of the ns cgroup,
>> > 		provided that:
>> > 		a. user namespaces can provide container confinement
>> > 		guarantees
>> > 		b. a compatibility flag is created to clone parent
>> > 		cgroup when creating a new cgroup (Paul and Daniel)
>> > 	5. Poweroff/reboot handling in containers (Daniel)
>> > 	6. Full user namespaces to segragate uids in different
>> > 		containers and confine root users in containers, i.e.
>> > 		with respect to file systems like cgroupfs.
>> > 	7. Checkpoint/restart (c/r) will want time virtualization (Daniel)
>> > 	8. C/r will want inode virtualization (Daniel)
>> 
>> What is the status on device namespace/virtualization ?  the first few
>> I have in mind are per-container: /dev/rtc, /dev/ttyX, and even
>> dev/urandom (isolated entropy pools?).
>
> They sound like good ideas.  I think the status is unstarted :)
>
>> The first two are important for containers that hold user sessions
>> (e.g. linux terminal server) - is anyone pushing this use-case in the
>> context of containers-end-game ?
>
> /me hopes someone chimes in and says "I am".
>
> BTW, containers end-game is off the ksummit agenda now.

I am still slowly poking at the sysfs cleanups/changes.

For me the priorities are rougly:
- bug fixes in the existing namespaces
- sysfs cleanups
- sysfs for the network namespace ( and likely others )
- a complete user namespace (I am tired of running everything as root).

I have a bunch of generally unrelated hotplug changes I am working on as
well.

Now that the network namespace has stabalized I am hoping to have a bit more
time for the others.

Eric

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: 2009 kernel summit preparation for 'containers end-game' discussion
       [not found]             ` <m18wfgjtaq.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
@ 2009-10-13 14:52               ` Serge E. Hallyn
  0 siblings, 0 replies; 8+ messages in thread
From: Serge E. Hallyn @ 2009-10-13 14:52 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Dave Hansen, cgroup-r/Jw6+rmf7HQT0dZR+AlfA, Linux Containers,
	Pavel Emelyanov

Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> I am still slowly poking at the sysfs cleanups/changes.
> 
> For me the priorities are rougly:
> - bug fixes in the existing namespaces

Do you have a list of bugs you're looking at?

Do you mean true existing bugs, or also shortcomings?  (i.e.
sunrpc vs. netns, kernel interfaces not yet properly namespaced,
like autofs4 vs. pidns)

It might be worth keeping both a bug-list and shortcomings-list
under ckpt.wiki.kernel.org.

> - sysfs cleanups
> - sysfs for the network namespace ( and likely others )
> - a complete user namespace (I am tired of running everything as root).
> 
> I have a bunch of generally unrelated hotplug changes I am working on as
> well.
> 
> Now that the network namespace has stabalized I am hoping to have a bit more
> time for the others.

Cool.

thanks,
-serge

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2009-10-13 14:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-06 15:56 2009 kernel summit preparation for 'containers end-game' discussion Serge E. Hallyn
     [not found] ` <20091006155637.GA14761-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-06 16:53   ` Ying Han
     [not found]     ` <604427e00910060953l2d14fa8ci3923320dfaeb6490-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2009-10-06 18:21       ` Serge E. Hallyn
     [not found]         ` <20091006182154.GB18694-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-06 18:54           ` Ying Han
2009-10-12 18:49   ` Oren Laadan
     [not found]     ` <4AD37A3C.8020408-RdfvBDnrOixBDgjK7y7TUQ@public.gmane.org>
2009-10-12 19:04       ` Serge E. Hallyn
     [not found]         ` <20091012190416.GA15143-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-10-12 19:39           ` Eric W. Biederman
     [not found]             ` <m18wfgjtaq.fsf-+imSwln9KH6u2/kzUuoCbdi2O/JbrIOy@public.gmane.org>
2009-10-13 14:52               ` Serge E. Hallyn

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.