* C/R minisummit notes
@ 2008-07-23 11:30 Daniel Lezcano
[not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 20+ messages in thread
From: Daniel Lezcano @ 2008-07-23 11:30 UTC (permalink / raw)
To: Linux Containers
* What are the problems that the linux community can solve with the
checkpoint/restart ?
Eric Biederman reminds at the previous OLS nobody complained about the
checkpoint/restart
Pavel Emylianov : The startup of Oracle takes some minutes, if we
checkpoint just after the startup, Oracle can be restarted from this
point later and provide fast startup
Oren Laaden : Time travel, we can do monotonic snapshot and go back on
one of this snaphost.
Eric Biedreman : Priority running, checkpoint/kill an application and
run another application with a bigger priority
Denis Lunev : Task migration, move application on one host to another host
Daniel Lezcano : SSI (task migration)
* Preparing the kernel internals
OL : Can we implement a kernel module and move CR functionality into
the kernel itself later ?
EB : Better to add a little CR functionnality into the kernel itself
and add more after.
DLu : Problem with kernel version
OL : Compatibility with intermediate kernel version should be possible
with userspace conversion tools
DLu : Non sequential file for checkpoint statefile is a challenge
OL : yes, but possible and useful for compression/encryption
We showed that there are five steps to realize a checkpoint:
1 - Pre-dump
2 - Freeze
3 - Dump
4 - Resume/kill
5 - Post-dump
At this point we state we want create a proof of concept and
checkpoint/restart the simplest application.
We will add iteratively more and more kernel resources.
Process hierarchy created from kernel or userspace ?
OL : Seems better to send a chunk of data to kernel and that restores
the processes hierarchy
PE : Agreed
OL : We should be able to checkpoint from inside the container, keep
that in mind for later.
=> we need a syscall or a ioctl
The first items to address before implementing the Checkpoint are:
1 - Make a container object (the context)
2 - Freeze the container (extend cgroup freezer ?)
3 - syscall | ioctl
First step:
* simplest application : A single process, without any file, no
checkpoint of text file (same file system for restart), no signals, no
syscall in the application, no ipc/no msgq, no network
Second step:
* multiple processes + zombie state
Third step:
* files, pipe, signals, socketpair ?
This proof of concept must came with a documentation describing what is
supported, what is not supported and what we plan to do.
^ permalink raw reply [flat|nested] 20+ messages in thread[parent not found: <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> @ 2008-07-23 14:20 ` Eric W. Biederman 2008-07-23 18:55 ` Oren Laadan ` (3 subsequent siblings) 4 siblings, 0 replies; 20+ messages in thread From: Eric W. Biederman @ 2008-07-23 14:20 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Daniel Lezcano <dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> writes: > * What are the problems that the linux community can solve with the > checkpoint/restart ? > > Eric Biederman reminds at the previous OLS nobody complained about the > checkpoint/restart Kernel summit. Not OLS. Which is a room packed full of maintainers. It isn't an endorsement but it also such a scary idea that people immediately rejected it either. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes [not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> 2008-07-23 14:20 ` Eric W. Biederman @ 2008-07-23 18:55 ` Oren Laadan [not found] ` <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-07-23 21:18 ` Serge E. Hallyn ` (2 subsequent siblings) 4 siblings, 1 reply; 20+ messages in thread From: Oren Laadan @ 2008-07-23 18:55 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Hi, I've placed a somewhat more detailed summary on the wiki: http://wiki.openvz.org/Containers/Mini-summit_2008_notes (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) To further discuss technical details, let's schedule to meet while we are here for the OLS. I suggest the following for a start: 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. Please confirm your participation. Thanks, Oren. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-07-23 20:18 ` Serge E. Hallyn 2008-07-23 20:23 ` [Devel] " Denis V. Lunev 2008-07-23 20:24 ` Daniel Lezcano 2 siblings, 0 replies; 20+ messages in thread From: Serge E. Hallyn @ 2008-07-23 20:18 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers, Daniel Lezcano Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): > Hi, > > I've placed a somewhat more detailed summary on the wiki: > http://wiki.openvz.org/Containers/Mini-summit_2008_notes > > (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) > > To further discuss technical details, let's schedule to meet while we are > here for the OLS. I suggest the following for a start: > > 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) > 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. > > Please confirm your participation. Hey, I've committed to dinner with another group. I'm definately up for breakfast. So I'll be at 3d floor congress center at mall entrance at 8:30am. thanks, -serge ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: [Devel] Re: C/R minisummit notes [not found] ` <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-07-23 20:18 ` Serge E. Hallyn @ 2008-07-23 20:23 ` Denis V. Lunev 2008-07-23 20:24 ` Daniel Lezcano 2 siblings, 0 replies; 20+ messages in thread From: Denis V. Lunev @ 2008-07-23 20:23 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers, Daniel Lezcano On Wed, 2008-07-23 at 14:55 -0400, Oren Laadan wrote: > Hi, > > I've placed a somewhat more detailed summary on the wiki: > http://wiki.openvz.org/Containers/Mini-summit_2008_notes > > (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) > > To further discuss technical details, let's schedule to meet while we are > here for the OLS. I suggest the following for a start: > > 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) +1 I think that we could meet at the registration desc. Any objections? > 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. +1 Regards, Den ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes [not found] ` <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 2008-07-23 20:18 ` Serge E. Hallyn 2008-07-23 20:23 ` [Devel] " Denis V. Lunev @ 2008-07-23 20:24 ` Daniel Lezcano 2 siblings, 0 replies; 20+ messages in thread From: Daniel Lezcano @ 2008-07-23 20:24 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers Oren Laadan wrote: > Hi, > > I've placed a somewhat more detailed summary on the wiki: > http://wiki.openvz.org/Containers/Mini-summit_2008_notes > > (also accessible from: http://wiki.openvz.org/Containers/Mini-summit_2008) > > To further discuss technical details, let's schedule to meet while we are > here for the OLS. I suggest the following for a start: > > 1) Dinner tonight at 7:30pm. Suggestions for a venue are welcome :) > 2) Breakfast tomorrow before the OLS, at 8:30am at the congress center. > > Please confirm your participation. Benjamin, Dave and I, we will come. Pavel, Denis and Andrey will come too. I am looking for the Kerrighed guys to come too. We meet in the hall of the congress at 7:00pm ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes [not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> 2008-07-23 14:20 ` Eric W. Biederman 2008-07-23 18:55 ` Oren Laadan @ 2008-07-23 21:18 ` Serge E. Hallyn [not found] ` <20080723211818.GA10295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2008-07-24 9:55 ` C/R minisummit notes (namespace naming) Eric W. Biederman 2008-07-24 20:28 ` C/R minisummit notes Oren Laadan 4 siblings, 1 reply; 20+ messages in thread From: Serge E. Hallyn @ 2008-07-23 21:18 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): > > * What are the problems that the linux community can solve with the > checkpoint/restart ? > > Eric Biederman reminds at the previous OLS nobody complained about the > checkpoint/restart > > Pavel Emylianov : The startup of Oracle takes some minutes, if we > checkpoint just after the startup, Oracle can be restarted from this > point later and provide fast startup > > Oren Laaden : Time travel, we can do monotonic snapshot and go back on > one of this snaphost. > > Eric Biedreman : Priority running, checkpoint/kill an application and > run another application with a bigger priority > > Denis Lunev : Task migration, move application on one host to another host > > Daniel Lezcano : SSI (task migration) > > * Preparing the kernel internals > > OL : Can we implement a kernel module and move CR functionality into > the kernel itself later ? > > EB : Better to add a little CR functionnality into the kernel itself > and add more after. > > DLu : Problem with kernel version > > OL : Compatibility with intermediate kernel version should be possible > with userspace conversion tools > > DLu : Non sequential file for checkpoint statefile is a challenge > > OL : yes, but possible and useful for compression/encryption > > We showed that there are five steps to realize a checkpoint: > > 1 - Pre-dump I'd just add here that the pre-dump is where you might start writing memory to disk, trying to get disk and memory closer and closer to being the same until, at some point, you decide they are close enough that you can go on to step two, and attempt the freeze+dump+migrate/kill with minimal downtime. Coming into the discussion my primary concern had been that doing a sys_checkpoint() system call would be tough to augment to provide this kind of incremental checkpoint, but this breakdown is great for that. > 2 - Freeze > 3 - Dump > 4 - Resume/kill > 5 - Post-dump > > At this point we state we want create a proof of concept and > checkpoint/restart the simplest application. By which we mean, start with a piece of step 3 (and maybe a bit of step 4). Step 2 was pretty widely accepted to be the freezer subsystem, but noone seemed to be sure quite what the status of that was. Matt, can you remind us how the freezer cgroup is doing? > We will add iteratively more and more kernel resources. > > Process hierarchy created from kernel or userspace ? > > OL : Seems better to send a chunk of data to kernel and that restores > the processes hierarchy > PE : Agreed > OL : We should be able to checkpoint from inside the container, keep > that in mind for later. > > => we need a syscall or a ioctl > > The first items to address before implementing the Checkpoint are: > 1 - Make a container object (the context) > 2 - Freeze the container (extend cgroup freezer ?) > 3 - syscall | ioctl > > First step: > * simplest application : A single process, without any file, no > checkpoint of text file (same file system for restart), no signals, no > syscall in the application, no ipc/no msgq, no network > > Second step: > * multiple processes + zombie state > > Third step: > * files, pipe, signals, socketpair ? > > This proof of concept must came with a documentation describing what is > supported, what is not supported and what we plan to do. And there was talk of making sure that if you attempt to checkpoint an app using unsupported resources, we return -EAGAIN. There had been murmurings about giving more meaningful feedback, but I have no idea what that would look like. -serge ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20080723211818.GA10295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <20080723211818.GA10295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-07-23 21:38 ` Oren Laadan [not found] ` <4887A4CC.5070009-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Oren Laadan @ 2008-07-23 21:38 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Daniel Lezcano Serge E. Hallyn wrote: > Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): >> * What are the problems that the linux community can solve with the >> checkpoint/restart ? >> >> Eric Biederman reminds at the previous OLS nobody complained about the >> checkpoint/restart >> >> Pavel Emylianov : The startup of Oracle takes some minutes, if we >> checkpoint just after the startup, Oracle can be restarted from this >> point later and provide fast startup >> >> Oren Laaden : Time travel, we can do monotonic snapshot and go back on >> one of this snaphost. >> >> Eric Biedreman : Priority running, checkpoint/kill an application and >> run another application with a bigger priority >> >> Denis Lunev : Task migration, move application on one host to another host >> >> Daniel Lezcano : SSI (task migration) >> >> * Preparing the kernel internals >> >> OL : Can we implement a kernel module and move CR functionality into >> the kernel itself later ? >> >> EB : Better to add a little CR functionnality into the kernel itself >> and add more after. >> >> DLu : Problem with kernel version >> >> OL : Compatibility with intermediate kernel version should be possible >> with userspace conversion tools >> >> DLu : Non sequential file for checkpoint statefile is a challenge >> >> OL : yes, but possible and useful for compression/encryption >> >> We showed that there are five steps to realize a checkpoint: >> >> 1 - Pre-dump > > I'd just add here that the pre-dump is where you might start writing > memory to disk, trying to get disk and memory closer and closer to > being the same until, at some point, you decide they are close enough > that you can go on to step two, and attempt the freeze+dump+migrate/kill > with minimal downtime. > > Coming into the discussion my primary concern had been that doing a > sys_checkpoint() system call would be tough to augment to provide this > kind of incremental checkpoint, but this breakdown is great for that. > >> 2 - Freeze >> 3 - Dump >> 4 - Resume/kill >> 5 - Post-dump >> >> At this point we state we want create a proof of concept and >> checkpoint/restart the simplest application. > > By which we mean, start with a piece of step 3 (and maybe a bit of > step 4). step 4 is also part of the freezer -- it's the unfreeze operation (or force a SIGKILL to all processes in the container). > > Step 2 was pretty widely accepted to be the freezer subsystem, but > noone seemed to be sure quite what the status of that was. > > Matt, can you remind us how the freezer cgroup is doing? > >> We will add iteratively more and more kernel resources. >> >> Process hierarchy created from kernel or userspace ? >> >> OL : Seems better to send a chunk of data to kernel and that restores >> the processes hierarchy >> PE : Agreed >> OL : We should be able to checkpoint from inside the container, keep >> that in mind for later. >> >> => we need a syscall or a ioctl >> >> The first items to address before implementing the Checkpoint are: >> 1 - Make a container object (the context) >> 2 - Freeze the container (extend cgroup freezer ?) >> 3 - syscall | ioctl >> >> First step: >> * simplest application : A single process, without any file, no >> checkpoint of text file (same file system for restart), no signals, no >> syscall in the application, no ipc/no msgq, no network >> >> Second step: >> * multiple processes + zombie state >> >> Third step: >> * files, pipe, signals, socketpair ? >> >> This proof of concept must came with a documentation describing what is >> supported, what is not supported and what we plan to do. > > And there was talk of making sure that if you attempt to checkpoint an > app using unsupported resources, we return -EAGAIN. There had been > murmurings about giving more meaningful feedback, but I have no idea > what that would look like. yes. some of it is mentioned in the notes that I put in the wiki. > > -serge > _______________________________________________ > Containers mailing list > Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org > https://lists.linux-foundation.org/mailman/listinfo/containers ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <4887A4CC.5070009-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <4887A4CC.5070009-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-07-24 1:41 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA [not found] ` <20080724014122.GA23105-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: sukadev-r/Jw6+rmf7HQT0dZR+AlfA @ 2008-07-24 1:41 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers, Daniel Lezcano Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: | | | Serge E. Hallyn wrote: | > Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): | >> * What are the problems that the linux community can solve with the | >> checkpoint/restart ? | >> | >> Eric Biederman reminds at the previous OLS nobody complained about the | >> checkpoint/restart | >> | >> Pavel Emylianov : The startup of Oracle takes some minutes, if we | >> checkpoint just after the startup, Oracle can be restarted from this | >> point later and provide fast startup | >> | >> Oren Laaden : Time travel, we can do monotonic snapshot and go back on | >> one of this snaphost. | >> | >> Eric Biedreman : Priority running, checkpoint/kill an application and | >> run another application with a bigger priority | >> | >> Denis Lunev : Task migration, move application on one host to another host | >> | >> Daniel Lezcano : SSI (task migration) | >> | >> * Preparing the kernel internals | >> | >> OL : Can we implement a kernel module and move CR functionality into | >> the kernel itself later ? | >> | >> EB : Better to add a little CR functionnality into the kernel itself | >> and add more after. | >> | >> DLu : Problem with kernel version | >> | >> OL : Compatibility with intermediate kernel version should be possible | >> with userspace conversion tools | >> | >> DLu : Non sequential file for checkpoint statefile is a challenge | >> | >> OL : yes, but possible and useful for compression/encryption | >> | >> We showed that there are five steps to realize a checkpoint: | >> | >> 1 - Pre-dump | > | > I'd just add here that the pre-dump is where you might start writing | > memory to disk, trying to get disk and memory closer and closer to | > being the same until, at some point, you decide they are close enough | > that you can go on to step two, and attempt the freeze+dump+migrate/kill | > with minimal downtime. | > | > Coming into the discussion my primary concern had been that doing a | > sys_checkpoint() system call would be tough to augment to provide this | > kind of incremental checkpoint, but this breakdown is great for that. | > | >> 2 - Freeze | >> 3 - Dump | >> 4 - Resume/kill | >> 5 - Post-dump | >> | >> At this point we state we want create a proof of concept and | >> checkpoint/restart the simplest application. | > | > By which we mean, start with a piece of step 3 (and maybe a bit of | > step 4). | | step 4 is also part of the freezer -- it's the unfreeze operation | (or force a SIGKILL to all processes in the container). Are steps 1-5 considered part of the sys_checkpoint() system call and if successful sys_checkpoint() returns after step 5 ? If so, like Serge points out, it would be harder to optimize for incremental checkpoints (as each sys_checkpoint() would be independent) ? But may not be something to worry about for POC. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20080724014122.GA23105-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <20080724014122.GA23105-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-07-24 3:26 ` Serge E. Hallyn [not found] ` <20080724032616.GB9839-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Serge E. Hallyn @ 2008-07-24 3:26 UTC (permalink / raw) To: sukadev-r/Jw6+rmf7HQT0dZR+AlfA; +Cc: Linux Containers, Daniel Lezcano Quoting sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org (sukadev-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org): > Oren Laadan [orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org] wrote: > | > | > | Serge E. Hallyn wrote: > | > Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): > | >> * What are the problems that the linux community can solve with the > | >> checkpoint/restart ? > | >> > | >> Eric Biederman reminds at the previous OLS nobody complained about the > | >> checkpoint/restart > | >> > | >> Pavel Emylianov : The startup of Oracle takes some minutes, if we > | >> checkpoint just after the startup, Oracle can be restarted from this > | >> point later and provide fast startup > | >> > | >> Oren Laaden : Time travel, we can do monotonic snapshot and go back on > | >> one of this snaphost. > | >> > | >> Eric Biedreman : Priority running, checkpoint/kill an application and > | >> run another application with a bigger priority > | >> > | >> Denis Lunev : Task migration, move application on one host to another host > | >> > | >> Daniel Lezcano : SSI (task migration) > | >> > | >> * Preparing the kernel internals > | >> > | >> OL : Can we implement a kernel module and move CR functionality into > | >> the kernel itself later ? > | >> > | >> EB : Better to add a little CR functionnality into the kernel itself > | >> and add more after. > | >> > | >> DLu : Problem with kernel version > | >> > | >> OL : Compatibility with intermediate kernel version should be possible > | >> with userspace conversion tools > | >> > | >> DLu : Non sequential file for checkpoint statefile is a challenge > | >> > | >> OL : yes, but possible and useful for compression/encryption > | >> > | >> We showed that there are five steps to realize a checkpoint: > | >> > | >> 1 - Pre-dump > | > > | > I'd just add here that the pre-dump is where you might start writing > | > memory to disk, trying to get disk and memory closer and closer to > | > being the same until, at some point, you decide they are close enough > | > that you can go on to step two, and attempt the freeze+dump+migrate/kill > | > with minimal downtime. > | > > | > Coming into the discussion my primary concern had been that doing a > | > sys_checkpoint() system call would be tough to augment to provide this > | > kind of incremental checkpoint, but this breakdown is great for that. > | > > | >> 2 - Freeze > | >> 3 - Dump > | >> 4 - Resume/kill > | >> 5 - Post-dump > | >> > | >> At this point we state we want create a proof of concept and > | >> checkpoint/restart the simplest application. > | > > | > By which we mean, start with a piece of step 3 (and maybe a bit of > | > step 4). > | > | step 4 is also part of the freezer -- it's the unfreeze operation > | (or force a SIGKILL to all processes in the container). > > Are steps 1-5 considered part of the sys_checkpoint() system call and > if successful sys_checkpoint() returns after step 5 ? > > If so, like Serge points out, it would be harder to optimize for > incremental checkpoints (as each sys_checkpoint() would be independent) ? No no, the idea (IIUC) is that if you want to do a very short-downtime migrate, you stay in step 1 for a long time, writing the container memory to disk, checking how different the disk img is from the memory image, updating the version on disk, checking again, etc. Then when you decide that the disk and memory are very close together, you quickly do steps 2-4, where 4 in this case is kill. In the meantime you would have been loading the disk data into memory ahead of time at the new machine, so you can also quickly complete the restart. So 3, 'Dump', in this case really becomes "dump the metadata and any more changes that have happened." Presumably, if when you get to 3, you find that there was suddenly a lot of activity and there is too much data to write quickly, you bail on the migrate and step 4 is a resume rather than kill. Then you start again at step 1. At least that was my understanding. > But may not be something to worry about for POC. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20080724032616.GB9839-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <20080724032616.GB9839-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-07-24 9:58 ` Eric W. Biederman 0 siblings, 0 replies; 20+ messages in thread From: Eric W. Biederman @ 2008-07-24 9:58 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Daniel Lezcano "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > No no, the idea (IIUC) is that if you want to do a very short-downtime > migrate, you stay in step 1 for a long time, writing the container > memory to disk, checking how different the disk img is from the memory > image, updating the version on disk, checking again, etc. Then when > you decide that the disk and memory are very close together, you > quickly do steps 2-4, where 4 in this case is kill. In the meantime > you would have been loading the disk data into memory ahead of time > at the new machine, so you can also quickly complete the restart. > > So 3, 'Dump', in this case really becomes "dump the metadata and any > more changes that have happened." Presumably, if when you get to 3, > you find that there was suddenly a lot of activity and there is too > much data to write quickly, you bail on the migrate and step 4 is > a resume rather than kill. Then you start again at step 1. > > At least that was my understanding. Yes. Too some extent you need those steps separate in the kernel so you can coordinate with filesystem snapshots and the like. Despite being in one large syscall we still have a few small other pieces of userspace we need to coordinate with. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes (namespace naming) [not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> ` (2 preceding siblings ...) 2008-07-23 21:18 ` Serge E. Hallyn @ 2008-07-24 9:55 ` Eric W. Biederman [not found] ` <m1zlo7a9nq.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org> 2008-07-24 20:28 ` C/R minisummit notes Oren Laadan 4 siblings, 1 reply; 20+ messages in thread From: Eric W. Biederman @ 2008-07-24 9:55 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Currently we have three possibilities on how to name pid namespaces. - indirect via processes - pids - names in the filesystem We discussed this a bit in the hallway track and pids are look like the way to go. Pavel has a patch in progress to help sort this out. The practical problem we have today is that we need a way to wait for the network namespace in particular and namespaces in general to exit. At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve this. After looking at wait a bit more it really is fundamentally just an exit status reaper of zombies, that has the option of blocking when the zombies do not yet exist. In any kind of event loop you would wait for SIGCHLD either as a signal or with signalfd. So how shall we wait for a namespace to exit? My brainstorm tonight suggests inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <m1zlo7a9nq.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>]
* Re: C/R minisummit notes (namespace naming) [not found] ` <m1zlo7a9nq.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org> @ 2008-07-25 19:13 ` Serge E. Hallyn [not found] ` <20080725191356.GE28136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Serge E. Hallyn @ 2008-07-25 19:13 UTC (permalink / raw) To: Eric W. Biederman; +Cc: Linux Containers, Daniel Lezcano Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): > > Currently we have three possibilities on how to name pid namespaces. > - indirect via processes > - pids > - names in the filesystem > > We discussed this a bit in the hallway track and pids are look like the way > to go. Pavel has a patch in progress to help sort this out. > > The practical problem we have today is that we need a way to wait for the network > namespace in particular and namespaces in general to exit. > > At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve > this. After looking at wait a bit more it really is fundamentally just an exit > status reaper of zombies, that has the option of blocking when the zombies > do not yet exist. In any kind of event loop you would wait for SIGCHLD either > as a signal or with signalfd. > > So how shall we wait for a namespace to exit? My brainstorm tonight suggests > inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); > > Eric I'm sorry, I'm still not quite clear on... Why? You care about when the tasks exit, and you care about when network devices, for instance, need to be deleted (which you can presumably get uevents for, when they get moved back into init_net_ns). Why do you care when the struct net actually gets deleted? -serge ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20080725191356.GE28136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes (namespace naming) [not found] ` <20080725191356.GE28136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-07-25 19:26 ` Daniel Lezcano [not found] ` <488A28E4.6080902-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Daniel Lezcano @ 2008-07-25 19:26 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Eric W. Biederman Serge E. Hallyn wrote: > Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >> Currently we have three possibilities on how to name pid namespaces. >> - indirect via processes >> - pids >> - names in the filesystem >> >> We discussed this a bit in the hallway track and pids are look like the way >> to go. Pavel has a patch in progress to help sort this out. >> >> The practical problem we have today is that we need a way to wait for the network >> namespace in particular and namespaces in general to exit. >> >> At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve >> this. After looking at wait a bit more it really is fundamentally just an exit >> status reaper of zombies, that has the option of blocking when the zombies >> do not yet exist. In any kind of event loop you would wait for SIGCHLD either >> as a signal or with signalfd. >> >> So how shall we wait for a namespace to exit? My brainstorm tonight suggests >> inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); >> >> Eric > > I'm sorry, I'm still not quite clear on... > > Why? > > You care about when the tasks exit, and you care about when network > devices, for instance, need to be deleted (which you can presumably > get uevents for, when they get moved back into init_net_ns). > > Why do you care when the struct net actually gets deleted? IMO, if we consider a container being an aggregation of different namespaces, we should consider the container dies when all the namespaces are dead. One good example is an application ran inside a container and doing a bulk data writing over the network. When the application finish its last call to "send" it will exits. At this point, there is no more processes running inside the container but we can not consider the container is dead because there are still some pending datas in the socket to be delivered to the peer. Eric will post a patch to automatically destroy the virtual devices when the netns is destroyed, so there is no way to know if a network namespace is dead or not as the uevent socket will not deliver an event outside of the container. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <488A28E4.6080902-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes (namespace naming) [not found] ` <488A28E4.6080902-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> @ 2008-07-25 19:34 ` Serge E. Hallyn [not found] ` <20080725193458.GA12356-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 20+ messages in thread From: Serge E. Hallyn @ 2008-07-25 19:34 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers, Eric W. Biederman Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): > Serge E. Hallyn wrote: >> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >>> Currently we have three possibilities on how to name pid namespaces. >>> - indirect via processes >>> - pids >>> - names in the filesystem >>> >>> We discussed this a bit in the hallway track and pids are look like the way >>> to go. Pavel has a patch in progress to help sort this out. >>> >>> The practical problem we have today is that we need a way to wait for the network >>> namespace in particular and namespaces in general to exit. >>> >>> At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve >>> this. After looking at wait a bit more it really is fundamentally just an exit >>> status reaper of zombies, that has the option of blocking when the zombies >>> do not yet exist. In any kind of event loop you would wait for SIGCHLD either >>> as a signal or with signalfd. >>> >>> So how shall we wait for a namespace to exit? My brainstorm tonight suggests >>> inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); >>> >>> Eric >> >> I'm sorry, I'm still not quite clear on... >> >> Why? >> >> You care about when the tasks exit, and you care about when network >> devices, for instance, need to be deleted (which you can presumably >> get uevents for, when they get moved back into init_net_ns). >> >> Why do you care when the struct net actually gets deleted? > > IMO, if we consider a container being an aggregation of different > namespaces, we should consider the container dies when all the > namespaces are dead. > > One good example is an application ran inside a container and doing a > bulk data writing over the network. When the application finish its last > call to "send" it will exits. At this point, there is no more processes > running inside the container but we can not consider the container is > dead because there are still some pending datas in the socket to be > delivered to the peer. > > Eric will post a patch to automatically destroy the virtual devices when > the netns is destroyed, so there is no way to know if a network > namespace is dead or not as the uevent socket will not deliver an event > outside of the container. My question remains: who cares? -serge ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <20080725193458.GA12356-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes (namespace naming) [not found] ` <20080725193458.GA12356-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2008-07-25 19:52 ` Oren Laadan 2008-07-25 20:09 ` Daniel Lezcano 1 sibling, 0 replies; 20+ messages in thread From: Oren Laadan @ 2008-07-25 19:52 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Daniel Lezcano, Eric W. Biederman Serge E. Hallyn wrote: > Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): >> Serge E. Hallyn wrote: >>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >>>> Currently we have three possibilities on how to name pid namespaces. >>>> - indirect via processes >>>> - pids >>>> - names in the filesystem >>>> >>>> We discussed this a bit in the hallway track and pids are look like the way >>>> to go. Pavel has a patch in progress to help sort this out. >>>> >>>> The practical problem we have today is that we need a way to wait for the network >>>> namespace in particular and namespaces in general to exit. >>>> >>>> At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve >>>> this. After looking at wait a bit more it really is fundamentally just an exit >>>> status reaper of zombies, that has the option of blocking when the zombies >>>> do not yet exist. In any kind of event loop you would wait for SIGCHLD either >>>> as a signal or with signalfd. >>>> >>>> So how shall we wait for a namespace to exit? My brainstorm tonight suggests >>>> inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); >>>> >>>> Eric >>> I'm sorry, I'm still not quite clear on... >>> >>> Why? >>> >>> You care about when the tasks exit, and you care about when network >>> devices, for instance, need to be deleted (which you can presumably >>> get uevents for, when they get moved back into init_net_ns). >>> >>> Why do you care when the struct net actually gets deleted? >> IMO, if we consider a container being an aggregation of different >> namespaces, we should consider the container dies when all the >> namespaces are dead. >> >> One good example is an application ran inside a container and doing a >> bulk data writing over the network. When the application finish its last >> call to "send" it will exits. At this point, there is no more processes >> running inside the container but we can not consider the container is >> dead because there are still some pending datas in the socket to be >> delivered to the peer. >> >> Eric will post a patch to automatically destroy the virtual devices when >> the netns is destroyed, so there is no way to know if a network >> namespace is dead or not as the uevent socket will not deliver an event >> outside of the container. > > My question remains: who cares? > In the context of CR, you'd care if you migrate a container including its network stack. In that case, you wanna make sure that: (1) you save sockets that have data in their (send) queue but otherwise not attached to any specific process, and (2) you disable these sockets at the source machine as soon as you enable the container on the target machine. Rethinking this, Serge is probably right because one you migrate the network to the target node, you disable the network (of that container) on the source node, so you don't care about #2 there anymore... Oren. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes (namespace naming) [not found] ` <20080725193458.GA12356-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2008-07-25 19:52 ` Oren Laadan @ 2008-07-25 20:09 ` Daniel Lezcano [not found] ` <488A32FC.7020803-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 20+ messages in thread From: Daniel Lezcano @ 2008-07-25 20:09 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Linux Containers, Eric W. Biederman Serge E. Hallyn wrote: > Quoting Daniel Lezcano (dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org): >> Serge E. Hallyn wrote: >>> Quoting Eric W. Biederman (ebiederm-aS9lmoZGLiVWk0Htik3J/w@public.gmane.org): >>>> Currently we have three possibilities on how to name pid namespaces. >>>> - indirect via processes >>>> - pids >>>> - names in the filesystem >>>> >>>> We discussed this a bit in the hallway track and pids are look like the way >>>> to go. Pavel has a patch in progress to help sort this out. >>>> >>>> The practical problem we have today is that we need a way to wait for the network >>>> namespace in particular and namespaces in general to exit. >>>> >>>> At a first glance waitid(P_NS, <pid>,....) looks like a useful way to achieve >>>> this. After looking at wait a bit more it really is fundamentally just an exit >>>> status reaper of zombies, that has the option of blocking when the zombies >>>> do not yet exist. In any kind of event loop you would wait for SIGCHLD either >>>> as a signal or with signalfd. >>>> >>>> So how shall we wait for a namespace to exit? My brainstorm tonight suggests >>>> inotify_add_watch(ifd, "/proc/ns/<pid>", IN_DELETE); >>>> >>>> Eric >>> I'm sorry, I'm still not quite clear on... >>> >>> Why? >>> >>> You care about when the tasks exit, and you care about when network >>> devices, for instance, need to be deleted (which you can presumably >>> get uevents for, when they get moved back into init_net_ns). >>> >>> Why do you care when the struct net actually gets deleted? >> IMO, if we consider a container being an aggregation of different >> namespaces, we should consider the container dies when all the >> namespaces are dead. >> >> One good example is an application ran inside a container and doing a >> bulk data writing over the network. When the application finish its last >> call to "send" it will exits. At this point, there is no more processes >> running inside the container but we can not consider the container is >> dead because there are still some pending datas in the socket to be >> delivered to the peer. >> >> Eric will post a patch to automatically destroy the virtual devices when >> the netns is destroyed, so there is no way to know if a network >> namespace is dead or not as the uevent socket will not deliver an event >> outside of the container. > > My question remains: who cares? The container implementation in userspace. Let's imagine it sets some routes outside of the container to route the traffic to the container. It should remove these routes when the container dies. And the container should be considered as dead when the network has died and not when the last process of the container exits. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <488A32FC.7020803-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>]
* Re: C/R minisummit notes (namespace naming) [not found] ` <488A32FC.7020803-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> @ 2008-07-26 7:32 ` Eric W. Biederman 0 siblings, 0 replies; 20+ messages in thread From: Eric W. Biederman @ 2008-07-26 7:32 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Daniel Lezcano <dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> writes: >>> Eric will post a patch to automatically destroy the virtual devices when the >>> netns is destroyed, so there is no way to know if a network namespace is >>> dead or not as the uevent socket will not deliver an event outside of the >>> container. >> >> My question remains: who cares? > > The container implementation in userspace. Let's imagine it sets some routes > outside of the container to route the traffic to the container. It should remove > these routes when the container dies. And the container should be considered as > dead when the network has died and not when the last process of the container > exits. Namespaces can definitely live on long past the time when there are any tasks that point to them from nsproxy, and knowing when that happens would be nice. So settling on pids for names would be nice as that would allows us to restructure /proc so that we could see those kinds of things. That said I am less certain of the need to actually wait for a network namespace to exit, once we start killing virtual network devices. It was mentioned that ip over ip tunnels don't currently have a dellink method so we need will still need a wait to handle that case. Similarly in general we need to wait until the network namespace exits to ensure we flush all of the outgoing packets at container shutdown. So I propose we remove merge the code to wait on delete virtual devices and then recheck to see what uses we actually have left. Eric ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: C/R minisummit notes [not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> ` (3 preceding siblings ...) 2008-07-24 9:55 ` C/R minisummit notes (namespace naming) Eric W. Biederman @ 2008-07-24 20:28 ` Oren Laadan [not found] ` <4888E5D3.807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 4 siblings, 1 reply; 20+ messages in thread From: Oren Laadan @ 2008-07-24 20:28 UTC (permalink / raw) To: Daniel Lezcano; +Cc: Linux Containers Let's have some more breakfast, tomorrow - Friday - morning. Same place, same time. If it doesn't rain we'll go outside ;) Oren. ^ permalink raw reply [flat|nested] 20+ messages in thread
[parent not found: <4888E5D3.807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: C/R minisummit notes [not found] ` <4888E5D3.807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2008-07-25 2:14 ` Daniel Lezcano 0 siblings, 0 replies; 20+ messages in thread From: Daniel Lezcano @ 2008-07-25 2:14 UTC (permalink / raw) To: Oren Laadan; +Cc: Linux Containers Oren Laadan wrote: > Let's have some more breakfast, tomorrow - Friday - morning. > Same place, same time. If it doesn't rain we'll go outside ;) > Acked-by: Daniel Lezcano <dlezcano-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org> ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2008-07-26 7:32 UTC | newest]
Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-23 11:30 C/R minisummit notes Daniel Lezcano
[not found] ` <4887163F.5090801-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
2008-07-23 14:20 ` Eric W. Biederman
2008-07-23 18:55 ` Oren Laadan
[not found] ` <48877EA7.1050206-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-23 20:18 ` Serge E. Hallyn
2008-07-23 20:23 ` [Devel] " Denis V. Lunev
2008-07-23 20:24 ` Daniel Lezcano
2008-07-23 21:18 ` Serge E. Hallyn
[not found] ` <20080723211818.GA10295-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-23 21:38 ` Oren Laadan
[not found] ` <4887A4CC.5070009-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-24 1:41 ` sukadev-r/Jw6+rmf7HQT0dZR+AlfA
[not found] ` <20080724014122.GA23105-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-24 3:26 ` Serge E. Hallyn
[not found] ` <20080724032616.GB9839-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-24 9:58 ` Eric W. Biederman
2008-07-24 9:55 ` C/R minisummit notes (namespace naming) Eric W. Biederman
[not found] ` <m1zlo7a9nq.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-07-25 19:13 ` Serge E. Hallyn
[not found] ` <20080725191356.GE28136-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-25 19:26 ` Daniel Lezcano
[not found] ` <488A28E4.6080902-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
2008-07-25 19:34 ` Serge E. Hallyn
[not found] ` <20080725193458.GA12356-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-25 19:52 ` Oren Laadan
2008-07-25 20:09 ` Daniel Lezcano
[not found] ` <488A32FC.7020803-NmTC/0ZBporQT0dZR+AlfA@public.gmane.org>
2008-07-26 7:32 ` Eric W. Biederman
2008-07-24 20:28 ` C/R minisummit notes Oren Laadan
[not found] ` <4888E5D3.807-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-25 2:14 ` Daniel Lezcano
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.