Linux Container Development

Linux Container Development
 help / color / mirror / Atom feed

* Re: What can OpenVZ do?
From: Alexey Dobriyan @ 2009-02-13 22:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Matt Mackall,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Pavel Emelyanov
In-Reply-To: <20090213114503.GG15679-X9Un+BFzKDI@public.gmane.org>

On Fri, Feb 13, 2009 at 12:45:03PM +0100, Ingo Molnar wrote:
> 
> * Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> > On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> > > Merging checkpoints instead might give them the incentive to get
> > > their act together.
> > 
> > Knowing how much time it takes to beat CPT back into usable shape every time
> > big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
> > to have CPT mainlined.
> 
> So where is the bottleneck? I suspect the effort in having forward ported
> it across 4 major kernel releases in a single year is already larger than
> the technical effort it would  take to upstream it. Any unreasonable upstream 
> resistence/passivity you are bumping into?

People were busy with netns/containers stuff and OpenVZ/Virtuozzo bugs.

^ permalink raw reply

* Re: [PATCH] User namespaces: Only put the userns when we unhash the uid
From: KOSAKI Motohiro @ 2009-02-13 15:48 UTC (permalink / raw)
  To: David Howells
  Cc: torvalds, jmorris, akpm, serue, mingo, linux-kernel,
	vegard.nossum, containers, snakebyte
In-Reply-To: <20090213140421.5698.86876.stgit@warthog.procyon.org.uk>

> From: Serge E. Hallyn <serue@us.ibm.com>
>
> uids in namespaces other than init don't get a sysfs entry.
>
> For those in the init namespace, while we're waiting to remove
> the sysfs entry for the uid the uid is still hashed, and
> alloc_uid() may re-grab that uid without getting a new
> reference to the user_ns, which we've already put in free_user
> before scheduling remove_user_sysfs_dir().
>
> Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
> Acked-by: David Howells <dhowells@redhat.com>
> Tested-by: Ingo Molnar <mingo@elte.hu>

ok, thanks good patch.
  Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

^ permalink raw reply

* Re: [PATCH] User namespaces: Only put the userns when we unhash the uid
From: Ingo Molnar @ 2009-02-13 14:21 UTC (permalink / raw)
  To: David Howells
  Cc: Rafael J. Wysocki, vegard.nossum-Re5JQEeQqe8AvxtiuMwx3w,
	snakebyte-Mmb7MZpHnFY, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	jmorris-JSZrJrUvVKb1P9xLtpHBDw, torvalds-3NddpPZAyC0,
	containers-qjLDD68F18O7TbgM5vRIOg,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b
In-Reply-To: <20090213140421.5698.86876.stgit-S6HVgzuS8uM4Awkfq6JHfwNdhmdF6hFW@public.gmane.org>


* David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> From: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> 
> uids in namespaces other than init don't get a sysfs entry.
> 
> For those in the init namespace, while we're waiting to remove
> the sysfs entry for the uid the uid is still hashed, and
> alloc_uid() may re-grab that uid without getting a new
> reference to the user_ns, which we've already put in free_user
> before scheduling remove_user_sysfs_dir().
> 
> Reported-by: KOSAKI Motohiro <kosaki.motohiro-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> Signed-off-by: Serge E. Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
> Acked-by: David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Tested-by: Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>

beyond the crashes, this should resolve this slab corruption
regression too:

Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=12503
Subject         : [slab corruption] BUG key_jar: Poison overwritten
Submitter       : Ingo Molnar <mingo-X9Un+BFzKDI@public.gmane.org>
Date            : 2009-01-15 18:16 (25 days old)
References      : http://marc.info/?l=linux-kernel&m=123204353425825&w=4
Handled-By      : David Howells <dhowells-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

	Ingo

^ permalink raw reply

* [PATCH] User namespaces: Only put the userns when we unhash the uid
From: David Howells @ 2009-02-13 14:04 UTC (permalink / raw)
  To: torvalds, jmorris
  Cc: akpm, dhowells, serue, mingo, kosaki.motohiro, linux-kernel,
	vegard.nossum, containers, snakebyte

From: Serge E. Hallyn <serue@us.ibm.com>

uids in namespaces other than init don't get a sysfs entry.

For those in the init namespace, while we're waiting to remove
the sysfs entry for the uid the uid is still hashed, and
alloc_uid() may re-grab that uid without getting a new
reference to the user_ns, which we've already put in free_user
before scheduling remove_user_sysfs_dir().

Reported-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: David Howells <dhowells@redhat.com>
Tested-by: Ingo Molnar <mingo@elte.hu>
---

 kernel/user.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)


diff --git a/kernel/user.c b/kernel/user.c
index 477b666..3551ac7 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -72,6 +72,7 @@ static void uid_hash_insert(struct user_struct *up, struct hlist_head *hashent)
 static void uid_hash_remove(struct user_struct *up)
 {
 	hlist_del_init(&up->uidhash_node);
+	put_user_ns(up->user_ns);
 }
 
 static struct user_struct *uid_hash_find(uid_t uid, struct hlist_head *hashent)
@@ -334,7 +335,6 @@ static void free_user(struct user_struct *up, unsigned long flags)
 	atomic_inc(&up->__count);
 	spin_unlock_irqrestore(&uidhash_lock, flags);
 
-	put_user_ns(up->user_ns);
 	INIT_WORK(&up->work, remove_user_sysfs_dir);
 	schedule_work(&up->work);
 }
@@ -357,7 +357,6 @@ static void free_user(struct user_struct *up, unsigned long flags)
 	sched_destroy_user(up);
 	key_put(up->uid_keyring);
 	key_put(up->session_keyring);
-	put_user_ns(up->user_ns);
 	kmem_cache_free(uid_cachep, up);
 }
 

^ permalink raw reply related

* Re: What can OpenVZ do?
From: Ingo Molnar @ 2009-02-13 11:45 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Matt Mackall,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Pavel Emelyanov
In-Reply-To: <20090213113248.GA15275-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>


* Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:

> On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> > 
> > * Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> > 
> > > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > > have a similar design?  If not, how will it differ?  To what extent can
> > > > we use that implementation as a tool for understanding what this new
> > > > implementation will look like?
> > > 
> > > Yes, we can certainly use it as a guide.  However, there are some
> > > barriers to being able to do that:
> > > 
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> > >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> > >   84887  290855 2308745
> > > 
> > > Unfortunately, the git tree doesn't have that great of a history.  It
> > > appears that the forward-ports are just applications of huge single
> > > patches which then get committed into git.  This tree has also
> > > historically contained a bunch of stuff not directly related to
> > > checkpoint/restart like resource management.
> > 
> > Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
> > upstream, they only seem to forward-port, keep their tree messy, do minimal
> > work to reduce the cross section to the rest of the kernel (so that they can
> > manage the forward ports) but otherwise are happy with their carved-out
> > niche market. [which niche is also spiced with some proprietary add-ons,
> > last i checked, not exactly the contribution environment that breeds a
> > healthy flow of patches towards the upstream kernel.]
> 
> Oh, cut the crap!
> 
> > Merging checkpoints instead might give them the incentive to get
> > their act together.
> 
> Knowing how much time it takes to beat CPT back into usable shape every time
> big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
> to have CPT mainlined.

So where is the bottleneck? I suspect the effort in having forward ported
it across 4 major kernel releases in a single year is already larger than
the technical effort it would  take to upstream it. Any unreasonable upstream 
resistence/passivity you are bumping into?

	Ingo

^ permalink raw reply

* Re: What can OpenVZ do?
From: Alexey Dobriyan @ 2009-02-13 11:32 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, Matt Mackall,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Pavel Emelyanov
In-Reply-To: <20090213102732.GB4608-X9Un+BFzKDI@public.gmane.org>

On Fri, Feb 13, 2009 at 11:27:32AM +0100, Ingo Molnar wrote:
> 
> * Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > have a similar design?  If not, how will it differ?  To what extent can
> > > we use that implementation as a tool for understanding what this new
> > > implementation will look like?
> > 
> > Yes, we can certainly use it as a guide.  However, there are some
> > barriers to being able to do that:
> > 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> >   84887  290855 2308745
> > 
> > Unfortunately, the git tree doesn't have that great of a history.  It
> > appears that the forward-ports are just applications of huge single
> > patches which then get committed into git.  This tree has also
> > historically contained a bunch of stuff not directly related to
> > checkpoint/restart like resource management.
> 
> Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
> upstream, they only seem to forward-port, keep their tree messy, do minimal
> work to reduce the cross section to the rest of the kernel (so that they can
> manage the forward ports) but otherwise are happy with their carved-out
> niche market. [which niche is also spiced with some proprietary add-ons,
> last i checked, not exactly the contribution environment that breeds a
> healthy flow of patches towards the upstream kernel.]

Oh, cut the crap!

> Merging checkpoints instead might give them the incentive to get
> their act together.

Knowing how much time it takes to beat CPT back into usable shape every time
big kernel rebase is done, OpenVZ/Virtuozzo have every single damn incentive
to have CPT mainlined.

If someone is afraid of long config options, there are always CONFIG_CPT and
CONFIG_CR available.

^ permalink raw reply

* Re: [PATCH 4/4] keys: make procfiles per-user-namespace
From: David Howells @ 2009-02-13 11:03 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: dhowells, lkml, Eric W. Biederman, Linux Containers
In-Reply-To: <20090109225313.GB15599@us.ibm.com>

Serge E. Hallyn <serue@us.ibm.com> wrote:

> Restrict the /proc/keys and /proc/key-users output to keys
> belonging to the same user namespace as the reading task.
> 
> We may want to make this more complicated - so that any
> keys in a user-namespace which is belongs to the reading
> task are also shown.  But let's see if anyone wants that
> first.

Hmmm...  I wonder if we can do better by making the file position indicate the
key ID rather than being a count of the number of keys read.  It might make
this cleaner.

David

^ permalink raw reply

* Re: What can OpenVZ do?
From: Ingo Molnar @ 2009-02-13 10:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Dave Hansen, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ,
	tglx-hfZtesqFncYOwBW4kG4KsQ,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	xemul-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <20090212141014.2cd3d54d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

* Andrew Morton <akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org> wrote:

> Now, we've gone in blind before - most notably on the
> containers/cgroups/namespaces stuff.  That hail mary pass worked out
> acceptably, I think.  Maybe we got lucky.  I thought that
> net-namespaces in particular would never get there, but it did.
> 
> That was a very large and quite long-term-important user-visible
> feature.
> 
> checkpoint/restart/migration is also a long-term-...-feature.  But if
> at all possible I do think that we should go into it with our eyes a
> little less shut.

IMO, s/.../important/

More important than containers in fact. Being able to detach all
software state from the hw state and being able to reattach it:

   1) at a later point in time,                   or
   2) in a different piece of hardware,           or
   3) [future] in a different kernel

... is powerful stuff on a very conceptual level IMO.

The only reason we dont have it in every OS is not because it's not
desired and not wanted, but because it's very, very hard to do it on
a wide scale. But people would love it even if it adds (some) overhead.

This kind of featureset is actually the main motivator for virtualization.

If the native kernel was able to do checkpointing we'd have not only
near-zero-cost virtualization done at the right abstraction level
(when combined with containers/control-groups), but we'd also have
a few future feature items like:

  1) Kernel upgrades done intelligently: transparent reboot into an
     upgraded kernel.

  2) Downgrade-on-regressions done sanely: transparent downgrade+reboot
     to a known-working kernel. (as long as the regression is app
     misbehavior or a performance problem - not a kernel crash. Most
     regressions on kernel upgrades are not actual crashes or data
     corruption but functional and performance regressions - i.e. it's
     safely checkpointable and downgradeable.)

  3) Hibernation done intelligently: checkpoint everything, turn off
     system. Turn on system, restore everything from the checkpoint.

  4) Backups done intelligently: full "backups" of long-running
     computational jobs, maybe even of complex things like databases
     or desktop sessions.

  5) Remote debugging done intelligently: got a crashed session?
     Checkpoint the whole app in its anomalous state and upload the
     image (as long as you can trust the developer with that image
     and with the filesystem state that goes with it).

I dont see many long-term dragons here. The kernel is obviously always
able to do near-zero-overhead checkpointing: it knows about all its
own data structures, can enumerate them and knows how they map to
user-space objects.

The rest is performance considerations: do we want to embedd
checkpointing helpers in certain runtime codepaths, to make
checkpointing faster? But if that is undesirable (serialization,
etc.), we can always fall back to the dumbest, zero-overhead methods.

There is _one_ interim runtime cost: the "can we checkpoint or not"
decision that the kernel has to make while the feature is not complete.

That, if this feature takes off, is just a short-term worry - as
basically everything will be checkpointable in the long run.

In any case, by designing checkpointing to reuse the existing LSM
callbacks, we'd hit multiple birds with the same stone. (One of
which is the constant complaints about the runtime costs of the LSM
callbacks - with checkpointing we get an independent, non-security
user of the facility which is a nice touch.)

So all things considered it does not look like a bad deal to me - but
i might be missing something nasty.

	Ingo

^ permalink raw reply

* Re: What can OpenVZ do?
From: Ingo Molnar @ 2009-02-13 10:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matt Mackall, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Pavel Emelyanov
In-Reply-To: <1234475483.30155.194.camel@nimitz>


* Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745
> 
> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.

Really, OpenVZ/Virtuozzo does not seem to have enough incentive to merge
upstream, they only seem to forward-port, keep their tree messy, do minimal
work to reduce the cross section to the rest of the kernel (so that they can
manage the forward ports) but otherwise are happy with their carved-out
niche market. [which niche is also spiced with some proprietary add-ons,
last i checked, not exactly the contribution environment that breeds a
healthy flow of patches towards the upstream kernel.]

Merging checkpoints instead might give them the incentive to get
their act together.

	Ingo

^ permalink raw reply

* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
From: Ingo Molnar @ 2009-02-13 10:20 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Andrew Morton,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ
In-Reply-To: <1234462283.30155.173.camel@nimitz>

* Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> > What is it good for right now, and what are the known weaknesses and
> > quirks you can think of. Declaring them upfront is a bonus - not talking
> > about them and us discovering them later at the patch integration stage
> > is a sure receipe for upstream grumpiness.
> 
> That's a fair enough point, and I do agree with you on it.
> 
> Right now, it is good for very little.  An app has to basically be
> either specifically designed to work, or be pretty puny in its
> capabilities.  Any fds that are open can only be restored if a simple
> open();lseek(); would have been sufficient to get it back into a good
> state.  The process must be single-threaded.  Shared memory, hugetlbfs,
> VM_NONLINEAR are not supported.  

That is OK as a starting point, as long as:

> > For example, one of the critical corner points: can an app programmatically 
> > determine whether it can support checkpoint/restart safely? Are there 
> > warnings/signals/helpers in place that make it a well-defined space, and
> > make the implementation of missing features directly actionable?
> > 
> > ( instead of: 'silent breakage' and a wishy-washy boundary between the
> >   working and non-working space. Without clear boundaries there's no
> >   clear dynamics that extends the 'working' space beyond the demo stage. )
> 
> Patch 12/14 is supposed to address this *concept*.  But, it hasn't been
> carried through so that it currently works.  My expectation was that we
> would go through and add things over time.  I'll go make sure I push it
> to the point that it actually works for at least the simple test
> programs that we have.
> 
> What I will probably do is something BKL-style.  Basically put a "this
> can't be checkpointed" marker over most everything I can think of and
> selectively remove it as we add features.  

An app really has to know whether it can reliably checkpoint+restart.

Otherwise it wont ever get past the toy stage and people will waste a
lot of time if their designed-for-checkpoints app accidentally runs
into some kernel feature or other side-effect that is not supported.

I personally wouldnt mind to sprinkle the kernel with markers, as long
as you can make it really cheap even with CONFIG_CHECKPOINT_RESTART=y.

Btw., i dont think it's all that much work, nor is it really intrusive:
have you thought of reusing all the existing security callbacks? You'd
have instant coverage of basically every system call and kernel
functionality that matters, and you could have a finegrained set of
policies.

The only drawback is that you have to enable CONFIG_SECURITY for it,
but in practice most distros enable that, so the callback overhead is
already there - you just have to enable it. (Also, some care has to
be taken to properly stack it to existing LSM modules, but that is
solvable too.)

Sidenote: CONFIG_CHECKPOINT_RESTART is IMO an uncomfortably long name,
i'd suggest to rename it to CONFIG_CHECKPOINTS or so. [the concept of a
checkpoint is good enough to mention - if there's a checkpoint then a
restart is logically implied.]

	Ingo

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Li Zefan @ 2009-02-13  7:26 UTC (permalink / raw)
  To: Al Viro
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, LKML, Paul Menage,
	Andrew Morton, Arjan van de Ven
In-Reply-To: <20090213071816.GK28946-3bDd1+5oDREiFSDQTTA3OLVCufUGDwFn@public.gmane.org>

Al Viro wrote:
> On Fri, Feb 13, 2009 at 06:41:35AM +0000, Al Viro wrote:
> 
> Aaaargh...
> 
>         /*
>          * We don't have to hold all of the locks at the
>          * same time here because we know that we're the
>          * last reference to mnt and that no new writers
>          * can come in.
>          */
>         for_each_possible_cpu(cpu) {
>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>                 if (cpu_writer->mnt != mnt)
>                         continue;
>                 spin_lock(&cpu_writer->lock);
> 
> is *almost* OK.  Modulo SMP cache coherency.  We know that nothing should
> be setting ->mnt to ours anymore, that's fine.  But we do not know if
> we'd seen *earlier* change done on CPU in question (not the one we
> are running __mntput() on).
> 
> I probably would still like to use milder solution in the long run, but for
> now let's check if turning that into
> 
>                 struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
>                 spin_lock(&cpu_writer->lock);
>                 if (cpu_writer->mnt != mnt) {
> 			spin_unlock(&cpu_writer->lock);
>                         continue;
> 		}
> prevents the problem, OK?
> 

Sure, I'll try. :)

BTW, thread2's rmdir failed:

rmdir: /cgroup/0: No such file or directory

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Al Viro @ 2009-02-13  7:18 UTC (permalink / raw)
  To: Li Zefan; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML
In-Reply-To: <20090213064135.GJ28946@ZenIV.linux.org.uk>

On Fri, Feb 13, 2009 at 06:41:35AM +0000, Al Viro wrote:

Aaaargh...

        /*
         * We don't have to hold all of the locks at the
         * same time here because we know that we're the
         * last reference to mnt and that no new writers
         * can come in.
         */
        for_each_possible_cpu(cpu) {
                struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
                if (cpu_writer->mnt != mnt)
                        continue;
                spin_lock(&cpu_writer->lock);

is *almost* OK.  Modulo SMP cache coherency.  We know that nothing should
be setting ->mnt to ours anymore, that's fine.  But we do not know if
we'd seen *earlier* change done on CPU in question (not the one we
are running __mntput() on).

I probably would still like to use milder solution in the long run, but for
now let's check if turning that into

                struct mnt_writer *cpu_writer = &per_cpu(mnt_writers, cpu);
                spin_lock(&cpu_writer->lock);
                if (cpu_writer->mnt != mnt) {
			spin_unlock(&cpu_writer->lock);
                        continue;
		}
prevents the problem, OK?

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Al Viro @ 2009-02-13  6:41 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, LKML, Paul Menage,
	Andrew Morton, Arjan van de Ven
In-Reply-To: <49950F3D.3030704-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

On Fri, Feb 13, 2009 at 02:12:13PM +0800, Li Zefan wrote:
> Al Viro wrote:
> > On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
> > 
> >> I ran following testcase, and triggered the warning in 1 hour:
> >>
> >> thread 1:
> >> for ((; ;))
> >> {
> >>         mount --bind /cgroup /mnt > /dev/null 2>&1
> >>         umount /mnt > /dev/null 2>&1
> >> }
> >>
> >> tread 2:
> >> for ((; ;))
> >> {
> >>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
> >>         mkdir /cgroup/0 > /dev/null 2>&1
> >>         rmdir /cgroup/0 > /dev/null 2>&1
> >>         umount -l /cgroup > /dev/null 2>&1
> >> }
> > 
> > Wow.  You know, at that point these redirects could probably be removed.
> 
> Ah, yes.
> 
> > If anything in there ends up producing an output, we very much want to
> > see that.  Actually, I'd even make that
> > 	mount --bind /cgroup/mnt || (echo mount1: ; date)
> > etc., so we'd see when do they fail and which one fails (if any)...
> >  
> > Which umount has failed in the above, BTW?
> > 
> > 
> 
> the first one sometimes failed, and the second one hasn't failed:

> mount: wrong fs type, bad option, bad superblock on /cgroup,
>        missing codepage or helper program, or other error
>        In some cases useful info is found in syslog - try
>        dmesg | tail  or so
> 
> mount1

Hold on.  In your last example the first one was doing mount --bind;
has _that_ failed?  Oh, wait...  It can fail, all right, if lookup on
/cgroup gives you your filesystem with the second thread managing to
detach it before we get the namespace_sem.  Then we'll fail that way -
and clean up properly.

Oh, well...  The original question still stands: with those two
scripts, which umount produces that WARN_ON?  The trivial way
to check would be to have a copy of /sbin/umount under a different
name and use _that_ in one of the threads instead of umount.
Then reproduce the WARN_ON and look at the process name in dmesg...

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Li Zefan @ 2009-02-13  6:31 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, LKML, Paul Menage, Andrew Morton, Arjan van de Ven
In-Reply-To: <49950F3D.3030704@cn.fujitsu.com>

Li Zefan wrote:
> Al Viro wrote:
>> On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
>>
>>> I ran following testcase, and triggered the warning in 1 hour:
>>>
>>> thread 1:
>>> for ((; ;))
>>> {
>>>         mount --bind /cgroup /mnt > /dev/null 2>&1
>>>         umount /mnt > /dev/null 2>&1
>>> }
>>>
>>> tread 2:
>>> for ((; ;))
>>> {
>>>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>>>         mkdir /cgroup/0 > /dev/null 2>&1
>>>         rmdir /cgroup/0 > /dev/null 2>&1
>>>         umount -l /cgroup > /dev/null 2>&1
>>> }
>> Wow.  You know, at that point these redirects could probably be removed.
> 
> Ah, yes.
> 
>> If anything in there ends up producing an output, we very much want to
>> see that.  Actually, I'd even make that
>> 	mount --bind /cgroup/mnt || (echo mount1: ; date)
>> etc., so we'd see when do they fail and which one fails (if any)...
>>  
>> Which umount has failed in the above, BTW?
>>
>>
> 
> the first one sometimes failed, and the second one hasn't failed:
> 

Just triggered the warning at about:
	Fri Feb 13 14:26:03 CST 2009

But both 2 threads were not failing at that time:

mount: wrong fs type, bad option, bad superblock on /cgroup,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

mount1
Fri Feb 13 14:25:21 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:26:32 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Li Zefan @ 2009-02-13  6:12 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML
In-Reply-To: <20090213054751.GI28946@ZenIV.linux.org.uk>

Al Viro wrote:
> On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:
> 
>> I ran following testcase, and triggered the warning in 1 hour:
>>
>> thread 1:
>> for ((; ;))
>> {
>>         mount --bind /cgroup /mnt > /dev/null 2>&1
>>         umount /mnt > /dev/null 2>&1
>> }
>>
>> tread 2:
>> for ((; ;))
>> {
>>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>>         mkdir /cgroup/0 > /dev/null 2>&1
>>         rmdir /cgroup/0 > /dev/null 2>&1
>>         umount -l /cgroup > /dev/null 2>&1
>> }
> 
> Wow.  You know, at that point these redirects could probably be removed.

Ah, yes.

> If anything in there ends up producing an output, we very much want to
> see that.  Actually, I'd even make that
> 	mount --bind /cgroup/mnt || (echo mount1: ; date)
> etc., so we'd see when do they fail and which one fails (if any)...
>  
> Which umount has failed in the above, BTW?
> 
> 

the first one sometimes failed, and the second one hasn't failed:

mount: wrong fs type, bad option, bad superblock on /cgroup,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so

mount1
Fri Feb 13 14:05:37 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:08:34 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

mount1
Fri Feb 13 14:08:43 CST 2009
umount: /mnt: not mounted
mount: wrong fs type, bad option, bad superblock on /cgroup,
...

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Al Viro @ 2009-02-13  5:47 UTC (permalink / raw)
  To: Li Zefan
  Cc: containers-qjLDD68F18O7TbgM5vRIOg, LKML, Paul Menage,
	Andrew Morton, Arjan van de Ven
In-Reply-To: <4995007D.7040101-BthXqXjhjHXQFUHtdCDX3A@public.gmane.org>

On Fri, Feb 13, 2009 at 01:09:17PM +0800, Li Zefan wrote:

> I ran following testcase, and triggered the warning in 1 hour:
> 
> thread 1:
> for ((; ;))
> {
>         mount --bind /cgroup /mnt > /dev/null 2>&1
>         umount /mnt > /dev/null 2>&1
> }
> 
> tread 2:
> for ((; ;))
> {
>         mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
>         mkdir /cgroup/0 > /dev/null 2>&1
>         rmdir /cgroup/0 > /dev/null 2>&1
>         umount -l /cgroup > /dev/null 2>&1
> }

Wow.  You know, at that point these redirects could probably be removed.
If anything in there ends up producing an output, we very much want to
see that.  Actually, I'd even make that
	mount --bind /cgroup/mnt || (echo mount1: ; date)
etc., so we'd see when do they fail and which one fails (if any)...
 
Which umount has failed in the above, BTW?

^ permalink raw reply

* Re: [cgroup or VFS ?] WARNING: at fs/namespace.c:636 mntput_no_expire+0xac/0xf2()
From: Li Zefan @ 2009-02-13  5:09 UTC (permalink / raw)
  To: Al Viro; +Cc: containers, Paul Menage, Arjan van de Ven, Andrew Morton, LKML
In-Reply-To: <20090212070729.GF28946@ZenIV.linux.org.uk>

>> thread 1:
>> for ((; ;))
>> {
>> 	mount -t cgroup -o ns xxx cgroup/ > /dev/null 2>&1
>> 	# remove the dirs generated by cgroup_clone()
>> 	rmdir cgroup/[1-9]* > /dev/null 2>&1
>> 	umount cgroup/ > /dev/null 2>&1
>> }
>>
>>
>> thread 2:
>>
>> int foo(void *arg)
>> { return 0; }
>>
>> char *stack[4096];
>>
>> int main(int argc, char **argv)
>> {
>>         int usec = DEFAULT_USEC;
>>         while (1) {
>>                 usleep(usec);
>> 		# cgroup_clone() will be called
>>                 clone(foo, stack+4096, CLONE_NEWNS, NULL);
>>         }
>>
>>         return 0;
>> }
> 
> Uh-oh...  That clone() will do more, actually - it will clone a bunch
> of vfsmounts.  What happens if you create a separate namespace for the
> first thread, so that the second one would not have our vfsmount to
> play with?
> 

The warning still can be triggered, but seems harder (cost me 1 hour)

> Alternatively, what if the second thread is doing
> 	mount --bind cgroup foo
> 	umount foo
> in a loop?
> 

I ran following testcase, and triggered the warning in 1 hour:

thread 1:
for ((; ;))
{
        mount --bind /cgroup /mnt > /dev/null 2>&1
        umount /mnt > /dev/null 2>&1
}

tread 2:
for ((; ;))
{
        mount -t cgroup -o cpu xxx /cgroup > /dev/null 2>&1
        mkdir /cgroup/0 > /dev/null 2>&1
        rmdir /cgroup/0 > /dev/null 2>&1
        umount -l /cgroup > /dev/null 2>&1
}

> Another one: does turning the umount in the first thread into umount -l
> affect anything?
> 

For this one, I ran the test for the whole night, but failed to hit the warning.

^ permalink raw reply

* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
From: Dave Hansen @ 2009-02-12 23:13 UTC (permalink / raw)
  To: Matt Mackall
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Alexey Dobriyan, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Cedric Le Goater, Thomas Gleixner,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Andrew Morton,
	Pavel Emelyanov
In-Reply-To: <1234479924.3152.13.camel@calx>

On Thu, 2009-02-12 at 17:05 -0600, Matt Mackall wrote:
> On Thu, 2009-02-12 at 14:57 -0800, Dave Hansen wrote:
> > > Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> > > 2.6.31 which has an expanded idea of what should be restored? Do your
> > > file formats handle this sort of forward compatibility or am I
> > > restricted to one kernel?
> > 
> > In general, you're restricted to one kernel.  But, people have mentioned
> > that, if the formats change, we should be able to write in-userspace
> > converters for the checkpoint files.  
> 
> I mentioned this because it seems like a key use case is upgrading
> kernels out from under long-lived applications.

The key users as I envision it aren't really kernel hackers who are
always running 2.6-next and running radically different kernels from
moment to moment. :)

Distros are pretty picky about changing things internal to the kernel
during errata updates or even service packs.  While that can be a pain
for some of us developers trying to get features and fixes in, it is a
godsend for trying to do something like process migration across an
update.

My random speculation would be that for things that if a kernel upgrade
can be performed with ksplice (http://www.ksplice.com/) -- the original
non-fancy version at least -- we can probably migrate across the
upgrade.

-- Dave

^ permalink raw reply

* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
From: Matt Mackall @ 2009-02-12 23:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Alexey Dobriyan, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Cedric Le Goater, Thomas Gleixner,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Andrew Morton,
	Pavel Emelyanov
In-Reply-To: <1234479457.30155.214.camel@nimitz>

On Thu, 2009-02-12 at 14:57 -0800, Dave Hansen wrote:
> > Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> > 2.6.31 which has an expanded idea of what should be restored? Do your
> > file formats handle this sort of forward compatibility or am I
> > restricted to one kernel?
> 
> In general, you're restricted to one kernel.  But, people have mentioned
> that, if the formats change, we should be able to write in-userspace
> converters for the checkpoint files.  

I mentioned this because it seems like a key use case is upgrading
kernels out from under long-lived applications.

-- 
http://selenic.com : development and support for Mercurial and Linux

^ permalink raw reply

* How much of a mess does OpenVZ make? ;) Was: What can OpenVZ do?
From: Dave Hansen @ 2009-02-12 23:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Alexey Dobriyan, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <20090212141014.2cd3d54d.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On Thu, 2009-02-12 at 14:10 -0800, Andrew Morton wrote:
> On Thu, 12 Feb 2009 13:51:23 -0800
> Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:
> 
> > On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > > On Thu, 12 Feb 2009 13:30:35 -0600
> > > Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org> wrote:
> > > 
> > > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > > 
> > > > > > - In bullet-point form, what features are missing, and should be added?
> > > > > 
> > > > >  * support for more architectures than i386
> > > > >  * file descriptors:
> > > > >   * sockets (network, AF_UNIX, etc...)
> > > > >   * devices files
> > > > >   * shmfs, hugetlbfs
> > > > >   * epoll
> > > > >   * unlinked files
> > > > 
> > > > >  * Filesystem state
> > > > >   * contents of files
> > > > >   * mount tree for individual processes
> > > > >  * flock
> > > > >  * threads and sessions
> > > > >  * CPU and NUMA affinity
> > > > >  * sys_remap_file_pages()
> > > > 
> > > > I think the real questions is: where are the dragons hiding? Some of
> > > > these are known to be hard. And some of them are critical checkpointing
> > > > typical applications. If you have plans or theories for implementing all
> > > > of the above, then great. But this list doesn't really give any sense of
> > > > whether we should be scared of what lurks behind those doors.
> > > 
> > > How close has OpenVZ come to implementing all of this?  I think the
> > > implementatation is fairly complete?
> > 
> > I also believe it is "fairly complete".  At least able to be used
> > practically.
> > 
> > > If so, perhaps that can be used as a guide.  Will the planned feature
> > > have a similar design?  If not, how will it differ?  To what extent can
> > > we use that implementation as a tool for understanding what this new
> > > implementation will look like?
> > 
> > Yes, we can certainly use it as a guide.  However, there are some
> > barriers to being able to do that:
> > 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
> >   84887  290855 2308745
> > 
> > Unfortunately, the git tree doesn't have that great of a history.  It
> > appears that the forward-ports are just applications of huge single
> > patches which then get committed into git.  This tree has also
> > historically contained a bunch of stuff not directly related to
> > checkpoint/restart like resource management.
> > 
> > We'd be idiots not to take a hard look at what has been done in OpenVZ.
> > But, for the time being, we have absolutely no shortage of things that
> > we know are important and know have to be done.  Our largest problem is
> > not finding things to do, but is our large out-of-tree patch that is
> > growing by the day. :(
> > 
> 
> Well we have a chicken-and-eggish thing.  The patchset will keep
> growing until we understand how much of this:
> 
> > dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
> >  628 files changed, 59597 insertions(+), 2927 deletions(-)
> 
> we will be committed to if we were to merge the current patchset.

Here's the measurement that Alexey suggested:

dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... kernel/cpt/ | diffstat 
 Makefile        |   53 +
 cpt_conntrack.c |  365 ++++++++++++
 cpt_context.c   |  257 ++++++++
 cpt_context.h   |  215 +++++++
 cpt_dump.c      | 1250 ++++++++++++++++++++++++++++++++++++++++++
 cpt_dump.h      |   16 
 cpt_epoll.c     |  113 +++
 cpt_exports.c   |   13 
 cpt_files.c     | 1626 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 cpt_files.h     |   71 ++
 cpt_fsmagic.h   |   16 
 cpt_inotify.c   |  144 ++++
 cpt_kernel.c    |  177 ++++++
 cpt_kernel.h    |   99 +++
 cpt_mm.c        |  923 +++++++++++++++++++++++++++++++
 cpt_mm.h        |   35 +
 cpt_net.c       |  614 ++++++++++++++++++++
 cpt_net.h       |    7 
 cpt_obj.c       |  162 +++++
 cpt_obj.h       |   62 ++
 cpt_proc.c      |  595 ++++++++++++++++++++
 cpt_process.c   | 1369 ++++++++++++++++++++++++++++++++++++++++++++++
 cpt_process.h   |   13 
 cpt_socket.c    |  790 ++++++++++++++++++++++++++
 cpt_socket.h    |   33 +
 cpt_socket_in.c |  450 +++++++++++++++
 cpt_syscalls.h  |  101 +++
 cpt_sysvipc.c   |  403 +++++++++++++
 cpt_tty.c       |  215 +++++++
 cpt_ubc.c       |  132 ++++
 cpt_ubc.h       |   23 
 cpt_x8664.S     |   67 ++
 rst_conntrack.c |  283 +++++++++
 rst_context.c   |  323 ++++++++++
 rst_epoll.c     |  169 +++++
 rst_files.c     | 1648 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 rst_inotify.c   |  196 ++++++
 rst_mm.c        | 1151 +++++++++++++++++++++++++++++++++++++++
 rst_net.c       |  741 +++++++++++++++++++++++++
 rst_proc.c      |  580 +++++++++++++++++++
 rst_process.c   | 1640 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 rst_socket.c    |  918 +++++++++++++++++++++++++++++++
 rst_socket_in.c |  489 ++++++++++++++++
 rst_sysvipc.c   |  633 +++++++++++++++++++++
 rst_tty.c       |  384 +++++++++++++
 rst_ubc.c       |  131 ++++
 rst_undump.c    | 1007 ++++++++++++++++++++++++++++++++++
 47 files changed, 20702 insertions(+)

One important thing that leaves out is the interaction that this code
has with the rest of the kernel.  That's critically important when
considering long-term maintenance, and I'd be curious how the OpenVZ
folks view it. 

> Now, we've gone in blind before - most notably on the
> containers/cgroups/namespaces stuff.  That hail mary pass worked out
> acceptably, I think.  Maybe we got lucky.  I thought that
> net-namespaces in particular would never get there, but it did.
> 
> That was a very large and quite long-term-important user-visible
> feature.
> 
> checkpoint/restart/migration is also a long-term-...-feature.  But if
> at all possible I do think that we should go into it with our eyes a
> little less shut.

One thing Ingo has asked for that I understand a bit more clearly is a
programmatic statement of what is and is not covered by this current
code.  That's certainly one eye-opening activity which I'll get to
immediately.

-- Dave

^ permalink raw reply

* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
From: Dave Hansen @ 2009-02-12 22:57 UTC (permalink / raw)
  To: Matt Mackall
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	hpa-YMNOUZJC4hwAvxtiuMwx3w, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	Alexey Dobriyan, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	Cedric Le Goater, Thomas Gleixner,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-api-u79uwXL29TY76Z2rM5mHXA, Ingo Molnar,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, Andrew Morton,
	Pavel Emelyanov
In-Reply-To: <1234467035.3243.538.camel@calx>

On Thu, 2009-02-12 at 13:30 -0600, Matt Mackall wrote:
> On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
...
> >  * Filesystem state
> >   * contents of files
> >   * mount tree for individual processes
> >  * flock
> >  * threads and sessions
> >  * CPU and NUMA affinity
> >  * sys_remap_file_pages()
> 
> I think the real questions is: where are the dragons hiding? Some of
> these are known to be hard. And some of them are critical checkpointing
> typical applications. If you have plans or theories for implementing all
> of the above, then great. But this list doesn't really give any sense of
> whether we should be scared of what lurks behind those doors.

This is probably a better question for people like Pavel, Alexey and
Cedric to answer.  

> Some of these things we probably don't have to care too much about. For
> instance, contents of files - these can legitimately change for a
> running process. Open TCP/IP sockets can legitimately get reset as well.
> But others are a bigger deal.

Legitimately, yes.  But, practically, these are things that we need to
handle because we want to make any checkpoint/restart as transparent as
possible.  Resetting people's network connections is not exactly illegal
but not very nice or transparent either.

> Also, what happens if I checkpoint a process in 2.6.30 and restore it in
> 2.6.31 which has an expanded idea of what should be restored? Do your
> file formats handle this sort of forward compatibility or am I
> restricted to one kernel?

In general, you're restricted to one kernel.  But, people have mentioned
that, if the formats change, we should be able to write in-userspace
converters for the checkpoint files.  

-- Dave

^ permalink raw reply

* Re: What can OpenVZ do?
From: Alexey Dobriyan @ 2009-02-12 22:17 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matt Mackall, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, tglx-hfZtesqFncYOwBW4kG4KsQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	mingo-X9Un+BFzKDI, Pavel Emelyanov
In-Reply-To: <1234475483.30155.194.camel@nimitz>

On Thu, Feb 12, 2009 at 01:51:23PM -0800, Dave Hansen wrote:
> On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > On Thu, 12 Feb 2009 13:30:35 -0600
> > Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > 
> > > > > - In bullet-point form, what features are missing, and should be added?
> > > > 
> > > >  * support for more architectures than i386
> > > >  * file descriptors:
> > > >   * sockets (network, AF_UNIX, etc...)
> > > >   * devices files
> > > >   * shmfs, hugetlbfs
> > > >   * epoll
> > > >   * unlinked files
> > > 
> > > >  * Filesystem state
> > > >   * contents of files
> > > >   * mount tree for individual processes
> > > >  * flock
> > > >  * threads and sessions
> > > >  * CPU and NUMA affinity
> > > >  * sys_remap_file_pages()
> > > 
> > > I think the real questions is: where are the dragons hiding? Some of
> > > these are known to be hard. And some of them are critical checkpointing
> > > typical applications. If you have plans or theories for implementing all
> > > of the above, then great. But this list doesn't really give any sense of
> > > whether we should be scared of what lurks behind those doors.
> > 
> > How close has OpenVZ come to implementing all of this?  I think the
> > implementatation is fairly complete?
> 
> I also believe it is "fairly complete".  At least able to be used
> practically.
> 
> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745

	git-diff -- kernel/cpt/

should give more realistic picture.

> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.

> We'd be idiots not to take a hard look at what has been done in OpenVZ.
> But, for the time being, we have absolutely no shortage of things that
> we know are important and know have to be done.  Our largest problem is
> not finding things to do, but is our large out-of-tree patch that is
> growing by the day. :(

^ permalink raw reply

* Re: What can OpenVZ do?
From: Andrew Morton @ 2009-02-12 22:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A
In-Reply-To: <1234475483.30155.194.camel@nimitz>

On Thu, 12 Feb 2009 13:51:23 -0800
Dave Hansen <dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org> wrote:

> On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> > On Thu, 12 Feb 2009 13:30:35 -0600
> > Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org> wrote:
> > 
> > > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > > 
> > > > > - In bullet-point form, what features are missing, and should be added?
> > > > 
> > > >  * support for more architectures than i386
> > > >  * file descriptors:
> > > >   * sockets (network, AF_UNIX, etc...)
> > > >   * devices files
> > > >   * shmfs, hugetlbfs
> > > >   * epoll
> > > >   * unlinked files
> > > 
> > > >  * Filesystem state
> > > >   * contents of files
> > > >   * mount tree for individual processes
> > > >  * flock
> > > >  * threads and sessions
> > > >  * CPU and NUMA affinity
> > > >  * sys_remap_file_pages()
> > > 
> > > I think the real questions is: where are the dragons hiding? Some of
> > > these are known to be hard. And some of them are critical checkpointing
> > > typical applications. If you have plans or theories for implementing all
> > > of the above, then great. But this list doesn't really give any sense of
> > > whether we should be scared of what lurks behind those doors.
> > 
> > How close has OpenVZ come to implementing all of this?  I think the
> > implementatation is fairly complete?
> 
> I also believe it is "fairly complete".  At least able to be used
> practically.
> 
> > If so, perhaps that can be used as a guide.  Will the planned feature
> > have a similar design?  If not, how will it differ?  To what extent can
> > we use that implementation as a tool for understanding what this new
> > implementation will look like?
> 
> Yes, we can certainly use it as a guide.  However, there are some
> barriers to being able to do that:
> 
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)
> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
>   84887  290855 2308745
> 
> Unfortunately, the git tree doesn't have that great of a history.  It
> appears that the forward-ports are just applications of huge single
> patches which then get committed into git.  This tree has also
> historically contained a bunch of stuff not directly related to
> checkpoint/restart like resource management.
> 
> We'd be idiots not to take a hard look at what has been done in OpenVZ.
> But, for the time being, we have absolutely no shortage of things that
> we know are important and know have to be done.  Our largest problem is
> not finding things to do, but is our large out-of-tree patch that is
> growing by the day. :(
> 

Well we have a chicken-and-eggish thing.  The patchset will keep
growing until we understand how much of this:

> dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
>  628 files changed, 59597 insertions(+), 2927 deletions(-)

we will be committed to if we were to merge the current patchset.


Now, we've gone in blind before - most notably on the
containers/cgroups/namespaces stuff.  That hail mary pass worked out
acceptably, I think.  Maybe we got lucky.  I thought that
net-namespaces in particular would never get there, but it did.

That was a very large and quite long-term-important user-visible
feature.

checkpoint/restart/migration is also a long-term-...-feature.  But if
at all possible I do think that we should go into it with our eyes a
little less shut.

Interestingly, there was also prior-art for
containers/cgroups/namespaces within OpenVZ.  But we decided up-front
(I think) that the eventual implementation would have little in common
with preceding implementations.


Oh, and I'd disagree with your new Subject:.  It's pretty easy to find
out what OpenVZ can do.  The more important question here is "how much
of a mess did it make when it did it?"

^ permalink raw reply

* What can OpenVZ do?
From: Dave Hansen @ 2009-02-12 21:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Matt Mackall, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	mingo-X9Un+BFzKDI, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, Pavel Emelyanov
In-Reply-To: <20090212114207.e1c2de82.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>

On Thu, 2009-02-12 at 11:42 -0800, Andrew Morton wrote:
> On Thu, 12 Feb 2009 13:30:35 -0600
> Matt Mackall <mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> > On Thu, 2009-02-12 at 10:11 -0800, Dave Hansen wrote:
> > 
> > > > - In bullet-point form, what features are missing, and should be added?
> > > 
> > >  * support for more architectures than i386
> > >  * file descriptors:
> > >   * sockets (network, AF_UNIX, etc...)
> > >   * devices files
> > >   * shmfs, hugetlbfs
> > >   * epoll
> > >   * unlinked files
> > 
> > >  * Filesystem state
> > >   * contents of files
> > >   * mount tree for individual processes
> > >  * flock
> > >  * threads and sessions
> > >  * CPU and NUMA affinity
> > >  * sys_remap_file_pages()
> > 
> > I think the real questions is: where are the dragons hiding? Some of
> > these are known to be hard. And some of them are critical checkpointing
> > typical applications. If you have plans or theories for implementing all
> > of the above, then great. But this list doesn't really give any sense of
> > whether we should be scared of what lurks behind those doors.
> 
> How close has OpenVZ come to implementing all of this?  I think the
> implementatation is fairly complete?

I also believe it is "fairly complete".  At least able to be used
practically.

> If so, perhaps that can be used as a guide.  Will the planned feature
> have a similar design?  If not, how will it differ?  To what extent can
> we use that implementation as a tool for understanding what this new
> implementation will look like?

Yes, we can certainly use it as a guide.  However, there are some
barriers to being able to do that:

dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | diffstat | tail -1
 628 files changed, 59597 insertions(+), 2927 deletions(-)
dave@nimitz:~/kernels/linux-2.6-openvz$ git diff v2.6.27.10... | wc 
  84887  290855 2308745

Unfortunately, the git tree doesn't have that great of a history.  It
appears that the forward-ports are just applications of huge single
patches which then get committed into git.  This tree has also
historically contained a bunch of stuff not directly related to
checkpoint/restart like resource management.

We'd be idiots not to take a hard look at what has been done in OpenVZ.
But, for the time being, we have absolutely no shortage of things that
we know are important and know have to be done.  Our largest problem is
not finding things to do, but is our large out-of-tree patch that is
growing by the day. :(

-- Dave

^ permalink raw reply

* Re: [RFC v13][PATCH 00/14] Kernel based checkpoint/restart
From: Serge E. Hallyn @ 2009-02-12 20:48 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg, tglx-hfZtesqFncYOwBW4kG4KsQ,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Ingo Molnar, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	Andrew Morton
In-Reply-To: <1234462283.30155.173.camel@nimitz>

Quoting Dave Hansen (dave-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> Patch 12/14 is supposed to address this *concept*.  But, it hasn't been
> carried through so that it currently works.  My expectation was that we
> would go through and add things over time.  I'll go make sure I push it
> to the point that it actually works for at least the simple test
> programs that we have.
> 
> What I will probably do is something BKL-style.  Basically put a "this
> can't be checkpointed" marker over most everything I can think of and
> selectively remove it as we add features.  

So the question is: when can we unset the uncheckpointable flag?

In your patch you suggest clone(CLONE_NEWPID).  But that would
require that we at that point do a slew of checks for other
things like open files of a type which are not supported.

I'm wondering whether we should instead stick to calculating
whether a task is checkpointable or not at checkpoint time.
To help an application figure out whether it can be checkpointed,
we can hook /proc/$$/checkpointable to the same function, and
have the file output list all of the reasons the task is not
checkpointable.  i.e.

	mmap MAP_SHARED file which is not yet supported
	open file from another mounts namespace
	open TCP socket which is not yet supported
	open epoll fd which is not yet supported
	TASK NOT FROZEN

So now every time we do a checkpoint we have to do all these
checks, but that's better than at clone time.

You suggested on irc having a fops->is_checkpointable()
fn, which is imo a good idea to help implement the above.
The default value can be a fn returning false.  I suppose
we want to pass back a char* with the file type as well.

-serge

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox