Banning checkpoint (was: Re: What can OpenVZ do?)

linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-18 23:15                 ` Ingo Molnar
@ 2009-02-19 19:06                   ` Alexey Dobriyan
  2009-02-19 19:11                     ` Dave Hansen
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Dobriyan @ 2009-02-19 19:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul

I think that all these efforts to abort checkpoint "intelligently" by
banning it early are completely misguided.

"Checkpointable" property isn't one-way ticket like "tainted" flag,
so doing it like tainted var isn't right, atomic or not, SMP-safe or
not.

With filesystems, one has ->f_op field to compare against banned
filesystems, one more flag isn't necessary.

Inotify isn't supported yet? You do

	if (!list_empty(&inode->inotify_watches))
		return -E;

without hooking into inotify syscalls.

ptrace(2) isn't supported -- look at struct task_struct::ptraced and
friends.

And so on.

System call (or whatever) does something with some piece of kernel
internals. We look at this "something" when walking data structures and
abort if it's scary enough.

Please, show at least one counter-example.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-19 19:06                   ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
@ 2009-02-19 19:11                     ` Dave Hansen
  2009-02-24  4:47                       ` Alexey Dobriyan
  0 siblings, 1 reply; 7+ messages in thread
From: Dave Hansen @ 2009-02-19 19:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul

On Thu, 2009-02-19 at 22:06 +0300, Alexey Dobriyan wrote:
> Inotify isn't supported yet? You do
> 
>         if (!list_empty(&inode->inotify_watches))
>                 return -E;
> 
> without hooking into inotify syscalls.
> 
> ptrace(2) isn't supported -- look at struct task_struct::ptraced and
> friends.
> 
> And so on.
> 
> System call (or whatever) does something with some piece of kernel
> internals. We look at this "something" when walking data structures
> and
> abort if it's scary enough.
> 
> Please, show at least one counter-example.

Alexey, I agree with you here.  I've been fighting myself internally
about these two somewhat opposing approaches.  Of *course* we can
determine the "checkpointability" at sys_checkpoint() time by checking
all the various bits of state.

The problem that I think Ingo is trying to address here is that doing it
then makes it hard to figure out _when_ you went wrong.  That's the
single most critical piece of finding out how to go address it.

I see where you are coming from.  Ingo's suggestion has the *huge*
downside that we've got to go muck with a lot of generic code and hook
into all the things we don't support.

I think what I posted is a decent compromise.  It gets you those
warnings at runtime and is a one-way trip for any given process.  But,
it does detect in certain cases (fork() and unshare(FILES)) when it is
safe to make the trip back to the "I'm checkpointable" state again.

-- Dave

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-19 19:11                     ` Dave Hansen
@ 2009-02-24  4:47                       ` Alexey Dobriyan
       [not found]                         ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
  0 siblings, 1 reply; 7+ messages in thread
From: Alexey Dobriyan @ 2009-02-24  4:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Nathan Lynch, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A

On Thu, Feb 19, 2009 at 11:11:54AM -0800, Dave Hansen wrote:
> On Thu, 2009-02-19 at 22:06 +0300, Alexey Dobriyan wrote:
> > Inotify isn't supported yet? You do
> > 
> >         if (!list_empty(&inode->inotify_watches))
> >                 return -E;
> > 
> > without hooking into inotify syscalls.
> > 
> > ptrace(2) isn't supported -- look at struct task_struct::ptraced and
> > friends.
> > 
> > And so on.
> > 
> > System call (or whatever) does something with some piece of kernel
> > internals. We look at this "something" when walking data structures
> > and
> > abort if it's scary enough.
> > 
> > Please, show at least one counter-example.
> 
> Alexey, I agree with you here.  I've been fighting myself internally
> about these two somewhat opposing approaches.  Of *course* we can
> determine the "checkpointability" at sys_checkpoint() time by checking
> all the various bits of state.
> 
> The problem that I think Ingo is trying to address here is that doing it
> then makes it hard to figure out _when_ you went wrong.  That's the
> single most critical piece of finding out how to go address it.
> 
> I see where you are coming from.  Ingo's suggestion has the *huge*
> downside that we've got to go muck with a lot of generic code and hook
> into all the things we don't support.
> 
> I think what I posted is a decent compromise.  It gets you those
> warnings at runtime and is a one-way trip for any given process.  But,
> it does detect in certain cases (fork() and unshare(FILES)) when it is
> safe to make the trip back to the "I'm checkpointable" state again.

"Checkpointable" is not even per-process property.

Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
They are a) per-netns, b) persistent.

You can hook into socketcalls to mark process as uncheckpointable,
but since SAs and SPDs are persistent, original process already exited.
You're going to walk every process with same netns as SA adder and mark
it as uncheckpointable. Definitely doable, but ugly, isn't it?

Same for iptable rules.

"Checkpointable" is container property, OK?
--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
       [not found]                         ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
@ 2009-02-24  5:11                           ` Dave Hansen
  2009-02-24 15:43                             ` Serge E. Hallyn
  2009-02-24 20:09                             ` Alexey Dobriyan
  0 siblings, 2 replies; 7+ messages in thread
From: Dave Hansen @ 2009-02-24  5:11 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: Ingo Molnar, Nathan Lynch, linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	mpm-VDJrAJ4Gl5ZBDgjK7y7TUQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn, hpa-YMNOUZJC4hwAvxtiuMwx3w,
	Andrew Morton, torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	tglx-hfZtesqFncYOwBW4kG4KsQ, xemul-GEFAQzZX7r8dnm+yROfE0A

On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > I think what I posted is a decent compromise.  It gets you those
> > warnings at runtime and is a one-way trip for any given process.  But,
> > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > safe to make the trip back to the "I'm checkpointable" state again.
> 
> "Checkpointable" is not even per-process property.
> 
> Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> They are a) per-netns, b) persistent.
> 
> You can hook into socketcalls to mark process as uncheckpointable,
> but since SAs and SPDs are persistent, original process already exited.
> You're going to walk every process with same netns as SA adder and mark
> it as uncheckpointable. Definitely doable, but ugly, isn't it?
> 
> Same for iptable rules.
> 
> "Checkpointable" is container property, OK?

Ideally, I completely agree.

But, we don't currently have a concept of a true container in the
kernel.  Do you have any suggestions for any current objects that we
could use in its place for a while?

-- Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
       [not found]                   ` <c8Xp3-3TH-3@gated-at.bofh.it>
@ 2009-02-24 13:00                     ` Bodo Eggert
  0 siblings, 0 replies; 7+ messages in thread
From: Bodo Eggert @ 2009-02-24 13:00 UTC (permalink / raw)
  To: Alexey Dobriyan, Ingo Molnar, Nathan Lynch,
	linux-api-u79uwXL29TY76Z2rM5mHXA,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, mpm

Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Thu, Feb 19, 2009 at 11:11:54AM -0800, Dave Hansen wrote:
>> On Thu, 2009-02-19 at 22:06 +0300, Alexey Dobriyan wrote:

>> Alexey, I agree with you here.  I've been fighting myself internally
>> about these two somewhat opposing approaches.  Of *course* we can
>> determine the "checkpointability" at sys_checkpoint() time by checking
>> all the various bits of state.
>> 
>> The problem that I think Ingo is trying to address here is that doing it
>> then makes it hard to figure out _when_ you went wrong.  That's the
>> single most critical piece of finding out how to go address it.
>> 
>> I see where you are coming from.  Ingo's suggestion has the *huge*
>> downside that we've got to go muck with a lot of generic code and hook
>> into all the things we don't support.
>> 
>> I think what I posted is a decent compromise.  It gets you those
>> warnings at runtime and is a one-way trip for any given process.  But,
>> it does detect in certain cases (fork() and unshare(FILES)) when it is
>> safe to make the trip back to the "I'm checkpointable" state again.
> 
> "Checkpointable" is not even per-process property.
> 
> Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> They are a) per-netns, b) persistent.
> 
> You can hook into socketcalls to mark process as uncheckpointable,
> but since SAs and SPDs are persistent, original process already exited.
> You're going to walk every process with same netns as SA adder and mark
> it as uncheckpointable. Definitely doable, but ugly, isn't it?
> 
> Same for iptable rules.
> 
> "Checkpointable" is container property, OK?

IMO: Everything around the process may change as long as you can do the same
using 'kill -STOP $PID; ...; kill -CONT $PID;'. E.g. changing iptables rules
can be done to a normal process, so this should not prevent checkpointing
(unless you checkpoint iptables, but don't do that then?).

BTW1: I might want to checkpoint something like seti@home. It will connect
to a server from time to time, and send/receive a packet. If having opened
a socket once in a lifetime would prevent checkpointing, this won't be
possible. I see the benefit of the one-way-flag forcing to make all
syscalls be checkpointable, but this won't work on sockets.

Therefore I think you need something inbetween. Some syscalls (etc.) are not
supported, so just make the process be uncheckpointable. But some syscalls
will enter and leave non-checkpointable states by design, they need at least
counters.

Maybe you'll want to let the application decide if it's OK to be checkpointed
on some conditions, too. The Seti client might know how to handle broken
connections, and doing duplicate transfers or skipping them is expected, too.
So the Seti client might declare the socket to be checkpointable, instead of
making the do-the-checkpoint application wait until it's closed.

BTW2: There is the problem of invalidating checkpoints, too. If a browser did
a HTTP PUT, you don't want to restore the checkpoint where it was just about
to start the PUT request. The application should be able to signal this to
a checkpointing daemon. There will be a race, so having a signal "Invalidate
checkpoints" won't work, but if the application sends a stable hash value,
the duplicate can be detected. (Off cause you'd say "don't do that then" for
browsers, but it's just an example. Off cause 2, the checkpoint daemon is
only needed for advanced setups, a simple "checkpoint $povray --store jobfile"
should just work.)

--
To unsubscribe from this list: send the line "unsubscribe linux-api" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-24  5:11                           ` Dave Hansen
@ 2009-02-24 15:43                             ` Serge E. Hallyn
  2009-02-24 20:09                             ` Alexey Dobriyan
  1 sibling, 0 replies; 7+ messages in thread
From: Serge E. Hallyn @ 2009-02-24 15:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Alexey Dobriyan, hpa, linux-api, containers, Nathan Lynch,
	linux-kernel, linux-mm, tglx, viro, mpm, Ingo Molnar, torvalds,
	Andrew Morton, xemul

Quoting Dave Hansen (dave@linux.vnet.ibm.com):
> On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > > I think what I posted is a decent compromise.  It gets you those
> > > warnings at runtime and is a one-way trip for any given process.  But,
> > > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > > safe to make the trip back to the "I'm checkpointable" state again.
> > 
> > "Checkpointable" is not even per-process property.
> > 
> > Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> > They are a) per-netns, b) persistent.
> > 
> > You can hook into socketcalls to mark process as uncheckpointable,
> > but since SAs and SPDs are persistent, original process already exited.
> > You're going to walk every process with same netns as SA adder and mark
> > it as uncheckpointable. Definitely doable, but ugly, isn't it?
> > 
> > Same for iptable rules.
> > 
> > "Checkpointable" is container property, OK?
> 
> Ideally, I completely agree.
> 
> But, we don't currently have a concept of a true container in the
> kernel.  Do you have any suggestions for any current objects that we
> could use in its place for a while?

I think the main point is that it makes the concept of marking a task as
uncheckpointable unworkable.  So at sys_checkpoint() time or when we cat
/proc/$$/checkpointable, we can check for all of the uncheckpointable
state of both $$ and its container (including whether $$ is a container
init).  But we can't expect that (to use Alexey's example) when one task
in a netns does a certain sys_socketcall, all tasks in the container
will be marked uncheckpointable.  Or at least we don't want to.

Which means task->uncheckpointable can't be the big stick which I think
you were hoping it would be.

-serge

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Banning checkpoint (was: Re: What can OpenVZ do?)
  2009-02-24  5:11                           ` Dave Hansen
  2009-02-24 15:43                             ` Serge E. Hallyn
@ 2009-02-24 20:09                             ` Alexey Dobriyan
  1 sibling, 0 replies; 7+ messages in thread
From: Alexey Dobriyan @ 2009-02-24 20:09 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Nathan Lynch, linux-api, containers, mpm,
	linux-kernel, linux-mm, viro, hpa, Andrew Morton, torvalds, tglx,
	xemul

On Mon, Feb 23, 2009 at 09:11:25PM -0800, Dave Hansen wrote:
> On Tue, 2009-02-24 at 07:47 +0300, Alexey Dobriyan wrote:
> > > I think what I posted is a decent compromise.  It gets you those
> > > warnings at runtime and is a one-way trip for any given process.  But,
> > > it does detect in certain cases (fork() and unshare(FILES)) when it is
> > > safe to make the trip back to the "I'm checkpointable" state again.
> > 
> > "Checkpointable" is not even per-process property.
> > 
> > Imagine, set of SAs (struct xfrm_state) and SPDs (struct xfrm_policy).
> > They are a) per-netns, b) persistent.
> > 
> > You can hook into socketcalls to mark process as uncheckpointable,
> > but since SAs and SPDs are persistent, original process already exited.
> > You're going to walk every process with same netns as SA adder and mark
> > it as uncheckpointable. Definitely doable, but ugly, isn't it?
> > 
> > Same for iptable rules.
> > 
> > "Checkpointable" is container property, OK?
> 
> Ideally, I completely agree.
> 
> But, we don't currently have a concept of a true container in the
> kernel.  Do you have any suggestions for any current objects that we
> could use in its place for a while?

After all foo_ns changes struct nsproxy is such thing.

More specific, a process with fully cloned nsproxy acting as init,
all its children. In terms of data structures, every task_struct in such
tree, every nsproxy of them, every foo_ns, and so on to lower levels.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-02-24 20:09 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <c6GC9-3l7-11@gated-at.bofh.it>
     [not found] ` <c6GLN-3xO-31@gated-at.bofh.it>
     [not found]   ` <c6IDT-6AZ-9@gated-at.bofh.it>
     [not found]     ` <c6INu-6Ol-1@gated-at.bofh.it>
     [not found]       ` <c6MR7-54s-3@gated-at.bofh.it>
     [not found]         ` <c6ZbG-hF-13@gated-at.bofh.it>
     [not found]           ` <c729F-5gF-21@gated-at.bofh.it>
     [not found]             ` <c73S1-8dJ-19@gated-at.bofh.it>
     [not found]               ` <c7mrC-4YD-3@gated-at.bofh.it>
     [not found]                 ` <c7mBk-5b2-23@gated-at.bofh.it>
     [not found]                   ` <c8Xp3-3TH-3@gated-at.bofh.it>
2009-02-24 13:00                     ` Banning checkpoint (was: Re: What can OpenVZ do?) Bodo Eggert
2009-02-13 10:53 What can OpenVZ do? Ingo Molnar
2009-02-16 20:51 ` Dave Hansen
2009-02-17 22:23   ` Ingo Molnar
2009-02-17 22:30     ` Dave Hansen
2009-02-18  0:32       ` Ingo Molnar
2009-02-18  0:40         ` Dave Hansen
2009-02-18  5:11           ` Alexey Dobriyan
2009-02-18 18:16             ` Ingo Molnar
2009-02-18 21:27               ` Dave Hansen
2009-02-18 23:15                 ` Ingo Molnar
2009-02-19 19:06                   ` Banning checkpoint (was: Re: What can OpenVZ do?) Alexey Dobriyan
2009-02-19 19:11                     ` Dave Hansen
2009-02-24  4:47                       ` Alexey Dobriyan
     [not found]                         ` <20090224044752.GB3202-2ev+ksY9ol182hYKe6nXyg@public.gmane.org>
2009-02-24  5:11                           ` Dave Hansen
2009-02-24 15:43                             ` Serge E. Hallyn
2009-02-24 20:09                             ` Alexey Dobriyan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).