All of lore.kernel.org
 help / color / mirror / Atom feed
* bugs with ckpt-v15-dev
@ 2009-05-18 19:23 Nathan Lynch
       [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Nathan Lynch @ 2009-05-18 19:23 UTC (permalink / raw)
  To: Containers

Last commit is ed3b275 "allow error string during checkpoint while
holding a spinlock".

# bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
[1] 2269
# ckpt $! > /tmp/bash.ckpt

BUG: sleeping function called from invalid context at mm/slub.c:1595
in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt
1 lock held by ckpt/2270:
 #0:  (tasklist_lock){.+.+.+}, at: [<c03911e6>] tree_count_tasks+0x2a/0x2a2
Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30
Call Trace:
 [<c024b6f9>] ? __debug_show_held_locks+0x1e/0x20
 [<c02234da>] __might_sleep+0x100/0x107
 [<c02a9372>] kmem_cache_alloc+0x35/0x11f
 [<c039100f>] ? __ckpt_generate_err+0x25/0x12b
 [<c024a9c7>] ? put_lock_stats+0x1e/0x29
 [<c039100f>] __ckpt_generate_err+0x25/0x12b
 [<c0203703>] ? ftrace_call+0x5/0x8
 [<c03911ba>] __ckpt_write_err+0x16/0x18
 [<c03912ae>] tree_count_tasks+0xf2/0x2a2
 [<c03915ae>] do_checkpoint+0x150/0x5f2
 [<c0390cd8>] ? kzalloc+0x10/0x12
 [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
 [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
 [<c0390465>] sys_checkpoint+0x6c/0x82
 [<c0202ce5>] syscall_call+0x7/0xb
------------[ cut here ]------------
kernel BUG at checkpoint/checkpoint.c:136!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sda/size
Modules linked in:

Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30) 
EIP: 0060:[<c03910dc>] EFLAGS: 00010246 CPU: 0
EIP is at __ckpt_generate_err+0xf2/0x12b
EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 0000001c
ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000)
Stack:
 c072ce85 df051300 0000001c deb75600 df9ad1c0 00000000 deb72f18 c03911ba
 deb72f50 c03912ae df051300 c072ce85 000008dd df9ad4ec df051300 df9ad1c0
 00000000 00000000 00000000 deb75600 deb75604 df051300 deb72f98 c03915ae
Call Trace:
 [<c03911ba>] ? __ckpt_write_err+0x16/0x18
 [<c03912ae>] ? tree_count_tasks+0xf2/0x2a2
 [<c03915ae>] ? do_checkpoint+0x150/0x5f2
 [<c0390cd8>] ? kzalloc+0x10/0x12
 [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
 [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
 [<c0390465>] ? sys_checkpoint+0x6c/0x82
 [<c0202ce5>] ? syscall_call+0x7/0xb
Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 <0f> 0b eb fe 8b 55 ec 31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31 
EIP: [<c03910dc>] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8
---[ end trace d54433b47f0c4829 ]---
note: ckpt[2270] exited with preempt_count 1
BUG: scheduling while atomic: ckpt/2270/0x10000002
INFO: lockdep is turned off.
Modules linked in:
Pid: 2270, comm: ckpt Tainted: G      D    2.6.30-rc3-00074-ged3b275 #30
Call Trace:
 [<c0223f6b>] __schedule_bug+0x63/0x6a
 [<c05ec7dc>] __schedule+0x8f/0x7ac
 [<c024d299>] ? print_lock_contention_bug+0x14/0xd7
 [<c0298093>] ? unmap_vmas+0x1e1/0x518
 [<c0203703>] ? ftrace_call+0x5/0x8
 [<c0203703>] ? ftrace_call+0x5/0x8
 [<c05ecf10>] schedule+0x17/0x38
 [<c0224738>] __cond_resched+0x26/0x3b
 [<c05ed034>] _cond_resched+0x2c/0x37
 [<c0298379>] unmap_vmas+0x4c7/0x518
 [<c029b81b>] exit_mmap+0x6c/0xb7
 [<c022906a>] mmput+0x3c/0x8f
 [<c022c8a0>] exit_mm+0xe3/0xeb
 [<c022e0e2>] do_exit+0x188/0x64b
 [<c05ec415>] ? printk+0x14/0x16
 [<c022b08d>] ? oops_exit+0x28/0x2d
 [<c05efbe7>] oops_end+0x92/0x9a
 [<c020560f>] die+0x59/0x5f
 [<c05ef56b>] do_trap+0x89/0xa2
 [<c02039fc>] ? do_invalid_op+0x0/0x80
 [<c0203a72>] do_invalid_op+0x76/0x80
 [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
 [<c0203703>] ? ftrace_call+0x5/0x8
 [<c039c95d>] ? strnlen+0x8/0x1f
 [<c039b8bd>] ? string+0x34/0x82
 [<c039c14a>] ? vsnprintf+0x173/0x311
 [<c039c05a>] ? vsnprintf+0x83/0x311
 [<c039c9d0>] ? trace_hardirqs_off_thunk+0xc/0x10
 [<c05ef322>] error_code+0x72/0x78
 [<c02039fc>] ? do_invalid_op+0x0/0x80
 [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
 [<c03911ba>] __ckpt_write_err+0x16/0x18
 [<c03912ae>] tree_count_tasks+0xf2/0x2a2
 [<c03915ae>] do_checkpoint+0x150/0x5f2
 [<c0390cd8>] ? kzalloc+0x10/0x12
 [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
 [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
 [<c0390465>] sys_checkpoint+0x6c/0x82
 [<c0202ce5>] syscall_call+0x7/0xb

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2009-05-18 21:10   ` Serge E. Hallyn
       [not found]     ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-05-20  5:28   ` Oren Laadan
  1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-05-18 21:10 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Containers

Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> Last commit is ed3b275 "allow error string during checkpoint while
> holding a spinlock".
> 
> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
> [1] 2269
> # ckpt $! > /tmp/bash.ckpt
> 
> BUG: sleeping function called from invalid context at mm/slub.c:1595

Yeah, not only does ckpt_write_err() get called under task_lock, but
the fn returns without ver doing put_task_struct.  (I'd generate and
send the quick trivial patch, but my git tree is in a bit of a debugme
state right now)

Now mind you this shows that your ckpt program isn't sending
CHECKPOINT_SUBTREE with flags.  This in turns means you are probably
not using the ckpt-v15-dev version of user-cr, and if that is
the case it makes your problems with gconf shared file mapping more
suspect ask well...?

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]     ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-18 21:36       ` Nathan Lynch
       [not found]         ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Nathan Lynch @ 2009-05-18 21:36 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
>> Last commit is ed3b275 "allow error string during checkpoint while
>> holding a spinlock".
>> 
>> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
>> [1] 2269
>> # ckpt $! > /tmp/bash.ckpt
>> 
>> BUG: sleeping function called from invalid context at mm/slub.c:1595
>
> Yeah, not only does ckpt_write_err() get called under task_lock, but
> the fn returns without ver doing put_task_struct.  (I'd generate and
> send the quick trivial patch, but my git tree is in a bit of a debugme
> state right now)

Would prefer to just rip that thing out, it's cost me more trouble then
it's worth.


> Now mind you this shows that your ckpt program isn't sending
> CHECKPOINT_SUBTREE with flags.

I don't follow.  There is "user error" here in that I'm not freezing the
task before checkpointing[1], but my ckpt command is passing the subtree
flag (0x4) afaict:

SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540


> This in turns means you are probably
> not using the ckpt-v15-dev version of user-cr, and if that is
> the case it makes your problems with gconf shared file mapping more
> suspect ask well...?

After updating to the latest user-cr I get the same BUGs.

[1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
before checkpoint, is there any mechanism apart from
cgroup/freezer.state to do this?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]         ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2009-05-18 22:39           ` Serge E. Hallyn
       [not found]             ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-05-18 22:51           ` Matt Helsley
  1 sibling, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-05-18 22:39 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Containers

Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> >> Last commit is ed3b275 "allow error string during checkpoint while
> >> holding a spinlock".
> >> 
> >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
> >> [1] 2269
> >> # ckpt $! > /tmp/bash.ckpt
> >> 
> >> BUG: sleeping function called from invalid context at mm/slub.c:1595
> >
> > Yeah, not only does ckpt_write_err() get called under task_lock, but
> > the fn returns without ver doing put_task_struct.  (I'd generate and
> > send the quick trivial patch, but my git tree is in a bit of a debugme
> > state right now)
> 
> Would prefer to just rip that thing out, it's cost me more trouble then
> it's worth.

Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err?

> > Now mind you this shows that your ckpt program isn't sending
> > CHECKPOINT_SUBTREE with flags.
> 
> I don't follow.  There is "user error" here in that I'm not freezing the
> task before checkpointing[1], but my ckpt command is passing the subtree
> flag (0x4) afaict:
> 
> SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540

Oh, it's the freezer test in may_checkpoint_task you're getting the
error on?  (in my git tree I'd commented that one out temporarily so I
just assumed it was the subtree check in get_container :)

> > This in turns means you are probably
> > not using the ckpt-v15-dev version of user-cr, and if that is
> > the case it makes your problems with gconf shared file mapping more
> > suspect ask well...?
> 
> After updating to the latest user-cr I get the same BUGs.
> 
> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
> before checkpoint, is there any mechanism apart from
> cgroup/freezer.state to do this?

A task can self-checkpoint without the freezer though.

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]         ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2009-05-18 22:39           ` Serge E. Hallyn
@ 2009-05-18 22:51           ` Matt Helsley
       [not found]             ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 13+ messages in thread
From: Matt Helsley @ 2009-05-18 22:51 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Containers

On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
> >> Last commit is ed3b275 "allow error string during checkpoint while
> >> holding a spinlock".
> >> 
> >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
> >> [1] 2269
> >> # ckpt $! > /tmp/bash.ckpt
> >> 
> >> BUG: sleeping function called from invalid context at mm/slub.c:1595
> >
> > Yeah, not only does ckpt_write_err() get called under task_lock, but
> > the fn returns without ver doing put_task_struct.  (I'd generate and
> > send the quick trivial patch, but my git tree is in a bit of a debugme
> > state right now)
> 
> Would prefer to just rip that thing out, it's cost me more trouble then
> it's worth.
> 
> 
> > Now mind you this shows that your ckpt program isn't sending
> > CHECKPOINT_SUBTREE with flags.
> 
> I don't follow.  There is "user error" here in that I'm not freezing the
> task before checkpointing[1], but my ckpt command is passing the subtree
> flag (0x4) afaict:
> 
> SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540
> 
> 
> > This in turns means you are probably
> > not using the ckpt-v15-dev version of user-cr, and if that is
> > the case it makes your problems with gconf shared file mapping more
> > suspect ask well...?
> 
> After updating to the latest user-cr I get the same BUGs.
> 
> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
> before checkpoint, is there any mechanism apart from
> cgroup/freezer.state to do this?

Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks
-- they'd still be capable of responding to some signals (CONT, TERM..). Also 
they'd presumably be placed in the stopped state upon restart so a SIGCONT will
be needed. In the case of bash, at least, that will technically change what
happens upon restart. My guess is that in many cases it won't matter but there
are some where it will. 

The freezer documentation shows an example of what happens with bash
when attempting to use only STOP/CONT rather than the freezer. gdb might
also present interesting cases when just utilizing STOP/CONT signals..

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]             ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-18 23:02               ` Nathan Lynch
  0 siblings, 0 replies; 13+ messages in thread
From: Nathan Lynch @ 2009-05-18 23:02 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers

"Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
>> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>> 
>> > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org):
>> >> Last commit is ed3b275 "allow error string during checkpoint while
>> >> holding a spinlock".
>> >> 
>> >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
>> >> [1] 2269
>> >> # ckpt $! > /tmp/bash.ckpt
>> >> 
>> >> BUG: sleeping function called from invalid context at mm/slub.c:1595
>> >
>> > Yeah, not only does ckpt_write_err() get called under task_lock, but
>> > the fn returns without ver doing put_task_struct.  (I'd generate and
>> > send the quick trivial patch, but my git tree is in a bit of a debugme
>> > state right now)
>> 
>> Would prefer to just rip that thing out, it's cost me more trouble then
>> it's worth.
>
> Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err?

ckpt_write_err.  I've yet to witness it perform its intended function
without triggering a WARN_ON or BUG.


>> > Now mind you this shows that your ckpt program isn't sending
>> > CHECKPOINT_SUBTREE with flags.
>> 
>> I don't follow.  There is "user error" here in that I'm not freezing the
>> task before checkpointing[1], but my ckpt command is passing the subtree
>> flag (0x4) afaict:
>> 
>> SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540
>
> Oh, it's the freezer test in may_checkpoint_task you're getting the
> error on?  (in my git tree I'd commented that one out temporarily so I
> just assumed it was the subtree check in get_container :)

Yes, the frozen test is failing, afaik.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]             ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-18 23:21               ` Nathan Lynch
       [not found]                 ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Nathan Lynch @ 2009-05-18 23:21 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
>> 
>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
>> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
>> before checkpoint, is there any mechanism apart from
>> cgroup/freezer.state to do this?
>
> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
> the tasks -- they'd still be capable of responding to some signals
> (CONT, TERM..). Also they'd presumably be placed in the stopped state
> upon restart so a SIGCONT will be needed. In the case of bash, at
> least, that will technically change what happens upon restart. My
> guess is that in many cases it won't matter but there are some where
> it will.

Hmm, I'm having trouble understanding your suggestion.  The current
checkpoint implementation requires non-self tasks to be frozen (p->flags
& PF_FROZEN), which is not equivalent to stopped state (task->state &
__TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
stopped state.  See may_checkpoint_task().

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]                 ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
@ 2009-05-19  1:09                   ` Matt Helsley
       [not found]                     ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Matt Helsley @ 2009-05-19  1:09 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Containers

On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> 
> > On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
> >> 
> >> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
> >> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
> >> before checkpoint, is there any mechanism apart from
> >> cgroup/freezer.state to do this?
> >
> > Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
> > the tasks -- they'd still be capable of responding to some signals
> > (CONT, TERM..). Also they'd presumably be placed in the stopped state
> > upon restart so a SIGCONT will be needed. In the case of bash, at
> > least, that will technically change what happens upon restart. My
> > guess is that in many cases it won't matter but there are some where
> > it will.
> 
> Hmm, I'm having trouble understanding your suggestion.  The current
> checkpoint implementation requires non-self tasks to be frozen (p->flags
> & PF_FROZEN), which is not equivalent to stopped state (task->state &
> __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
> stopped state.  See may_checkpoint_task().

Oops. You're right. That would require changing may_checkpoint_task() to include
__TASK_STOPPED -- not something we'd want in the final code. I had assumed
you wanted to try a different mechanism for debugging purposes.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
  2009-05-18 21:10   ` Serge E. Hallyn
@ 2009-05-20  5:28   ` Oren Laadan
  1 sibling, 0 replies; 13+ messages in thread
From: Oren Laadan @ 2009-05-20  5:28 UTC (permalink / raw)
  To: Nathan Lynch; +Cc: Containers

Nathan,

Thanks for insisting on this ... I believe it's now fixed in the
ckpt-v15-dev branch.

In particular, error reporting works better, and there is a new
utility "ckptinfo" which can do basic parsing of the checkpoint
image. If given the switch '-e' it will display error strings
found in the image.

The checkpoint image format has changed so you need to pull both
linux-cr and user-cr.

Oren.

Nathan Lynch wrote:
> Last commit is ed3b275 "allow error string during checkpoint while
> holding a spinlock".
> 
> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
> [1] 2269
> # ckpt $! > /tmp/bash.ckpt
> 
> BUG: sleeping function called from invalid context at mm/slub.c:1595
> in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt
> 1 lock held by ckpt/2270:
>  #0:  (tasklist_lock){.+.+.+}, at: [<c03911e6>] tree_count_tasks+0x2a/0x2a2
> Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30
> Call Trace:
>  [<c024b6f9>] ? __debug_show_held_locks+0x1e/0x20
>  [<c02234da>] __might_sleep+0x100/0x107
>  [<c02a9372>] kmem_cache_alloc+0x35/0x11f
>  [<c039100f>] ? __ckpt_generate_err+0x25/0x12b
>  [<c024a9c7>] ? put_lock_stats+0x1e/0x29
>  [<c039100f>] __ckpt_generate_err+0x25/0x12b
>  [<c0203703>] ? ftrace_call+0x5/0x8
>  [<c03911ba>] __ckpt_write_err+0x16/0x18
>  [<c03912ae>] tree_count_tasks+0xf2/0x2a2
>  [<c03915ae>] do_checkpoint+0x150/0x5f2
>  [<c0390cd8>] ? kzalloc+0x10/0x12
>  [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
>  [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
>  [<c0390465>] sys_checkpoint+0x6c/0x82
>  [<c0202ce5>] syscall_call+0x7/0xb
> ------------[ cut here ]------------
> kernel BUG at checkpoint/checkpoint.c:136!
> invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> last sysfs file: /sys/block/sda/size
> Modules linked in:
> 
> Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30) 
> EIP: 0060:[<c03910dc>] EFLAGS: 00010246 CPU: 0
> EIP is at __ckpt_generate_err+0xf2/0x12b
> EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 0000001c
> ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8
>  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
> Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000)
> Stack:
>  c072ce85 df051300 0000001c deb75600 df9ad1c0 00000000 deb72f18 c03911ba
>  deb72f50 c03912ae df051300 c072ce85 000008dd df9ad4ec df051300 df9ad1c0
>  00000000 00000000 00000000 deb75600 deb75604 df051300 deb72f98 c03915ae
> Call Trace:
>  [<c03911ba>] ? __ckpt_write_err+0x16/0x18
>  [<c03912ae>] ? tree_count_tasks+0xf2/0x2a2
>  [<c03915ae>] ? do_checkpoint+0x150/0x5f2
>  [<c0390cd8>] ? kzalloc+0x10/0x12
>  [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
>  [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
>  [<c0390465>] ? sys_checkpoint+0x6c/0x82
>  [<c0202ce5>] ? syscall_call+0x7/0xb
> Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 <0f> 0b eb fe 8b 55 ec 31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31 
> EIP: [<c03910dc>] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8
> ---[ end trace d54433b47f0c4829 ]---
> note: ckpt[2270] exited with preempt_count 1
> BUG: scheduling while atomic: ckpt/2270/0x10000002
> INFO: lockdep is turned off.
> Modules linked in:
> Pid: 2270, comm: ckpt Tainted: G      D    2.6.30-rc3-00074-ged3b275 #30
> Call Trace:
>  [<c0223f6b>] __schedule_bug+0x63/0x6a
>  [<c05ec7dc>] __schedule+0x8f/0x7ac
>  [<c024d299>] ? print_lock_contention_bug+0x14/0xd7
>  [<c0298093>] ? unmap_vmas+0x1e1/0x518
>  [<c0203703>] ? ftrace_call+0x5/0x8
>  [<c0203703>] ? ftrace_call+0x5/0x8
>  [<c05ecf10>] schedule+0x17/0x38
>  [<c0224738>] __cond_resched+0x26/0x3b
>  [<c05ed034>] _cond_resched+0x2c/0x37
>  [<c0298379>] unmap_vmas+0x4c7/0x518
>  [<c029b81b>] exit_mmap+0x6c/0xb7
>  [<c022906a>] mmput+0x3c/0x8f
>  [<c022c8a0>] exit_mm+0xe3/0xeb
>  [<c022e0e2>] do_exit+0x188/0x64b
>  [<c05ec415>] ? printk+0x14/0x16
>  [<c022b08d>] ? oops_exit+0x28/0x2d
>  [<c05efbe7>] oops_end+0x92/0x9a
>  [<c020560f>] die+0x59/0x5f
>  [<c05ef56b>] do_trap+0x89/0xa2
>  [<c02039fc>] ? do_invalid_op+0x0/0x80
>  [<c0203a72>] do_invalid_op+0x76/0x80
>  [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
>  [<c0203703>] ? ftrace_call+0x5/0x8
>  [<c039c95d>] ? strnlen+0x8/0x1f
>  [<c039b8bd>] ? string+0x34/0x82
>  [<c039c14a>] ? vsnprintf+0x173/0x311
>  [<c039c05a>] ? vsnprintf+0x83/0x311
>  [<c039c9d0>] ? trace_hardirqs_off_thunk+0xc/0x10
>  [<c05ef322>] error_code+0x72/0x78
>  [<c02039fc>] ? do_invalid_op+0x0/0x80
>  [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
>  [<c03911ba>] __ckpt_write_err+0x16/0x18
>  [<c03912ae>] tree_count_tasks+0xf2/0x2a2
>  [<c03915ae>] do_checkpoint+0x150/0x5f2
>  [<c0390cd8>] ? kzalloc+0x10/0x12
>  [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
>  [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
>  [<c0390465>] sys_checkpoint+0x6c/0x82
>  [<c0202ce5>] syscall_call+0x7/0xb
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]                     ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-20  5:30                       ` Oren Laadan
       [not found]                         ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Oren Laadan @ 2009-05-20  5:30 UTC (permalink / raw)
  To: Matt Helsley; +Cc: Containers, Nathan Lynch



Matt Helsley wrote:
> On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
>> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>>
>>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
>>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
>>>> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
>>>> before checkpoint, is there any mechanism apart from
>>>> cgroup/freezer.state to do this?
>>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
>>> the tasks -- they'd still be capable of responding to some signals
>>> (CONT, TERM..). Also they'd presumably be placed in the stopped state
>>> upon restart so a SIGCONT will be needed. In the case of bash, at
>>> least, that will technically change what happens upon restart. My
>>> guess is that in many cases it won't matter but there are some where
>>> it will.
>> Hmm, I'm having trouble understanding your suggestion.  The current
>> checkpoint implementation requires non-self tasks to be frozen (p->flags
>> & PF_FROZEN), which is not equivalent to stopped state (task->state &
>> __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
>> stopped state.  See may_checkpoint_task().
> 
> Oops. You're right. That would require changing may_checkpoint_task() to include
> __TASK_STOPPED -- not something we'd want in the final code. I had assumed
> you wanted to try a different mechanism for debugging purposes.
> 

Allowing checkpoint of stopped tasks is actually not such a bad
idea, IMHO.

Oren.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]                         ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2009-05-20 13:14                           ` Serge E. Hallyn
       [not found]                             ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 13+ messages in thread
From: Serge E. Hallyn @ 2009-05-20 13:14 UTC (permalink / raw)
  To: Oren Laadan; +Cc: Containers, Nathan Lynch

Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> 
> 
> Matt Helsley wrote:
> > On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
> >> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> >>
> >>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
> >>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
> >>>> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
> >>>> before checkpoint, is there any mechanism apart from
> >>>> cgroup/freezer.state to do this?
> >>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
> >>> the tasks -- they'd still be capable of responding to some signals
> >>> (CONT, TERM..). Also they'd presumably be placed in the stopped state
> >>> upon restart so a SIGCONT will be needed. In the case of bash, at
> >>> least, that will technically change what happens upon restart. My
> >>> guess is that in many cases it won't matter but there are some where
> >>> it will.
> >> Hmm, I'm having trouble understanding your suggestion.  The current
> >> checkpoint implementation requires non-self tasks to be frozen (p->flags
> >> & PF_FROZEN), which is not equivalent to stopped state (task->state &
> >> __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
> >> stopped state.  See may_checkpoint_task().
> > 
> > Oops. You're right. That would require changing may_checkpoint_task() to include
> > __TASK_STOPPED -- not something we'd want in the final code. I had assumed
> > you wanted to try a different mechanism for debugging purposes.
> > 
> 
> Allowing checkpoint of stopped tasks is actually not such a bad
> idea, IMHO.

Well, it might be bad for the same reason that Matt is pursuing the
CHECKPOINTING freezer state:  the task might get kicked alive in
the middle of the checkpoint.

So it might be ok so long as we still move the task to CHECKPOINTING
state.  But I'm just not sure it's worth worrying about.

-serge

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]                             ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2009-05-20 13:21                               ` Oren Laadan
  2009-05-20 21:10                               ` Matt Helsley
  1 sibling, 0 replies; 13+ messages in thread
From: Oren Laadan @ 2009-05-20 13:21 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Nathan Lynch



Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
>>
>> Matt Helsley wrote:
>>> On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
>>>> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
>>>>
>>>>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
>>>>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
>>>>>> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
>>>>>> before checkpoint, is there any mechanism apart from
>>>>>> cgroup/freezer.state to do this?
>>>>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
>>>>> the tasks -- they'd still be capable of responding to some signals
>>>>> (CONT, TERM..). Also they'd presumably be placed in the stopped state
>>>>> upon restart so a SIGCONT will be needed. In the case of bash, at
>>>>> least, that will technically change what happens upon restart. My
>>>>> guess is that in many cases it won't matter but there are some where
>>>>> it will.
>>>> Hmm, I'm having trouble understanding your suggestion.  The current
>>>> checkpoint implementation requires non-self tasks to be frozen (p->flags
>>>> & PF_FROZEN), which is not equivalent to stopped state (task->state &
>>>> __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
>>>> stopped state.  See may_checkpoint_task().
>>> Oops. You're right. That would require changing may_checkpoint_task() to include
>>> __TASK_STOPPED -- not something we'd want in the final code. I had assumed
>>> you wanted to try a different mechanism for debugging purposes.
>>>
>> Allowing checkpoint of stopped tasks is actually not such a bad
>> idea, IMHO.
> 
> Well, it might be bad for the same reason that Matt is pursuing the
> CHECKPOINTING freezer state:  the task might get kicked alive in
> the middle of the checkpoint.

Yes, that was my concern and I try to make the code safe with regard
to such behavior. And if that is achieved, then at worst the checkpoint
will either fail or yield meaningless results. On the other hand, it
can allow c/r without requiring cgroups/freezer, with some additional
restrictions.

> 
> So it might be ok so long as we still move the task to CHECKPOINTING
> state.  But I'm just not sure it's worth worrying about.

Probably not at the moment, except for "lowering the barrier" for
people to try it out.

Oren.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: bugs with ckpt-v15-dev
       [not found]                             ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2009-05-20 13:21                               ` Oren Laadan
@ 2009-05-20 21:10                               ` Matt Helsley
  1 sibling, 0 replies; 13+ messages in thread
From: Matt Helsley @ 2009-05-20 21:10 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: Containers, Nathan Lynch

On Wed, May 20, 2009 at 08:14:57AM -0500, Serge E. Hallyn wrote:
> Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org):
> > 
> > 
> > Matt Helsley wrote:
> > > On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote:
> > >> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:
> > >>
> > >>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote:
> > >>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or
> > >>>> CONFIG_CGROUPS_FREEZER?  We require tasks to be put in frozen state
> > >>>> before checkpoint, is there any mechanism apart from
> > >>>> cgroup/freezer.state to do this?
> > >>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze
> > >>> the tasks -- they'd still be capable of responding to some signals
> > >>> (CONT, TERM..). Also they'd presumably be placed in the stopped state
> > >>> upon restart so a SIGCONT will be needed. In the case of bash, at
> > >>> least, that will technically change what happens upon restart. My
> > >>> guess is that in many cases it won't matter but there are some where
> > >>> it will.
> > >> Hmm, I'm having trouble understanding your suggestion.  The current
> > >> checkpoint implementation requires non-self tasks to be frozen (p->flags
> > >> & PF_FROZEN), which is not equivalent to stopped state (task->state &
> > >> __TASK_STOPPED).  That is, it would refuse to checkpoint tasks in
> > >> stopped state.  See may_checkpoint_task().
> > > 
> > > Oops. You're right. That would require changing may_checkpoint_task() to include
> > > __TASK_STOPPED -- not something we'd want in the final code. I had assumed
> > > you wanted to try a different mechanism for debugging purposes.
> > > 
> > 
> > Allowing checkpoint of stopped tasks is actually not such a bad
> > idea, IMHO.
> 
> Well, it might be bad for the same reason that Matt is pursuing the
> CHECKPOINTING freezer state:  the task might get kicked alive in
> the middle of the checkpoint.
> 
> So it might be ok so long as we still move the task to CHECKPOINTING
> state.  But I'm just not sure it's worth worrying about.

FYI: currently there is no CHECKPOINTING state. CHECKPOINTING is
specific to the freezer.state -- the tasks still appear "frozen" in the
D state. This works since nothing else unfreezes these tasks.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-05-20 21:10 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-18 19:23 bugs with ckpt-v15-dev Nathan Lynch
     [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-18 21:10   ` Serge E. Hallyn
     [not found]     ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 21:36       ` Nathan Lynch
     [not found]         ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-18 22:39           ` Serge E. Hallyn
     [not found]             ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 23:02               ` Nathan Lynch
2009-05-18 22:51           ` Matt Helsley
     [not found]             ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 23:21               ` Nathan Lynch
     [not found]                 ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-19  1:09                   ` Matt Helsley
     [not found]                     ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-20  5:30                       ` Oren Laadan
     [not found]                         ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-20 13:14                           ` Serge E. Hallyn
     [not found]                             ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-20 13:21                               ` Oren Laadan
2009-05-20 21:10                               ` Matt Helsley
2009-05-20  5:28   ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.