* bugs with ckpt-v15-dev
@ 2009-05-18 19:23 Nathan Lynch
[not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 13+ messages in thread
From: Nathan Lynch @ 2009-05-18 19:23 UTC (permalink / raw)
To: Containers
Last commit is ed3b275 "allow error string during checkpoint while
holding a spinlock".
# bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' &
[1] 2269
# ckpt $! > /tmp/bash.ckpt
BUG: sleeping function called from invalid context at mm/slub.c:1595
in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt
1 lock held by ckpt/2270:
#0: (tasklist_lock){.+.+.+}, at: [<c03911e6>] tree_count_tasks+0x2a/0x2a2
Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30
Call Trace:
[<c024b6f9>] ? __debug_show_held_locks+0x1e/0x20
[<c02234da>] __might_sleep+0x100/0x107
[<c02a9372>] kmem_cache_alloc+0x35/0x11f
[<c039100f>] ? __ckpt_generate_err+0x25/0x12b
[<c024a9c7>] ? put_lock_stats+0x1e/0x29
[<c039100f>] __ckpt_generate_err+0x25/0x12b
[<c0203703>] ? ftrace_call+0x5/0x8
[<c03911ba>] __ckpt_write_err+0x16/0x18
[<c03912ae>] tree_count_tasks+0xf2/0x2a2
[<c03915ae>] do_checkpoint+0x150/0x5f2
[<c0390cd8>] ? kzalloc+0x10/0x12
[<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
[<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
[<c0390465>] sys_checkpoint+0x6c/0x82
[<c0202ce5>] syscall_call+0x7/0xb
------------[ cut here ]------------
kernel BUG at checkpoint/checkpoint.c:136!
invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sda/size
Modules linked in:
Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30)
EIP: 0060:[<c03910dc>] EFLAGS: 00010246 CPU: 0
EIP is at __ckpt_generate_err+0xf2/0x12b
EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 0000001c
ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000)
Stack:
c072ce85 df051300 0000001c deb75600 df9ad1c0 00000000 deb72f18 c03911ba
deb72f50 c03912ae df051300 c072ce85 000008dd df9ad4ec df051300 df9ad1c0
00000000 00000000 00000000 deb75600 deb75604 df051300 deb72f98 c03915ae
Call Trace:
[<c03911ba>] ? __ckpt_write_err+0x16/0x18
[<c03912ae>] ? tree_count_tasks+0xf2/0x2a2
[<c03915ae>] ? do_checkpoint+0x150/0x5f2
[<c0390cd8>] ? kzalloc+0x10/0x12
[<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
[<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
[<c0390465>] ? sys_checkpoint+0x6c/0x82
[<c0202ce5>] ? syscall_call+0x7/0xb
Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 <0f> 0b eb fe 8b 55 ec 31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31
EIP: [<c03910dc>] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8
---[ end trace d54433b47f0c4829 ]---
note: ckpt[2270] exited with preempt_count 1
BUG: scheduling while atomic: ckpt/2270/0x10000002
INFO: lockdep is turned off.
Modules linked in:
Pid: 2270, comm: ckpt Tainted: G D 2.6.30-rc3-00074-ged3b275 #30
Call Trace:
[<c0223f6b>] __schedule_bug+0x63/0x6a
[<c05ec7dc>] __schedule+0x8f/0x7ac
[<c024d299>] ? print_lock_contention_bug+0x14/0xd7
[<c0298093>] ? unmap_vmas+0x1e1/0x518
[<c0203703>] ? ftrace_call+0x5/0x8
[<c0203703>] ? ftrace_call+0x5/0x8
[<c05ecf10>] schedule+0x17/0x38
[<c0224738>] __cond_resched+0x26/0x3b
[<c05ed034>] _cond_resched+0x2c/0x37
[<c0298379>] unmap_vmas+0x4c7/0x518
[<c029b81b>] exit_mmap+0x6c/0xb7
[<c022906a>] mmput+0x3c/0x8f
[<c022c8a0>] exit_mm+0xe3/0xeb
[<c022e0e2>] do_exit+0x188/0x64b
[<c05ec415>] ? printk+0x14/0x16
[<c022b08d>] ? oops_exit+0x28/0x2d
[<c05efbe7>] oops_end+0x92/0x9a
[<c020560f>] die+0x59/0x5f
[<c05ef56b>] do_trap+0x89/0xa2
[<c02039fc>] ? do_invalid_op+0x0/0x80
[<c0203a72>] do_invalid_op+0x76/0x80
[<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
[<c0203703>] ? ftrace_call+0x5/0x8
[<c039c95d>] ? strnlen+0x8/0x1f
[<c039b8bd>] ? string+0x34/0x82
[<c039c14a>] ? vsnprintf+0x173/0x311
[<c039c05a>] ? vsnprintf+0x83/0x311
[<c039c9d0>] ? trace_hardirqs_off_thunk+0xc/0x10
[<c05ef322>] error_code+0x72/0x78
[<c02039fc>] ? do_invalid_op+0x0/0x80
[<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b
[<c03911ba>] __ckpt_write_err+0x16/0x18
[<c03912ae>] tree_count_tasks+0xf2/0x2a2
[<c03915ae>] do_checkpoint+0x150/0x5f2
[<c0390cd8>] ? kzalloc+0x10/0x12
[<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60
[<c039033d>] ? ckpt_ctx_alloc+0x77/0x99
[<c0390465>] sys_checkpoint+0x6c/0x82
[<c0202ce5>] syscall_call+0x7/0xb
^ permalink raw reply [flat|nested] 13+ messages in thread[parent not found: <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> @ 2009-05-18 21:10 ` Serge E. Hallyn [not found] ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-20 5:28 ` Oren Laadan 1 sibling, 1 reply; 13+ messages in thread From: Serge E. Hallyn @ 2009-05-18 21:10 UTC (permalink / raw) To: Nathan Lynch; +Cc: Containers Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): > Last commit is ed3b275 "allow error string during checkpoint while > holding a spinlock". > > # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & > [1] 2269 > # ckpt $! > /tmp/bash.ckpt > > BUG: sleeping function called from invalid context at mm/slub.c:1595 Yeah, not only does ckpt_write_err() get called under task_lock, but the fn returns without ver doing put_task_struct. (I'd generate and send the quick trivial patch, but my git tree is in a bit of a debugme state right now) Now mind you this shows that your ckpt program isn't sending CHECKPOINT_SUBTREE with flags. This in turns means you are probably not using the ckpt-v15-dev version of user-cr, and if that is the case it makes your problems with gconf shared file mapping more suspect ask well...? -serge ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-18 21:36 ` Nathan Lynch [not found] ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Nathan Lynch @ 2009-05-18 21:36 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): >> Last commit is ed3b275 "allow error string during checkpoint while >> holding a spinlock". >> >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & >> [1] 2269 >> # ckpt $! > /tmp/bash.ckpt >> >> BUG: sleeping function called from invalid context at mm/slub.c:1595 > > Yeah, not only does ckpt_write_err() get called under task_lock, but > the fn returns without ver doing put_task_struct. (I'd generate and > send the quick trivial patch, but my git tree is in a bit of a debugme > state right now) Would prefer to just rip that thing out, it's cost me more trouble then it's worth. > Now mind you this shows that your ckpt program isn't sending > CHECKPOINT_SUBTREE with flags. I don't follow. There is "user error" here in that I'm not freezing the task before checkpointing[1], but my ckpt command is passing the subtree flag (0x4) afaict: SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 > This in turns means you are probably > not using the ckpt-v15-dev version of user-cr, and if that is > the case it makes your problems with gconf shared file mapping more > suspect ask well...? After updating to the latest user-cr I get the same BUGs. [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state before checkpoint, is there any mechanism apart from cgroup/freezer.state to do this? ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> @ 2009-05-18 22:39 ` Serge E. Hallyn [not found] ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-18 22:51 ` Matt Helsley 1 sibling, 1 reply; 13+ messages in thread From: Serge E. Hallyn @ 2009-05-18 22:39 UTC (permalink / raw) To: Nathan Lynch; +Cc: Containers Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): > "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): > >> Last commit is ed3b275 "allow error string during checkpoint while > >> holding a spinlock". > >> > >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & > >> [1] 2269 > >> # ckpt $! > /tmp/bash.ckpt > >> > >> BUG: sleeping function called from invalid context at mm/slub.c:1595 > > > > Yeah, not only does ckpt_write_err() get called under task_lock, but > > the fn returns without ver doing put_task_struct. (I'd generate and > > send the quick trivial patch, but my git tree is in a bit of a debugme > > state right now) > > Would prefer to just rip that thing out, it's cost me more trouble then > it's worth. Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err? > > Now mind you this shows that your ckpt program isn't sending > > CHECKPOINT_SUBTREE with flags. > > I don't follow. There is "user error" here in that I'm not freezing the > task before checkpointing[1], but my ckpt command is passing the subtree > flag (0x4) afaict: > > SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 Oh, it's the freezer test in may_checkpoint_task you're getting the error on? (in my git tree I'd commented that one out temporarily so I just assumed it was the subtree check in get_container :) > > This in turns means you are probably > > not using the ckpt-v15-dev version of user-cr, and if that is > > the case it makes your problems with gconf shared file mapping more > > suspect ask well...? > > After updating to the latest user-cr I get the same BUGs. > > [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or > CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state > before checkpoint, is there any mechanism apart from > cgroup/freezer.state to do this? A task can self-checkpoint without the freezer though. -serge ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-18 23:02 ` Nathan Lynch 0 siblings, 0 replies; 13+ messages in thread From: Nathan Lynch @ 2009-05-18 23:02 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): >> "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: >> >> > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): >> >> Last commit is ed3b275 "allow error string during checkpoint while >> >> holding a spinlock". >> >> >> >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & >> >> [1] 2269 >> >> # ckpt $! > /tmp/bash.ckpt >> >> >> >> BUG: sleeping function called from invalid context at mm/slub.c:1595 >> > >> > Yeah, not only does ckpt_write_err() get called under task_lock, but >> > the fn returns without ver doing put_task_struct. (I'd generate and >> > send the quick trivial patch, but my git tree is in a bit of a debugme >> > state right now) >> >> Would prefer to just rip that thing out, it's cost me more trouble then >> it's worth. > > Which thing - CHECKPOINT_SUBTREE, freezer check, or ckpt_write_err? ckpt_write_err. I've yet to witness it perform its intended function without triggering a WARN_ON or BUG. >> > Now mind you this shows that your ckpt program isn't sending >> > CHECKPOINT_SUBTREE with flags. >> >> I don't follow. There is "user error" here in that I'm not freezing the >> task before checkpointing[1], but my ckpt command is passing the subtree >> flag (0x4) afaict: >> >> SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 > > Oh, it's the freezer test in may_checkpoint_task you're getting the > error on? (in my git tree I'd commented that one out temporarily so I > just assumed it was the subtree check in get_container :) Yes, the frozen test is failing, afaik. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bugs with ckpt-v15-dev [not found] ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> 2009-05-18 22:39 ` Serge E. Hallyn @ 2009-05-18 22:51 ` Matt Helsley [not found] ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 1 sibling, 1 reply; 13+ messages in thread From: Matt Helsley @ 2009-05-18 22:51 UTC (permalink / raw) To: Nathan Lynch; +Cc: Containers On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: > "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > > > Quoting Nathan Lynch (ntl-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org): > >> Last commit is ed3b275 "allow error string during checkpoint while > >> holding a spinlock". > >> > >> # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & > >> [1] 2269 > >> # ckpt $! > /tmp/bash.ckpt > >> > >> BUG: sleeping function called from invalid context at mm/slub.c:1595 > > > > Yeah, not only does ckpt_write_err() get called under task_lock, but > > the fn returns without ver doing put_task_struct. (I'd generate and > > send the quick trivial patch, but my git tree is in a bit of a debugme > > state right now) > > Would prefer to just rip that thing out, it's cost me more trouble then > it's worth. > > > > Now mind you this shows that your ckpt program isn't sending > > CHECKPOINT_SUBTREE with flags. > > I don't follow. There is "user error" here in that I'm not freezing the > task before checkpointing[1], but my ckpt command is passing the subtree > flag (0x4) afaict: > > SYS_335(0x9ec, 0x1, 0x4, 0xbfdc6200, 0[2542:c/r:may_checkpoint_task] check 2540 > > > > This in turns means you are probably > > not using the ckpt-v15-dev version of user-cr, and if that is > > the case it makes your problems with gconf shared file mapping more > > suspect ask well...? > > After updating to the latest user-cr I get the same BUGs. > > [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or > CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state > before checkpoint, is there any mechanism apart from > cgroup/freezer.state to do this? Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze the tasks -- they'd still be capable of responding to some signals (CONT, TERM..). Also they'd presumably be placed in the stopped state upon restart so a SIGCONT will be needed. In the case of bash, at least, that will technically change what happens upon restart. My guess is that in many cases it won't matter but there are some where it will. The freezer documentation shows an example of what happens with bash when attempting to use only STOP/CONT rather than the freezer. gdb might also present interesting cases when just utilizing STOP/CONT signals.. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-18 23:21 ` Nathan Lynch [not found] ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Nathan Lynch @ 2009-05-18 23:21 UTC (permalink / raw) To: Matt Helsley; +Cc: Containers Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: >> >> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or >> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state >> before checkpoint, is there any mechanism apart from >> cgroup/freezer.state to do this? > > Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze > the tasks -- they'd still be capable of responding to some signals > (CONT, TERM..). Also they'd presumably be placed in the stopped state > upon restart so a SIGCONT will be needed. In the case of bash, at > least, that will technically change what happens upon restart. My > guess is that in many cases it won't matter but there are some where > it will. Hmm, I'm having trouble understanding your suggestion. The current checkpoint implementation requires non-self tasks to be frozen (p->flags & PF_FROZEN), which is not equivalent to stopped state (task->state & __TASK_STOPPED). That is, it would refuse to checkpoint tasks in stopped state. See may_checkpoint_task(). ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> @ 2009-05-19 1:09 ` Matt Helsley [not found] ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Matt Helsley @ 2009-05-19 1:09 UTC (permalink / raw) To: Nathan Lynch; +Cc: Containers On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: > Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > > > On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: > >> > >> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or > >> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state > >> before checkpoint, is there any mechanism apart from > >> cgroup/freezer.state to do this? > > > > Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze > > the tasks -- they'd still be capable of responding to some signals > > (CONT, TERM..). Also they'd presumably be placed in the stopped state > > upon restart so a SIGCONT will be needed. In the case of bash, at > > least, that will technically change what happens upon restart. My > > guess is that in many cases it won't matter but there are some where > > it will. > > Hmm, I'm having trouble understanding your suggestion. The current > checkpoint implementation requires non-self tasks to be frozen (p->flags > & PF_FROZEN), which is not equivalent to stopped state (task->state & > __TASK_STOPPED). That is, it would refuse to checkpoint tasks in > stopped state. See may_checkpoint_task(). Oops. You're right. That would require changing may_checkpoint_task() to include __TASK_STOPPED -- not something we'd want in the final code. I had assumed you wanted to try a different mechanism for debugging purposes. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-20 5:30 ` Oren Laadan [not found] ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Oren Laadan @ 2009-05-20 5:30 UTC (permalink / raw) To: Matt Helsley; +Cc: Containers, Nathan Lynch Matt Helsley wrote: > On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: >> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: >> >>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: >>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or >>>> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state >>>> before checkpoint, is there any mechanism apart from >>>> cgroup/freezer.state to do this? >>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze >>> the tasks -- they'd still be capable of responding to some signals >>> (CONT, TERM..). Also they'd presumably be placed in the stopped state >>> upon restart so a SIGCONT will be needed. In the case of bash, at >>> least, that will technically change what happens upon restart. My >>> guess is that in many cases it won't matter but there are some where >>> it will. >> Hmm, I'm having trouble understanding your suggestion. The current >> checkpoint implementation requires non-self tasks to be frozen (p->flags >> & PF_FROZEN), which is not equivalent to stopped state (task->state & >> __TASK_STOPPED). That is, it would refuse to checkpoint tasks in >> stopped state. See may_checkpoint_task(). > > Oops. You're right. That would require changing may_checkpoint_task() to include > __TASK_STOPPED -- not something we'd want in the final code. I had assumed > you wanted to try a different mechanism for debugging purposes. > Allowing checkpoint of stopped tasks is actually not such a bad idea, IMHO. Oren. ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> @ 2009-05-20 13:14 ` Serge E. Hallyn [not found] ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 13+ messages in thread From: Serge E. Hallyn @ 2009-05-20 13:14 UTC (permalink / raw) To: Oren Laadan; +Cc: Containers, Nathan Lynch Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): > > > Matt Helsley wrote: > > On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: > >> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > >> > >>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: > >>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or > >>>> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state > >>>> before checkpoint, is there any mechanism apart from > >>>> cgroup/freezer.state to do this? > >>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze > >>> the tasks -- they'd still be capable of responding to some signals > >>> (CONT, TERM..). Also they'd presumably be placed in the stopped state > >>> upon restart so a SIGCONT will be needed. In the case of bash, at > >>> least, that will technically change what happens upon restart. My > >>> guess is that in many cases it won't matter but there are some where > >>> it will. > >> Hmm, I'm having trouble understanding your suggestion. The current > >> checkpoint implementation requires non-self tasks to be frozen (p->flags > >> & PF_FROZEN), which is not equivalent to stopped state (task->state & > >> __TASK_STOPPED). That is, it would refuse to checkpoint tasks in > >> stopped state. See may_checkpoint_task(). > > > > Oops. You're right. That would require changing may_checkpoint_task() to include > > __TASK_STOPPED -- not something we'd want in the final code. I had assumed > > you wanted to try a different mechanism for debugging purposes. > > > > Allowing checkpoint of stopped tasks is actually not such a bad > idea, IMHO. Well, it might be bad for the same reason that Matt is pursuing the CHECKPOINTING freezer state: the task might get kicked alive in the middle of the checkpoint. So it might be ok so long as we still move the task to CHECKPOINTING state. But I'm just not sure it's worth worrying about. -serge ^ permalink raw reply [flat|nested] 13+ messages in thread
[parent not found: <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>]
* Re: bugs with ckpt-v15-dev [not found] ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> @ 2009-05-20 13:21 ` Oren Laadan 2009-05-20 21:10 ` Matt Helsley 1 sibling, 0 replies; 13+ messages in thread From: Oren Laadan @ 2009-05-20 13:21 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers, Nathan Lynch Serge E. Hallyn wrote: > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): >> >> Matt Helsley wrote: >>> On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: >>>> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: >>>> >>>>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: >>>>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or >>>>>> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state >>>>>> before checkpoint, is there any mechanism apart from >>>>>> cgroup/freezer.state to do this? >>>>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze >>>>> the tasks -- they'd still be capable of responding to some signals >>>>> (CONT, TERM..). Also they'd presumably be placed in the stopped state >>>>> upon restart so a SIGCONT will be needed. In the case of bash, at >>>>> least, that will technically change what happens upon restart. My >>>>> guess is that in many cases it won't matter but there are some where >>>>> it will. >>>> Hmm, I'm having trouble understanding your suggestion. The current >>>> checkpoint implementation requires non-self tasks to be frozen (p->flags >>>> & PF_FROZEN), which is not equivalent to stopped state (task->state & >>>> __TASK_STOPPED). That is, it would refuse to checkpoint tasks in >>>> stopped state. See may_checkpoint_task(). >>> Oops. You're right. That would require changing may_checkpoint_task() to include >>> __TASK_STOPPED -- not something we'd want in the final code. I had assumed >>> you wanted to try a different mechanism for debugging purposes. >>> >> Allowing checkpoint of stopped tasks is actually not such a bad >> idea, IMHO. > > Well, it might be bad for the same reason that Matt is pursuing the > CHECKPOINTING freezer state: the task might get kicked alive in > the middle of the checkpoint. Yes, that was my concern and I try to make the code safe with regard to such behavior. And if that is achieved, then at worst the checkpoint will either fail or yield meaningless results. On the other hand, it can allow c/r without requiring cgroups/freezer, with some additional restrictions. > > So it might be ok so long as we still move the task to CHECKPOINTING > state. But I'm just not sure it's worth worrying about. Probably not at the moment, except for "lowering the barrier" for people to try it out. Oren. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bugs with ckpt-v15-dev [not found] ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> 2009-05-20 13:21 ` Oren Laadan @ 2009-05-20 21:10 ` Matt Helsley 1 sibling, 0 replies; 13+ messages in thread From: Matt Helsley @ 2009-05-20 21:10 UTC (permalink / raw) To: Serge E. Hallyn; +Cc: Containers, Nathan Lynch On Wed, May 20, 2009 at 08:14:57AM -0500, Serge E. Hallyn wrote: > Quoting Oren Laadan (orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org): > > > > > > Matt Helsley wrote: > > > On Mon, May 18, 2009 at 06:21:22PM -0500, Nathan Lynch wrote: > > >> Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes: > > >> > > >>> On Mon, May 18, 2009 at 04:36:11PM -0500, Nathan Lynch wrote: > > >>>> [1] Should CONFIG_CHECKPOINT depend on CONFIG_CGROUPS and/or > > >>>> CONFIG_CGROUPS_FREEZER? We require tasks to be put in frozen state > > >>>> before checkpoint, is there any mechanism apart from > > >>>> cgroup/freezer.state to do this? > > >>> Have you tried sending all of the tasks SIGSTOP? It won't 100% freeze > > >>> the tasks -- they'd still be capable of responding to some signals > > >>> (CONT, TERM..). Also they'd presumably be placed in the stopped state > > >>> upon restart so a SIGCONT will be needed. In the case of bash, at > > >>> least, that will technically change what happens upon restart. My > > >>> guess is that in many cases it won't matter but there are some where > > >>> it will. > > >> Hmm, I'm having trouble understanding your suggestion. The current > > >> checkpoint implementation requires non-self tasks to be frozen (p->flags > > >> & PF_FROZEN), which is not equivalent to stopped state (task->state & > > >> __TASK_STOPPED). That is, it would refuse to checkpoint tasks in > > >> stopped state. See may_checkpoint_task(). > > > > > > Oops. You're right. That would require changing may_checkpoint_task() to include > > > __TASK_STOPPED -- not something we'd want in the final code. I had assumed > > > you wanted to try a different mechanism for debugging purposes. > > > > > > > Allowing checkpoint of stopped tasks is actually not such a bad > > idea, IMHO. > > Well, it might be bad for the same reason that Matt is pursuing the > CHECKPOINTING freezer state: the task might get kicked alive in > the middle of the checkpoint. > > So it might be ok so long as we still move the task to CHECKPOINTING > state. But I'm just not sure it's worth worrying about. FYI: currently there is no CHECKPOINTING state. CHECKPOINTING is specific to the freezer.state -- the tasks still appear "frozen" in the D state. This works since nothing else unfreezes these tasks. Cheers, -Matt Helsley ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: bugs with ckpt-v15-dev [not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org> 2009-05-18 21:10 ` Serge E. Hallyn @ 2009-05-20 5:28 ` Oren Laadan 1 sibling, 0 replies; 13+ messages in thread From: Oren Laadan @ 2009-05-20 5:28 UTC (permalink / raw) To: Nathan Lynch; +Cc: Containers Nathan, Thanks for insisting on this ... I believe it's now fixed in the ckpt-v15-dev branch. In particular, error reporting works better, and there is a new utility "ckptinfo" which can do basic parsing of the checkpoint image. If given the switch '-e' it will display error strings found in the image. The checkpoint image format has changed so you need to pull both linux-cr and user-cr. Oren. Nathan Lynch wrote: > Last commit is ed3b275 "allow error string during checkpoint while > holding a spinlock". > > # bash -c 'exec <&- >&- 2>&- ; while : ; do : ; done' & > [1] 2269 > # ckpt $! > /tmp/bash.ckpt > > BUG: sleeping function called from invalid context at mm/slub.c:1595 > in_atomic(): 1, irqs_disabled(): 0, pid: 2270, name: ckpt > 1 lock held by ckpt/2270: > #0: (tasklist_lock){.+.+.+}, at: [<c03911e6>] tree_count_tasks+0x2a/0x2a2 > Pid: 2270, comm: ckpt Not tainted 2.6.30-rc3-00074-ged3b275 #30 > Call Trace: > [<c024b6f9>] ? __debug_show_held_locks+0x1e/0x20 > [<c02234da>] __might_sleep+0x100/0x107 > [<c02a9372>] kmem_cache_alloc+0x35/0x11f > [<c039100f>] ? __ckpt_generate_err+0x25/0x12b > [<c024a9c7>] ? put_lock_stats+0x1e/0x29 > [<c039100f>] __ckpt_generate_err+0x25/0x12b > [<c0203703>] ? ftrace_call+0x5/0x8 > [<c03911ba>] __ckpt_write_err+0x16/0x18 > [<c03912ae>] tree_count_tasks+0xf2/0x2a2 > [<c03915ae>] do_checkpoint+0x150/0x5f2 > [<c0390cd8>] ? kzalloc+0x10/0x12 > [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60 > [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99 > [<c0390465>] sys_checkpoint+0x6c/0x82 > [<c0202ce5>] syscall_call+0x7/0xb > ------------[ cut here ]------------ > kernel BUG at checkpoint/checkpoint.c:136! > invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > last sysfs file: /sys/block/sda/size > Modules linked in: > > Pid: 2270, comm: ckpt Not tainted (2.6.30-rc3-00074-ged3b275 #30) > EIP: 0060:[<c03910dc>] EFLAGS: 00010246 CPU: 0 > EIP is at __ckpt_generate_err+0xf2/0x12b > EAX: df051300 EBX: deb72f30 ECX: df051530 EDX: 0000001c > ESI: df051430 EDI: deb72f28 EBP: deb72f10 ESP: deb72ef8 > DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 > Process ckpt (pid: 2270, ti=deb72000 task=df9adf60 task.ti=deb72000) > Stack: > c072ce85 df051300 0000001c deb75600 df9ad1c0 00000000 deb72f18 c03911ba > deb72f50 c03912ae df051300 c072ce85 000008dd df9ad4ec df051300 df9ad1c0 > 00000000 00000000 00000000 deb75600 deb75604 df051300 deb72f98 c03915ae > Call Trace: > [<c03911ba>] ? __ckpt_write_err+0x16/0x18 > [<c03912ae>] ? tree_count_tasks+0xf2/0x2a2 > [<c03915ae>] ? do_checkpoint+0x150/0x5f2 > [<c0390cd8>] ? kzalloc+0x10/0x12 > [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60 > [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99 > [<c0390465>] ? sys_checkpoint+0x6c/0x82 > [<c0202ce5>] ? syscall_call+0x7/0xb > Code: 08 0c 8b c0 03 74 1b f6 05 c2 8f ff c0 20 74 12 f6 05 c9 8f ff c0 10 74 09 80 3d 47 94 83 c0 00 75 1d 8b 45 ec 83 78 2c 00 75 04 <0f> 0b eb fe 8b 55 ec 31 c0 89 72 2c 8d 65 f4 5b 5e 5f 5d c3 31 > EIP: [<c03910dc>] __ckpt_generate_err+0xf2/0x12b SS:ESP 0068:deb72ef8 > ---[ end trace d54433b47f0c4829 ]--- > note: ckpt[2270] exited with preempt_count 1 > BUG: scheduling while atomic: ckpt/2270/0x10000002 > INFO: lockdep is turned off. > Modules linked in: > Pid: 2270, comm: ckpt Tainted: G D 2.6.30-rc3-00074-ged3b275 #30 > Call Trace: > [<c0223f6b>] __schedule_bug+0x63/0x6a > [<c05ec7dc>] __schedule+0x8f/0x7ac > [<c024d299>] ? print_lock_contention_bug+0x14/0xd7 > [<c0298093>] ? unmap_vmas+0x1e1/0x518 > [<c0203703>] ? ftrace_call+0x5/0x8 > [<c0203703>] ? ftrace_call+0x5/0x8 > [<c05ecf10>] schedule+0x17/0x38 > [<c0224738>] __cond_resched+0x26/0x3b > [<c05ed034>] _cond_resched+0x2c/0x37 > [<c0298379>] unmap_vmas+0x4c7/0x518 > [<c029b81b>] exit_mmap+0x6c/0xb7 > [<c022906a>] mmput+0x3c/0x8f > [<c022c8a0>] exit_mm+0xe3/0xeb > [<c022e0e2>] do_exit+0x188/0x64b > [<c05ec415>] ? printk+0x14/0x16 > [<c022b08d>] ? oops_exit+0x28/0x2d > [<c05efbe7>] oops_end+0x92/0x9a > [<c020560f>] die+0x59/0x5f > [<c05ef56b>] do_trap+0x89/0xa2 > [<c02039fc>] ? do_invalid_op+0x0/0x80 > [<c0203a72>] do_invalid_op+0x76/0x80 > [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b > [<c0203703>] ? ftrace_call+0x5/0x8 > [<c039c95d>] ? strnlen+0x8/0x1f > [<c039b8bd>] ? string+0x34/0x82 > [<c039c14a>] ? vsnprintf+0x173/0x311 > [<c039c05a>] ? vsnprintf+0x83/0x311 > [<c039c9d0>] ? trace_hardirqs_off_thunk+0xc/0x10 > [<c05ef322>] error_code+0x72/0x78 > [<c02039fc>] ? do_invalid_op+0x0/0x80 > [<c03910dc>] ? __ckpt_generate_err+0xf2/0x12b > [<c03911ba>] __ckpt_write_err+0x16/0x18 > [<c03912ae>] tree_count_tasks+0xf2/0x2a2 > [<c03915ae>] do_checkpoint+0x150/0x5f2 > [<c0390cd8>] ? kzalloc+0x10/0x12 > [<c0390d0f>] ? ckpt_obj_hash_alloc+0x35/0x60 > [<c039033d>] ? ckpt_ctx_alloc+0x77/0x99 > [<c0390465>] sys_checkpoint+0x6c/0x82 > [<c0202ce5>] syscall_call+0x7/0xb > ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2009-05-20 21:10 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-05-18 19:23 bugs with ckpt-v15-dev Nathan Lynch
[not found] ` <m3my9amczw.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-18 21:10 ` Serge E. Hallyn
[not found] ` <20090518211041.GA20781-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 21:36 ` Nathan Lynch
[not found] ` <m3y6suhz5g.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-18 22:39 ` Serge E. Hallyn
[not found] ` <20090518223919.GA24826-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 23:02 ` Nathan Lynch
2009-05-18 22:51 ` Matt Helsley
[not found] ` <20090518225100.GC28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-18 23:21 ` Nathan Lynch
[not found] ` <m3zldagfpp.fsf-e+AXbWqSrlAAvxtiuMwx3w@public.gmane.org>
2009-05-19 1:09 ` Matt Helsley
[not found] ` <20090519010911.GD28083-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-20 5:30 ` Oren Laadan
[not found] ` <4A13955E.2040301-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2009-05-20 13:14 ` Serge E. Hallyn
[not found] ` <20090520131457.GB25989-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2009-05-20 13:21 ` Oren Laadan
2009-05-20 21:10 ` Matt Helsley
2009-05-20 5:28 ` Oren Laadan
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.