All of lore.kernel.org
 help / color / mirror / Atom feed
* multi-threaded app fails to restart
@ 2010-07-19 19:36 John Paul Walters
       [not found] ` <AANLkTilxfsYGyYLwO__VmDLSFQ_s_Qe03G49kIEztVja-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-19 19:36 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

I have a very simple multi-threaded application that I'm testing with,
but I'm unable to get a restart to complete.  I've tried both versions
21 and version 22-dev.  I'm using a debian 32 bit install inside of a
VMWare fusion virtual machine.  The problem seems to be limited to
threads as I'm able to checkpoint and restart the multitask test
application.  The steps that I'm executing are:

./pthread_test  &
[1] 3982

 ps -efL | grep pthread_test
jwalters  3982  3357  3982  0    2 19:21 pts/0    00:00:00 ./pthread_test
jwalters  3982  3357  3983  0    2 19:21 pts/0    00:00:00 ./pthread_test

for i in 3982 3983; do echo $i > /containers/1/tasks ; done

echo FROZEN /containers/1/freezer.state

cat /containers/1/freezer.state
FROZEN

 ./checkpoint 3982 > checkpoint_out
(there aren't any unusual looking messages in the dmesg output at this point)

After thawing and killing off the running instance, I attempt to restart:
./restart -d < checkpoint_out
...

<4030>c/r read input 16384
<4030>c/r read input 16384
<4030>c/r read input 12789
<4030>c/r read input 0
<4029>restart succeeded
<4029>SIGCHLD: already collected
<4029>task terminated with signal 11
<4029>c/r succeeded

The tail end of the syslog also contains:
[ 3210.327177] [4029:4029:c/r:do_restart:1451] sys_restart returns 0
[ 3210.327190] [4033:4033:c/r:wait_task_sync:919] task sync done (errno 0)
[ 3210.327192] [4033:4033:c/r:clear_task_ctx:852] task 4033 clear checkpoint_ctx
[ 3210.327194] [4033:4033:c/r:do_restart:1451] sys_restart returns -516
[ 3210.327227] pthread_test[4033]: segfault at b781f424 ip b781f424 sp
b75cc1c0 error 4
[ 3210.330254] [4031:4031:c/r:wait_task_sync:919] task sync done (errno 0)
[ 3210.330257] [4031:4031:c/r:clear_task_ctx:852] task 4031 clear checkpoint_ctx
[ 3210.330259] [4031:4031:c/r:restore_debug_free:144] 4 tasks
registered, nr_tasks was 0 nr_total 0
[ 3210.330261] [4031:4031:c/r:restore_debug_free:147] active pid was
2, ctx->errno 0
[ 3210.330263] [4031:4031:c/r:restore_debug_free:149] kflags 22 uflags
0 oflags 1
[ 3210.330265] [4031:4031:c/r:restore_debug_free:151] task[0] to run 4031
[ 3210.330267] [4031:4031:c/r:restore_debug_free:151] task[1] to run 4033
[ 3210.330269] [4031:4031:c/r:restore_debug_free:176] pid 4029 type
Coord state Success
[ 3210.330272] [4031:4031:c/r:restore_debug_free:176] pid 4031 type
Root state Success
[ 3210.330274] [4031:4031:c/r:restore_debug_free:176] pid 4033 type
Task state Success
[ 3210.330276] [4031:4031:c/r:restore_debug_free:176] pid 4032 type
Ghost state Success
[ 3210.330285] [4031:4031:c/r:pgarr_release_pages:102] total pages 0
[ 3210.330288] [4031:4031:c/r:do_restart:1451] sys_restart returns -512

Any thoughts?

best regards,
JP

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found] ` <AANLkTilxfsYGyYLwO__VmDLSFQ_s_Qe03G49kIEztVja-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-19 19:54   ` Nathan Lynch
  2010-07-19 20:27     ` John Paul Walters
  0 siblings, 1 reply; 15+ messages in thread
From: Nathan Lynch @ 2010-07-19 19:54 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Mon, 2010-07-19 at 15:36 -0400, John Paul Walters wrote:
> [ 3210.327227] pthread_test[4033]: segfault at b781f424 ip b781f424 sp
> b75cc1c0 error 4
> [ 3210.330254] [4031:4031:c/r:wait_task_sync:919] task sync done (errno 0)
> [ 3210.330257] [4031:4031:c/r:clear_task_ctx:852] task 4031 clear checkpoint_ctx
> [ 3210.330259] [4031:4031:c/r:restore_debug_free:144] 4 tasks
> registered, nr_tasks was 0 nr_total 0
> [ 3210.330261] [4031:4031:c/r:restore_debug_free:147] active pid was
> 2, ctx->errno 0
> [ 3210.330263] [4031:4031:c/r:restore_debug_free:149] kflags 22 uflags
> 0 oflags 1
> [ 3210.330265] [4031:4031:c/r:restore_debug_free:151] task[0] to run 4031
> [ 3210.330267] [4031:4031:c/r:restore_debug_free:151] task[1] to run 4033
> [ 3210.330269] [4031:4031:c/r:restore_debug_free:176] pid 4029 type
> Coord state Success
> [ 3210.330272] [4031:4031:c/r:restore_debug_free:176] pid 4031 type
> Root state Success
> [ 3210.330274] [4031:4031:c/r:restore_debug_free:176] pid 4033 type
> Task state Success
> [ 3210.330276] [4031:4031:c/r:restore_debug_free:176] pid 4032 type
> Ghost state Success
> [ 3210.330285] [4031:4031:c/r:pgarr_release_pages:102] total pages 0
> [ 3210.330288] [4031:4031:c/r:do_restart:1451] sys_restart returns -512
> 
> Any thoughts?

There were two patches posted to the containers list on 11 July - "fix
task tree traversal for threads" and "save/restore 'sysenter_return' for
threads".  Can you try with those on top of ckpt-v22-dev?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
  2010-07-19 19:54   ` Nathan Lynch
@ 2010-07-19 20:27     ` John Paul Walters
       [not found]       ` <AANLkTimpXSXQr1wew1wvZKnBFsOXD7f2tblY4EGmJoFM-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-19 20:27 UTC (permalink / raw)
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

>> Ghost state Success
>> [ 3210.330285] [4031:4031:c/r:pgarr_release_pages:102] total pages 0
>> [ 3210.330288] [4031:4031:c/r:do_restart:1451] sys_restart returns -512
>>
>> Any thoughts?
>
> There were two patches posted to the containers list on 11 July - "fix
> task tree traversal for threads" and "save/restore 'sysenter_return' for
> threads".  Can you try with those on top of ckpt-v22-dev?
>
>
>

Hi Nathan,

Thanks for your help.  I applied the two patches as you suggested.
They fixed the first of the two bad sys_restart return values, but the
final one (quoted above, for what it's worth) still returns -512.
When I use the -d -v switches to restart, it appears to work (no error
messages are returned), but only the main thread is restored while the
second thread is not.

JP

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]       ` <AANLkTimpXSXQr1wew1wvZKnBFsOXD7f2tblY4EGmJoFM-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-20  3:24         ` Oren Laadan
       [not found]           ` <4C4516DD.1000809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Oren Laadan @ 2010-07-20  3:24 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA


On 07/19/2010 04:27 PM, John Paul Walters wrote:
>>> Ghost state Success
>>> [ 3210.330285] [4031:4031:c/r:pgarr_release_pages:102] total pages 0
>>> [ 3210.330288] [4031:4031:c/r:do_restart:1451] sys_restart returns -512
>>>
>>> Any thoughts?
>>
>> There were two patches posted to the containers list on 11 July - "fix
>> task tree traversal for threads" and "save/restore 'sysenter_return' for
>> threads".  Can you try with those on top of ckpt-v22-dev?
>>
>>
>>
> 
> Hi Nathan,
> 
> Thanks for your help.  I applied the two patches as you suggested.
> They fixed the first of the two bad sys_restart return values, but the
> final one (quoted above, for what it's worth) still returns -512.
> When I use the -d -v switches to restart, it appears to work (no error
> messages are returned), but only the main thread is restored while the
> second thread is not.

Hi John,

I just pushed a few more fixes related to signals to ckpt-v22-dev.
Can you please see if they fix your problem ?

Also, can you please post the test program that you are using, so
we can try to replicate the problem ?

Note that it is usually ok for sys_restart() to return -512 -- it
means that the process/thread was interrupted when the checkpoint,
and it will now retry the same syscall from then.

You can use the -F (--freezer) switch of restart(1) to freeze the
restarted tasks/threads before they are allowed to run in userspace.
Using it you can tell whether the other thread dies immediately
after restart, or is not at all restarted.

Thanks,

Oren.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]           ` <4C4516DD.1000809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2010-07-20 18:58             ` John Paul Walters
       [not found]               ` <AANLkTimPENgm-LSh6iMv2uxegRdHEivbGMTYmEfiOEJG-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-20 18:58 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

>
> Hi John,
>
> I just pushed a few more fixes related to signals to ckpt-v22-dev.
> Can you please see if they fix your problem ?
>
> Also, can you please post the test program that you are using, so
> we can try to replicate the problem ?
>
> Note that it is usually ok for sys_restart() to return -512 -- it
> means that the process/thread was interrupted when the checkpoint,
> and it will now retry the same syscall from then.
>
> You can use the -F (--freezer) switch of restart(1) to freeze the
> restarted tasks/threads before they are allowed to run in userspace.
> Using it you can tell whether the other thread dies immediately
> after restart, or is not at all restarted.
>
> Thanks,
>
> Oren.
>

Hi Oren,

I grabbed the most recent v22-dev that includes the updates.  I'm
still experiencing the same issue.  Testing with -F indicates that the
second thread isn't being restarted.  The code that I'm using is:

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sys/syscall.h>
#include <errno.h>
#include <string.h>
#include <unistd.h>

#define OUTFILE "/tmp/cr-self.out"

void *
func (void *arg)
{
  FILE *file;
  int counter = 0;

  file = fopen(OUTFILE, "w+");

    while (1){
        sleep(2);
        counter++;
        fprintf(file, "Count %d\n", counter);
        fflush(file);
    }

return NULL;
}

int
main (int argc, char **argv)
{
  pthread_t thread;
  close (0);
  close (1);
  close (2);
  unlink (OUTFILE);

  pthread_create(&thread, NULL, func, NULL);
  pthread_join(thread, NULL);
  return 0;
}

Thanks for your help,
JP

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]               ` <AANLkTimPENgm-LSh6iMv2uxegRdHEivbGMTYmEfiOEJG-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-20 23:12                 ` Oren Laadan
       [not found]                   ` <Pine.LNX.4.64.1007201906370.15255-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Oren Laadan @ 2010-07-20 23:12 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA


Hi John

In your program, it is a thread of the root task (of the hierarchy)
that is missed. Indeed the previous patch was incomplete - it did
fix the non-root-threads case but spoiled the root-threads case.
That was silly... well, can you try this little patch:

Thanks for following up, was very helpful !

Oren.

---
diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
index 171c867..3288af0 100644
--- a/kernel/checkpoint/sys.c
+++ b/kernel/checkpoint/sys.c
@@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
 			continue;
 		}
 
+		/* if not last thread - proceed with thread */
+		task = next_thread(task);
+		if (!thread_group_leader(task))
+			continue;
+
 		/* by definition, skip siblings of root */
 		while (task != root) {
-			/* if not last thread - proceed with thread */
-			task = next_thread(task);
-			if (!thread_group_leader(task))
-				break;
-
 			/* if has sibling - proceed with sibling */
 			if (!list_is_last(&task->sibling, &parent->children)) {
 				task = list_entry(task->sibling.next,
---

On Tue, 20 Jul 2010, John Paul Walters wrote:

> >
> > Hi John,
> >
> > I just pushed a few more fixes related to signals to ckpt-v22-dev.
> > Can you please see if they fix your problem ?
> >
> > Also, can you please post the test program that you are using, so
> > we can try to replicate the problem ?
> >
> > Note that it is usually ok for sys_restart() to return -512 -- it
> > means that the process/thread was interrupted when the checkpoint,
> > and it will now retry the same syscall from then.
> >
> > You can use the -F (--freezer) switch of restart(1) to freeze the
> > restarted tasks/threads before they are allowed to run in userspace.
> > Using it you can tell whether the other thread dies immediately
> > after restart, or is not at all restarted.
> >
> > Thanks,
> >
> > Oren.
> >
> 
> Hi Oren,
> 
> I grabbed the most recent v22-dev that includes the updates.  I'm
> still experiencing the same issue.  Testing with -F indicates that the
> second thread isn't being restarted.  The code that I'm using is:
> 
> #include <stdio.h>
> #include <stdlib.h>
> #include <pthread.h>
> #include <sys/syscall.h>
> #include <errno.h>
> #include <string.h>
> #include <unistd.h>
> 
> #define OUTFILE "/tmp/cr-self.out"
> 
> void *
> func (void *arg)
> {
>   FILE *file;
>   int counter = 0;
> 
>   file = fopen(OUTFILE, "w+");
> 
>     while (1){
>         sleep(2);
>         counter++;
>         fprintf(file, "Count %d\n", counter);
>         fflush(file);
>     }
> 
> return NULL;
> }
> 
> int
> main (int argc, char **argv)
> {
>   pthread_t thread;
>   close (0);
>   close (1);
>   close (2);
>   unlink (OUTFILE);
> 
>   pthread_create(&thread, NULL, func, NULL);
>   pthread_join(thread, NULL);
>   return 0;
> }
> 
> Thanks for your help,
> JP
> 
> 

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                   ` <Pine.LNX.4.64.1007201906370.15255-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
@ 2010-07-21  0:03                     ` John Paul Walters
       [not found]                       ` <AANLkTinZYiWPtSegjRJWnlc6hipFAZyujr8-2ug6ettF-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-21  0:03 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Jul 20, 2010 at 7:12 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote:
>
> Hi John
>
> In your program, it is a thread of the root task (of the hierarchy)
> that is missed. Indeed the previous patch was incomplete - it did
> fix the non-root-threads case but spoiled the root-threads case.
> That was silly... well, can you try this little patch:
>
> Thanks for following up, was very helpful !
>
> Oren.

Hi Oren,

I'm still unable to fully restart the application with your patch, but
the result is now different.  If I attempt to restart using  --pidns
and -F, both threads are created and frozen.  However, as soon as I
thaw them I get a segfault.  If I attempt to restart them without the
--pidns option, I get a message from restart indicating that it's
about to call sys_restart and restart hangs.  I also have the
following in my syslog:


[ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
[ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
[ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
[ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
[ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
[ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
[ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
[ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
[ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
wait_checkpoint_ctx: failed (-512)
[ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
[ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
[ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
kflags 0x1a (ret 0)
[ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
[ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
[ 1541.864698] [err -512][pos 419][E @
ckpt_read_obj_type:426]Expecting to read type 9001
[ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
[ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
failed (coordinator)
[ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
[ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
[ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
[ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
registered, nr_tasks was 0 nr_total 1
[ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
0, ctx->errno -512
[ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
0 oflags 1
[ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
[ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
[ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
Coord state Failed
[ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
Root state Failed
[ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
Ghost state Failed

thanks,
JP

>
> ---
> diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
> index 171c867..3288af0 100644
> --- a/kernel/checkpoint/sys.c
> +++ b/kernel/checkpoint/sys.c
> @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
>                        continue;
>                }
>
> +               /* if not last thread - proceed with thread */
> +               task = next_thread(task);
> +               if (!thread_group_leader(task))
> +                       continue;
> +
>                /* by definition, skip siblings of root */
>                while (task != root) {
> -                       /* if not last thread - proceed with thread */
> -                       task = next_thread(task);
> -                       if (!thread_group_leader(task))
> -                               break;
> -
>                        /* if has sibling - proceed with sibling */
>                        if (!list_is_last(&task->sibling, &parent->children)) {
>                                task = list_entry(task->sibling.next,
> ---

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                       ` <AANLkTinZYiWPtSegjRJWnlc6hipFAZyujr8-2ug6ettF-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-21  5:54                         ` Oren Laadan
       [not found]                           ` <Pine.LNX.4.64.1007210143120.22870-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Oren Laadan @ 2010-07-21  5:54 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 4684 bytes --]

On Tue, 20 Jul 2010, John Paul Walters wrote:

> On Tue, Jul 20, 2010 at 7:12 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote:
> >
> > Hi John
> >
> > In your program, it is a thread of the root task (of the hierarchy)
> > that is missed. Indeed the previous patch was incomplete - it did
> > fix the non-root-threads case but spoiled the root-threads case.
> > That was silly... well, can you try this little patch:
> >
> > Thanks for following up, was very helpful !
> >
> > Oren.
> 
> Hi Oren,
> 
> I'm still unable to fully restart the application with your patch, but
> the result is now different.  If I attempt to restart using  --pidns
> and -F, both threads are created and frozen.  However, as soon as I
> thaw them I get a segfault.  If I attempt to restart them without the
> --pidns option, I get a message from restart indicating that it's
> about to call sys_restart and restart hangs.  I also have the
> following in my syslog:

Hi John,

I assume the log below is for the --no-pidns case, right ?
Can you also post the output of 'restart -vd ...' ?
(Unfortunately I won't have a chance to try it until the weekend)

Thanks,

Oren.

> 
> 
> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
> wait_checkpoint_ctx: failed (-512)
> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
> kflags 0x1a (ret 0)
> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
> [ 1541.864698] [err -512][pos 419][E @
> ckpt_read_obj_type:426]Expecting to read type 9001
> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
> failed (coordinator)
> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
> registered, nr_tasks was 0 nr_total 1
> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
> 0, ctx->errno -512
> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
> 0 oflags 1
> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
> Coord state Failed
> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
> Root state Failed
> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
> Ghost state Failed
> 
> thanks,
> JP
> 
> >
> > ---
> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
> > index 171c867..3288af0 100644
> > --- a/kernel/checkpoint/sys.c
> > +++ b/kernel/checkpoint/sys.c
> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
> >                        continue;
> >                }
> >
> > +               /* if not last thread - proceed with thread */
> > +               task = next_thread(task);
> > +               if (!thread_group_leader(task))
> > +                       continue;
> > +
> >                /* by definition, skip siblings of root */
> >                while (task != root) {
> > -                       /* if not last thread - proceed with thread */
> > -                       task = next_thread(task);
> > -                       if (!thread_group_leader(task))
> > -                               break;
> > -
> >                        /* if has sibling - proceed with sibling */
> >                        if (!list_is_last(&task->sibling, &parent->children)) {
> >                                task = list_entry(task->sibling.next,
> > ---
> 
> 

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                           ` <Pine.LNX.4.64.1007210143120.22870-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
@ 2010-07-21 12:52                             ` John Paul Walters
       [not found]                               ` <AANLkTinOFIzK8RZnp9NHouKv-WA7Omr08pPTGfrfVLfP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-21 12:52 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

>>
>> Hi Oren,
>>
>> I'm still unable to fully restart the application with your patch, but
>> the result is now different.  If I attempt to restart using  --pidns
>> and -F, both threads are created and frozen.  However, as soon as I
>> thaw them I get a segfault.  If I attempt to restart them without the
>> --pidns option, I get a message from restart indicating that it's
>> about to call sys_restart and restart hangs.  I also have the
>> following in my syslog:
>
> Hi John,
>
> I assume the log below is for the --no-pidns case, right ?
> Can you also post the output of 'restart -vd ...' ?
> (Unfortunately I won't have a chance to try it until the weekend)
>

Hi Oren,

That's correct, the original log was for the --no-pidns case.  Below
I've included the restart log up to the point where it hangs at
sys_restart.  Thanks again for all of your help.

best,
JP

./restart -v -d --no-pidns < checkpoint_out
<4124>number of tasks: 2
<4124>number of vpids: 0
<4124>total tasks (including ghosts): 3
<4124>pid 3583: thread tgid 3582
<4124>pid 3583: creator set to 3582
<4124>pid 1: propagate session 3582
<4124>pid 1: creator set to 3582
<4124>pid 1: set session
<4124>pid 1: moving up to 3582
<4124>====== TASKS
<4124>	[0] pid 3582 ppid 3349 sid 0 creator 0
<4124>	[1] pid 3583 ppid 3349 sid 0 creator 3582 prev 1 T
<4124>	[2] pid 1 ppid 3582 sid 3582 creator 3582 next 3583   S G
<4124>............
<4124>task[0].vidx = -1
<4124>task[1].vidx = -1
<4124>subtree (existing pidns)
<4124>forking child vpid 3582 flags 0x1
<4124>task 3582 forking with flags 11 numpids 1
<4124>task 3582 pid[0]=0
<4124>forked child vpid 4126 (asked 3582)
<4126>root task pid 4126
<4126>pid 3582: pid 4126 sid 3386 parent 4124
<4126>pid 3582: fork child 1 with session
<4126>forking child vpid 1 flags 0x12
<4126>task 1 forking with flags 11 numpids 1
<4126>task 1 pid[0]=0
<4126>forked child vpid 4127 (asked 1)
<4126>pid 3582: fork child 3583 without session
<4126>forking child vpid 3583 flags 0x4
<4126>task 3583 forking with flags 10911 numpids 1
<4126>task 3583 pid[0]=0
<4126>forked child vpid 4128 (asked 3583)
<4126>about to call sys_restart(), flags 0
<4125>====== PIDS ARRAY
<4125>[0] pid 3582 ppid 1 sid 1 pgid 3582
<4125>[1] pid 3583 ppid 1 sid 1 pgid 3582
<4125>............
<4125>c/r swap old 3582 new 4126
<4128>pid 3583: pid 4128 sid 3386 parent 4124
<4128>about to call sys_restart(), flags 0
<4125>c/r swap old 3583 new 4128
<4127>pid 1: pid 4127 sid 3386 parent 4126
<4125>c/r swap old 1 new 4127
<4125>====== PIDS ARRAY (swaped)
<4125>[0] pid 4126 ppid 1 sid 4127 pgid 4126
<4125>[1] pid 4128 ppid 1 sid 4127 pgid 4126
<4125>............
<4125>c/r read input 16384
<4127>about to call sys_restart(), flags 0x4
<4125>c/r read input 16384
<4125>c/r read input 16384
<4125>c/r read input 16384
<4125>c/r read input 16384






> Thanks,
>
> Oren.
>
>>
>>
>> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
>> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
>> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
>> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
>> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
>> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
>> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
>> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
>> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
>> wait_checkpoint_ctx: failed (-512)
>> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
>> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
>> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
>> kflags 0x1a (ret 0)
>> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
>> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
>> [ 1541.864698] [err -512][pos 419][E @
>> ckpt_read_obj_type:426]Expecting to read type 9001
>> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
>> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
>> failed (coordinator)
>> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
>> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
>> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
>> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
>> registered, nr_tasks was 0 nr_total 1
>> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
>> 0, ctx->errno -512
>> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
>> 0 oflags 1
>> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
>> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
>> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
>> Coord state Failed
>> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
>> Root state Failed
>> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
>> Ghost state Failed
>>
>> thanks,
>> JP
>>
>> >
>> > ---
>> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
>> > index 171c867..3288af0 100644
>> > --- a/kernel/checkpoint/sys.c
>> > +++ b/kernel/checkpoint/sys.c
>> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
>> >                        continue;
>> >                }
>> >
>> > +               /* if not last thread - proceed with thread */
>> > +               task = next_thread(task);
>> > +               if (!thread_group_leader(task))
>> > +                       continue;
>> > +
>> >                /* by definition, skip siblings of root */
>> >                while (task != root) {
>> > -                       /* if not last thread - proceed with thread */
>> > -                       task = next_thread(task);
>> > -                       if (!thread_group_leader(task))
>> > -                               break;
>> > -
>> >                        /* if has sibling - proceed with sibling */
>> >                        if (!list_is_last(&task->sibling, &parent->children)) {
>> >                                task = list_entry(task->sibling.next,
>> > ---
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                               ` <AANLkTinOFIzK8RZnp9NHouKv-WA7Omr08pPTGfrfVLfP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-22  1:04                                 ` Oren Laadan
       [not found]                                   ` <Pine.LNX.4.64.1007212102010.6257-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Oren Laadan @ 2010-07-22  1:04 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 7585 bytes --]

Hi John,

This is a bit embarrassing, the behavior sounds too familiar -- 
please try to following patch:

--
diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c
index 3fb9deb..b770f70 100644
--- a/arch/x86/kernel/checkpoint.c
+++ b/arch/x86/kernel/checkpoint.c
@@ -104,7 +104,7 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
 	h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
 	h->sizeof_tls_array = tls_size;
 	h->sysenter_return = (__u64) (unsigned long)
-		task_thread_info(current)->sysenter_return;
+		task_thread_info(t)->sysenter_return;
 
 	/* For simplicity dump the entire array */
 	memcpy(h + 1, t->thread.tls_array, tls_size);
--

On Wed, 21 Jul 2010, John Paul Walters wrote:

> >>
> >> Hi Oren,
> >>
> >> I'm still unable to fully restart the application with your patch, but
> >> the result is now different.  If I attempt to restart using  --pidns
> >> and -F, both threads are created and frozen.  However, as soon as I
> >> thaw them I get a segfault.  If I attempt to restart them without the
> >> --pidns option, I get a message from restart indicating that it's
> >> about to call sys_restart and restart hangs.  I also have the
> >> following in my syslog:
> >
> > Hi John,
> >
> > I assume the log below is for the --no-pidns case, right ?
> > Can you also post the output of 'restart -vd ...' ?
> > (Unfortunately I won't have a chance to try it until the weekend)
> >
> 
> Hi Oren,
> 
> That's correct, the original log was for the --no-pidns case.  Below
> I've included the restart log up to the point where it hangs at
> sys_restart.  Thanks again for all of your help.
> 
> best,
> JP
> 
> ./restart -v -d --no-pidns < checkpoint_out
> <4124>number of tasks: 2
> <4124>number of vpids: 0
> <4124>total tasks (including ghosts): 3
> <4124>pid 3583: thread tgid 3582
> <4124>pid 3583: creator set to 3582
> <4124>pid 1: propagate session 3582
> <4124>pid 1: creator set to 3582
> <4124>pid 1: set session
> <4124>pid 1: moving up to 3582
> <4124>====== TASKS
> <4124>	[0] pid 3582 ppid 3349 sid 0 creator 0
> <4124>	[1] pid 3583 ppid 3349 sid 0 creator 3582 prev 1 T
> <4124>	[2] pid 1 ppid 3582 sid 3582 creator 3582 next 3583   S G
> <4124>............
> <4124>task[0].vidx = -1
> <4124>task[1].vidx = -1
> <4124>subtree (existing pidns)
> <4124>forking child vpid 3582 flags 0x1
> <4124>task 3582 forking with flags 11 numpids 1
> <4124>task 3582 pid[0]=0
> <4124>forked child vpid 4126 (asked 3582)
> <4126>root task pid 4126
> <4126>pid 3582: pid 4126 sid 3386 parent 4124
> <4126>pid 3582: fork child 1 with session
> <4126>forking child vpid 1 flags 0x12
> <4126>task 1 forking with flags 11 numpids 1
> <4126>task 1 pid[0]=0
> <4126>forked child vpid 4127 (asked 1)
> <4126>pid 3582: fork child 3583 without session
> <4126>forking child vpid 3583 flags 0x4
> <4126>task 3583 forking with flags 10911 numpids 1
> <4126>task 3583 pid[0]=0
> <4126>forked child vpid 4128 (asked 3583)
> <4126>about to call sys_restart(), flags 0
> <4125>====== PIDS ARRAY
> <4125>[0] pid 3582 ppid 1 sid 1 pgid 3582
> <4125>[1] pid 3583 ppid 1 sid 1 pgid 3582
> <4125>............
> <4125>c/r swap old 3582 new 4126
> <4128>pid 3583: pid 4128 sid 3386 parent 4124
> <4128>about to call sys_restart(), flags 0
> <4125>c/r swap old 3583 new 4128
> <4127>pid 1: pid 4127 sid 3386 parent 4126
> <4125>c/r swap old 1 new 4127
> <4125>====== PIDS ARRAY (swaped)
> <4125>[0] pid 4126 ppid 1 sid 4127 pgid 4126
> <4125>[1] pid 4128 ppid 1 sid 4127 pgid 4126
> <4125>............
> <4125>c/r read input 16384
> <4127>about to call sys_restart(), flags 0x4
> <4125>c/r read input 16384
> <4125>c/r read input 16384
> <4125>c/r read input 16384
> <4125>c/r read input 16384
> 
> 
> 
> 
> 
> 
> > Thanks,
> >
> > Oren.
> >
> >>
> >>
> >> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
> >> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
> >> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
> >> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
> >> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
> >> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
> >> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
> >> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
> >> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
> >> wait_checkpoint_ctx: failed (-512)
> >> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
> >> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
> >> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
> >> kflags 0x1a (ret 0)
> >> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
> >> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
> >> [ 1541.864698] [err -512][pos 419][E @
> >> ckpt_read_obj_type:426]Expecting to read type 9001
> >> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
> >> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
> >> failed (coordinator)
> >> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
> >> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
> >> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
> >> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
> >> registered, nr_tasks was 0 nr_total 1
> >> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
> >> 0, ctx->errno -512
> >> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
> >> 0 oflags 1
> >> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
> >> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
> >> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
> >> Coord state Failed
> >> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
> >> Root state Failed
> >> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
> >> Ghost state Failed
> >>
> >> thanks,
> >> JP
> >>
> >> >
> >> > ---
> >> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
> >> > index 171c867..3288af0 100644
> >> > --- a/kernel/checkpoint/sys.c
> >> > +++ b/kernel/checkpoint/sys.c
> >> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
> >> >                        continue;
> >> >                }
> >> >
> >> > +               /* if not last thread - proceed with thread */
> >> > +               task = next_thread(task);
> >> > +               if (!thread_group_leader(task))
> >> > +                       continue;
> >> > +
> >> >                /* by definition, skip siblings of root */
> >> >                while (task != root) {
> >> > -                       /* if not last thread - proceed with thread */
> >> > -                       task = next_thread(task);
> >> > -                       if (!thread_group_leader(task))
> >> > -                               break;
> >> > -
> >> >                        /* if has sibling - proceed with sibling */
> >> >                        if (!list_is_last(&task->sibling, &parent->children)) {
> >> >                                task = list_entry(task->sibling.next,
> >> > ---
> >>
> >>
> 
> 

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                                   ` <Pine.LNX.4.64.1007212102010.6257-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
@ 2010-07-22 16:23                                     ` John Paul Walters
       [not found]                                       ` <AANLkTimW98q0sFZeCAk3xHsEfBV9yhL4kUKHjNGxn_2P-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-22 16:23 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Hi Oren,

Thanks for the patch.  For the --pidns case, that seems to have solved
the problem.  In the case of --no-pidns, restart still hangs as
described before.  Should this work with in the --no-pidns case, or is
it expected to fail in this case?

JP

On Wed, Jul 21, 2010 at 9:04 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote:
> Hi John,
>
> This is a bit embarrassing, the behavior sounds too familiar --
> please try to following patch:
>
> --
> diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c
> index 3fb9deb..b770f70 100644
> --- a/arch/x86/kernel/checkpoint.c
> +++ b/arch/x86/kernel/checkpoint.c
> @@ -104,7 +104,7 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
>        h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
>        h->sizeof_tls_array = tls_size;
>        h->sysenter_return = (__u64) (unsigned long)
> -               task_thread_info(current)->sysenter_return;
> +               task_thread_info(t)->sysenter_return;
>
>        /* For simplicity dump the entire array */
>        memcpy(h + 1, t->thread.tls_array, tls_size);
> --
>
> On Wed, 21 Jul 2010, John Paul Walters wrote:
>
>> >>
>> >> Hi Oren,
>> >>
>> >> I'm still unable to fully restart the application with your patch, but
>> >> the result is now different.  If I attempt to restart using  --pidns
>> >> and -F, both threads are created and frozen.  However, as soon as I
>> >> thaw them I get a segfault.  If I attempt to restart them without the
>> >> --pidns option, I get a message from restart indicating that it's
>> >> about to call sys_restart and restart hangs.  I also have the
>> >> following in my syslog:
>> >
>> > Hi John,
>> >
>> > I assume the log below is for the --no-pidns case, right ?
>> > Can you also post the output of 'restart -vd ...' ?
>> > (Unfortunately I won't have a chance to try it until the weekend)
>> >
>>
>> Hi Oren,
>>
>> That's correct, the original log was for the --no-pidns case.  Below
>> I've included the restart log up to the point where it hangs at
>> sys_restart.  Thanks again for all of your help.
>>
>> best,
>> JP
>>
>> ./restart -v -d --no-pidns < checkpoint_out
>> <4124>number of tasks: 2
>> <4124>number of vpids: 0
>> <4124>total tasks (including ghosts): 3
>> <4124>pid 3583: thread tgid 3582
>> <4124>pid 3583: creator set to 3582
>> <4124>pid 1: propagate session 3582
>> <4124>pid 1: creator set to 3582
>> <4124>pid 1: set session
>> <4124>pid 1: moving up to 3582
>> <4124>====== TASKS
>> <4124>        [0] pid 3582 ppid 3349 sid 0 creator 0
>> <4124>        [1] pid 3583 ppid 3349 sid 0 creator 3582 prev 1 T
>> <4124>        [2] pid 1 ppid 3582 sid 3582 creator 3582 next 3583   S G
>> <4124>............
>> <4124>task[0].vidx = -1
>> <4124>task[1].vidx = -1
>> <4124>subtree (existing pidns)
>> <4124>forking child vpid 3582 flags 0x1
>> <4124>task 3582 forking with flags 11 numpids 1
>> <4124>task 3582 pid[0]=0
>> <4124>forked child vpid 4126 (asked 3582)
>> <4126>root task pid 4126
>> <4126>pid 3582: pid 4126 sid 3386 parent 4124
>> <4126>pid 3582: fork child 1 with session
>> <4126>forking child vpid 1 flags 0x12
>> <4126>task 1 forking with flags 11 numpids 1
>> <4126>task 1 pid[0]=0
>> <4126>forked child vpid 4127 (asked 1)
>> <4126>pid 3582: fork child 3583 without session
>> <4126>forking child vpid 3583 flags 0x4
>> <4126>task 3583 forking with flags 10911 numpids 1
>> <4126>task 3583 pid[0]=0
>> <4126>forked child vpid 4128 (asked 3583)
>> <4126>about to call sys_restart(), flags 0
>> <4125>====== PIDS ARRAY
>> <4125>[0] pid 3582 ppid 1 sid 1 pgid 3582
>> <4125>[1] pid 3583 ppid 1 sid 1 pgid 3582
>> <4125>............
>> <4125>c/r swap old 3582 new 4126
>> <4128>pid 3583: pid 4128 sid 3386 parent 4124
>> <4128>about to call sys_restart(), flags 0
>> <4125>c/r swap old 3583 new 4128
>> <4127>pid 1: pid 4127 sid 3386 parent 4126
>> <4125>c/r swap old 1 new 4127
>> <4125>====== PIDS ARRAY (swaped)
>> <4125>[0] pid 4126 ppid 1 sid 4127 pgid 4126
>> <4125>[1] pid 4128 ppid 1 sid 4127 pgid 4126
>> <4125>............
>> <4125>c/r read input 16384
>> <4127>about to call sys_restart(), flags 0x4
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>> <4125>c/r read input 16384
>>
>>
>>
>>
>>
>>
>> > Thanks,
>> >
>> > Oren.
>> >
>> >>
>> >>
>> >> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
>> >> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
>> >> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
>> >> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
>> >> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
>> >> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
>> >> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
>> >> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
>> >> wait_checkpoint_ctx: failed (-512)
>> >> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
>> >> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
>> >> kflags 0x1a (ret 0)
>> >> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
>> >> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
>> >> [ 1541.864698] [err -512][pos 419][E @
>> >> ckpt_read_obj_type:426]Expecting to read type 9001
>> >> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
>> >> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
>> >> failed (coordinator)
>> >> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
>> >> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
>> >> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
>> >> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
>> >> registered, nr_tasks was 0 nr_total 1
>> >> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
>> >> 0, ctx->errno -512
>> >> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
>> >> 0 oflags 1
>> >> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
>> >> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
>> >> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
>> >> Coord state Failed
>> >> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
>> >> Root state Failed
>> >> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
>> >> Ghost state Failed
>> >>
>> >> thanks,
>> >> JP
>> >>
>> >> >
>> >> > ---
>> >> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
>> >> > index 171c867..3288af0 100644
>> >> > --- a/kernel/checkpoint/sys.c
>> >> > +++ b/kernel/checkpoint/sys.c
>> >> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
>> >> >                        continue;
>> >> >                }
>> >> >
>> >> > +               /* if not last thread - proceed with thread */
>> >> > +               task = next_thread(task);
>> >> > +               if (!thread_group_leader(task))
>> >> > +                       continue;
>> >> > +
>> >> >                /* by definition, skip siblings of root */
>> >> >                while (task != root) {
>> >> > -                       /* if not last thread - proceed with thread */
>> >> > -                       task = next_thread(task);
>> >> > -                       if (!thread_group_leader(task))
>> >> > -                               break;
>> >> > -
>> >> >                        /* if has sibling - proceed with sibling */
>> >> >                        if (!list_is_last(&task->sibling, &parent->children)) {
>> >> >                                task = list_entry(task->sibling.next,
>> >> > ---
>> >>
>> >>
>>
>>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                                       ` <AANLkTimW98q0sFZeCAk3xHsEfBV9yhL4kUKHjNGxn_2P-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-26 11:18                                         ` Oren Laadan
       [not found]                                           ` <Pine.LNX.4.64.1007260711310.1050-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Oren Laadan @ 2010-07-26 11:18 UTC (permalink / raw)
  To: John Paul Walters; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

[-- Attachment #1: Type: TEXT/PLAIN, Size: 9692 bytes --]

Hi John,

Please try the following patch - it should be applied _instead_ of the 
patch I sent on 7/20.

The previous patch was still insufficient when the root task has not only 
threads, but also a child (the child was a "ghost" task used temporarily 
during restart). I believe this patch correctly addresses the problem, and 
I tested against your program with and without --pidns.

I'll wait for your confirmation before pushing the fix to cpt-v22-dev.

Thanks !

Oren.

---
diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
index 171c867..c5517c2 100644
--- a/kernel/checkpoint/sys.c
+++ b/kernel/checkpoint/sys.c
@@ -625,8 +625,11 @@ int walk_task_subtree(struct task_struct *root,
 		}
 
 		/* if we arrive at root again -- done */
-		if (task == root)
-			break;
+		if (task == root) {
+			/* if not last thread - proceed with thread */
+			task = root = next_thread(task);
+			if (thread_group_leader(task))
+				break;
 	}
 	read_unlock(&tasklist_lock);
 
---

On Thu, 22 Jul 2010, John Paul Walters wrote:

> Hi Oren,
> 
> Thanks for the patch.  For the --pidns case, that seems to have solved
> the problem.  In the case of --no-pidns, restart still hangs as
> described before.  Should this work with in the --no-pidns case, or is
> it expected to fail in this case?
> 
> JP
> 
> On Wed, Jul 21, 2010 at 9:04 PM, Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> wrote:
> > Hi John,
> >
> > This is a bit embarrassing, the behavior sounds too familiar --
> > please try to following patch:
> >
> > --
> > diff --git a/arch/x86/kernel/checkpoint.c b/arch/x86/kernel/checkpoint.c
> > index 3fb9deb..b770f70 100644
> > --- a/arch/x86/kernel/checkpoint.c
> > +++ b/arch/x86/kernel/checkpoint.c
> > @@ -104,7 +104,7 @@ int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
> >        h->gdt_entry_tls_entries = GDT_ENTRY_TLS_ENTRIES;
> >        h->sizeof_tls_array = tls_size;
> >        h->sysenter_return = (__u64) (unsigned long)
> > -               task_thread_info(current)->sysenter_return;
> > +               task_thread_info(t)->sysenter_return;
> >
> >        /* For simplicity dump the entire array */
> >        memcpy(h + 1, t->thread.tls_array, tls_size);
> > --
> >
> > On Wed, 21 Jul 2010, John Paul Walters wrote:
> >
> >> >>
> >> >> Hi Oren,
> >> >>
> >> >> I'm still unable to fully restart the application with your patch, but
> >> >> the result is now different.  If I attempt to restart using  --pidns
> >> >> and -F, both threads are created and frozen.  However, as soon as I
> >> >> thaw them I get a segfault.  If I attempt to restart them without the
> >> >> --pidns option, I get a message from restart indicating that it's
> >> >> about to call sys_restart and restart hangs.  I also have the
> >> >> following in my syslog:
> >> >
> >> > Hi John,
> >> >
> >> > I assume the log below is for the --no-pidns case, right ?
> >> > Can you also post the output of 'restart -vd ...' ?
> >> > (Unfortunately I won't have a chance to try it until the weekend)
> >> >
> >>
> >> Hi Oren,
> >>
> >> That's correct, the original log was for the --no-pidns case.  Below
> >> I've included the restart log up to the point where it hangs at
> >> sys_restart.  Thanks again for all of your help.
> >>
> >> best,
> >> JP
> >>
> >> ./restart -v -d --no-pidns < checkpoint_out
> >> <4124>number of tasks: 2
> >> <4124>number of vpids: 0
> >> <4124>total tasks (including ghosts): 3
> >> <4124>pid 3583: thread tgid 3582
> >> <4124>pid 3583: creator set to 3582
> >> <4124>pid 1: propagate session 3582
> >> <4124>pid 1: creator set to 3582
> >> <4124>pid 1: set session
> >> <4124>pid 1: moving up to 3582
> >> <4124>====== TASKS
> >> <4124>        [0] pid 3582 ppid 3349 sid 0 creator 0
> >> <4124>        [1] pid 3583 ppid 3349 sid 0 creator 3582 prev 1 T
> >> <4124>        [2] pid 1 ppid 3582 sid 3582 creator 3582 next 3583   S G
> >> <4124>............
> >> <4124>task[0].vidx = -1
> >> <4124>task[1].vidx = -1
> >> <4124>subtree (existing pidns)
> >> <4124>forking child vpid 3582 flags 0x1
> >> <4124>task 3582 forking with flags 11 numpids 1
> >> <4124>task 3582 pid[0]=0
> >> <4124>forked child vpid 4126 (asked 3582)
> >> <4126>root task pid 4126
> >> <4126>pid 3582: pid 4126 sid 3386 parent 4124
> >> <4126>pid 3582: fork child 1 with session
> >> <4126>forking child vpid 1 flags 0x12
> >> <4126>task 1 forking with flags 11 numpids 1
> >> <4126>task 1 pid[0]=0
> >> <4126>forked child vpid 4127 (asked 1)
> >> <4126>pid 3582: fork child 3583 without session
> >> <4126>forking child vpid 3583 flags 0x4
> >> <4126>task 3583 forking with flags 10911 numpids 1
> >> <4126>task 3583 pid[0]=0
> >> <4126>forked child vpid 4128 (asked 3583)
> >> <4126>about to call sys_restart(), flags 0
> >> <4125>====== PIDS ARRAY
> >> <4125>[0] pid 3582 ppid 1 sid 1 pgid 3582
> >> <4125>[1] pid 3583 ppid 1 sid 1 pgid 3582
> >> <4125>............
> >> <4125>c/r swap old 3582 new 4126
> >> <4128>pid 3583: pid 4128 sid 3386 parent 4124
> >> <4128>about to call sys_restart(), flags 0
> >> <4125>c/r swap old 3583 new 4128
> >> <4127>pid 1: pid 4127 sid 3386 parent 4126
> >> <4125>c/r swap old 1 new 4127
> >> <4125>====== PIDS ARRAY (swaped)
> >> <4125>[0] pid 4126 ppid 1 sid 4127 pgid 4126
> >> <4125>[1] pid 4128 ppid 1 sid 4127 pgid 4126
> >> <4125>............
> >> <4125>c/r read input 16384
> >> <4127>about to call sys_restart(), flags 0x4
> >> <4125>c/r read input 16384
> >> <4125>c/r read input 16384
> >> <4125>c/r read input 16384
> >> <4125>c/r read input 16384
> >>
> >>
> >>
> >>
> >>
> >>
> >> > Thanks,
> >> >
> >> > Oren.
> >> >
> >> >>
> >> >>
> >> >> [ 1482.348060] [3753:3753:c/r:walk_task_subtree:633] total 2 ret 1
> >> >> [ 1482.348060] [3753:3753:c/r:prepare_descendants:1148] nr 2/2
> >> >> [ 1482.348060] [3753:3753:c/r:do_restore_coord:1320] restore prepare: 2
> >> >> [ 1541.864073] [err -512][pos 419][E @ do_ghost_task:973]ghost restart failed
> >> >> [ 1541.864343] [err -512][pos 419][E @ do_restore_task:1084]task restart failed
> >> >> [ 1541.864346] [3755:3755:c/r:clear_task_ctx:852] task 3755 clear checkpoint_ctx
> >> >> [ 1541.864349] [3755:3755:c/r:do_restart:1444] restart err -4, exiting
> >> >> [ 1541.864352] [3755:3755:c/r:do_restart:1451] sys_restart returns -4
> >> >> [ 1541.864366] [3757:3757:c/r:wait_checkpoint_ctx:938]
> >> >> wait_checkpoint_ctx: failed (-512)
> >> >> [ 1541.864368] [3757:3757:c/r:do_restart:1444] restart err -4, exiting
> >> >> [ 1541.864371] [3757:3757:c/r:do_restart:1451] sys_restart returns -4
> >> >> [ 1541.864689] [3753:3753:c/r:wait_all_tasks_finish:1173] final sync
> >> >> kflags 0x1a (ret 0)
> >> >> [ 1541.864692] [3753:3753:c/r:do_restore_coord:1325] restore finish: 0
> >> >> [ 1541.864694] [3753:3753:c/r:do_restore_coord:1331] restore deferqueue: 0
> >> >> [ 1541.864698] [err -512][pos 419][E @
> >> >> ckpt_read_obj_type:426]Expecting to read type 9001
> >> >> [ 1541.864700] [3753:3753:c/r:do_restore_coord:1336] restore tail: -512
> >> >> [ 1541.864703] [err -512][pos 419][E @ do_restore_coord:1350]restart
> >> >> failed (coordinator)
> >> >> [ 1541.864706] [3753:3753:c/r:walk_task_subtree:633] total 0 ret 0
> >> >> [ 1541.864709] [3753:3753:c/r:clear_task_ctx:852] task 3753 clear checkpoint_ctx
> >> >> [ 1541.864715] [3753:3753:c/r:do_restart:1451] sys_restart returns -4
> >> >> [ 1541.864718] [3753:3753:c/r:restore_debug_free:144] 3 tasks
> >> >> registered, nr_tasks was 0 nr_total 1
> >> >> [ 1541.864721] [3753:3753:c/r:restore_debug_free:147] active pid was
> >> >> 0, ctx->errno -512
> >> >> [ 1541.864723] [3753:3753:c/r:restore_debug_free:149] kflags 26 uflags
> >> >> 0 oflags 1
> >> >> [ 1541.864726] [3753:3753:c/r:restore_debug_free:151] task[0] to run 3755
> >> >> [ 1541.864728] [3753:3753:c/r:restore_debug_free:151] task[1] to run 3757
> >> >> [ 1541.864731] [3753:3753:c/r:restore_debug_free:176] pid 3753 type
> >> >> Coord state Failed
> >> >> [ 1541.864735] [3753:3753:c/r:restore_debug_free:176] pid 3755 type
> >> >> Root state Failed
> >> >> [ 1541.864737] [3753:3753:c/r:restore_debug_free:176] pid 3756 type
> >> >> Ghost state Failed
> >> >>
> >> >> thanks,
> >> >> JP
> >> >>
> >> >> >
> >> >> > ---
> >> >> > diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
> >> >> > index 171c867..3288af0 100644
> >> >> > --- a/kernel/checkpoint/sys.c
> >> >> > +++ b/kernel/checkpoint/sys.c
> >> >> > @@ -605,13 +605,13 @@ int walk_task_subtree(struct task_struct *root,
> >> >> >                        continue;
> >> >> >                }
> >> >> >
> >> >> > +               /* if not last thread - proceed with thread */
> >> >> > +               task = next_thread(task);
> >> >> > +               if (!thread_group_leader(task))
> >> >> > +                       continue;
> >> >> > +
> >> >> >                /* by definition, skip siblings of root */
> >> >> >                while (task != root) {
> >> >> > -                       /* if not last thread - proceed with thread */
> >> >> > -                       task = next_thread(task);
> >> >> > -                       if (!thread_group_leader(task))
> >> >> > -                               break;
> >> >> > -
> >> >> >                        /* if has sibling - proceed with sibling */
> >> >> >                        if (!list_is_last(&task->sibling, &parent->children)) {
> >> >> >                                task = list_entry(task->sibling.next,
> >> >> > ---
> >> >>
> >> >>
> >>
> >>
> 
> 

[-- Attachment #2: Type: text/plain, Size: 206 bytes --]

_______________________________________________
Containers mailing list
Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                                           ` <Pine.LNX.4.64.1007260711310.1050-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
@ 2010-07-26 17:11                                             ` Dan Smith
       [not found]                                               ` <8739v6tbgj.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Smith @ 2010-07-26 17:11 UTC (permalink / raw)
  To: Oren Laadan; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

OL> diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
OL> index 171c867..c5517c2 100644
OL> --- a/kernel/checkpoint/sys.c
OL> +++ b/kernel/checkpoint/sys.c
OL> @@ -625,8 +625,11 @@ int walk_task_subtree(struct task_struct *root,
OL>  		}

OL>  		/* if we arrive at root again -- done */
OL> -		if (task == root)
OL> -			break;
OL> +		if (task == root) {
OL> +			/* if not last thread - proceed with thread */
OL> +			task = root = next_thread(task);
OL> +			if (thread_group_leader(task))
OL> +				break;

                } // Need to close this block

Otherwise it seems to work for me:

Tested-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

-- 
Dan Smith
IBM Linux Technology Center
email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                                               ` <8739v6tbgj.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
@ 2010-07-26 17:56                                                 ` John Paul Walters
       [not found]                                                   ` <AANLkTikaaxCdjgKywJ6SvHpez_R1PNiW5LzNYAdAONxr-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 15+ messages in thread
From: John Paul Walters @ 2010-07-26 17:56 UTC (permalink / raw)
  To: Dan Smith; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

It works for me as well.  Thanks for your help Oren.

JP



On Mon, Jul 26, 2010 at 1:11 PM, Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
> OL> diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
> OL> index 171c867..c5517c2 100644
> OL> --- a/kernel/checkpoint/sys.c
> OL> +++ b/kernel/checkpoint/sys.c
> OL> @@ -625,8 +625,11 @@ int walk_task_subtree(struct task_struct *root,
> OL>             }
>
> OL>             /* if we arrive at root again -- done */
> OL> -           if (task == root)
> OL> -                   break;
> OL> +           if (task == root) {
> OL> +                   /* if not last thread - proceed with thread */
> OL> +                   task = root = next_thread(task);
> OL> +                   if (thread_group_leader(task))
> OL> +                           break;
>
>                } // Need to close this block
>
> Otherwise it seems to work for me:
>
> Tested-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>
> --
> Dan Smith
> IBM Linux Technology Center
> email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: multi-threaded app fails to restart
       [not found]                                                   ` <AANLkTikaaxCdjgKywJ6SvHpez_R1PNiW5LzNYAdAONxr-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-07-26 18:18                                                     ` Oren Laadan
  0 siblings, 0 replies; 15+ messages in thread
From: Oren Laadan @ 2010-07-26 18:18 UTC (permalink / raw)
  To: John Paul Walters
  Cc: Dan Smith, containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA


Great.
Pushed fixes to ckpt-v22-dev.

Oren.

On 07/26/2010 01:56 PM, John Paul Walters wrote:
> It works for me as well.  Thanks for your help Oren.
> 
> JP
> 
> 
> 
> On Mon, Jul 26, 2010 at 1:11 PM, Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> wrote:
>> OL> diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
>> OL> index 171c867..c5517c2 100644
>> OL> --- a/kernel/checkpoint/sys.c
>> OL> +++ b/kernel/checkpoint/sys.c
>> OL> @@ -625,8 +625,11 @@ int walk_task_subtree(struct task_struct *root,
>> OL>             }
>>
>> OL>             /* if we arrive at root again -- done */
>> OL> -           if (task == root)
>> OL> -                   break;
>> OL> +           if (task == root) {
>> OL> +                   /* if not last thread - proceed with thread */
>> OL> +                   task = root = next_thread(task);
>> OL> +                   if (thread_group_leader(task))
>> OL> +                           break;
>>
>>                } // Need to close this block
>>
>> Otherwise it seems to work for me:
>>
>> Tested-by: Dan Smith <danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>>
>> --
>> Dan Smith
>> IBM Linux Technology Center
>> email: danms-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org
>>
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-07-26 18:18 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-07-19 19:36 multi-threaded app fails to restart John Paul Walters
     [not found] ` <AANLkTilxfsYGyYLwO__VmDLSFQ_s_Qe03G49kIEztVja-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-19 19:54   ` Nathan Lynch
2010-07-19 20:27     ` John Paul Walters
     [not found]       ` <AANLkTimpXSXQr1wew1wvZKnBFsOXD7f2tblY4EGmJoFM-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-20  3:24         ` Oren Laadan
     [not found]           ` <4C4516DD.1000809-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-07-20 18:58             ` John Paul Walters
     [not found]               ` <AANLkTimPENgm-LSh6iMv2uxegRdHEivbGMTYmEfiOEJG-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-20 23:12                 ` Oren Laadan
     [not found]                   ` <Pine.LNX.4.64.1007201906370.15255-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-07-21  0:03                     ` John Paul Walters
     [not found]                       ` <AANLkTinZYiWPtSegjRJWnlc6hipFAZyujr8-2ug6ettF-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-21  5:54                         ` Oren Laadan
     [not found]                           ` <Pine.LNX.4.64.1007210143120.22870-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-07-21 12:52                             ` John Paul Walters
     [not found]                               ` <AANLkTinOFIzK8RZnp9NHouKv-WA7Omr08pPTGfrfVLfP-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-22  1:04                                 ` Oren Laadan
     [not found]                                   ` <Pine.LNX.4.64.1007212102010.6257-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-07-22 16:23                                     ` John Paul Walters
     [not found]                                       ` <AANLkTimW98q0sFZeCAk3xHsEfBV9yhL4kUKHjNGxn_2P-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-26 11:18                                         ` Oren Laadan
     [not found]                                           ` <Pine.LNX.4.64.1007260711310.1050-CXF6herHY6ykSYb+qCZC/1i27PF6R63G9nwVQlTi/Pw@public.gmane.org>
2010-07-26 17:11                                             ` Dan Smith
     [not found]                                               ` <8739v6tbgj.fsf-FLMGYpZoEPULwtHQx/6qkW3U47Q5hpJU@public.gmane.org>
2010-07-26 17:56                                                 ` John Paul Walters
     [not found]                                                   ` <AANLkTikaaxCdjgKywJ6SvHpez_R1PNiW5LzNYAdAONxr-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-07-26 18:18                                                     ` Oren Laadan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.