All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
To: Jiro SEKIBA <jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Subject: Re: Linux Checkpoint-Restart - v19
Date: Mon, 29 Mar 2010 22:05:35 -0500	[thread overview]
Message-ID: <20100330030535.GA13362@us.ibm.com> (raw)
In-Reply-To: <BC2CC354-59BA-465A-A863-0CDCD921A99A-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>

Quoting Jiro SEKIBA (jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org):
> Hi
> 
> On 2010/03/25, at 1:47, Serge E. Hallyn wrote:
> 
> > Quoting Jiro SEKIBA (jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org):
> >>> If it doesn't work, can you please describe again the exact order of
> >>> commands that you use and the reported error(s) ?
> >>> 
> >> I'll let you know in any cases.
> >> 
> >> Thank you very much for the advice
> > 
> > Hi Jiro,
> > 
> > Can you fetch the latest cr_tests
> > 	(git clone git://git.sr71.net/~hallyn/cr_tests)
> > 
> > and
> > 	cd cr_tests; make; cd simple
> > 	sh runtests.sh
> > 
> > and tell me whether the second (restart --self) test succeeds?
> > If it fails, can you send me the cr_*/log2 contents?
> > 
> 
> I've tried on ckpt-v20 and the above test looks OK.
> And looks like self_checkpointing is working fine so far.
> 
> However, I'm still not able to restart external checkpoint correctly.
> 
> Here are the program and scripts I used for the test.
> I used user-cr ckpt-v20 branch for checkpoint/restart program.
> 
> This time I disconnect the program from tty completely.
> 
> ----------8<----------8<----------test.c----------8<----------8<----------
> #include <stdio.h>
> #include <unistd.h>
> 
> int main(void)
> {
>   FILE *fp;
>   int i;
>   pid_t pid;
>   int st;
> 
>   if(fork()) {
>     return 0;

Odd thing to do, not sure if you had a reason for it.  Still,
should be fine :)

>   } else {
>     waitpid(getppid(), &st, NULL);
> 
>     close(0);
>     close(1);
>     close(2);
>     setsid();
> 
>     if(fork()) {
>       return 0;
>     } else 
>       waitpid(getppid(), &st, NULL);
>   }
> 
>   //unlink("/tmp/test.out");
>   fp = fopen("/tmp/test.out","w");
> 
>   for(i=0;i<10;i++) {
>     fprintf(fp,"%d\n",i);
>     fflush(fp);
>     sleep(1);
>   }
> 
>   fclose(fp);
>   return 0;
> }
> ----------8<----------8<----------test.c----------8<----------8<----------
> 
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
> #!/bin/sh
> 
> CLOG=checkpoint.log
> RLOG=restart.log
> rm -f $CLOG $RLOG
> 
> ./test &
> sleep 1
> PID=$(ps x | grep test | grep -v grep |cut -f 2 -d' ')
> 
> sleep 2
> echo $PID > /cgroup/0/tasks
> 
> echo FROZEN > /cgroup/0/freezer.state
> ./checkpoint -l $CLOG -v $PID > ckpt.image
> 
> mv /tmp/test.out /tmp/test.out.orig
> cp /tmp/test.out.orig /tmp/test.out
> 
> echo THAWED > /cgroup/0/freezer.state
> 
> ./restart --pidns -l $RLOG -v -i ckpt.image;
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
> 
> When I run the above script, I got following:
> 
> # mount -t cgroup -o freezer cgroup /cgroup
> # mkdir /cgroup/0
> # sh checkpoint.sh
> checkpoint id 8
> Success
> 
> Then, I'm expecting to see number 0 to 9 in /tmp/test.out, but
> I only got 0 to 3, which is the state I froze and checkpointed the process.
> 
> checkpoint.log and restart.log are empty.
> I guess it means the programs worked fine.
> 
> I attached the dmesg I got by the single session of the script.
> It looks the restart tries to reopen /tmp/test.out.
> 
> Could you give me any clues that I should check with?

Hmm, with ckpt-v20 of both kernel and user, on a powerpc system, I get:

elm3b203:/usr/src/jiro # sh checkpoint.sh
checkpoint id 146
Success
elm3b203:/usr/src/jiro # ls
checkpoint.log  checkpoint.sh  ckpt.image  restart.log  test  test.c
elm3b203:/usr/src/jiro # cat /tmp/test.out
0
1
2
3
4
5
6
7
8
9

> My environment is Virtualbox VM.
> I tried both with VT and without VT.
> No virtualbox guest module is installed.

What distro are you on?

Anyway, two things to do.  First, add '-d' to your restart flags, so

restart --pidns -l $RLOG -vd -i ckpt.image

That will give you debugging info.  For instance I get:

checkpoint id 147
<2507>number of tasks: 1
<2507>total tasks (including ghosts): 1
<2507>====== TASKS
<2507>  [0] pid 2497 ppid 1 sid 0 creator 0       
<2507>............
<2507>new pidns without init
<2507>forking coordinator in new pidns
<2508>====== PIDS ARRAY
<2508>[0] pid 2497 ppid 1 sid 0 pgid 0
<2508>............
<1>forking child vpid 2497 flags 0x1
<1>forked child vpid 2497 (asked 2497)
<2497>root task pid 2497
<2497>pid 2497: pid 2497 sid 0 parent 1
<2497>about to call sys_restart(), flags 0
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 8336
<2508>c/r read input 0
Success
<1>restart succeeded
<1>SIGCHLD: already collected
<1>task exited with status 0
<1>mimic ret 0
<1>c/r succeeded
<2507>SIGCHLD: already collected
<2507>task exited with status 0


The other thing is to restart frozen and attach strace or gdb to the
restarted test before thawing.  So perhaps

# cc -g -o test test.c
# sh checkpoint.sh

Then when that has failed, do

# mkdir /cgroup/1
# restart -F /cgroup/1 -i ckpt.image

That will hang.  Then in another terminal, you can

# gdb -se test -p `pidof test`

and in a third terminal,

# echo THAWED > /cgroup/1/freezer.state

Now in gdb you can figure out where the task is and step through
to see where it dies.

thanks,
-serge

  parent reply	other threads:[~2010-03-30  3:05 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-02-22 23:17 Linux Checkpoint-Restart - v19 Oren Laadan
2010-02-22 23:17 ` Oren Laadan
     [not found] ` <4B83106C.7040203-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-01 21:36   ` Andrew Morton
2010-03-01 21:36 ` Andrew Morton
     [not found]   ` <20100301133623.9808986f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-03-01 22:56     ` Oren Laadan
2010-03-01 22:56   ` Oren Laadan
2010-03-15  8:55 ` Jiro SEKIBA
2010-03-15 22:55   ` Oren Laadan
     [not found]     ` <4B9EBAF2.1060304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-16  8:36       ` Jiro SEKIBA
2010-03-16  8:36         ` Jiro SEKIBA
     [not found]         ` <0B4E8136-FFC6-490D-B04A-23A6E1A924FF-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-17 20:55           ` Serge E. Hallyn
2010-03-17 20:55         ` Serge E. Hallyn
2010-03-19 13:14           ` Jiro SEKIBA
2010-03-19 15:34             ` Oren Laadan
     [not found]               ` <4BA39971.2080402-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-23 10:53                 ` Jiro SEKIBA
2010-03-23 10:53               ` Jiro SEKIBA
     [not found]                 ` <FF5CB8EA-436D-4685-B7A2-946A83DF3F78-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-24 16:47                   ` Serge E. Hallyn
     [not found]                     ` <20100324164758.GA21021-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-29  8:52                       ` Jiro SEKIBA
     [not found]                         ` <BC2CC354-59BA-465A-A863-0CDCD921A99A-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-30  3:05                           ` Serge E. Hallyn [this message]
     [not found]                             ` <20100330030535.GA13362-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-03  9:03                               ` Jiro SEKIBA
     [not found]                                 ` <18557515-762E-4EE6-90D7-C8F782E487B2-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-04-05 14:06                                   ` Serge E. Hallyn
     [not found]                                     ` <20100405140629.GG32049-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-05 14:31                                       ` Matt Helsley
     [not found]                                         ` <20100405143157.GX3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-04-06  2:54                                           ` Jiro SEKIBA
     [not found]                                             ` <39FCECBC-BFE3-4328-BCFC-CBACA3CB442E-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-04-06 21:49                                               ` Nathan Lynch
2010-04-06 22:23                                                 ` Serge E. Hallyn
2010-04-07 13:08                                                 ` Jiro SEKIBA
     [not found]             ` <EF179F3A-4FBA-4776-B7A4-48F5EF73DC9C-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-19 15:34               ` Oren Laadan
     [not found]           ` <20100317205556.GA20750-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-19 13:14             ` Jiro SEKIBA
     [not found]   ` <a1c54a921003150155q4a0c7fc1vb02ba0464b07f452-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-15 22:55     ` Oren Laadan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100330030535.GA13362@us.ibm.com \
    --to=serue-r/jw6+rmf7hqt0dzr+alfa@public.gmane.org \
    --cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
    --cc=jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.