From: "Serge E. Hallyn" <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
To: Jiro SEKIBA <jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
Subject: Re: Linux Checkpoint-Restart - v19
Date: Mon, 29 Mar 2010 22:05:35 -0500 [thread overview]
Message-ID: <20100330030535.GA13362@us.ibm.com> (raw)
In-Reply-To: <BC2CC354-59BA-465A-A863-0CDCD921A99A-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
Quoting Jiro SEKIBA (jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org):
> Hi
>
> On 2010/03/25, at 1:47, Serge E. Hallyn wrote:
>
> > Quoting Jiro SEKIBA (jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org):
> >>> If it doesn't work, can you please describe again the exact order of
> >>> commands that you use and the reported error(s) ?
> >>>
> >> I'll let you know in any cases.
> >>
> >> Thank you very much for the advice
> >
> > Hi Jiro,
> >
> > Can you fetch the latest cr_tests
> > (git clone git://git.sr71.net/~hallyn/cr_tests)
> >
> > and
> > cd cr_tests; make; cd simple
> > sh runtests.sh
> >
> > and tell me whether the second (restart --self) test succeeds?
> > If it fails, can you send me the cr_*/log2 contents?
> >
>
> I've tried on ckpt-v20 and the above test looks OK.
> And looks like self_checkpointing is working fine so far.
>
> However, I'm still not able to restart external checkpoint correctly.
>
> Here are the program and scripts I used for the test.
> I used user-cr ckpt-v20 branch for checkpoint/restart program.
>
> This time I disconnect the program from tty completely.
>
> ----------8<----------8<----------test.c----------8<----------8<----------
> #include <stdio.h>
> #include <unistd.h>
>
> int main(void)
> {
> FILE *fp;
> int i;
> pid_t pid;
> int st;
>
> if(fork()) {
> return 0;
Odd thing to do, not sure if you had a reason for it. Still,
should be fine :)
> } else {
> waitpid(getppid(), &st, NULL);
>
> close(0);
> close(1);
> close(2);
> setsid();
>
> if(fork()) {
> return 0;
> } else
> waitpid(getppid(), &st, NULL);
> }
>
> //unlink("/tmp/test.out");
> fp = fopen("/tmp/test.out","w");
>
> for(i=0;i<10;i++) {
> fprintf(fp,"%d\n",i);
> fflush(fp);
> sleep(1);
> }
>
> fclose(fp);
> return 0;
> }
> ----------8<----------8<----------test.c----------8<----------8<----------
>
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
> #!/bin/sh
>
> CLOG=checkpoint.log
> RLOG=restart.log
> rm -f $CLOG $RLOG
>
> ./test &
> sleep 1
> PID=$(ps x | grep test | grep -v grep |cut -f 2 -d' ')
>
> sleep 2
> echo $PID > /cgroup/0/tasks
>
> echo FROZEN > /cgroup/0/freezer.state
> ./checkpoint -l $CLOG -v $PID > ckpt.image
>
> mv /tmp/test.out /tmp/test.out.orig
> cp /tmp/test.out.orig /tmp/test.out
>
> echo THAWED > /cgroup/0/freezer.state
>
> ./restart --pidns -l $RLOG -v -i ckpt.image;
> ----------8<----------8<----------checkpoint.sh----------8<----------8<----------
>
> When I run the above script, I got following:
>
> # mount -t cgroup -o freezer cgroup /cgroup
> # mkdir /cgroup/0
> # sh checkpoint.sh
> checkpoint id 8
> Success
>
> Then, I'm expecting to see number 0 to 9 in /tmp/test.out, but
> I only got 0 to 3, which is the state I froze and checkpointed the process.
>
> checkpoint.log and restart.log are empty.
> I guess it means the programs worked fine.
>
> I attached the dmesg I got by the single session of the script.
> It looks the restart tries to reopen /tmp/test.out.
>
> Could you give me any clues that I should check with?
Hmm, with ckpt-v20 of both kernel and user, on a powerpc system, I get:
elm3b203:/usr/src/jiro # sh checkpoint.sh
checkpoint id 146
Success
elm3b203:/usr/src/jiro # ls
checkpoint.log checkpoint.sh ckpt.image restart.log test test.c
elm3b203:/usr/src/jiro # cat /tmp/test.out
0
1
2
3
4
5
6
7
8
9
> My environment is Virtualbox VM.
> I tried both with VT and without VT.
> No virtualbox guest module is installed.
What distro are you on?
Anyway, two things to do. First, add '-d' to your restart flags, so
restart --pidns -l $RLOG -vd -i ckpt.image
That will give you debugging info. For instance I get:
checkpoint id 147
<2507>number of tasks: 1
<2507>total tasks (including ghosts): 1
<2507>====== TASKS
<2507> [0] pid 2497 ppid 1 sid 0 creator 0
<2507>............
<2507>new pidns without init
<2507>forking coordinator in new pidns
<2508>====== PIDS ARRAY
<2508>[0] pid 2497 ppid 1 sid 0 pgid 0
<2508>............
<1>forking child vpid 2497 flags 0x1
<1>forked child vpid 2497 (asked 2497)
<2497>root task pid 2497
<2497>pid 2497: pid 2497 sid 0 parent 1
<2497>about to call sys_restart(), flags 0
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 16384
<2508>c/r read input 8336
<2508>c/r read input 0
Success
<1>restart succeeded
<1>SIGCHLD: already collected
<1>task exited with status 0
<1>mimic ret 0
<1>c/r succeeded
<2507>SIGCHLD: already collected
<2507>task exited with status 0
The other thing is to restart frozen and attach strace or gdb to the
restarted test before thawing. So perhaps
# cc -g -o test test.c
# sh checkpoint.sh
Then when that has failed, do
# mkdir /cgroup/1
# restart -F /cgroup/1 -i ckpt.image
That will hang. Then in another terminal, you can
# gdb -se test -p `pidof test`
and in a third terminal,
# echo THAWED > /cgroup/1/freezer.state
Now in gdb you can figure out where the task is and step through
to see where it dies.
thanks,
-serge
next prev parent reply other threads:[~2010-03-30 3:05 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-02-22 23:17 Linux Checkpoint-Restart - v19 Oren Laadan
2010-02-22 23:17 ` Oren Laadan
[not found] ` <4B83106C.7040203-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-01 21:36 ` Andrew Morton
2010-03-01 21:36 ` Andrew Morton
[not found] ` <20100301133623.9808986f.akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2010-03-01 22:56 ` Oren Laadan
2010-03-01 22:56 ` Oren Laadan
2010-03-15 8:55 ` Jiro SEKIBA
2010-03-15 22:55 ` Oren Laadan
[not found] ` <4B9EBAF2.1060304-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-16 8:36 ` Jiro SEKIBA
2010-03-16 8:36 ` Jiro SEKIBA
[not found] ` <0B4E8136-FFC6-490D-B04A-23A6E1A924FF-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-17 20:55 ` Serge E. Hallyn
2010-03-17 20:55 ` Serge E. Hallyn
2010-03-19 13:14 ` Jiro SEKIBA
2010-03-19 15:34 ` Oren Laadan
[not found] ` <4BA39971.2080402-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2010-03-23 10:53 ` Jiro SEKIBA
2010-03-23 10:53 ` Jiro SEKIBA
[not found] ` <FF5CB8EA-436D-4685-B7A2-946A83DF3F78-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-24 16:47 ` Serge E. Hallyn
[not found] ` <20100324164758.GA21021-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-29 8:52 ` Jiro SEKIBA
[not found] ` <BC2CC354-59BA-465A-A863-0CDCD921A99A-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-30 3:05 ` Serge E. Hallyn [this message]
[not found] ` <20100330030535.GA13362-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-03 9:03 ` Jiro SEKIBA
[not found] ` <18557515-762E-4EE6-90D7-C8F782E487B2-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-04-05 14:06 ` Serge E. Hallyn
[not found] ` <20100405140629.GG32049-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-04-05 14:31 ` Matt Helsley
[not found] ` <20100405143157.GX3345-52DBMbEzqgQ/wnmkkaCWp/UQ3DHhIser@public.gmane.org>
2010-04-06 2:54 ` Jiro SEKIBA
[not found] ` <39FCECBC-BFE3-4328-BCFC-CBACA3CB442E-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-04-06 21:49 ` Nathan Lynch
2010-04-06 22:23 ` Serge E. Hallyn
2010-04-07 13:08 ` Jiro SEKIBA
[not found] ` <EF179F3A-4FBA-4776-B7A4-48F5EF73DC9C-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org>
2010-03-19 15:34 ` Oren Laadan
[not found] ` <20100317205556.GA20750-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-19 13:14 ` Jiro SEKIBA
[not found] ` <a1c54a921003150155q4a0c7fc1vb02ba0464b07f452-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-03-15 22:55 ` Oren Laadan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100330030535.GA13362@us.ibm.com \
--to=serue-r/jw6+rmf7hqt0dzr+alfa@public.gmane.org \
--cc=containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \
--cc=jir-Xy3Dp9s2+bNGIRItUzBvX16hYfS7NtTn@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.