* user-cr: Extra unshare() calls ?
@ 2010-03-08 20:13 Sukadev Bhattiprolu
[not found] ` <20100308201338.GA16446-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-08 20:13 UTC (permalink / raw)
To: Oren Laadan; +Cc: Containers
Came across this while testing LXC.
1. Does ckpt_remount_proc() need to unshare() ? Or can we have the
clone() that calls __ckpt_coordinator() clone with CLONE_NEWNS|CLONE_FS
instead ?
The problem with the unshare() in ckpt_remount_proc() is that it
creates an extra level in cgroup hierarchy (see below) after restart.
So applications expecting the cgroup hierarchy before chckpoint will
be surprised.
2. When --mount-pty (or --mntns) is specified, do we need to unshare()
in the parent process ? Considering only the full-container restart
for now (ignore self-restart and subtree restart), can we just
specify (CLONE_NEWNS|CLONE_FS) at the time of creating the first
restarted process ?
Here is an example (using LXC) that shows the problems I am running into
Attached is a quick hack to point out the unshare() calls I am referring
to.
If I create a simple container with LXC
$ lxc-execute --name foo --rcfile lxc-macvlan.conf -- /bin/sleep 1000
It creates the following three processes:
PID PPID CMD
3350 3239 lxc-execute --name foo -- /bin/sleep 1000
3353 3350 /usr/local/libexec/lxc-init -- /bin/sleep 1000
3357 3353 /bin/sleep 1000
A new cgroup is created named 'foo' (which is basically a user-space
rename of the pid of the lxc-init). This cgroup is in the root cgroup
directory and has two tasks (lxc-init, sleep)
$ cat /cgroup/foo/tasks
3353
3357
When I checkpoint and restart this container (using the equivalent of
--pidns --pids --mount-pty options to /bin/restart). I get three
processes:
3434 3375 ./lxc_restart --name bar --statefile=/root/foo.ckpt
3436 3434 /usr/local/libexec/lxc-init -- /bin/sleep 1000
3437 3436 /bin/sleep 1000
But the directory in /cgroup referring to lxc-init is 3 levels deep:
ls /cgroup/3434/3436/1
cgroup.procs freezer.state notify_on_release tasks
Here is the complete hierarchy created after the restart:
$ ls -R /cgroup/3434
/cgroup/3434:
3436 cgroup.procs freezer.state notify_on_release tasks
/cgroup/3434/3436:
1 cgroup.procs freezer.state notify_on_release tasks
/cgroup/3434/3436/1:
cgroup.procs freezer.state notify_on_release tasks
$ cat /cgroup/3434/tasks
3434
$ cat /cgroup/3434/3436/tasks # empty
$ cat /cgroup/3434/3436/1/tasks
3436
3437
I think we get the directory /cgroup/3434 due to the following unshare()
/* private mounts namespace ? */
if (args->mntns && unshare(CLONE_NEWNS | CLONE_FS) < 0) {
ckpt_perror("unshare");
exit(1);
}
And we get the "3436/1" directory due to the unshare() in ckpt_remount_proc().
Following hack seems to fix both the levels and the lxc_restart command
correctly creates just the "/cgroup/3436" (which LXC renames to "/cgroup/bar"
cgroup).
---
From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Date: Mon, 8 Mar 2010 12:03:46 -0800
Subject: [PATCH 1/1] Minimize unshare() calls
---
restart.c | 9 ++++++++-
1 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/restart.c b/restart.c
index c82de21..6ac51e3 100644
--- a/restart.c
+++ b/restart.c
@@ -459,10 +459,12 @@ int app_restart(struct app_restart_args *args)
exit(1);
/* private mounts namespace ? */
+#if 0
if (args->mntns && unshare(CLONE_NEWNS | CLONE_FS) < 0) {
ckpt_perror("unshare");
exit(1);
}
+#endif
/* chroot ? */
if (args->root && chroot(args->root) < 0) {
@@ -717,10 +719,12 @@ static int ckpt_probe_child(pid_t pid, char *str)
*/
static int ckpt_remount_proc(struct ckpt_ctx *ctx)
{
+#if 0
if (unshare(CLONE_NEWNS | CLONE_FS) < 0) {
ckpt_perror("unshare");
return -1;
}
+#endif
/* this is unlikely, but we don't want to fail */
if (umount2("/proc", MNT_DETACH) < 0) {
if (ckpt_cond_fail(ctx, CKPT_COND_MNTPROC)) {
@@ -778,6 +782,7 @@ static int ckpt_coordinator_pidns(struct ckpt_ctx *ctx)
int copy, ret;
genstack stk;
void *sp;
+ unsigned long flags = SIGCHLD;
ckpt_dbg("forking coordinator in new pidns\n");
@@ -802,7 +807,9 @@ static int ckpt_coordinator_pidns(struct ckpt_ctx *ctx)
copy = ctx->args->copy_status;
ctx->args->copy_status = 1;
- coord_pid = clone(__ckpt_coordinator, sp, CLONE_NEWPID|SIGCHLD, ctx);
+ flags |= CLONE_NEWPID|CLONE_NEWNS|CLONE_FS;
+
+ coord_pid = clone(__ckpt_coordinator, sp, flags, ctx);
genstack_release(stk);
if (coord_pid < 0) {
ckpt_perror("clone coordinator");
--
1.6.6.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: user-cr: Extra unshare() calls ?
[not found] ` <20100308201338.GA16446-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-08 20:58 ` Serge E. Hallyn
[not found] ` <20100308205851.GB21490-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Serge E. Hallyn @ 2010-03-08 20:58 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: Containers
Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
>
> Came across this while testing LXC.
>
>
> 1. Does ckpt_remount_proc() need to unshare() ? Or can we have the
> clone() that calls __ckpt_coordinator() clone with CLONE_NEWNS|CLONE_FS
> instead ?
>
> The problem with the unshare() in ckpt_remount_proc() is that it
> creates an extra level in cgroup hierarchy (see below) after restart.
> So applications expecting the cgroup hierarchy before chckpoint will
> be surprised.
>
> 2. When --mount-pty (or --mntns) is specified, do we need to unshare()
> in the parent process ? Considering only the full-container restart
> for now (ignore self-restart and subtree restart), can we just
> specify (CLONE_NEWNS|CLONE_FS) at the time of creating the first
> restarted process ?
And then move remounting of devpts into ckpt_remount_proc() called
from __ckpt_coordinator() as well?
-serge
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: user-cr: Extra unshare() calls ?
[not found] ` <20100308205851.GB21490-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-08 21:11 ` Sukadev Bhattiprolu
[not found] ` <20100308211135.GB14607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-08 21:11 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: Containers
Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
| >
| > Came across this while testing LXC.
| >
| >
| > 1. Does ckpt_remount_proc() need to unshare() ? Or can we have the
| > clone() that calls __ckpt_coordinator() clone with CLONE_NEWNS|CLONE_FS
| > instead ?
| >
| > The problem with the unshare() in ckpt_remount_proc() is that it
| > creates an extra level in cgroup hierarchy (see below) after restart.
| > So applications expecting the cgroup hierarchy before chckpoint will
| > be surprised.
| >
| > 2. When --mount-pty (or --mntns) is specified, do we need to unshare()
| > in the parent process ? Considering only the full-container restart
| > for now (ignore self-restart and subtree restart), can we just
| > specify (CLONE_NEWNS|CLONE_FS) at the time of creating the first
| > restarted process ?
|
| And then move remounting of devpts into ckpt_remount_proc() called
| from __ckpt_coordinator() as well?
Yes, I missed that in the hack.
Thanks,
Sukadev
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: user-cr: Extra unshare() calls ?
[not found] ` <20100308211135.GB14607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-12 1:58 ` Sukadev Bhattiprolu
[not found] ` <20100312015823.GB6444-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 5+ messages in thread
From: Sukadev Bhattiprolu @ 2010-03-12 1:58 UTC (permalink / raw)
To: Serge E. Hallyn; +Cc: Containers
Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
| Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
| | Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
| | >
| | > Came across this while testing LXC.
| | >
| | >
| | > 1. Does ckpt_remount_proc() need to unshare() ? Or can we have the
| | > clone() that calls __ckpt_coordinator() clone with CLONE_NEWNS|CLONE_FS
| | > instead ?
| | >
| | > The problem with the unshare() in ckpt_remount_proc() is that it
| | > creates an extra level in cgroup hierarchy (see below) after restart.
| | > So applications expecting the cgroup hierarchy before chckpoint will
| | > be surprised.
| | >
| | > 2. When --mount-pty (or --mntns) is specified, do we need to unshare()
| | > in the parent process ? Considering only the full-container restart
| | > for now (ignore self-restart and subtree restart), can we just
| | > specify (CLONE_NEWNS|CLONE_FS) at the time of creating the first
| | > restarted process ?
| |
| | And then move remounting of devpts into ckpt_remount_proc() called
| | from __ckpt_coordinator() as well?
|
| Yes, I missed that in the hack.
Here is a more involved fix, but not sure if there is a more efficient
way to solve.
---
From 6813961bc1f357516685145279f9f8ce883c31a4 Mon Sep 17 00:00:00 2001
From: Sukadev Bhattiprolu <sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org>
Date: Mon, 8 Mar 2010 12:03:46 -0800
Subject: [PATCH 1/1] Minimize unshare() calls
We currently have a few unshare() calls at different points in the
code. While these don't affect the restart application itself, the
excess calls create additional levels in the cgroup hierarchy, which
can surprise the administrator (or other users of the hierarchy
such as LXC.
Rather than several unshare() calls, can we instead specify the
appropriate clone_flags while creating the coordinator/root process
of the application tree ? When this root process is created it can
remount /proc, remount devpts, chroot() etc if necessary.
Note that for "new-container with init" and "subtree restart", the
first process is also the root of the application process tree.
In the case of "new-container without init", the coordinator process
which acts as the container-init can do the setup.
In case of self-restart, the main process itself can do the unshare.
This patch has been very gently tested :-) but wanted to get more feedback
on the direction and see if there is an easier way.
---
restart.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++++------------
1 files changed, 62 insertions(+), 16 deletions(-)
diff --git a/restart.c b/restart.c
index 63e1881..87a30dc 100644
--- a/restart.c
+++ b/restart.c
@@ -127,6 +127,9 @@ struct task zero_task;
#define TASK_SESSION 0x10 /* inherits creator's original sid */
#define TASK_NEWPID 0x20 /* starts a new pid namespace */
#define TASK_DEAD 0x40 /* dead task (dummy) */
+#define TASK_NEWROOT 0x80 /* task must chroot() */
+#define TASK_NEWPTS 0x100 /* remount devpts */
+#define TASK_NEWNS 0x200 /* unshare namespace/file-system */
struct ckpt_ctx {
pid_t root_pid;
@@ -458,24 +461,24 @@ int app_restart(struct app_restart_args *args)
if (args->freezer && freezer_prepare(&ctx) < 0)
exit(1);
- /* private mounts namespace ? */
- if (args->mntns && unshare(CLONE_NEWNS | CLONE_FS) < 0) {
- ckpt_perror("unshare");
- exit(1);
- }
+ /* self-restart ends here: */
+ if (args->self) {
+ /* private mounts namespace ? */
+ if (args->mntns && unshare(CLONE_NEWNS | CLONE_FS) < 0) {
+ ckpt_perror("unshare");
+ exit(1);
+ }
- /* chroot ? */
- if (args->root && chroot(args->root) < 0) {
- ckpt_perror("chroot");
- exit(1);
- }
+ /* chroot ? */
+ if (args->root && chroot(args->root) < 0) {
+ ckpt_perror("chroot");
+ exit(1);
+ }
- /* remount /dev/pts ? */
- if (args->mnt_pty && ckpt_remount_devpts(&ctx) < 0)
- exit(1);
+ /* remount /dev/pts ? */
+ if (args->mnt_pty && ckpt_remount_devpts(&ctx) < 0)
+ exit(1);
- /* self-restart ends here: */
- if (args->self) {
restart(getpid(), STDIN_FILENO, RESTART_TASKSELF, args->klogfd);
/* reach here if restart(2) failed ! */
ckpt_perror("restart");
@@ -520,10 +523,30 @@ int app_restart(struct app_restart_args *args)
if (ret < 0)
exit(1);
+ /*
+ * Have the first child in the restarted process tree
+ * setup devpts, root-dir and /proc if necessary, ...
+ */
+ if (ctx.args->mnt_pty)
+ ctx.tasks_arr[0].flags |= TASK_NEWPTS;
+
+ if (ctx.args->mntns)
+ ctx.tasks_arr[0].flags |= TASK_NEWNS;
+
+ if (ctx.args->root)
+ ctx.tasks_arr[0].flags |= TASK_NEWROOT;
+
if (ctx.args->pidns && ctx.tasks_arr[0].pid != 1) {
ckpt_dbg("new pidns without init\n");
if (global_send_sigint == -1)
global_send_sigint = SIGINT;
+ /*
+ * ...unless we have an explicit coordinator, in which case
+ * the coordinator should set up the filesystems and
+ * not the first process in the application process tree.
+ */
+ ctx.tasks_arr[0].flags &= ~(TASK_NEWPTS | TASK_NEWROOT | \
+ TASK_NEWNS);
ret = ckpt_coordinator_pidns(&ctx);
} else if (ctx.args->pidns) {
ckpt_dbg("new pidns with init\n");
@@ -717,10 +740,12 @@ static int ckpt_probe_child(pid_t pid, char *str)
*/
static int ckpt_remount_proc(struct ckpt_ctx *ctx)
{
+#if 0
if (unshare(CLONE_NEWNS | CLONE_FS) < 0) {
ckpt_perror("unshare");
return -1;
}
+#endif
/* this is unlikely, but we don't want to fail */
if (umount2("/proc", MNT_DETACH) < 0) {
if (ckpt_cond_fail(ctx, CKPT_COND_MNTPROC)) {
@@ -743,9 +768,18 @@ static int __ckpt_coordinator(void *arg)
{
struct ckpt_ctx *ctx = (struct ckpt_ctx *) arg;
+ /* chroot ? */
+ if (ctx->args->root && chroot(ctx->args->root) < 0) {
+ ckpt_perror("chroot");
+ exit(1);
+ }
+
if (ckpt_remount_proc(ctx) < 0)
return -1;
+ if (ctx->args->mnt_pty && ckpt_remount_devpts(ctx) < 0)
+ return -1;
+
if (!ctx->args->wait)
close(ctx->pipe_coord[0]);
@@ -778,6 +812,7 @@ static int ckpt_coordinator_pidns(struct ckpt_ctx *ctx)
int copy, ret;
genstack stk;
void *sp;
+ unsigned long flags = SIGCHLD;
ckpt_dbg("forking coordinator in new pidns\n");
@@ -802,7 +837,9 @@ static int ckpt_coordinator_pidns(struct ckpt_ctx *ctx)
copy = ctx->args->copy_status;
ctx->args->copy_status = 1;
- coord_pid = clone(__ckpt_coordinator, sp, CLONE_NEWPID|SIGCHLD, ctx);
+ flags |= CLONE_NEWPID|CLONE_NEWNS;
+
+ coord_pid = clone(__ckpt_coordinator, sp, flags, ctx);
genstack_release(stk);
if (coord_pid < 0) {
ckpt_perror("clone coordinator");
@@ -1610,10 +1647,16 @@ int ckpt_fork_stub(void *data)
struct task *task = (struct task *) data;
struct ckpt_ctx *ctx = task->ctx;
+ if ((task->flags & TASK_NEWROOT) && chroot(ctx->args->root) < 0)
+ return -1;
+
/* tasks with new pid-ns need new /proc mount */
if ((task->flags & TASK_NEWPID) && ckpt_remount_proc(ctx) < 0)
return -1;
+ if ((task->flags & TASK_NEWPTS) && ckpt_remount_devpts(ctx) < 0)
+ return -1;
+
/*
* In restart into a new pid namespace (--pidns), coordinator
* is the container init, hence if it terminated permatutely
@@ -1691,6 +1734,9 @@ static pid_t ckpt_fork_child(struct ckpt_ctx *ctx, struct task *child)
pid = 0;
}
#endif
+ if (child->flags & TASK_NEWNS) {
+ flags |= CLONE_NEWNS;
+ }
if (child->flags & (TASK_SIBLING | TASK_THREAD))
child->real_parent = getppid();
--
1.6.6.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: user-cr: Extra unshare() calls ?
[not found] ` <20100312015823.GB6444-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2010-03-12 14:44 ` Serge E. Hallyn
0 siblings, 0 replies; 5+ messages in thread
From: Serge E. Hallyn @ 2010-03-12 14:44 UTC (permalink / raw)
To: Sukadev Bhattiprolu; +Cc: Containers
Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> Sukadev Bhattiprolu [sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org] wrote:
> | Serge E. Hallyn [serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org] wrote:
> | | Quoting Sukadev Bhattiprolu (sukadev-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8@public.gmane.org):
> | | >
> | | > Came across this while testing LXC.
> | | >
> | | >
> | | > 1. Does ckpt_remount_proc() need to unshare() ? Or can we have the
> | | > clone() that calls __ckpt_coordinator() clone with CLONE_NEWNS|CLONE_FS
> | | > instead ?
> | | >
> | | > The problem with the unshare() in ckpt_remount_proc() is that it
> | | > creates an extra level in cgroup hierarchy (see below) after restart.
> | | > So applications expecting the cgroup hierarchy before chckpoint will
> | | > be surprised.
> | | >
> | | > 2. When --mount-pty (or --mntns) is specified, do we need to unshare()
> | | > in the parent process ? Considering only the full-container restart
> | | > for now (ignore self-restart and subtree restart), can we just
> | | > specify (CLONE_NEWNS|CLONE_FS) at the time of creating the first
> | | > restarted process ?
> | |
> | | And then move remounting of devpts into ckpt_remount_proc() called
> | | from __ckpt_coordinator() as well?
> |
> | Yes, I missed that in the hack.
>
> Here is a more involved fix, but not sure if there is a more efficient
> way to solve.
I don't know what you mean by more efficient - this looks good to me.
(Apart from probably also pulling the self-restart stuff off into a helper)
-serge
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2010-03-12 14:44 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-03-08 20:13 user-cr: Extra unshare() calls ? Sukadev Bhattiprolu
[not found] ` <20100308201338.GA16446-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-08 20:58 ` Serge E. Hallyn
[not found] ` <20100308205851.GB21490-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-08 21:11 ` Sukadev Bhattiprolu
[not found] ` <20100308211135.GB14607-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-12 1:58 ` Sukadev Bhattiprolu
[not found] ` <20100312015823.GB6444-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2010-03-12 14:44 ` Serge E. Hallyn
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.