[PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page

public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 22:10   ` David Miller
  2010-05-04 14:43   ` David Howells
  2010-05-01 14:14 ` [PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Sukadev Bhattiprolu,
	Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

To simplify alloc_pidmap(), move code to allocate a pid map page to a
separate function.

Changelog[v4]:
	- [Oren Laadan] Adapt to kernel 2.6.33-rc5
Changelog[v3]:
	- Earlier version of patchset called alloc_pidmap_page() from two
	  places. But now its called from only one place. Even so, moving
	  this code out into a separate function simplifies alloc_pidmap().
Changelog[v2]:
	- (Matt Helsley, Dave Hansen) Have alloc_pidmap_page() return
	  -ENOMEM on error instead of -1.

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   41 ++++++++++++++++++++++++++---------------
 1 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index aebb30d..52a371a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -122,6 +122,30 @@ static void free_pidmap(struct upid *upid)
 	atomic_inc(&map->nr_free);
 }
 
+static int alloc_pidmap_page(struct pidmap *map)
+{
+	void *page;
+
+	if (likely(map->page))
+		return 0;
+
+	page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+	/*
+	 * Free the page if someone raced with us installing it:
+	 */
+	spin_lock_irq(&pidmap_lock);
+	if (!map->page) {
+		map->page = page;
+		page = NULL;
+	}
+	spin_unlock_irq(&pidmap_lock);
+	kfree(page);
+	if (unlikely(!map->page))
+		return -1;
+
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -134,22 +158,9 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (!map->page) {
-				map->page = page;
-				page = NULL;
-			}
-			spin_unlock_irq(&pidmap_lock);
-			kfree(page);
-			if (unlikely(!map->page))
+		if (unlikely(!map->page))
+			if (alloc_pidmap_page(map) < 0)
 				break;
-		}
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
  2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 003/100] eclone (3/11): Define set_pidmap() function Oren Laadan
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Sukadev Bhattiprolu,
	Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

alloc_pidmap() can fail either because all pid numbers are in use or
because memory allocation failed.  With support for setting a specific
pid number, alloc_pidmap() would also fail if either the given pid
number is invalid or in use.

Rather than have callers assume -ENOMEM, have alloc_pidmap() return
the actual error.

Changelog[v1]:
	- [Oren Laadan] Rebase to kernel 2.6.33

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    5 +++--
 kernel/pid.c  |   10 ++++++----
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 44b0791..afdfb08 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1147,10 +1147,11 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(p->nsproxy->pid_ns);
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
diff --git a/kernel/pid.c b/kernel/pid.c
index 52a371a..8330488 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -160,7 +160,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	for (i = 0; i <= max_scan; ++i) {
 		if (unlikely(!map->page))
 			if (alloc_pidmap_page(map) < 0)
-				break;
+				return -ENOMEM;
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -191,7 +191,7 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 		}
 		pid = mk_pid(pid_ns, map, offset);
 	}
-	return -1;
+	return -EBUSY;
 }
 
 int next_pidmap(struct pid_namespace *pid_ns, int last)
@@ -260,8 +260,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
@@ -295,7 +297,7 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
+	pid = ERR_PTR(nr);
 	goto out;
 }
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 003/100] eclone (3/11): Define set_pidmap() function
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
  2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	Sukadev Bhattiprolu, linuxppc-dev, Matt Helsley, Serge Hallyn,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Define a set_pidmap() interface which is like alloc_pidmap() only that
caller specifies the pid number to be assigned.

Changelog[v13]:
	- Don't let do_alloc_pidmap return 0 if it failed to find a pid.
Changelog[v9]:
	- Completely rewrote this patch based on Eric Biederman's code.
Changelog[v7]:
        - [Eric Biederman] Generalize alloc_pidmap() to take a range of pids.
Changelog[v6]:
        - Separate target_pid > 0 case to minimize the number of checks needed.
Changelog[v3]:
        - (Eric Biederman): Avoid set_pidmap() function. Added couple of
          checks for target_pid in alloc_pidmap() itself.
Changelog[v2]:
        - (Serge Hallyn) Check for 'pid < 0' in set_pidmap().(Code
          actually checks for 'pid <= 0' for completeness).

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@us.ibm.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/pid.c |   41 +++++++++++++++++++++++++++++++++--------
 1 files changed, 33 insertions(+), 8 deletions(-)

diff --git a/kernel/pid.c b/kernel/pid.c
index 8330488..4eaf975 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,17 +146,18 @@ static int alloc_pidmap_page(struct pidmap *map)
 	return 0;
 }
 
-static int alloc_pidmap(struct pid_namespace *pid_ns)
+static int do_alloc_pidmap(struct pid_namespace *pid_ns, int last, int min,
+		int max)
 {
-	int i, offset, max_scan, pid, last = pid_ns->last_pid;
+	int i, offset, max_scan, pid;
 	struct pidmap *map;
 
 	pid = last + 1;
 	if (pid >= pid_max)
-		pid = RESERVED_PIDS;
+		pid = min;
 	offset = pid & BITS_PER_PAGE_MASK;
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
-	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
+	max_scan = (max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
 		if (unlikely(!map->page))
 			if (alloc_pidmap_page(map) < 0)
@@ -165,7 +166,6 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
 					atomic_dec(&map->nr_free);
-					pid_ns->last_pid = pid;
 					return pid;
 				}
 				offset = find_next_offset(map, offset);
@@ -176,16 +176,16 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 			 * bitmap block and the final block was the same
 			 * as the starting point, pid is before last_pid.
 			 */
-			} while (offset < BITS_PER_PAGE && pid < pid_max &&
+			} while (offset < BITS_PER_PAGE && pid < max &&
 					(i != max_scan || pid < last ||
 					    !((last+1) & BITS_PER_PAGE_MASK)));
 		}
-		if (map < &pid_ns->pidmap[(pid_max-1)/BITS_PER_PAGE]) {
+		if (map < &pid_ns->pidmap[(max-1)/BITS_PER_PAGE]) {
 			++map;
 			offset = 0;
 		} else {
 			map = &pid_ns->pidmap[0];
-			offset = RESERVED_PIDS;
+			offset = min;
 			if (unlikely(last == offset))
 				break;
 		}
@@ -194,6 +194,31 @@ static int alloc_pidmap(struct pid_namespace *pid_ns)
 	return -EBUSY;
 }
 
+static int alloc_pidmap(struct pid_namespace *pid_ns)
+{
+	int nr;
+
+	nr = do_alloc_pidmap(pid_ns, pid_ns->last_pid, RESERVED_PIDS, pid_max);
+	if (nr >= 0)
+		pid_ns->last_pid = nr;
+	return nr;
+}
+
+static int set_pidmap(struct pid_namespace *pid_ns, int target)
+{
+	if (!target)
+		return alloc_pidmap(pid_ns);
+
+	if (target >= pid_max)
+		return -EINVAL;
+
+	if ((target < 0) || (target < RESERVED_PIDS &&
+				pid_ns->last_pid >= RESERVED_PIDS))
+		return -EINVAL;
+
+	return do_alloc_pidmap(pid_ns, target - 1, target, target + 1);
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid()
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (2 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 003/100] eclone (3/11): Define set_pidmap() function Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
	linuxppc-dev

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This parameter is currently NULL, but will be used in a follow-on patch.

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/pid.h |    2 +-
 kernel/fork.c       |    3 ++-
 kernel/pid.c        |    9 +++++++--
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/include/linux/pid.h b/include/linux/pid.h
index 49f1c2f..914185d 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -119,7 +119,7 @@ extern struct pid *find_get_pid(int nr);
 extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
 int next_pidmap(struct pid_namespace *pid_ns, int last);
 
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids);
 extern void free_pid(struct pid *pid);
 
 /*
diff --git a/kernel/fork.c b/kernel/fork.c
index afdfb08..62018c8 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -962,6 +962,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
+	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1147,7 +1148,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		pid = alloc_pid(p->nsproxy->pid_ns);
+		pid = alloc_pid(p->nsproxy->pid_ns, target_pids);
 		if (IS_ERR(pid)) {
 			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
diff --git a/kernel/pid.c b/kernel/pid.c
index 4eaf975..57f1344 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -276,13 +276,14 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *target_pids)
 {
 	struct pid *pid;
 	enum pid_type type;
 	int i, nr;
 	struct pid_namespace *tmp;
 	struct upid *upid;
+	pid_t tpid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
 	if (!pid) {
@@ -292,7 +293,11 @@ struct pid *alloc_pid(struct pid_namespace *ns)
 
 	tmp = ns;
 	for (i = ns->level; i >= 0; i--) {
-		nr = alloc_pidmap(tmp);
+		tpid = 0;
+		if (target_pids)
+			tpid = target_pids[i];
+
+		nr = set_pidmap(tmp, tpid);
 		if (nr < 0)
 			goto out_free;
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process()
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (3 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 006/100] eclone (6/11): Check invalid clone flags Oren Laadan
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
	linuxppc-dev, Oleg Nesterov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Add a 'target_pids' parameter to copy_process().  The new parameter will be
used in a follow-on patch when eclone() is implemented.

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 kernel/fork.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 62018c8..9d2b57e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -957,12 +957,12 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 					unsigned long stack_size,
 					int __user *child_tidptr,
 					struct pid *pid,
+					pid_t *target_pids,
 					int trace)
 {
 	int retval;
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
-	pid_t *target_pids = NULL;
 
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
@@ -1339,7 +1339,7 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 	struct pt_regs regs;
 
 	task = copy_process(CLONE_VM, 0, idle_regs(&regs), 0, NULL,
-			    &init_struct_pid, 0);
+			    &init_struct_pid, NULL, 0);
 	if (!IS_ERR(task))
 		init_idle(task, cpu);
 
@@ -1362,6 +1362,7 @@ long do_fork(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	pid_t *target_pids = NULL;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1402,7 +1403,7 @@ long do_fork(unsigned long clone_flags,
 		trace = tracehook_prepare_clone(clone_flags);
 
 	p = copy_process(clone_flags, stack_start, regs, stack_size,
-			 child_tidptr, NULL, trace);
+			 child_tidptr, NULL, target_pids, trace);
 	/*
 	 * Do this prior waking up the new thread - the thread pointer
 	 * might get invalid after that point, if the thread exits quickly.
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 006/100] eclone (6/11): Check invalid clone flags
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (4 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids() Oren Laadan
                   ` (9 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	Oleg Nesterov, linuxppc-dev, Matt Helsley, Serge Hallyn,
	Sukadev Bhattiprolu, Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

As pointed out by Oren Laadan, we want to ensure that unused bits in the
clone-flags remain unused and available for future. To ensure this, define
a mask of clone-flags and check the flags in the clone() system calls.

Changelog[v9]:
	- Include the unused clone-flag (CLONE_UNUSED) to VALID_CLONE_FLAGS
	  to avoid breaking any applications that may have set it. IOW, this
	  patch/check only applies to clone-flags bits 33 and higher.

Changelog[v8]:
	- New patch in set

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl.cs.columbia.edu>
---
 include/linux/sched.h |   12 ++++++++++++
 kernel/fork.c         |    3 +++
 2 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dad7f66..5de3ce5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -29,6 +29,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+#define CLONE_UNUSED		0x00001000	/* Can be reused ? */
+
+#define VALID_CLONE_FLAGS	(CSIGNAL | CLONE_VM | CLONE_FS | CLONE_FILES |\
+				 CLONE_SIGHAND | CLONE_UNUSED | CLONE_PTRACE |\
+				 CLONE_VFORK  | CLONE_PARENT | CLONE_THREAD  |\
+				 CLONE_NEWNS  | CLONE_SYSVSEM | CLONE_SETTLS |\
+				 CLONE_PARENT_SETTID | CLONE_CHILD_CLEARTID  |\
+				 CLONE_DETACHED | CLONE_UNTRACED             |\
+				 CLONE_CHILD_SETTID | CLONE_STOPPED          |\
+				 CLONE_NEWUTS | CLONE_NEWIPC | CLONE_NEWUSER |\
+				 CLONE_NEWPID | CLONE_NEWNET | CLONE_IO)
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d2b57e..e41b3d1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -964,6 +964,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	struct task_struct *p;
 	int cgroup_callbacks_done = 0;
 
+	if (clone_flags & ~VALID_CLONE_FLAGS)
+		return ERR_PTR(-EINVAL);
+
 	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
 		return ERR_PTR(-EINVAL);
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids()
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (5 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 006/100] eclone (6/11): Check invalid clone flags Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32, 64) Oren Laadan
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
	linuxppc-dev, Oleg Nesterov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

do_fork_with_pids() is same as do_fork(), except that it takes an
additional, 'pid_set', parameter. This parameter, currently unused,
specifies the set of target pids of the process in each of its pid
namespaces.

Changelog[v7]:
	- Drop 'struct pid_set' object and pass in 'pid_t *target_pids'
	  instead of 'struct pid_set *'.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change 'pid_set.num_pids' to 'unsigned int'.

Changelog[v4]:
	- Rename 'struct target_pid_set' to 'struct pid_set' since it may
	  be useful in other contexts.

Changelog[v3]:
	- Fix "long-line" warning from checkpatch.pl

Changelog[v2]:
	- To facilitate moving architecture-inpdendent code to kernel/fork.c
	  pass in 'struct target_pid_set __user *' to do_fork_with_pids()
	  rather than 'pid_t *' (next patch moves the arch-independent
	  code to kernel/fork.c)

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Cc: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Reviewed-by: Oren Laadan <orenl@cs.columbia.edu>
---
 include/linux/sched.h |    3 +++
 kernel/fork.c         |   17 +++++++++++++++--
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5de3ce5..f4ae3e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2129,6 +2129,9 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
+				unsigned long, int __user *, int __user *,
+				unsigned int, pid_t __user *);
 struct task_struct *fork_idle(int);
 
 extern void set_task_comm(struct task_struct *tsk, char *from);
diff --git a/kernel/fork.c b/kernel/fork.c
index e41b3d1..2559d7a 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1355,12 +1355,14 @@ struct task_struct * __cpuinit fork_idle(int cpu)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long do_fork(unsigned long clone_flags,
+long do_fork_with_pids(unsigned long clone_flags,
 	      unsigned long stack_start,
 	      struct pt_regs *regs,
 	      unsigned long stack_size,
 	      int __user *parent_tidptr,
-	      int __user *child_tidptr)
+	      int __user *child_tidptr,
+	      unsigned int num_pids,
+	      pid_t __user *upids)
 {
 	struct task_struct *p;
 	int trace = 0;
@@ -1463,6 +1465,17 @@ long do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long do_fork(unsigned long clone_flags,
+	      unsigned long stack_start,
+	      struct pt_regs *regs,
+	      unsigned long stack_size,
+	      int __user *parent_tidptr,
+	      int __user *child_tidptr)
+{
+	return do_fork_with_pids(clone_flags, stack_start, regs, stack_size,
+			parent_tidptr, child_tidptr, 0, NULL);
+}
+
 #ifndef ARCH_MIN_MMSTRUCT_ALIGN
 #define ARCH_MIN_MMSTRUCT_ALIGN 0
 #endif
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32, 64)
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (6 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids() Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Sukadev Bhattiprolu,
	Pavel Emelyanov

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

Container restart requires that a task have the same pid it had when it was
checkpointed. When containers are nested the tasks within the containers
exist in multiple pid namespaces and hence have multiple pids to specify
during restart.

eclone(), intended for use during restart, is the same as
clone(), except that it takes a 'pids' paramter. This parameter lets
caller choose specific pid numbers for the child process, in the
process's active and ancestor pid namespaces. (Descendant pid namespaces
in general don't matter since processes don't have pids in them anyway,
but see comments in copy_target_pids() regarding CLONE_NEWPID).

eclone() also attempts to address a second limitation of the
clone() system call. clone() is restricted to 32 clone flags and all but
one of these are in use. If more new clone flags are needed, we will be
forced to define a new variant of the clone() system call. To address
this, eclone() allows at least 64 clone flags with some room
for more if necessary.

To prevent unprivileged processes from misusing this interface,
eclone() currently needs CAP_SYS_ADMIN, when the 'pids' parameter
is non-NULL.

See Documentation/eclone in next patch for more details and an
example of its usage.

NOTE:
	- System calls are restricted to 6 parameters and the number and sizes
	  of parameters needed for eclone() exceed 6 integers. The new
	  prototype works around this restriction while providing some
	  flexibility if eclone() needs to be further extended in the
	  future.
TODO:
	- We should convert clone-flags to 64-bit value in all architectures.
	  Its probably best to do that as a separate patchset since clone_flags
	  touches several functions and that patchset seems independent of this
	  new system call.

Changelog[v14]:
	- [Oren Laadan] Rebase to kernel 2.6.33
	  * introduce PTREGSCALL4 for sys_eclone
	  * consolidate syscall definitions for 32/64 bit
	- [Oren Laadan] Merge x86_64 (trivial patch) with current
        - [Serge Hallyn] Add eclone stub for ia32 eclone

Changelog[v13]:
	- [Dave Hansen]: Reorg to enable sharing code between x86 and x86-64.
	- [Arnd Bergmann]: With args_size parameter, ->reserved1 is redundant
	  and can be removed.
	- [Nathan Lynch]: stop warnings about assigning u64 to a (32-bit) int*.
	- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
	  ->child_stack and ensure ->child_stack_size is 0 on architectures
	  that don't need it (see comments in types.h for details).

Changelog[v12]:
	- [Serge Hallyn] Ignore ->child_stack_size if ->child_stack_base
	  is NULL.
	- [Oren Laadan, Serge Hallyn] Rename clone_with_pids() to eclone()
Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpeendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'

Changelog[v10]:
	- Rename clone3() to clone_with_pids()
	- [Linus Torvalds] Use PTREGSCALL() rather than the generic syscall
	  implementation

Changelog[v9]:
	- [Roland McGrath, H. Peter Anvin] To avoid confusion on 64-bit
	  architectures split the new clone-flags into 'low' and 'high'
	  words and pass in the 'lower' flags as the first argument.
	  This would maintain similarity of the clone3() with clone()/
	  clone2(). Also has the side-effect of the name matching the
	  number of parameters :-)
	- [Roland McGrath] Rename structure to 'clone_args' and add a
	  'child_stack_size' field

Changelog[v8]
	- [Oren Laadan] parent_tid and child_tid fields in 'struct clone_arg'
	  must be 64-bit.
	- clone2() is in use in IA64. Rename system call to clone3().

Changelog[v7]:
	- [Peter Zijlstra, Arnd Bergmann] Rename system call to clone2()
	  and group parameters into a new 'struct clone_struct' object.

Changelog[v6]:
	- (Nathan Lynch, Arnd Bergmann, H. Peter Anvin, Linus Torvalds)
	  Change 'pid_set.pids' to a 'pid_t pids[]' so size of 'struct pid_set'
	  is constant across architectures.
	- (Nathan Lynch) Change pid_set.num_pids to unsigned and remove
	  'unum_pids < 0' check.

Changelog[v4]:
	- (Oren Laadan) rename 'struct target_pid_set' to 'struct pid_set'

Changelog[v3]:
	- (Oren Laadan) Allow CLONE_NEWPID flag (by allocating an extra pid
	  in the target_pids[] list and setting it 0. See copy_target_pids()).
	- (Oren Laadan) Specified target pids should apply only to youngest
	  pid-namespaces (see copy_target_pids())
	- (Matt Helsley) Update patch description.

Changelog[v2]:
	- Remove unnecessary printk and add a note to callers of
	  copy_target_pids() to free target_pids.
	- (Serge Hallyn) Mention CAP_SYS_ADMIN restriction in patch description.
	- (Oren Laadan) Add checks for 'num_pids < 0' (return -EINVAL) and
	  'num_pids == 0' (fall back to normal clone()).
	- Move arch-independent code (sanity checks and copy-in of target-pids)
	  into kernel/fork.c and simplify sys_clone_with_pids()

Changelog[v1]:
	- Fixed some compile errors (had fixed these errors earlier in my
	  git tree but had not refreshed patches before emailing them)

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan <orenl.cs.columbia.edu>
---
 arch/x86/ia32/ia32entry.S          |    2 +
 arch/x86/include/asm/syscalls.h    |    2 +
 arch/x86/include/asm/unistd_32.h   |    3 +-
 arch/x86/include/asm/unistd_64.h   |    2 +
 arch/x86/kernel/entry_32.S         |   14 ++++
 arch/x86/kernel/entry_64.S         |    1 +
 arch/x86/kernel/process.c          |   40 +++++++++++-
 arch/x86/kernel/syscall_table_32.S |    1 +
 include/linux/sched.h              |    2 +
 include/linux/types.h              |   16 +++++
 kernel/fork.c                      |  124 +++++++++++++++++++++++++++++++++++-
 11 files changed, 204 insertions(+), 3 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index 59b4556..b7f3f34 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -477,6 +477,7 @@ quiet_ni_syscall:
 	PTREGSCALL stub32_clone, sys32_clone, %rdx
 	PTREGSCALL stub32_vfork, sys_vfork, %rdi
 	PTREGSCALL stub32_iopl, sys_iopl, %rsi
+	PTREGSCALL stub32_eclone, sys_eclone, %r8
 
 ENTRY(ia32_ptregs_common)
 	popq %r11
@@ -842,4 +843,5 @@ ia32_sys_call_table:
 	.quad compat_sys_rt_tgsigqueueinfo	/* 335 */
 	.quad sys_perf_event_open
 	.quad compat_sys_recvmmsg
+	.quad stub32_eclone
 ia32_syscall_end:
diff --git a/arch/x86/include/asm/syscalls.h b/arch/x86/include/asm/syscalls.h
index 5c044b4..d525677 100644
--- a/arch/x86/include/asm/syscalls.h
+++ b/arch/x86/include/asm/syscalls.h
@@ -27,6 +27,8 @@ long sys_execve(char __user *, char __user * __user *,
 		char __user * __user *, struct pt_regs *);
 long sys_clone(unsigned long, unsigned long, void __user *,
 	       void __user *, struct pt_regs *);
+long sys_eclone(unsigned flags_low, struct clone_args __user *uca,
+		int args_size, pid_t __user *pids, struct pt_regs *regs);
 
 /* kernel/ldt.c */
 asmlinkage int sys_modify_ldt(int, void __user *, unsigned long);
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index beb9b5f..e543b0e 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,11 @@
 #define __NR_rt_tgsigqueueinfo	335
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
+#define __NR_eclone		338
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 338
+#define NR_syscalls 339
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index ff4307b..1cd16af 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,8 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
 __SYSCALL(__NR_perf_event_open, sys_perf_event_open)
 #define __NR_recvmmsg				299
 __SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_eclone				300
+__SYSCALL(__NR_eclone, stub_eclone)
 
 #ifndef __NO_STUBS
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index 44a8e0d..65e1735 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -758,6 +758,19 @@ ptregs_##name: \
 	addl $4,%esp; \
 	ret
 
+#define PTREGSCALL4(name) \
+	ALIGN; \
+ptregs_##name: \
+	leal 4(%esp),%eax; \
+	pushl %eax; \
+	pushl PT_ESI(%eax); \
+	movl PT_EDX(%eax),%ecx; \
+	movl PT_ECX(%eax),%edx; \
+	movl PT_EBX(%eax),%eax; \
+	call sys_##name; \
+	addl $8,%esp; \
+	ret
+
 PTREGSCALL1(iopl)
 PTREGSCALL0(fork)
 PTREGSCALL0(vfork)
@@ -767,6 +780,7 @@ PTREGSCALL0(sigreturn)
 PTREGSCALL0(rt_sigreturn)
 PTREGSCALL2(vm86)
 PTREGSCALL1(vm86old)
+PTREGSCALL4(eclone)
 
 /* Clone is an oddball.  The 4th arg is in %edi */
 	ALIGN;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0697ff1..216681e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -698,6 +698,7 @@ END(\label)
 	PTREGSCALL stub_vfork, sys_vfork, %rdi
 	PTREGSCALL stub_sigaltstack, sys_sigaltstack, %rdx
 	PTREGSCALL stub_iopl, sys_iopl, %rsi
+	PTREGSCALL stub_eclone, sys_eclone, %r8
 
 ENTRY(ptregscall_common)
 	DEFAULT_FRAME 1 8	/* offset 8: return address */
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 28ad9f4..5abad20 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -259,6 +259,45 @@ sys_clone(unsigned long clone_flags, unsigned long newsp,
 	return do_fork(clone_flags, newsp, regs, 0, parent_tid, child_tid);
 }
 
+long
+sys_eclone(unsigned flags_low, struct clone_args __user *uca,
+	   int args_size, pid_t __user *pids, struct pt_regs *regs)
+{
+	int rc;
+	struct clone_args kca;
+	unsigned long flags;
+	int __user *parent_tidp;
+	int __user *child_tidp;
+	unsigned long __user stack;
+	unsigned long stack_size;
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	/*
+	 * TODO: Convert 'clone-flags' to 64-bits on all architectures.
+	 * TODO: When ->clone_flags_high is non-zero, copy it in to the
+	 *	 higher word(s) of 'flags':
+	 *
+	 *	 flags = (kca.clone_flags_high << 32) | flags_low;
+	 */
+	flags = flags_low;
+	parent_tidp = (int *)(unsigned long)kca.parent_tid_ptr;
+	child_tidp = (int *)(unsigned long)kca.child_tid_ptr;
+
+	stack_size = (unsigned long)kca.child_stack_size;
+	if (stack_size)
+		return -EINVAL;
+
+	stack = (unsigned long)kca.child_stack;
+	if (!stack)
+		stack = regs->sp;
+
+	return do_fork_with_pids(flags, stack, regs, stack_size, parent_tidp,
+				child_tidp, kca.nr_pids, pids);
+}
+
 /*
  * This gets run with %si containing the
  * function to call, and %di containing
@@ -700,4 +739,3 @@ unsigned long arch_randomize_brk(struct mm_struct *mm)
 	unsigned long range_end = mm->brk + 0x02000000;
 	return randomize_range(mm->brk, range_end, 0) ? : mm->brk;
 }
-
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 8b37293..0c92570 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,4 @@ ENTRY(sys_call_table)
 	.long sys_rt_tgsigqueueinfo	/* 335 */
 	.long sys_perf_event_open
 	.long sys_recvmmsg
+	.long ptregs_eclone
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f4ae3e3..8593051 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2129,6 +2129,8 @@ extern int disallow_signal(int);
 
 extern int do_execve(char *, char __user * __user *, char __user * __user *, struct pt_regs *);
 extern long do_fork(unsigned long, unsigned long, struct pt_regs *, unsigned long, int __user *, int __user *);
+extern int fetch_clone_args_from_user(struct clone_args __user *, int,
+				struct clone_args *);
 extern long do_fork_with_pids(unsigned long, unsigned long, struct pt_regs *,
 				unsigned long, int __user *, int __user *,
 				unsigned int, pid_t __user *);
diff --git a/include/linux/types.h b/include/linux/types.h
index c42724f..d8bfd6b 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -204,6 +204,22 @@ struct ustat {
 	char			f_fpack[6];
 };
 
+struct clone_args {
+	u64 clone_flags_high;
+	/*
+	 * Architectures can use child_stack for either the stack pointer or
+	 * the base of of stack. If child_stack is used as the stack pointer,
+	 * child_stack_size must be 0. Otherwise child_stack_size must be
+	 * set to size of allocated stack.
+	 */
+	u64 child_stack;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+};
+
 #endif	/* __KERNEL__ */
 #endif /*  __ASSEMBLY__ */
 #endif /* _LINUX_TYPES_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index 2559d7a..9d5be5c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1350,6 +1350,114 @@ struct task_struct * __cpuinit fork_idle(int cpu)
 }
 
 /*
+ * If user specified any 'target-pids' in @upid_setp, copy them from
+ * user and return a pointer to a local copy of the list of pids. The
+ * caller must free the list, when they are done using it.
+ *
+ * If user did not specify any target pids, return NULL (caller should
+ * treat this like normal clone).
+ *
+ * On any errors, return the error code
+ */
+static pid_t *copy_target_pids(int unum_pids, pid_t __user *upids)
+{
+	int j;
+	int rc;
+	int size;
+	int knum_pids;		/* # of pids needed in kernel */
+	pid_t *target_pids;
+
+	if (!unum_pids)
+		return NULL;
+
+	knum_pids = task_pid(current)->level + 1;
+	if (unum_pids > knum_pids)
+		return ERR_PTR(-EINVAL);
+
+	/*
+	 * To keep alloc_pid() simple, allocate an extra pid_t in target_pids[]
+	 * and set it to 0. This last entry in target_pids[] corresponds to the
+	 * (yet-to-be-created) descendant pid-namespace if CLONE_NEWPID was
+	 * specified. If CLONE_NEWPID was not specified, this last entry will
+	 * simply be ignored.
+	 */
+	target_pids = kzalloc((knum_pids + 1) * sizeof(pid_t), GFP_KERNEL);
+	if (!target_pids)
+		return ERR_PTR(-ENOMEM);
+
+	/*
+	 * A process running in a level 2 pid namespace has three pid namespaces
+	 * and hence three pid numbers. If this process is checkpointed,
+	 * information about these three namespaces are saved. We refer to these
+	 * namespaces as 'known namespaces'.
+	 *
+	 * If this checkpointed process is however restarted in a level 3 pid
+	 * namespace, the restarted process has an extra ancestor pid namespace
+	 * (i.e 'unknown namespace') and 'knum_pids' exceeds 'unum_pids'.
+	 *
+	 * During restart, the process requests specific pids for its 'known
+	 * namespaces' and lets kernel assign pids to its 'unknown namespaces'.
+	 *
+	 * Since the requested-pids correspond to 'known namespaces' and since
+	 * 'known-namespaces' are younger than (i.e descendants of) 'unknown-
+	 * namespaces', copy requested pids to the back-end of target_pids[]
+	 * (i.e before the last entry for CLONE_NEWPID mentioned above).
+	 * Any entries in target_pids[] not corresponding to a requested pid
+	 * will be set to zero and kernel assigns a pid in those namespaces.
+	 *
+	 * NOTE: The order of pids in target_pids[] is oldest pid namespace
+	 * to youngest (target_pids[0] corresponds to init_pid_ns). i.e. the
+	 * the order is:
+	 *
+	 *   - pids for 'unknown-namespaces' (if any)
+	 *   - pids for 'known-namespaces' (requested pids)
+	 *   - 0 in the last entry (for CLONE_NEWPID).
+	 */
+	j = knum_pids - unum_pids;
+	size = unum_pids * sizeof(pid_t);
+
+	rc = copy_from_user(&target_pids[j], upids, size);
+	if (rc) {
+		rc = -EFAULT;
+		goto out_free;
+	}
+
+	return target_pids;
+
+out_free:
+	kfree(target_pids);
+	return ERR_PTR(rc);
+}
+
+int
+fetch_clone_args_from_user(struct clone_args __user *uca, int args_size,
+			struct clone_args *kca)
+{
+	int rc;
+
+	/*
+	 * TODO: If size of clone_args is not what the kernel expects, it
+	 * could be that kernel is newer and has an extended structure.
+	 * When that happens, this check needs to be smarter.  For now,
+	 * assume exact match.
+	 */
+	if (args_size != sizeof(struct clone_args))
+		return -EINVAL;
+
+	rc = copy_from_user(kca, uca, args_size);
+	if (rc)
+		return -EFAULT;
+
+	/*
+	 * To avoid future compatibility issues, ensure unused fields are 0.
+	 */
+	if (kca->reserved0 || kca->clone_flags_high)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
  *  Ok, this is the main fork-routine.
  *
  * It copies the process, and if successful kick-starts
@@ -1367,7 +1475,7 @@ long do_fork_with_pids(unsigned long clone_flags,
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
-	pid_t *target_pids = NULL;
+	pid_t *target_pids;
 
 	/*
 	 * Do some preliminary argument and permissions checking before we
@@ -1401,6 +1509,16 @@ long do_fork_with_pids(unsigned long clone_flags,
 		}
 	}
 
+	target_pids = copy_target_pids(num_pids, upids);
+	if (target_pids) {
+		if (IS_ERR(target_pids))
+			return PTR_ERR(target_pids);
+
+		nr = -EPERM;
+		if (!capable(CAP_SYS_ADMIN))
+			goto out_free;
+	}
+
 	/*
 	 * When called from kernel_thread, don't do user tracing stuff.
 	 */
@@ -1462,6 +1580,10 @@ long do_fork_with_pids(unsigned long clone_flags,
 	} else {
 		nr = PTR_ERR(p);
 	}
+
+out_free:
+	kfree(target_pids);
+
 	return nr;
 }
 
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (7 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32, 64) Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Pavel Emelyanov

From: Serge E. Hallyn <serue@us.ibm.com>

Implement the s390 hook for sys_eclone().

Changelog:
	Nov 24: Removed user-space code from commit log. See user-cr git tree.
	Nov 17: remove redundant flags_high check
	Nov 13: As suggested by Heiko, convert eclone to take its
		parameters via registers.

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/include/asm/unistd.h    |    3 ++-
 arch/s390/kernel/compat_linux.c   |   17 +++++++++++++++++
 arch/s390/kernel/compat_wrapper.S |    8 ++++++++
 arch/s390/kernel/process.c        |   37 +++++++++++++++++++++++++++++++++++++
 arch/s390/kernel/syscalls.S       |    1 +
 5 files changed, 65 insertions(+), 1 deletions(-)

diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index 5f00751..ff13be1 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -269,7 +269,8 @@
 #define	__NR_pwritev		329
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
-#define NR_syscalls 332
+#define __NR_eclone		332
+#define NR_syscalls 333
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/compat_linux.c b/arch/s390/kernel/compat_linux.c
index 73b624e..1f70d6f 100644
--- a/arch/s390/kernel/compat_linux.c
+++ b/arch/s390/kernel/compat_linux.c
@@ -663,6 +663,23 @@ asmlinkage long sys32_write(unsigned int fd, char __user * buf, size_t count)
 	return sys_write(fd, buf, count);
 }
 
+asmlinkage long sys32_clone(void)
+{
+	struct pt_regs *regs = task_pt_regs(current);
+	unsigned long clone_flags;
+	unsigned long newsp;
+	int __user *parent_tidptr, *child_tidptr;
+
+	clone_flags = regs->gprs[3] & 0xffffffffUL;
+	newsp = regs->orig_gpr2 & 0x7fffffffUL;
+	parent_tidptr = compat_ptr(regs->gprs[4]);
+	child_tidptr = compat_ptr(regs->gprs[5]);
+	if (!newsp)
+		newsp = regs->gprs[15];
+	return do_fork(clone_flags, newsp, regs, 0,
+		       parent_tidptr, child_tidptr);
+}
+
 /*
  * 31 bit emulation wrapper functions for sys_fadvise64/fadvise64_64.
  * These need to rewrite the advise values for POSIX_FADV_{DONTNEED,NOREUSE}
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index 672ce52..b7bedfa 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1847,6 +1847,14 @@ sys_clone_wrapper:
 	llgtr	%r5,%r5			# int *
 	jg	sys_clone		# branch to system call
 
+	.globl	sys_eclone_wrapper
+sys_eclone_wrapper:
+	llgfr	%r2,%r2			# unsigned int
+	llgtr	%r3,%r3			# struct clone_args *
+	lgfr	%r4,%r4			# int
+	llgtr	%r5,%r5			# pid_t *
+	jg	sys_eclone		# branch to system call
+
 	.globl	sys32_execve_wrapper
 sys32_execve_wrapper:
 	llgtr	%r2,%r2			# char *
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 1039fde..799cbb0 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -240,6 +240,43 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 		       parent_tidptr, child_tidptr);
 }
 
+SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *,
+		uca, int, args_size, pid_t __user *, pids)
+{
+	int rc;
+	struct pt_regs *regs = task_pt_regs(current);
+	struct clone_args kca;
+	int __user *parent_tid_ptr;
+	int __user *child_tid_ptr;
+	unsigned long flags;
+	unsigned long __user child_stack;
+	unsigned long stack_size;
+
+	rc = fetch_clone_args_from_user(uca, args_size, &kca);
+	if (rc)
+		return rc;
+
+	flags = flags_low;
+	parent_tid_ptr = (int __user *) kca.parent_tid_ptr;
+	child_tid_ptr =  (int __user *) kca.child_tid_ptr;
+
+	stack_size = (unsigned long) kca.child_stack_size;
+	if (stack_size)
+		return -EINVAL;
+
+	child_stack = (unsigned long) kca.child_stack;
+	if (!child_stack)
+		child_stack = regs->gprs[15];
+
+	/*
+	 * TODO: On 32-bit systems, clone_flags is passed in as 32-bit value
+	 * to several functions. Need to convert clone_flags to 64-bit.
+	 */
+	return do_fork_with_pids(flags, child_stack, regs, stack_size,
+				parent_tid_ptr, child_tid_ptr, kca.nr_pids,
+				pids);
+}
+
 /*
  * This is trivial, and on the face of it looks like it
  * could equally well be done in user mode.
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 201ce6b..08eab1d 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -340,3 +340,4 @@ SYSCALL(sys_preadv,sys_preadv,compat_sys_preadv_wrapper)
 SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
+SYSCALL(sys_eclone,sys_eclone,sys_eclone_wrapper)
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (8 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Nathan Lynch, Matt Helsley, Serge Hallyn,
	Pavel Emelyanov

From: Nathan Lynch <ntl@pobox.com>

Wired up for both ppc32 and ppc64, but tested only with the latter.

Changelog:
  - Jan 20: (ntl) fix 32-bit build
  - Nov 17: (serge) remove redundant flags_high check, and
    	    don't fold it into flags.

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Nathan Lynch <ntl@pobox.com>
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/powerpc/include/asm/syscalls.h |    6 ++++
 arch/powerpc/include/asm/systbl.h   |    1 +
 arch/powerpc/include/asm/unistd.h   |    3 +-
 arch/powerpc/kernel/entry_32.S      |    8 +++++
 arch/powerpc/kernel/entry_64.S      |    5 +++
 arch/powerpc/kernel/process.c       |   54 ++++++++++++++++++++++++++++++++++-
 6 files changed, 75 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/syscalls.h b/arch/powerpc/include/asm/syscalls.h
index 4084e56..920cefd 100644
--- a/arch/powerpc/include/asm/syscalls.h
+++ b/arch/powerpc/include/asm/syscalls.h
@@ -23,6 +23,12 @@ asmlinkage int sys_execve(unsigned long a0, unsigned long a1,
 asmlinkage int sys_clone(unsigned long clone_flags, unsigned long usp,
 		int __user *parent_tidp, void __user *child_threadptr,
 		int __user *child_tidp, int p6, struct pt_regs *regs);
+asmlinkage int sys_eclone(unsigned long flags_low,
+			  struct clone_args __user *args,
+			  size_t args_size,
+			  pid_t __user *pids,
+			  unsigned long p5, unsigned long p6,
+			  struct pt_regs *regs);
 asmlinkage int sys_fork(unsigned long p1, unsigned long p2,
 		unsigned long p3, unsigned long p4, unsigned long p5,
 		unsigned long p6, struct pt_regs *regs);
diff --git a/arch/powerpc/include/asm/systbl.h b/arch/powerpc/include/asm/systbl.h
index a5ee345..f94fc43 100644
--- a/arch/powerpc/include/asm/systbl.h
+++ b/arch/powerpc/include/asm/systbl.h
@@ -326,3 +326,4 @@ SYSCALL_SPU(perf_event_open)
 COMPAT_SYS_SPU(preadv)
 COMPAT_SYS_SPU(pwritev)
 COMPAT_SYS(rt_tgsigqueueinfo)
+PPC_SYS(eclone)
diff --git a/arch/powerpc/include/asm/unistd.h b/arch/powerpc/include/asm/unistd.h
index f0a1026..4cdbd5c 100644
--- a/arch/powerpc/include/asm/unistd.h
+++ b/arch/powerpc/include/asm/unistd.h
@@ -345,10 +345,11 @@
 #define __NR_preadv		320
 #define __NR_pwritev		321
 #define __NR_rt_tgsigqueueinfo	322
+#define __NR_eclone		323
 
 #ifdef __KERNEL__
 
-#define __NR_syscalls		323
+#define __NR_syscalls		324
 
 #define __NR__exit __NR_exit
 #define NR_syscalls	__NR_syscalls
diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 1175a85..579f1da 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -586,6 +586,14 @@ ppc_clone:
 	stw	r0,_TRAP(r1)		/* register set saved */
 	b	sys_clone
 
+	.globl	ppc_eclone
+ppc_eclone:
+	SAVE_NVGPRS(r1)
+	lwz	r0,_TRAP(r1)
+	rlwinm	r0,r0,0,0,30		/* clear LSB to indicate full */
+	stw	r0,_TRAP(r1)		/* register set saved */
+	b	sys_eclone
+
 	.globl	ppc_swapcontext
 ppc_swapcontext:
 	SAVE_NVGPRS(r1)
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 07109d8..b763340 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -344,6 +344,11 @@ _GLOBAL(ppc_clone)
 	bl	.sys_clone
 	b	syscall_exit
 
+_GLOBAL(ppc_eclone)
+	bl	.save_nvgprs
+	bl	.sys_eclone
+	b	syscall_exit
+
 _GLOBAL(ppc32_swapcontext)
 	bl	.save_nvgprs
 	bl	.compat_sys_swapcontext
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index e4d71ce..b183287 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -961,7 +961,59 @@ int sys_clone(unsigned long clone_flags, unsigned long usp,
 		child_tidp = TRUNC_PTR(child_tidp);
 	}
 #endif
- 	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+	return do_fork(clone_flags, usp, regs, 0, parent_tidp, child_tidp);
+}
+
+int sys_eclone(unsigned long clone_flags_low,
+	       struct clone_args __user *uclone_args,
+	       size_t size,
+	       pid_t __user *upids,
+	       unsigned long p5, unsigned long p6,
+	       struct pt_regs *regs)
+{
+	struct clone_args kclone_args;
+	unsigned long stack_base;
+	int __user *parent_tidp;
+	int __user *child_tidp;
+	unsigned long stack_sz;
+	unsigned int nr_pids;
+	unsigned long flags;
+	unsigned long usp;
+	int rc;
+
+	CHECK_FULL_REGS(regs);
+
+	rc = fetch_clone_args_from_user(uclone_args, size, &kclone_args);
+	if (rc)
+		return rc;
+
+	stack_sz = kclone_args.child_stack_size;
+	stack_base = kclone_args.child_stack;
+
+	/* powerpc doesn't do anything useful with the stack size */
+	if (stack_sz)
+		return -EINVAL;
+
+	/* Interpret stack_base as the child sp if it is set. */
+	usp = regs->gpr[1];
+	if (stack_base)
+		usp = stack_base;
+
+	flags = clone_flags_low;
+
+	nr_pids = kclone_args.nr_pids;
+
+	parent_tidp = (int __user *)(unsigned long)kclone_args.parent_tid_ptr;
+	child_tidp = (int __user *)(unsigned long)kclone_args.child_tid_ptr;
+
+#ifdef CONFIG_PPC64
+	if (test_thread_flag(TIF_32BIT)) {
+		parent_tidp = TRUNC_PTR(parent_tidp);
+		child_tidp = TRUNC_PTR(child_tidp);
+	}
+#endif
+	return do_fork_with_pids(flags, stack_base, regs, stack_sz,
+				 parent_tidp, child_tidp, nr_pids, upids);
 }
 
 int sys_fork(unsigned long p1, unsigned long p2, unsigned long p3,
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 011/100] eclone (11/11): Document sys_eclone
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (9 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-05 21:14   ` Randy Dunlap
  2010-05-01 14:14 ` [PATCH v21 012/100] c/r: extend arch_setup_additional_pages() Oren Laadan
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Sukadev Bhattiprolu, linux-api, x86, linux-s390,
	linuxppc-dev

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

This gives a brief overview of the eclone() system call.  We should
eventually describe more details in existing clone(2) man page or in
a new man page.

Changelog[v13]:
	- [Nathan Lynch, Serge Hallyn] Rename ->child_stack_base to
	  ->child_stack and ensure ->child_stack_size is 0 on architectures
	  that don't need it.
	- [Arnd Bergmann] Remove ->reserved1 field
	- [Louis Rilling, Dave Hansen] Combine the two asm statements in the
	  example into one and use memory constraint to avoid unncessary copies.
Changelog[v12]:
	- [Serge Hallyn] Fix/simplify stack-setup in the example code
	- [Serge Hallyn, Oren Laadan] Rename syscall to eclone()

Changelog[v11]:
	- [Dave Hansen] Move clone_args validation checks to arch-indpendent
	  code.
	- [Oren Laadan] Make args_size a parameter to system call and remove
	  it from 'struct clone_args'
	- [Oren Laadan] Fix some typos and clarify the order of pids in the
	  @pids parameter.

Changelog[v10]:
	- Rename clone3() to clone_with_pids() and fix some typos.
	- Modify example to show usage with the ptregs implementation.
Changelog[v9]:
	- [Pavel Machek]: Fix an inconsistency and rename new file to
	  Documentation/clone3.
	- [Roland McGrath, H. Peter Anvin] Updates to description and
	  example to reflect new prototype of clone3() and the updated/
	  renamed 'struct clone_args'.

Changelog[v8]:
	- clone2() is already in use in IA64. Rename syscall to clone3()
	- Add notes to say that we return -EINVAL if invalid clone flags
	  are specified or if the reserved fields are not 0.
Changelog[v7]:
	- Rename clone_with_pids() to clone2()
	- Changes to reflect new prototype of clone2() (using clone_struct).

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
---
 Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 348 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/eclone

diff --git a/Documentation/eclone b/Documentation/eclone
new file mode 100644
index 0000000..c2f1b4b
--- /dev/null
+++ b/Documentation/eclone
@@ -0,0 +1,348 @@
+
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+	u32 nr_pids;
+	u32 reserved0;
+};
+
+
+sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
+		pid_t * __user pids)
+
+	In addition to doing everything that clone() system call does, the
+	eclone() system call:
+
+		- allows additional clone flags (31 of 32 bits in the flags
+		  parameter to clone() are in use)
+
+		- allows user to specify a pid for the child process in its
+		  active and ancestor pid namespaces.
+
+	This system call is meant to be used when restarting an application
+	from a checkpoint. Such restart requires that the processes in the
+	application have the same pids they had when the application was
+	checkpointed. When containers are nested, the processes within the
+	containers exist in multiple pid namespaces and hence have multiple
+	pids to specify during restart.
+
+	The @flags_low parameter is identical to the 'clone_flags' parameter
+	in existing clone() system call.
+
+	The fields in 'struct clone_args' are meant to be used as follows:
+
+	u64 clone_flags_high:
+
+		When eclone() supports more than 32 flags, the additional bits
+		in the clone_flags should be specified in this field. This
+		field is currently unused and must be set to 0.
+
+	u64 child_stack;
+	u64 child_stack_size;
+
+		These two fields correspond to the 'child_stack' fields in
+		clone() and clone2() (on IA64) system calls. The usage of
+		these two fields depends on the processor architecture.
+
+		Most architectures use ->child_stack to pass-in a stack-pointer
+		itself and don't need the ->child_stack_size field. On these
+		architectures the ->child_stack_size field must be 0.
+
+		Some architectures, eg IA64, use ->child_stack to pass-in the
+		base of the region allocated for stack. These architectures
+		must pass in the size of the stack-region in ->child_stack_size.
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+		These two fields correspond to the 'parent_tid_ptr' and
+		'child_tid_ptr' fields in the clone() system call
+
+	u32 nr_pids;
+
+		nr_pids specifies the number of pids in the @pids array
+		parameter to eclone() (see below). nr_pids should not exceed
+		the current nesting level of the calling process (i.e if the
+		process is in init_pid_ns, nr_pids must be 1, if process is
+		in a pid namespace that is a child of init-pid-ns, nr_pids
+		cannot exceed 2, and so on).
+
+	u32 reserved0;
+	u64 reserved1;
+
+		These fields are intended to extend the functionality of the
+		eclone() in the future, while preserving backward compatibility.
+		They must be set to 0 for now.
+
+	The @cargs_size parameter specifes the sizeof(struct clone_args) and
+	is intended to enable extending this structure in the future, while
+	preserving backward compatibility.  For now, this field must be set
+	to the sizeof(struct clone_args) and this size must match the kernel's
+	view of the structure.
+
+	The @pids parameter defines the set of pids that should be assigned to
+	the child process in its active and ancestor pid namespaces. The
+	descendant pid namespaces do not matter since a process does not have a
+	pid in descendant namespaces, unless the process is in a new pid
+	namespace in which case the process is a container-init (and must have
+	the pid 1 in that namespace).
+
+	See CLONE_NEWPID section of clone(2) man page for details about pid
+	namespaces.
+
+	If a pid in the @pids list is 0, the kernel will assign the next
+	available pid in the pid namespace.
+
+	If a pid in the @pids list is non-zero, the kernel tries to assign
+	the specified pid in that namespace.  If that pid is already in use
+	by another process, the system call fails (see EBUSY below).
+
+	The order of pids in @pids is oldest in pids[0] to youngest pid
+	namespace in pids[nr_pids-1]. If the number of pids specified in the
+	@pids list is fewer than the nesting level of the process, the pids
+	are applied from youngest namespace. i.e if the process is nested in
+	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
+	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
+	have a pid of '0' (the kernel will assign a pid in those namespaces).
+
+	On success, the system call returns the pid of the child process in
+	the parent's active pid namespace.
+
+	On failure, eclone() returns -1 and sets 'errno' to one of following
+	values (the child process is not created).
+
+	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
+		specify the pids in this call (if pids are not specifed
+		CAP_SYS_ADMIN is not required).
+
+	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
+		the current nesting level of parent process
+
+	EINVAL	Not all specified clone-flags are valid.
+
+	EINVAL	The reserved fields in the clone_args argument are not 0.
+
+	EINVAL	The child_stack_size field is not 0 (on architectures that
+		pass in a stack pointer in ->child_stack field)
+
+	EBUSY	A requested pid is in use by another process in that namespace.
+
+---
+/*
+ * Example eclone() usage - Create a child process with pid CHILD_TID1 in
+ * the current pid namespace. The child gets the usual "random" pid in any
+ * ancestor pid namespaces.
+ */
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <signal.h>
+#include <errno.h>
+#include <unistd.h>
+#include <wait.h>
+#include <sys/syscall.h>
+
+#define __NR_eclone		337
+#define CLONE_NEWPID            0x20000000
+#define CLONE_CHILD_SETTID      0x01000000
+#define CLONE_PARENT_SETTID     0x00100000
+#define CLONE_UNUSED		0x00001000
+
+#define STACKSIZE		8192
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef int pid_t;
+struct clone_args {
+	u64 clone_flags_high;
+	u64 child_stack;
+	u64 child_stack_size;
+
+	u64 parent_tid_ptr;
+	u64 child_tid_ptr;
+
+	u32 nr_pids;
+
+	u32 reserved0;
+};
+
+#define exit		_exit
+
+/*
+ * Following eclone() is based on code posted by Oren Laadan at:
+ * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
+ */
+#if defined(__i386__) && defined(__NR_eclone)
+
+int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
+		int *pids)
+{
+	long retval;
+
+	__asm__ __volatile__(
+		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
+		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
+		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
+		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
+
+		 "pushl %%ebp\n\t"	/* save value of ebp */
+		 "int $0x80\n\t"	/* Linux/i386 system call */
+		 "testl %0,%0\n\t"	/* check return value */
+		 "jne 1f\n\t"		/* jump if parent */
+
+		 "popl %%esi\n\t"	/* get subthread function */
+		 "call *%%esi\n\t"	/* start subthread function */
+		 "movl %2,%0\n\t"
+		 "int $0x80\n"		/* exit system call: exit subthread */
+		 "1:\n\t"
+		 "popl %%ebp\t"		/* restore parent's ebp */
+
+		:"=a" (retval)
+
+		:"0" (__NR_eclone),
+		 "i" (__NR_exit),
+		 "m" (flags_low),
+		 "m" (clone_args),
+		 "m" (args_size),
+		 "m" (pids)
+		);
+
+	if (retval < 0) {
+		errno = -retval;
+		retval = -1;
+	}
+	return retval;
+}
+
+/*
+ * Allocate a stack for the clone-child and arrange to have the child
+ * execute @child_fn with @child_arg as the argument.
+ */
+void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
+{
+	void *stack_base;
+	void **stack_top;
+
+	stack_base = malloc(size + size);
+	if (!stack_base) {
+		perror("malloc()");
+		exit(1);
+	}
+
+	stack_top = (void **)((char *)stack_base + (size - 4));
+	*--stack_top = child_arg;
+	*--stack_top = child_fn;
+
+	return stack_top;
+}
+#endif
+
+/* gettid() is a bit more useful than getpid() when messing with clone() */
+int gettid()
+{
+	int rc;
+
+	rc = syscall(__NR_gettid, 0, 0, 0);
+	if (rc < 0) {
+		printf("rc %d, errno %d\n", rc, errno);
+		exit(1);
+	}
+	return rc;
+}
+
+#define CHILD_TID1	377
+#define CHILD_TID2	1177
+#define CHILD_TID3	2799
+
+struct clone_args clone_args;
+void *child_arg = &clone_args;
+int child_tid;
+
+int do_child(void *arg)
+{
+	struct clone_args *cs = (struct clone_args *)arg;
+	int ctid;
+
+	/* Verify we pushed the arguments correctly on the stack... */
+	if (arg != child_arg)  {
+		printf("Child: Incorrect child arg pointer, expected %p,"
+				"actual %p\n", child_arg, arg);
+		exit(1);
+	}
+
+	/* ... and that we got the thread-id we expected */
+	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
+	if (ctid != CHILD_TID1) {
+		printf("Child: Incorrect child tid, expected %d, actual %d\n",
+				CHILD_TID1, ctid);
+		exit(1);
+	} else {
+		printf("Child got the expected tid, %d\n", gettid());
+	}
+	sleep(2);
+
+	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
+	exit(0);
+}
+
+static int do_clone(int (*child_fn)(void *), void *child_arg,
+		unsigned int flags_low, int nr_pids, pid_t *pids_list)
+{
+	int rc;
+	void *stack;
+	struct clone_args *ca = &clone_args;
+	int args_size;
+
+	stack = setup_stack(child_fn, child_arg, STACKSIZE);
+
+	memset(ca, 0, sizeof(*ca));
+
+	ca->child_stack		= (u64)(unsigned long)stack;
+	ca->child_stack_size	= (u64)0;
+	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
+	ca->nr_pids		= nr_pids;
+
+	args_size = sizeof(struct clone_args);
+	rc = eclone(flags_low, ca, args_size, pids_list);
+
+	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
+				rc, errno);
+	return rc;
+}
+
+/*
+ * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
+ * The test case creates a child in the current pid namespace and uses only
+ * the first value, CHILD_TID1.
+ */
+pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
+int main()
+{
+	int rc, pid, status;
+	unsigned long flags;
+	int nr_pids = 1;
+
+	flags = SIGCHLD|CLONE_CHILD_SETTID;
+
+	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
+
+	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
+
+	rc = waitpid(pid, &status, __WALL);
+	if (rc < 0) {
+		printf("waitpid(): rc %d, error %d\n", rc, errno);
+	} else {
+		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
+			 gettid(), rc, status);
+
+		if (WIFEXITED(status)) {
+			printf("\t EXITED, %d\n", WEXITSTATUS(status));
+		} else if (WIFSIGNALED(status)) {
+			printf("\t SIGNALED, %d\n", WTERMSIG(status));
+		}
+	}
+	return 0;
+}
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 012/100] c/r: extend arch_setup_additional_pages()
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (10 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
@ 2010-05-01 14:14 ` Oren Laadan
  2010-05-01 14:15 ` [PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-s390, Oren Laadan, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Alexey Dobriyan,
	Pavel Emelyanov

From: Alexey Dobriyan <adobriyan@gmail.com>

Add "start" argument, to request to map vDSO to a specific place,
and fail the operation if not.

This is useful for restart(2) to ensure that memory layout is restore
exactly as needed.

Changelog[v19]:
  - [serge hallyn] Fix potential use-before-set ret
Changelog[v2]:
  - [ntl] powerpc: vdso build fix (ckpt-v17)

Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/powerpc/include/asm/elf.h     |    1 +
 arch/powerpc/kernel/vdso.c         |   13 ++++++++++++-
 arch/s390/include/asm/elf.h        |    2 +-
 arch/s390/kernel/vdso.c            |   13 ++++++++++++-
 arch/sh/include/asm/elf.h          |    1 +
 arch/sh/kernel/vsyscall/vsyscall.c |    2 +-
 arch/x86/include/asm/elf.h         |    3 ++-
 arch/x86/vdso/vdso32-setup.c       |    9 +++++++--
 arch/x86/vdso/vma.c                |   11 ++++++++---
 fs/binfmt_elf.c                    |    2 +-
 10 files changed, 46 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/elf.h b/arch/powerpc/include/asm/elf.h
index c376eda..0b06255 100644
--- a/arch/powerpc/include/asm/elf.h
+++ b/arch/powerpc/include/asm/elf.h
@@ -266,6 +266,7 @@ extern int ucache_bsize;
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 #define VDSO_AUX_ENT(a,b) NEW_AUX_ENT(a,b);
 
diff --git a/arch/powerpc/kernel/vdso.c b/arch/powerpc/kernel/vdso.c
index d84d192..74210ab 100644
--- a/arch/powerpc/kernel/vdso.c
+++ b/arch/powerpc/kernel/vdso.c
@@ -188,7 +188,8 @@ static void dump_vdso_pages(struct vm_area_struct * vma)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -220,6 +221,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_base = VDSO32_MBASE;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	current->mm->context.vdso_base = 0;
 
 	/* vDSO has a problem and was disabled, just don't "enable" it for the
@@ -249,6 +254,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	/* Add required alignment. */
 	vdso_base = ALIGN(vdso_base, VDSO_ALIGNMENT);
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EBUSY;
+		goto fail_mmapsem;
+	}
+
 	/*
 	 * Put vDSO base into mm struct. We need to do this before calling
 	 * install_special_mapping or the perf counter mmap tracking code
diff --git a/arch/s390/include/asm/elf.h b/arch/s390/include/asm/elf.h
index 354d426..5081938 100644
--- a/arch/s390/include/asm/elf.h
+++ b/arch/s390/include/asm/elf.h
@@ -216,6 +216,6 @@ do {									    \
 struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
-int arch_setup_additional_pages(struct linux_binprm *, int);
+int arch_setup_additional_pages(struct linux_binprm *, unsigned long, int);
 
 #endif
diff --git a/arch/s390/kernel/vdso.c b/arch/s390/kernel/vdso.c
index 6bc9c19..54dad2f 100644
--- a/arch/s390/kernel/vdso.c
+++ b/arch/s390/kernel/vdso.c
@@ -195,7 +195,8 @@ static void vdso_init_cr5(void)
  * This is called from binfmt_elf, we create the special vma for the
  * vDSO and insert it into the mm struct tree
  */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	struct page **vdso_pagelist;
@@ -226,6 +227,10 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	vdso_pages = vdso32_pages;
 #endif
 
+	/* in case restart(2) mandates a specific location */
+	if (start)
+		vdso_base = start;
+
 	/*
 	 * vDSO has a problem and was disabled, just don't "enable" it for
 	 * the process
@@ -248,6 +253,12 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 		goto out_up;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && vdso_base != start) {
+		rc = -EINVAL;
+		goto out_up;
+	}
+
 	/*
 	 * Put vDSO base into mm struct. We need to do this before calling
 	 * install_special_mapping or the perf counter mmap tracking code
diff --git a/arch/sh/include/asm/elf.h b/arch/sh/include/asm/elf.h
index ce830fa..4128c30 100644
--- a/arch/sh/include/asm/elf.h
+++ b/arch/sh/include/asm/elf.h
@@ -201,6 +201,7 @@ do {									\
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES
 struct linux_binprm;
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
 extern unsigned int vdso_enabled;
diff --git a/arch/sh/kernel/vsyscall/vsyscall.c b/arch/sh/kernel/vsyscall/vsyscall.c
index 242117c..6dbdfe1 100644
--- a/arch/sh/kernel/vsyscall/vsyscall.c
+++ b/arch/sh/kernel/vsyscall/vsyscall.c
@@ -58,7 +58,7 @@ int __init vsyscall_init(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm, unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index f2ad216..3761be8 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -312,9 +312,10 @@ struct linux_binprm;
 
 #define ARCH_HAS_SETUP_ADDITIONAL_PAGES 1
 extern int arch_setup_additional_pages(struct linux_binprm *bprm,
+				       unsigned long start,
 				       int uses_interp);
 
-extern int syscall32_setup_pages(struct linux_binprm *, int exstack);
+extern int syscall32_setup_pages(struct linux_binprm *, unsigned long start, int exstack);
 #define compat_arch_setup_additional_pages	syscall32_setup_pages
 
 extern unsigned long arch_randomize_brk(struct mm_struct *mm);
diff --git a/arch/x86/vdso/vdso32-setup.c b/arch/x86/vdso/vdso32-setup.c
index 02b442e..62043c1 100644
--- a/arch/x86/vdso/vdso32-setup.c
+++ b/arch/x86/vdso/vdso32-setup.c
@@ -310,7 +310,8 @@ int __init sysenter_setup(void)
 }
 
 /* Setup a VMA at program startup for the vsyscall page */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
@@ -331,13 +332,17 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
 	if (compat)
 		addr = VDSO_HIGH_BASE;
 	else {
-		addr = get_unmapped_area(NULL, 0, PAGE_SIZE, 0, 0);
+		addr = get_unmapped_area(NULL, start, PAGE_SIZE, 0, 0);
 		if (IS_ERR_VALUE(addr)) {
 			ret = addr;
 			goto up_fail;
 		}
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	if (compat_uses_vma || !compat) {
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index ac74869..b813286 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -100,23 +100,28 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
 
 /* Setup a VMA at program startup for the vsyscall page.
    Not called for compat tasks */
-int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
+int arch_setup_additional_pages(struct linux_binprm *bprm,
+				unsigned long start, int uses_interp)
 {
 	struct mm_struct *mm = current->mm;
 	unsigned long addr;
-	int ret;
+	int ret = -EINVAL;
 
 	if (!vdso_enabled)
 		return 0;
 
 	down_write(&mm->mmap_sem);
-	addr = vdso_addr(mm->start_stack, vdso_size);
+	addr = start ? : vdso_addr(mm->start_stack, vdso_size);
 	addr = get_unmapped_area(NULL, addr, vdso_size, 0, 0);
 	if (IS_ERR_VALUE(addr)) {
 		ret = addr;
 		goto up_fail;
 	}
 
+	/* for restart(2), double check that we got we asked for */
+	if (start && addr != start)
+		goto up_fail;
+
 	current->mm->context.vdso = (void *)addr;
 
 	ret = install_special_mapping(mm, addr, vdso_size,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 535e763..6434003 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -923,7 +923,7 @@ static int load_elf_binary(struct linux_binprm *bprm, struct pt_regs *regs)
 	set_binfmt(&elf_format);
 
 #ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
-	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
+	retval = arch_setup_additional_pages(bprm, 0, !!elf_interpreter);
 	if (retval < 0) {
 		send_sig(SIGKILL, current, 0);
 		goto out;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (11 preceding siblings ...)
  2010-05-01 14:14 ` [PATCH v21 012/100] c/r: extend arch_setup_additional_pages() Oren Laadan
@ 2010-05-01 14:15 ` Oren Laadan
  2010-05-01 14:15 ` [PATCH v21 058/100] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Oren Laadan, linux-api, x86, linux-s390,
	linuxppc-dev, Dave Hansen

Create trivial sys_checkpoint and sys_restore system calls. They will
enable to checkpoint and restart an entire container, to and from a
checkpoint image file descriptor.

The syscalls take a pid, a file descriptor (for the image file) and
flags as arguments. The pid identifies the top-most (root) task in the
process tree, e.g. the container init: for sys_checkpoint the first
argument identifies the pid of the target container/subtree; for
sys_restart it will identify the pid of restarting root task.

A checkpoint, much like a process coredump, dumps the state of multiple
processes at once, including the state of the container. The checkpoint
image is written to (and read from) the file descriptor directly from
the kernel. This way the data is generated and then pushed out naturally
as resources and tasks are scanned to save their state. This is the
approach taken by, e.g., Zap and OpenVZ.

By using a return value and not a file descriptor, we can distinguish
between a return from checkpoint, a return from restart (in case of a
checkpoint that includes self, i.e. a task checkpointing its own
container, or itself), and an error condition, in a manner analogous
to a fork() call.

We don't use copy_from_user()/copy_to_user() because it requires
holding the entire image in user space, and does not make sense for
restart.  Also, we don't use a pipe, pseudo-fs file and the like,
because they work by generating data on demand as the user pulls it
(unless the entire image is buffered in the kernel) and would require
more complex logic.  They also would significantly complicate
checkpoint that includes self.

Changelog[v21-rc3]:
  - Reorganize code:move checkpoint/* to kernel/checkpoint/*
Changelog[v19-rc1]:
  - Add 'int logfd' to prototype of sys_{checkpoint,restart}
Changelog[v18]:
  - [John Dykstra] Fix no-dot-config-targets pattern in linux/Makefile
Changelog[v17]:
  - Move checkpoint closer to namespaces (kconfig)
  - Kill "Enable" in c/r config option
Changelog[v16]:
  - Change sys_restart() first argument to be 'pid_t pid'
Changelog[v14]:
  - Change CONFIG_CHEKCPOINT_RESTART to CONFIG_CHECKPOINT (Ingo)
  - Remove line 'def_bool n' (default is already 'n')
  - Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
Changelog[v5]:
  - Config is 'def_bool n' by default

Cc: linux-api@vger.kernel.org
Cc: x86@kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linuxppc-dev@ozlabs.org
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Signed-off-by: Dave Hansen <dave@linux.vnet.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 Makefile                           |    2 +-
 arch/x86/Kconfig                   |    4 +++
 arch/x86/include/asm/unistd_32.h   |    4 ++-
 arch/x86/kernel/syscall_table_32.S |    2 +
 include/linux/syscalls.h           |    4 +++
 init/Kconfig                       |    2 +
 kernel/Makefile                    |    1 +
 kernel/checkpoint/Kconfig          |   14 +++++++++++
 kernel/checkpoint/Makefile         |    5 ++++
 kernel/checkpoint/sys.c            |   45 ++++++++++++++++++++++++++++++++++++
 kernel/sys_ni.c                    |    4 +++
 11 files changed, 85 insertions(+), 2 deletions(-)
 create mode 100644 kernel/checkpoint/Kconfig
 create mode 100644 kernel/checkpoint/Makefile
 create mode 100644 kernel/checkpoint/sys.c

diff --git a/Makefile b/Makefile
index fa1db90..93be4e1 100644
--- a/Makefile
+++ b/Makefile
@@ -409,7 +409,7 @@ endif
 # of make so .config is not included in this case either (for *config).
 
 no-dot-config-targets := clean mrproper distclean \
-			 cscope TAGS tags help %docs check% \
+			 cscope TAGS tags help %docs checkstack \
 			 include/linux/version.h headers_% \
 			 kernelrelease kernelversion
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 9458685..0874484 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -93,6 +93,10 @@ config STACKTRACE_SUPPORT
 config HAVE_LATENCYTOP_SUPPORT
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if X86_32
+
 config MMU
 	def_bool y
 
diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index e543b0e..007d7cd 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -344,10 +344,12 @@
 #define __NR_perf_event_open	336
 #define __NR_recvmmsg		337
 #define __NR_eclone		338
+#define __NR_checkpoint		339
+#define __NR_restart		340
 
 #ifdef __KERNEL__
 
-#define NR_syscalls 339
+#define NR_syscalls 341
 
 #define __ARCH_WANT_IPC_PARSE_VERSION
 #define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 0c92570..2d5a6b0 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -338,3 +338,5 @@ ENTRY(sys_call_table)
 	.long sys_perf_event_open
 	.long sys_recvmmsg
 	.long ptregs_eclone
+	.long sys_checkpoint
+	.long sys_restart		/* 340 */
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 057929b..d1d1703 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -834,6 +834,10 @@ asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
 asmlinkage long sys_ppoll(struct pollfd __user *, unsigned int,
 			  struct timespec __user *, const sigset_t __user *,
 			  size_t);
+asmlinkage long sys_checkpoint(pid_t pid, int fd, unsigned long flags,
+			       int logfd);
+asmlinkage long sys_restart(pid_t pid, int fd, unsigned long flags,
+			    int logfd);
 
 int kernel_execve(const char *filename, char *const argv[], char *const envp[]);
 
diff --git a/init/Kconfig b/init/Kconfig
index bd8174f..2345902 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -715,6 +715,8 @@ config NET_NS
 	  Allow user space to create what appear to be multiple instances
 	  of the network stack.
 
+source "kernel/checkpoint/Kconfig"
+
 config BLK_DEV_INITRD
 	bool "Initial RAM filesystem and RAM disk (initramfs/initrd) support"
 	depends on BROKEN || !FRV
diff --git a/kernel/Makefile b/kernel/Makefile
index a987aa1..1b78cca 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_PERF_EVENTS) += perf_event.o
 obj-$(CONFIG_HAVE_HW_BREAKPOINT) += hw_breakpoint.o
 obj-$(CONFIG_USER_RETURN_NOTIFIER) += user-return-notifier.o
 obj-$(CONFIG_PADATA) += padata.o
+obj-$(CONFIG_CHECKPOINT) += checkpoint/
 
 ifneq ($(CONFIG_SCHED_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
diff --git a/kernel/checkpoint/Kconfig b/kernel/checkpoint/Kconfig
new file mode 100644
index 0000000..ef7d406
--- /dev/null
+++ b/kernel/checkpoint/Kconfig
@@ -0,0 +1,14 @@
+# Architectures should define CHECKPOINT_SUPPORT when they have
+# implemented the hooks for processor state etc. needed by the
+# core checkpoint/restart code.
+
+config CHECKPOINT
+	bool "Checkpoint/restart (EXPERIMENTAL)"
+	depends on CHECKPOINT_SUPPORT && EXPERIMENTAL
+	help
+	  Application checkpoint/restart is the ability to save the
+	  state of a running application so that it can later resume
+	  its execution from the time at which it was checkpointed.
+
+	  Turning this option on will enable checkpoint and restart
+	  functionality in the kernel.
diff --git a/kernel/checkpoint/Makefile b/kernel/checkpoint/Makefile
new file mode 100644
index 0000000..8a32c6f
--- /dev/null
+++ b/kernel/checkpoint/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for linux checkpoint/restart.
+#
+
+obj-$(CONFIG_CHECKPOINT) += sys.o
diff --git a/kernel/checkpoint/sys.c b/kernel/checkpoint/sys.c
new file mode 100644
index 0000000..a81750a
--- /dev/null
+++ b/kernel/checkpoint/sys.c
@@ -0,0 +1,45 @@
+/*
+ *  Generic container checkpoint-restart
+ *
+ *  Copyright (C) 2008-2009 Oren Laadan
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/sched.h>
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+
+/**
+ * sys_checkpoint - checkpoint a container
+ * @pid: pid of the container init(1) process
+ * @fd: file to which dump the checkpoint image
+ * @flags: checkpoint operation flags
+ * @logfd: fd to which to dump debug and error messages
+ *
+ * Returns positive identifier on success, 0 when returning from restart
+ * or negative value on error
+ */
+SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd,
+		unsigned long, flags, int, logfd)
+{
+	return -ENOSYS;
+}
+
+/**
+ * sys_restart - restart a container
+ * @pid: pid of task root (in coordinator's namespace), or 0
+ * @fd: file from which read the checkpoint image
+ * @flags: restart operation flags
+ * @logfd: fd to which to dump debug and error messages
+ *
+ * Returns negative value on error, or otherwise returns in the realm
+ * of the original checkpoint
+ */
+SYSCALL_DEFINE4(restart, pid_t, pid, int, fd,
+		unsigned long, flags, int, logfd)
+{
+	return -ENOSYS;
+}
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 70f2ea7..0206aca 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -181,3 +181,7 @@ cond_syscall(sys_eventfd2);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
+
+/* checkpoint/restart */
+cond_syscall(sys_checkpoint);
+cond_syscall(sys_restart);
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 058/100] c/r: (s390): expose a constant for the number of words (CRs)
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (12 preceding siblings ...)
  2010-05-01 14:15 ` [PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
@ 2010-05-01 14:15 ` Oren Laadan
  2010-05-01 14:15 ` [PATCH v21 059/100] c/r: add CKPT_COPY() macro Oren Laadan
  2010-05-01 14:15 ` [PATCH v21 060/100] c/r: define s390-specific checkpoint-restart code Oren Laadan
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Oren Laadan, linux-s390, Dan Smith

We need to use this value in the checkpoint/restart code and would like to
have a constant instead of a magic '3'.

Changelog:
    Jan 20:
            . Define s390x sys_restart wrapper
    Mar 30:
            . Add CHECKPOINT_SUPPORT in Kconfig (Nathan Lynch)
    Mar 03:
            . Picked up additional use of magic '3' in ptrace.h

Cc: linux-s390@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>
---
 arch/s390/Kconfig          |    4 ++++
 arch/s390/kernel/process.c |    8 ++++++++
 2 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 0d8cd9b..b358e63 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -49,6 +49,10 @@ config GENERIC_TIME_VSYSCALL
 config GENERIC_CLOCKEVENTS
 	def_bool y
 
+config CHECKPOINT_SUPPORT
+	bool
+	default y if 64BIT
+
 config GENERIC_BUG
 	bool
 	depends on BUG
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 799cbb0..15f5719 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -240,6 +240,14 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 		       parent_tidptr, child_tidptr);
 }
 
+#ifdef CONFIG_CHECKPOINT
+SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, unsigned long, flags,
+		int, logfd)
+{
+	return do_sys_restart(pid, fd, flags, logfd);
+}
+#endif
+
 SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *,
 		uca, int, args_size, pid_t __user *, pids)
 {
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 059/100] c/r: add CKPT_COPY() macro
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (13 preceding siblings ...)
  2010-05-01 14:15 ` [PATCH v21 058/100] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
@ 2010-05-01 14:15 ` Oren Laadan
  2010-05-01 14:15 ` [PATCH v21 060/100] c/r: define s390-specific checkpoint-restart code Oren Laadan
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, linux-s390, Oren Laadan

From: Dan Smith <danms@us.ibm.com>

As suggested by Dave[1], this provides us a way to make the copy-in and
copy-out processes symmetric.  CKPT_COPY_ARRAY() provides us a way to do
the same thing but for arrays.  It's not critical, but it helps us unify
the checkpoint and restart paths for some things.

Changelog:
    Mar 04:
            . Removed semicolons
            . Added build-time check for __must_be_array in CKPT_COPY_ARRAY
    Feb 27:
            . Changed CKPT_COPY() to use assignment, eliminating the need
              for the CKPT_COPY_BIT() macro
            . Add CKPT_COPY_ARRAY() macro to help copying register arrays,
              etc
            . Move the macro definitions inside the CR #ifdef
    Feb 25:
            . Changed WARN_ON() to BUILD_BUG_ON()

Cc: linux-s390@vger.kernel.org
Signed-off-by: Dan Smith <danms@us.ibm.com>
Signed-off-by: Oren Laadan <orenl@cs.columbia.edu>
Acked-by: Serge E. Hallyn <serue@us.ibm.com>
Tested-by: Serge E. Hallyn <serue@us.ibm.com>

1: https://lists.linux-foundation.org/pipermail/containers/2009-February/015821.html (all the way at the bottom)
---
 include/linux/checkpoint.h |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/include/linux/checkpoint.h b/include/linux/checkpoint.h
index 2f72a6a..cd76f70 100644
--- a/include/linux/checkpoint.h
+++ b/include/linux/checkpoint.h
@@ -246,6 +246,34 @@ static inline int ckpt_validate_errno(int errno)
 	return (errno >= 0) && (errno < MAX_ERRNO);
 }
 
+/* useful macros to copy fields and buffers to/from ckpt_hdr_xxx structures */
+#define CKPT_CPT 1
+#define CKPT_RST 2
+
+#define CKPT_COPY(op, SAVE, LIVE)				        \
+	do {							\
+		if (op == CKPT_CPT)				\
+			SAVE = LIVE;				\
+		else						\
+			LIVE = SAVE;				\
+	} while (0)
+
+/*
+ * Copy @count items from @LIVE to @SAVE if op is CKPT_CPT (otherwise,
+ * copy in the reverse direction)
+ */
+#define CKPT_COPY_ARRAY(op, SAVE, LIVE, count)				\
+	do {								\
+		(void)__must_be_array(SAVE);				\
+		(void)__must_be_array(LIVE);				\
+		BUILD_BUG_ON(sizeof(*SAVE) != sizeof(*LIVE));		\
+		if (op == CKPT_CPT)					\
+			memcpy(SAVE, LIVE, count * sizeof(*SAVE));	\
+		else							\
+			memcpy(LIVE, SAVE, count * sizeof(*SAVE));	\
+	} while (0)
+
+
 /* debugging flags */
 #define CKPT_DBASE	0x1		/* anything */
 #define CKPT_DSYS	0x2		/* generic (system) */
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v21 060/100] c/r: define s390-specific checkpoint-restart code
       [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
                   ` (14 preceding siblings ...)
  2010-05-01 14:15 ` [PATCH v21 059/100] c/r: add CKPT_COPY() macro Oren Laadan
@ 2010-05-01 14:15 ` Oren Laadan
  15 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-01 14:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: containers, linux-kernel, Serge Hallyn, Matt Helsley,
	Pavel Emelyanov, Dan Smith, linux-s390

From: Dan Smith <danms@us.ibm.com>

Implement the s390 arch-specific checkpoint/restart helpers.  This
is on top of Oren Laadan's c/r code.

With these, I am able to checkpoint and restart simple programs as per
Oren's patch intro.  While on x86 I never had to freeze a single task
to checkpoint it, on s390 I do need to.  That is a prereq for consistent
snapshots (esp with multiple processes) anyway so I don't see that as
a problem.

Changelog [v21]:
  - Do not include checkpoint_hdr.h explicitly
Changelog [v20]:
  - [Serge Hallyn] save_access_regs for self-checkpoint
  - [Serge Hallyn] send uses_interp=1 to arch_setup_additional_pages
Changelog [v19]:
  - [Serge Hallyn] Move get_signal_to_deliver() up in do_signal
Changelog [v19-rc3]:
  - [Serge Hallyn] Ue simpler test_task_thread to test current ti flags
  - [Serge Hallyn] Fix 31-bit s390 checkpoint/restart wrappers
  - [Serge Hallyn] Update sys_checkpoint (do_sys_checkpoint on all archs)
  - [Oren Laadan] Move checkpoint.c from arch/s390/mm->arch/s390/kernel
Changelog [v19-rc1]:
  - [Matt Helsley] Add cpp definitions for enums
Changelog:
    Jun 15:
            . Fix checkpoint and restart compat wrappers
    May 28:
            . Export asm/checkpoint_hdr.h to userspace
            . Define CKPT_ARCH_ID for S390
    Apr 11:
            . Introduce ckpt_arch_vdso()
    Feb 27:
            . Add checkpoint_s390.h
            . Fixed up save and restore of PSW, with the non-address bits
              properly masked out
    Feb 25:
            . Make checkpoint_hdr.h safe for inclusion in userspace
            . Replace comment about vsdo code
            . Add comment about restoring access registers
            . Write and read an empty ckpt_hdr_head_arch record to appease
              code (mktree) that expects it to be there
            . Utilize NUM_CKPT_WORDS in checkpoint_hdr.h
    Feb 24:
            . Use CKPT_COPY() to unify the un/loading of cpu and mm state
            . Fix fprs definition in ckpt_hdr_cpu
            . Remove debug WARN_ON() from checkpoint.c
    Feb 23:
            . Macro-ize the un/packing of trace flags
            . Fix the crash when externally-linked
            . Break out the restart functions into restart.c
            . Remove unneeded s390_enable_sie() call
    Jan 30:
            . Switched types in ckpt_hdr_cpu to __u64 etc.
              (Per Oren suggestion)
            . Replaced direct inclusion of structs in
              ckpt_hdr_cpu with the struct members.
              (Per Oren suggestion)
            . Also ended up adding a bunch of new things
              into restart (mm_segment, ksp, etc) in vain
              attempt to get code using fpu to not segfault
              after restart.

Cc: linux-s390@vger.kernel.org
Signed-off-by: Serge E. Hallyn <serue@us.ibm.com>
Signed-off-by: Dan Smith <danms@us.ibm.com>
Acked-by: Oren Laadan <orenl@cs.columbia.edu>
---
 arch/s390/include/asm/Kbuild           |    1 +
 arch/s390/include/asm/checkpoint_hdr.h |   98 ++++++++++++
 arch/s390/include/asm/thread_info.h    |    2 +
 arch/s390/include/asm/unistd.h         |    4 +-
 arch/s390/kernel/Makefile              |    1 +
 arch/s390/kernel/checkpoint.c          |  254 ++++++++++++++++++++++++++++++++
 arch/s390/kernel/compat_wrapper.S      |   16 ++
 arch/s390/kernel/process.c             |   19 +++
 arch/s390/kernel/signal.c              |   16 ++
 arch/s390/kernel/syscalls.S            |    2 +
 arch/s390/mm/Makefile                  |    1 +
 arch/s390/mm/checkpoint_s390.h         |   23 +++
 include/linux/checkpoint_hdr.h         |    3 +
 mm/mmap.c                              |    9 +-
 14 files changed, 447 insertions(+), 2 deletions(-)
 create mode 100644 arch/s390/include/asm/checkpoint_hdr.h
 create mode 100644 arch/s390/kernel/checkpoint.c
 create mode 100644 arch/s390/mm/checkpoint_s390.h

diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 63a2341..3282a6e 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -8,6 +8,7 @@ header-y += ucontext.h
 header-y += vtoc.h
 header-y += zcrypt.h
 header-y += chsc.h
+header-y += checkpoint_hdr.h
 
 unifdef-y += cmb.h
 unifdef-y += debug.h
diff --git a/arch/s390/include/asm/checkpoint_hdr.h b/arch/s390/include/asm/checkpoint_hdr.h
new file mode 100644
index 0000000..e3312c0
--- /dev/null
+++ b/arch/s390/include/asm/checkpoint_hdr.h
@@ -0,0 +1,98 @@
+#ifndef __ASM_S390_CKPT_HDR_H
+#define __ASM_S390_CKPT_HDR_H
+/*
+ *  Checkpoint/restart - architecture specific headers s/390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _CHECKPOINT_CKPT_HDR_H_
+#error asm/checkpoint_hdr.h included directly
+#endif
+
+#include <linux/types.h>
+#include <asm/ptrace.h>
+
+#ifdef __KERNEL__
+#include <asm/processor.h>
+#endif
+
+#ifdef CONFIG_64BIT
+#define CKPT_ARCH_ID	CKPT_ARCH_S390X
+/* else - if we ever support 32bit - CKPT_ARCH_S390 */
+#endif
+
+/*
+ * Notes
+ * NUM_GPRS defined in <asm/ptrace.h> to be 16
+ * NUM_FPRS defined in <asm/ptrace.h> to be 16
+ * NUM_APRS defined in <asm/ptrace.h> to be 16
+ * NUM_CR_WORDS defined in <asm/ptrace.h> to be 3
+ *              but is not yet in glibc headers.
+ */
+
+#ifndef NUM_CR_WORDS
+#define NUM_CR_WORDS 3
+#endif
+
+struct ckpt_hdr_cpu {
+	struct ckpt_hdr h;
+	__u64 args[1];
+	__u64 gprs[NUM_GPRS];
+	__u64 orig_gpr2;
+	__u16 svcnr;
+	__u16 ilc;
+	__u32 acrs[NUM_ACRS];
+	__u64 ieee_instruction_pointer;
+
+	/* psw_t */
+	__u64 psw_t_mask;
+	__u64 psw_t_addr;
+
+	/* s390_fp_regs_t */
+	__u32 fpc;
+	union {
+		float f;
+		double d;
+		__u64 ui;
+		struct {
+			__u32 fp_hi;
+			__u32 fp_lo;
+		} fp;
+	} fprs[NUM_FPRS];
+
+	/* per_struct */
+	__u64 per_control_regs[NUM_CR_WORDS];
+	__u64 starting_addr;
+	__u64 ending_addr;
+	__u64 address;
+	__u16 perc_atmid;
+	__u8 access_id;
+	__u8 single_step;
+	__u8 instruction_fetch;
+};
+
+struct ckpt_hdr_thread {
+	struct ckpt_hdr h;
+	__u64 thread_info_flags;
+};
+
+struct ckpt_hdr_mm_context {
+	struct ckpt_hdr h;
+	unsigned long vdso_base;
+	int noexec;
+	int has_pgste;
+	int alloc_pgste;
+	unsigned long asce_bits;
+	unsigned long asce_limit;
+};
+
+struct ckpt_hdr_header_arch {
+	struct ckpt_hdr h;
+};
+
+#endif /* __ASM_S390_CKPT_HDR__H */
diff --git a/arch/s390/include/asm/thread_info.h b/arch/s390/include/asm/thread_info.h
index 34f0873..60f932e 100644
--- a/arch/s390/include/asm/thread_info.h
+++ b/arch/s390/include/asm/thread_info.h
@@ -99,6 +99,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_MEMDIE		18
 #define TIF_RESTORE_SIGMASK	19	/* restore signal mask in do_signal() */
 #define TIF_FREEZE		20	/* thread is freezing for suspend */
+#define TIF_SIG_RESTARTBLOCK	23	/* restart must set TIF_RESTART_SVC */
 
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_RESTORE_SIGMASK	(1<<TIF_RESTORE_SIGMASK)
@@ -114,6 +115,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_POLLING_NRFLAG	(1<<TIF_POLLING_NRFLAG)
 #define _TIF_31BIT		(1<<TIF_31BIT)
 #define _TIF_FREEZE		(1<<TIF_FREEZE)
+#define _TIF_SIG_RESTARTBLOCK	(1<<TIF_SIG_RESTARTBLOCK)
 
 #endif /* __KERNEL__ */
 
diff --git a/arch/s390/include/asm/unistd.h b/arch/s390/include/asm/unistd.h
index ff13be1..79a4178 100644
--- a/arch/s390/include/asm/unistd.h
+++ b/arch/s390/include/asm/unistd.h
@@ -270,7 +270,9 @@
 #define __NR_rt_tgsigqueueinfo	330
 #define __NR_perf_event_open	331
 #define __NR_eclone		332
-#define NR_syscalls 333
+#define __NR_checkpoint		333
+#define __NR_restart		334
+#define NR_syscalls 335
 
 /* 
  * There are some system calls that are not present on 64 bit, some
diff --git a/arch/s390/kernel/Makefile b/arch/s390/kernel/Makefile
index 64230bc..b472855 100644
--- a/arch/s390/kernel/Makefile
+++ b/arch/s390/kernel/Makefile
@@ -36,6 +36,7 @@ obj-$(CONFIG_SMP)		+= smp.o topology.o
 obj-$(CONFIG_SMP)		+= $(if $(CONFIG_64BIT),switch_cpu64.o, \
 							switch_cpu.o)
 obj-$(CONFIG_HIBERNATION)	+= suspend.o swsusp_asm64.o
+obj-$(CONFIG_CHECKPOINT)	+= checkpoint.o
 obj-$(CONFIG_AUDIT)		+= audit.o
 compat-obj-$(CONFIG_AUDIT)	+= compat_audit.o
 obj-$(CONFIG_COMPAT)		+= compat_linux.o compat_signal.o \
diff --git a/arch/s390/kernel/checkpoint.c b/arch/s390/kernel/checkpoint.c
new file mode 100644
index 0000000..c0de4cd
--- /dev/null
+++ b/arch/s390/kernel/checkpoint.c
@@ -0,0 +1,254 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#include <linux/kernel.h>
+#include <asm/system.h>
+#include <asm/pgtable.h>
+#include <asm/elf.h>
+#include <asm/unistd.h>
+
+#include <linux/checkpoint.h>
+
+/**************************************************************************
+ * Checkpoint
+ */
+
+static void s390_copy_regs(int op, struct ckpt_hdr_cpu *h,
+			   struct task_struct *t)
+{
+	struct pt_regs *regs = task_pt_regs(t);
+	struct thread_struct *thr = &t->thread;
+
+	/* Save the whole PSW to facilitate forensic debugging, but only
+	 * restore the address portion to avoid letting userspace do
+	 * bad things by manipulating its value.
+	 */
+	if (op == CKPT_CPT) {
+		CKPT_COPY(op, h->psw_t_addr, regs->psw.addr);
+	} else {
+		regs->psw.addr &= ~PSW_ADDR_INSN;
+		regs->psw.addr |= h->psw_t_addr;
+	}
+
+	CKPT_COPY(op, h->args[0], regs->args[0]);
+	CKPT_COPY(op, h->orig_gpr2, regs->orig_gpr2);
+	CKPT_COPY(op, h->svcnr, regs->svcnr);
+	CKPT_COPY(op, h->ilc, regs->ilc);
+	CKPT_COPY(op, h->ieee_instruction_pointer,
+		thr->ieee_instruction_pointer);
+	CKPT_COPY(op, h->psw_t_mask, regs->psw.mask);
+	CKPT_COPY(op, h->fpc, thr->fp_regs.fpc);
+	CKPT_COPY(op, h->starting_addr, thr->per_info.starting_addr);
+	CKPT_COPY(op, h->ending_addr, thr->per_info.ending_addr);
+	CKPT_COPY(op, h->address, thr->per_info.lowcore.words.address);
+	CKPT_COPY(op, h->perc_atmid, thr->per_info.lowcore.words.perc_atmid);
+	CKPT_COPY(op, h->access_id, thr->per_info.lowcore.words.access_id);
+	CKPT_COPY(op, h->single_step, thr->per_info.single_step);
+	CKPT_COPY(op, h->instruction_fetch, thr->per_info.instruction_fetch);
+
+	CKPT_COPY_ARRAY(op, h->gprs, regs->gprs, NUM_GPRS);
+	/*
+	 * for checkpoint in process context (from within a container),
+	 * the actual syscall is taking place at this very moment; so
+	 * we (optimistically) subtitute the future return value (0) of
+	 * this syscall into the orig_eax, so that upon restart it will
+	 * succeed (or it will endlessly retry checkpoint...)
+	 */
+	if (op == CKPT_CPT && t == current) {
+		BUG_ON(h->gprs[2] < 0);
+		h->gprs[2] = 0;
+	}
+
+	/*
+	 * if a checkpoint was taken while interrupted from a restartable
+	 * syscall, then treat this as though signr==0 (since we did not
+	 * handle the signal) and finish the last part of do_signal
+	 */
+	if (op == CKPT_RST && test_thread_flag(TIF_SIG_RESTARTBLOCK)) {
+		regs->gprs[2] = __NR_restart_syscall;
+		set_thread_flag(TIF_RESTART_SVC);
+		clear_thread_flag(TIF_SIG_RESTARTBLOCK);
+	}
+
+	CKPT_COPY_ARRAY(op, h->fprs, thr->fp_regs.fprs, NUM_FPRS);
+	if (op == CKPT_CPT && t == current)
+		save_access_regs(h->acrs);
+	else
+		CKPT_COPY_ARRAY(op, h->acrs, thr->acrs, NUM_ACRS);
+	CKPT_COPY_ARRAY(op, h->per_control_regs,
+		      thr->per_info.control_regs.words.cr, NUM_CR_WORDS);
+}
+
+static void s390_mm(int op, struct ckpt_hdr_mm_context *h,
+		    struct mm_struct *mm)
+{
+	CKPT_COPY(op, h->noexec, mm->context.noexec);
+	CKPT_COPY(op, h->has_pgste, mm->context.has_pgste);
+	CKPT_COPY(op, h->alloc_pgste, mm->context.alloc_pgste);
+	CKPT_COPY(op, h->asce_bits, mm->context.asce_bits);
+	CKPT_COPY(op, h->asce_limit, mm->context.asce_limit);
+}
+
+int checkpoint_thread(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_thread *h;
+	int ret;
+
+	/* we will eventually support this, as we do on x86-64 */
+	if (test_tsk_thread_flag(t, TIF_31BIT)) {
+		ckpt_err(ctx, -EINVAL, "checkpoint of 31-bit task\n");
+		return -EINVAL;
+	}
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_THREAD);
+	if (!h)
+		return -ENOMEM;
+
+	h->thread_info_flags = task_thread_info(t)->flags;
+	ret = ckpt_write_obj(ctx, &h->h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* dump the cpu state and registers of a given task */
+int checkpoint_cpu(struct ckpt_ctx *ctx, struct task_struct *t)
+{
+	struct ckpt_hdr_cpu *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (!h)
+		return -ENOMEM;
+
+	s390_copy_regs(CKPT_CPT, h, t);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/* Write an empty header since it is assumed to be there */
+int checkpoint_write_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (!h)
+		return -ENOMEM;
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+int checkpoint_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+	int ret;
+
+	h = ckpt_hdr_get_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (!h)
+		return -ENOMEM;
+
+	s390_mm(CKPT_CPT, h, mm);
+
+	ret = ckpt_write_obj(ctx, (struct ckpt_hdr *) h);
+	ckpt_hdr_put(ctx, h);
+
+	return ret;
+}
+
+/**************************************************************************
+ * Restart
+ */
+
+int restore_thread(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_thread *h;
+
+	/* a 31-bit task cannot call sys_restart right now */
+	if (test_thread_flag(TIF_31BIT)) {
+		ckpt_err(ctx, -EINVAL, "restart from 31-bit task\n");
+		return -EINVAL;
+	}
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_THREAD);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	/*
+	 * If we were checkpointed while do_signal() interrupted a
+	 * syscall with restart blocks, then we have some cleanup to
+	 * do at end of restart, in order to finish our pretense of
+	 * having handled signr==0.  (See last part of do_signal).
+	 *
+	 * We can't set gprs[2] here bc we haven't copied registers
+	 * yet, that happens later in restore_cpu().  So re-set the
+	 * TIF_SIG_RESTARTBLOCK thread flag so we can detect it
+	 * at restore_cpu()->s390_copy_regs() and do the right thing.
+	 */
+	if (h->thread_info_flags & _TIF_SIG_RESTARTBLOCK)
+		set_thread_flag(TIF_SIG_RESTARTBLOCK);
+
+	ckpt_hdr_put(ctx, h);
+
+	return 0;
+}
+
+int restore_cpu(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_cpu *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_CPU);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_copy_regs(CKPT_RST, h, current);
+
+	/* s390 does not restore the access registers after a syscall,
+	 * but does on a task switch.  Since we're switching tasks (in
+	 * a way), we need to replicate that behavior here.
+	 */
+	restore_access_regs(h->acrs);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+int restore_read_header_arch(struct ckpt_ctx *ctx)
+{
+	struct ckpt_hdr_header_arch *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_HEADER_ARCH);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
+
+
+int restore_mm_context(struct ckpt_ctx *ctx, struct mm_struct *mm)
+{
+	struct ckpt_hdr_mm_context *h;
+
+	h = ckpt_read_obj_type(ctx, sizeof(*h), CKPT_HDR_MM_CONTEXT);
+	if (IS_ERR(h))
+		return PTR_ERR(h);
+
+	s390_mm(CKPT_RST, h, mm);
+
+	ckpt_hdr_put(ctx, h);
+	return 0;
+}
diff --git a/arch/s390/kernel/compat_wrapper.S b/arch/s390/kernel/compat_wrapper.S
index b7bedfa..d57e5e0 100644
--- a/arch/s390/kernel/compat_wrapper.S
+++ b/arch/s390/kernel/compat_wrapper.S
@@ -1861,3 +1861,19 @@ sys32_execve_wrapper:
 	llgtr	%r3,%r3			# compat_uptr_t *
 	llgtr	%r4,%r4			# compat_uptr_t *
 	jg	sys32_execve		# branch to system call
+
+	.globl sys_checkpoint_wrapper
+sys_checkpoint_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	lgfr	%r5,%r5			# int
+	jg	sys_checkpoint
+
+	.globl sys_restart_wrapper
+sys_restart_wrapper:
+	lgfr	%r2,%r2			# pid_t
+	lgfr	%r3,%r3			# int
+	llgfr	%r4,%r4			# unsigned long
+	lgfr	%r5,%r5			# int
+	jg	sys_restart
diff --git a/arch/s390/kernel/process.c b/arch/s390/kernel/process.c
index 15f5719..efc7e8a 100644
--- a/arch/s390/kernel/process.c
+++ b/arch/s390/kernel/process.c
@@ -32,6 +32,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/syscalls.h>
 #include <linux/compat.h>
+#include <linux/checkpoint.h>
 #include <asm/compat.h>
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -241,11 +242,29 @@ SYSCALL_DEFINE4(clone, unsigned long, newsp, unsigned long, clone_flags,
 }
 
 #ifdef CONFIG_CHECKPOINT
+SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, unsigned long, flags,
+		int, logfd)
+{
+	return do_sys_checkpoint(pid, fd, flags, logfd);
+}
+
 SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, unsigned long, flags,
 		int, logfd)
 {
 	return do_sys_restart(pid, fd, flags, logfd);
 }
+#else
+SYSCALL_DEFINE4(checkpoint, pid_t, pid, int, fd, unsigned long, flags,
+		int, logfd)
+{
+	return -ENOSYS;
+}
+
+SYSCALL_DEFINE4(restart, pid_t, pid, int, fd, unsigned long, flags,
+		int, logfd)
+{
+	return -ENOSYS;
+}
 #endif
 
 SYSCALL_DEFINE4(eclone, unsigned int, flags_low, struct clone_args __user *,
diff --git a/arch/s390/kernel/signal.c b/arch/s390/kernel/signal.c
index 6289945..83425b1 100644
--- a/arch/s390/kernel/signal.c
+++ b/arch/s390/kernel/signal.c
@@ -459,6 +459,16 @@ void do_signal(struct pt_regs *regs)
 			break;
 		case -ERESTART_RESTARTBLOCK:
 			regs->gprs[2] = -EINTR;
+			/*
+			 * This condition is the only one which requires
+			 * special care after handling a signr==0.  So if
+			 * we get frozen and checkpointed at the
+			 * get_signal_to_deliver() below, then we need
+			 * to convey this condition to sys_restart() so it
+			 * can set the restored thread up to run the restart
+			 * block.
+			 */
+			set_thread_flag(TIF_SIG_RESTARTBLOCK);
 		}
 		regs->svcnr = 0;	/* Don't deal with this again. */
 	}
@@ -467,6 +477,12 @@ void do_signal(struct pt_regs *regs)
 	   the debugger may change all our registers ... */
 	signr = get_signal_to_deliver(&info, &ka, regs, NULL);
 
+	/*
+	 * we won't get frozen past this so clear the thread flag hinting
+	 * to sys_restart that TIF_RESTART_SVC must be set.
+	 */
+	clear_thread_flag(TIF_SIG_RESTARTBLOCK);
+
 	/* Depending on the signal settings we may need to revert the
 	   decision to restart the system call. */
 	if (signr > 0 && regs->psw.addr == restart_addr) {
diff --git a/arch/s390/kernel/syscalls.S b/arch/s390/kernel/syscalls.S
index 08eab1d..9f1f28e 100644
--- a/arch/s390/kernel/syscalls.S
+++ b/arch/s390/kernel/syscalls.S
@@ -341,3 +341,5 @@ SYSCALL(sys_pwritev,sys_pwritev,compat_sys_pwritev_wrapper)
 SYSCALL(sys_rt_tgsigqueueinfo,sys_rt_tgsigqueueinfo,compat_sys_rt_tgsigqueueinfo_wrapper) /* 330 */
 SYSCALL(sys_perf_event_open,sys_perf_event_open,sys_perf_event_open_wrapper)
 SYSCALL(sys_eclone,sys_eclone,sys_eclone_wrapper)
+SYSCALL(sys_checkpoint,sys_checkpoint,sys_checkpoint_wrapper)
+SYSCALL(sys_restart,sys_restart,sys_restart_wrapper)
diff --git a/arch/s390/mm/Makefile b/arch/s390/mm/Makefile
index eec0544..359a3bc 100644
--- a/arch/s390/mm/Makefile
+++ b/arch/s390/mm/Makefile
@@ -6,3 +6,4 @@ obj-y	 := init.o fault.o extmem.o mmap.o vmem.o pgtable.o maccess.o \
 	    page-states.o
 obj-$(CONFIG_CMM) += cmm.o
 obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
+obj-$(CONFIG_PAGE_STATES) += page-states.o
diff --git a/arch/s390/mm/checkpoint_s390.h b/arch/s390/mm/checkpoint_s390.h
new file mode 100644
index 0000000..c3bf24d
--- /dev/null
+++ b/arch/s390/mm/checkpoint_s390.h
@@ -0,0 +1,23 @@
+/*
+ *  Checkpoint/restart - architecture specific support for s390
+ *
+ *  Copyright IBM Corp. 2009
+ *
+ *  This file is subject to the terms and conditions of the GNU General Public
+ *  License.  See the file COPYING in the main directory of the Linux
+ *  distribution for more details.
+ */
+
+#ifndef _S390_CHECKPOINT_H
+#define _S390_CHECKPOINT_H
+
+#include <linux/checkpoint_hdr.h>
+#include <linux/sched.h>
+#include <linux/mm_types.h>
+
+extern void checkpoint_s390_regs(int op, struct ckpt_hdr_cpu *h,
+				 struct task_struct *t);
+extern void checkpoint_s390_mm(int op, struct ckpt_hdr_mm_context *h,
+			       struct mm_struct *mm);
+
+#endif /* _S390_CHECKPOINT_H */
diff --git a/include/linux/checkpoint_hdr.h b/include/linux/checkpoint_hdr.h
index 3ae2d7d..8f81da1 100644
--- a/include/linux/checkpoint_hdr.h
+++ b/include/linux/checkpoint_hdr.h
@@ -133,10 +133,13 @@ enum {
 
 /* architecture */
 enum {
+	/* do not change order (will break ABI) */
 	CKPT_ARCH_X86_32 = 1,
 #define CKPT_ARCH_X86_32 CKPT_ARCH_X86_32
 	CKPT_ARCH_X86_64,
 #define CKPT_ARCH_X86_64 CKPT_ARCH_X86_64
+	CKPT_ARCH_S390X,
+#define CKPT_ARCH_S390X CKPT_ARCH_S390X
 };
 
 /* shared objrects (objref) */
diff --git a/mm/mmap.c b/mm/mmap.c
index ddbe589..4b36b2a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2495,7 +2495,14 @@ int special_mapping_restore(struct ckpt_ctx *ctx,
 		ret = syscall32_setup_pages(NULL, h->vm_start, 0);
 	else
 #endif
-	ret = arch_setup_additional_pages(NULL, h->vm_start, 0);
+	/*
+	 * Pass uses_interp=1 to arch_setup_additional_pages because
+	 * s390 won't set up vdso if it is 0. Here it's safe to assume
+	 * that there is a vdso in the checkpoint image, and therefore
+	 * the checkpointed program was dynamically linked. The other
+	 * architectures ignore uses_interp altogether.
+	 */
+	ret = arch_setup_additional_pages(NULL, h->vm_start, 1);
 #endif
 
 	return ret;
-- 
1.6.3.3

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
@ 2010-05-01 22:10   ` David Miller
  2010-05-02  0:14     ` Josh Boyer
                       ` (3 more replies)
  2010-05-04 14:43   ` David Howells
  1 sibling, 4 replies; 26+ messages in thread
From: David Miller @ 2010-05-01 22:10 UTC (permalink / raw)
  To: orenl
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, matthltc, serue, akpm, sukadev, xemul

NO WAY, there is no way in the world you should post 100 patches
at a time to any mailing list, especially those at vger.kernel.org
that have thousands upon thousands of subscribers.

Post only small, well contained, sets of patches at a time.  At most
10 or so in one go.

Do you realize how much mail traffic you generate by posting so many
patches at one time, and how unlikely it is for anyone to actually
sift through and review your patches after you've spammed them by
posting so many at one time?

A second infraction and I will have no choice but to block you at the
SMTP level at vger.kernel.org so please do not do it again.

Thanks.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 22:10   ` David Miller
@ 2010-05-02  0:14     ` Josh Boyer
  2010-05-02  0:25     ` Matt Helsley
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Josh Boyer @ 2010-05-02  0:14 UTC (permalink / raw)
  To: David Miller
  Cc: linux-s390, orenl, containers, x86, linux-kernel, linuxppc-dev,
	matthltc, linux-api, serue, akpm, sukadev, xemul

On Sat, May 01, 2010 at 03:10:22PM -0700, David Miller wrote:
>
>NO WAY, there is no way in the world you should post 100 patches
>at a time to any mailing list, especially those at vger.kernel.org
>that have thousands upon thousands of subscribers.
>
>Post only small, well contained, sets of patches at a time.  At most
>10 or so in one go.
>
>Do you realize how much mail traffic you generate by posting so many
>patches at one time, and how unlikely it is for anyone to actually
>sift through and review your patches after you've spammed them by
>posting so many at one time?
>
>A second infraction and I will have no choice but to block you at the
>SMTP level at vger.kernel.org so please do not do it again.

So I really agree with everything you said here, but I do wonder why you haven't
sent a similar rant about the often 100+ patchsets for the -stable series.  We  
are supposed to review those and follow up on them to be sure they're suitable  
for a stable release.

Or the 100+ emails about regressions from version to version.  Etc, etc.

I'm not saying you're wrong, but it does seem a bit odd that you choose to reply
to this one, and not the other umpteen cases I often see.  Maybe it isn't about 
the size or volume of the emails, and more about the fact that it's 100 patches 
to implement _one_ thing?  If so, then I don't really think it's about list
traffic at all...

josh

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 22:10   ` David Miller
  2010-05-02  0:14     ` Josh Boyer
@ 2010-05-02  0:25     ` Matt Helsley
  2010-05-03  8:48     ` Brian K. White
  2010-05-03 21:02     ` Dave Hansen
  3 siblings, 0 replies; 26+ messages in thread
From: Matt Helsley @ 2010-05-02  0:25 UTC (permalink / raw)
  To: David Miller
  Cc: linux-s390, orenl, containers, x86, linux-kernel, linuxppc-dev,
	matthltc, linux-api, serue, akpm, sukadev, xemul

On Sat, May 01, 2010 at 03:10:22PM -0700, David Miller wrote:
> NO WAY, there is no way in the world you should post 100 patches
> at a time to any mailing list, especially those at vger.kernel.org
> that have thousands upon thousands of subscribers.

I am sorry we concluded that sending these 100 patches at once was a
good idea. I will try, again, to find ways to divide the
set up into more manageable pieces. Regardless of how that goes
the whole set will not be submitted to LKML/vger all at once in the
future.

If anyone would like to offer more specific constructive suggestions
on subdividing the patches I'd be happy to try them.

That said, for anyone who's curious, we faced a few dilemmas which
pointed us down the wrong path here.

http://lkml.org/lkml/2010/3/1/422

Specifically the last part is rather hard to misinterpret:

"I'd suggest waiting until very shortly after 2.6.34-rc1 then please
send all the patches onto the list and let's get to work."

(ok, it's not shortly after 2.6.34-rc1 -- we were asked to reorganize
the code and we did...)

But even if one decides to ignore the common sense interpretation of
Andrew's reply there was more:

Standard procedure is to post to LKML when pushing patches upstream.

We were asked to create a useful implementation of checkpoint/restart
yet when we tried to submit a digestable piece we were told that
submitting it by itself was pointless (eclone). The rest of the code
was even more checkpoint/restart-specific so the same logic seemed to
apply.

We have public git trees and used the containers@ mailing list to post
patches for review but rarely received outside feedback on patches
there. Not even requests to divide the set.

So clearly we needed to post to relevant external lists and
reviewers. We tried that earlier and received complaints that lists
hadn't been Cc'd on some of the patches (e.g. fsdevel). So clearly we
needed to expand the Cc list for v21.

We looked at dividing the set but it always came down to trimming
functionality -- this conflicted with the "useful implementation"
we were asked for.

In summary: We've been given a fair number of conflicting instructions
		and we failed to find the right balance in following them.

> Post only small, well contained, sets of patches at a time.  At most
> 10 or so in one go.

We've tried to keep the individual patches small and reviewable. That
has the opposite effect on patch count unfortunately.

>
> Do you realize how much mail traffic you generate by posting so many
> patches at one time, and how unlikely it is for anyone to actually
> sift through and review your patches after you've spammed them by
> posting so many at one time?
>
> A second infraction and I will have no choice but to block you at the
> SMTP level at vger.kernel.org so please do not do it again.

We will not post nearly this many at once again.

I'm thinking we'll just provide URLs to git trees or quilt series
if subdividing is impossible and/or anyone needs wider context than
the 10 or so we post at a time.

Sorry again,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 22:10   ` David Miller
  2010-05-02  0:14     ` Josh Boyer
  2010-05-02  0:25     ` Matt Helsley
@ 2010-05-03  8:48     ` Brian K. White
  2010-05-03 21:02     ` Dave Hansen
  3 siblings, 0 replies; 26+ messages in thread
From: Brian K. White @ 2010-05-03  8:48 UTC (permalink / raw)
  To: David Miller
  Cc: linux-s390, orenl, containers, x86, linux-kernel, linuxppc-dev,
	linux-api, akpm, sukadev, xemul

David Miller wrote:
> NO WAY, there is no way in the world you should post 100 patches
> at a time to any mailing list, especially those at vger.kernel.org
> that have thousands upon thousands of subscribers.
> 
> Post only small, well contained, sets of patches at a time.  At most
> 10 or so in one go.
> 
> Do you realize how much mail traffic you generate by posting so many
> patches at one time, and how unlikely it is for anyone to actually
> sift through and review your patches after you've spammed them by
> posting so many at one time?
> 
> A second infraction and I will have no choice but to block you at the
> SMTP level at vger.kernel.org so please do not do it again.

Some people, you give them gold and they just complain it's heavy.

-- 
bkw

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 22:10   ` David Miller
                       ` (2 preceding siblings ...)
  2010-05-03  8:48     ` Brian K. White
@ 2010-05-03 21:02     ` Dave Hansen
  2010-05-03 21:12       ` David Miller
  3 siblings, 1 reply; 26+ messages in thread
From: Dave Hansen @ 2010-05-03 21:02 UTC (permalink / raw)
  To: David Miller
  Cc: linux-s390, orenl, containers, x86, linux-kernel, linuxppc-dev,
	matthltc, linux-api, serue, akpm, sukadev, xemul

On Sat, 2010-05-01 at 15:10 -0700, David Miller wrote:
> NO WAY, there is no way in the world you should post 100 patches
> at a time to any mailing list, especially those at vger.kernel.org
> that have thousands upon thousands of subscribers.
> 
> Post only small, well contained, sets of patches at a time.  At most
> 10 or so in one go.

Hi Dave,

I really do apologize if these caused undue traffic on vger.  It
certainly wasn't our intention to cause any problems.

We've had a fear all along that we'll just go back into our little
containers@lists.linux-foundation.org, and go astray of what the
community wants done with these patches.  It's also important to
remember that these really do affect the entire kernel.  Unfortunately,
it's not really a single feature or something that fits well on any
other mailing list.  It has implications _everywhere_.  I think Andrew
Morton also asked that these continue coming to LKML, although his
request probably came at a time when the set was a wee bit smaller.

> Do you realize how much mail traffic you generate by posting so many
> patches at one time, and how unlikely it is for anyone to actually
> sift through and review your patches after you've spammed them by
> posting so many at one time?

I honestly don't expect people to be reading the whole thing at once.
But, I do hope that people can take a look at their individual bits that
are touched.

> A second infraction and I will have no choice but to block you at the
> SMTP level at vger.kernel.org so please do not do it again.

I know these patches are certainly not at the level of importance as,
say the pata or -stable updates, but they're certainly not of
unprecedented scale.  I've seen a number of patchbombs of this size
recently.

I hope Andrew pulls this set into -mm so this doesn't even come up
again.  But, if it does, can you make some suggestions on how to be more
kind to vger in the process?

-- Dave

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-03 21:02     ` Dave Hansen
@ 2010-05-03 21:12       ` David Miller
  0 siblings, 0 replies; 26+ messages in thread
From: David Miller @ 2010-05-03 21:12 UTC (permalink / raw)
  To: dave
  Cc: linux-s390, orenl, containers, x86, linux-kernel, linuxppc-dev,
	matthltc, linux-api, serue, akpm, sukadev, xemul

From: Dave Hansen <dave@linux.vnet.ibm.com>
Date: Mon, 03 May 2010 14:02:31 -0700

> It has implications _everywhere_.

That does not remove the responsibility to break things up into
managable pieces, not does it make such a task impossible or
even hard to do.

You post sets of 10 to 15 at a time, once those are agreed to
and to everyone's general liking, you toss them into a GIT tree
and you say "here's the next 10 to 15 and they are relative to
the changes in GIT tree X which have already been fully reviewed"

And so on and so forth.

And this is the only logical thing to do, because if someone wants
a change in patch 7, it can effect patch 23 so it's pointless to
post for review a patch that's going to end up changing anyways.
That's a waste of reviewer resources.

To be honest, I'm really tired of what tends to be people's knee jerk
reaction to this situations, which is a lot of people doing nothing
but defending themselves.  Even if it did not violate documented
policy (it did), it violates common sense.  So, can people do
something more constructive than trying to defend themselves on this?

It's stupid and shouldn't have been done, and we should move on.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
  2010-05-01 22:10   ` David Miller
@ 2010-05-04 14:43   ` David Howells
  2010-05-05 15:13     ` Oren Laadan
  1 sibling, 1 reply; 26+ messages in thread
From: David Howells @ 2010-05-04 14:43 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Andrew Morton,
	Sukadev Bhattiprolu, Pavel Emelyanov


With a huge patch series like this, can you post a cover note at the front
(usually patch 0) saying what the point of the whole series is?

David

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page
  2010-05-04 14:43   ` David Howells
@ 2010-05-05 15:13     ` Oren Laadan
  0 siblings, 0 replies; 26+ messages in thread
From: Oren Laadan @ 2010-05-05 15:13 UTC (permalink / raw)
  To: David Howells
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Andrew Morton,
	Sukadev Bhattiprolu, Pavel Emelyanov

Hi David,

I suppose you are looking for more details than those found in the
current patch-0 (http://lkml.org/lkml/2010/5/1/140).

We omitted them for brevity sake; here is a link to patch-0 of a 
previous post of the patchset: http://lkml.org/lkml/2009/9/23/423

Thanks,

Oren.

David Howells wrote:
> With a huge patch series like this, can you post a cover note at the front
> (usually patch 0) saying what the point of the whole series is?
> 
> David
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
@ 2010-05-05 21:14   ` Randy Dunlap
  2010-05-05 22:25     ` Sukadev Bhattiprolu
  0 siblings, 1 reply; 26+ messages in thread
From: Randy Dunlap @ 2010-05-05 21:14 UTC (permalink / raw)
  To: Oren Laadan
  Cc: linux-s390, linux-api, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, Serge Hallyn, Andrew Morton,
	Sukadev Bhattiprolu, Pavel Emelyanov

On Sat,  1 May 2010 10:14:53 -0400 Oren Laadan wrote:

> From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> 
> This gives a brief overview of the eclone() system call.  We should
> eventually describe more details in existing clone(2) man page or in
> a new man page.
> 
> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Acked-by: Serge E. Hallyn <serue@us.ibm.com>
> Acked-by: Oren Laadan  <orenl@cs.columbia.edu>
> ---
>  Documentation/eclone |  348 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 348 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/eclone
> 
> diff --git a/Documentation/eclone b/Documentation/eclone
> new file mode 100644
> index 0000000..c2f1b4b
> --- /dev/null
> +++ b/Documentation/eclone
> @@ -0,0 +1,348 @@
> +
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +	u32 nr_pids;
> +	u32 reserved0;
> +};
> +
> +
> +sys_eclone(u32 flags_low, struct clone_args * __user cargs, int cargs_size,
> +		pid_t * __user pids)
> +
> +	In addition to doing everything that clone() system call does, the

	                                that the clone()

> +	eclone() system call:
> +
> +		- allows additional clone flags (31 of 32 bits in the flags
> +		  parameter to clone() are in use)
> +
> +		- allows user to specify a pid for the child process in its
> +		  active and ancestor pid namespaces.
> +
> +	This system call is meant to be used when restarting an application
> +	from a checkpoint. Such restart requires that the processes in the
> +	application have the same pids they had when the application was
> +	checkpointed. When containers are nested, the processes within the
> +	containers exist in multiple pid namespaces and hence have multiple
> +	pids to specify during restart.
> +
> +	The @flags_low parameter is identical to the 'clone_flags' parameter
> +	in existing clone() system call.

	in the existing

> +
> +	The fields in 'struct clone_args' are meant to be used as follows:
> +
> +	u64 clone_flags_high:
> +
> +		When eclone() supports more than 32 flags, the additional bits
> +		in the clone_flags should be specified in this field. This
> +		field is currently unused and must be set to 0.
> +
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +		These two fields correspond to the 'child_stack' fields in
> +		clone() and clone2() (on IA64) system calls. The usage of
> +		these two fields depends on the processor architecture.
> +
> +		Most architectures use ->child_stack to pass-in a stack-pointer

		                                     to pass in

> +		itself and don't need the ->child_stack_size field. On these
> +		architectures the ->child_stack_size field must be 0.
> +
> +		Some architectures, eg IA64, use ->child_stack to pass-in the

		                    e.g.                        to pass in

> +		base of the region allocated for stack. These architectures
> +		must pass in the size of the stack-region in ->child_stack_size.

		                             stack region

Seems unfortunate that different architectures use the fields differently.

> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +		These two fields correspond to the 'parent_tid_ptr' and
> +		'child_tid_ptr' fields in the clone() system call

		                                      system call.

> +
> +	u32 nr_pids;
> +
> +		nr_pids specifies the number of pids in the @pids array
> +		parameter to eclone() (see below). nr_pids should not exceed
> +		the current nesting level of the calling process (i.e if the

		                                                  i.e.

> +		process is in init_pid_ns, nr_pids must be 1, if process is
> +		in a pid namespace that is a child of init-pid-ns, nr_pids
> +		cannot exceed 2, and so on).
> +
> +	u32 reserved0;
> +	u64 reserved1;
> +
> +		These fields are intended to extend the functionality of the
> +		eclone() in the future, while preserving backward compatibility.
> +		They must be set to 0 for now.

The struct does not have a reserved1 field AFAICT.

> +	The @cargs_size parameter specifes the sizeof(struct clone_args) and
> +	is intended to enable extending this structure in the future, while
> +	preserving backward compatibility.  For now, this field must be set
> +	to the sizeof(struct clone_args) and this size must match the kernel's
> +	view of the structure.
> +
> +	The @pids parameter defines the set of pids that should be assigned to
> +	the child process in its active and ancestor pid namespaces. The
> +	descendant pid namespaces do not matter since a process does not have a
> +	pid in descendant namespaces, unless the process is in a new pid
> +	namespace in which case the process is a container-init (and must have
> +	the pid 1 in that namespace).
> +
> +	See CLONE_NEWPID section of clone(2) man page for details about pid

	                         of the clone(2)

> +	namespaces.
> +
> +	If a pid in the @pids list is 0, the kernel will assign the next
> +	available pid in the pid namespace.
> +
> +	If a pid in the @pids list is non-zero, the kernel tries to assign
> +	the specified pid in that namespace.  If that pid is already in use
> +	by another process, the system call fails (see EBUSY below).
> +
> +	The order of pids in @pids is oldest in pids[0] to youngest pid
> +	namespace in pids[nr_pids-1]. If the number of pids specified in the
> +	@pids list is fewer than the nesting level of the process, the pids
> +	are applied from youngest namespace. i.e if the process is nested in

	                 the youngest namespace. I.e.

> +	a level-6 pid namespace and @pids only specifies 3 pids, the 3 pids
> +	are applied to levels 6, 5 and 4. Levels 0 through 3 are assumed to
> +	have a pid of '0' (the kernel will assign a pid in those namespaces).
> +
> +	On success, the system call returns the pid of the child process in
> +	the parent's active pid namespace.
> +
> +	On failure, eclone() returns -1 and sets 'errno' to one of following
> +	values (the child process is not created).
> +
> +	EPERM	Caller does not have the CAP_SYS_ADMIN privilege needed to
> +		specify the pids in this call (if pids are not specifed
> +		CAP_SYS_ADMIN is not required).
> +
> +	EINVAL	The number of pids specified in 'clone_args.nr_pids' exceeds
> +		the current nesting level of parent process

		                                    process.

> +
> +	EINVAL	Not all specified clone-flags are valid.
> +
> +	EINVAL	The reserved fields in the clone_args argument are not 0.
> +
> +	EINVAL	The child_stack_size field is not 0 (on architectures that
> +		pass in a stack pointer in ->child_stack field)

		                                         field).

> +
> +	EBUSY	A requested pid is in use by another process in that namespace.
> +
> +---


Is this example program meant to build only on i386?

On x86_64 I get:

eclone-syscall-test.c: In function 'do_clone':
eclone-syscall-test.c:166: warning: assignment makes pointer from integer without a cast
/tmp/cc0OrhU3.o: In function `do_clone':
eclone-syscall-test.c:(.text+0x173): undefined reference to `setup_stack'
eclone-syscall-test.c:(.text+0x1e2): undefined reference to `eclone'


> +/*
> + * Example eclone() usage - Create a child process with pid CHILD_TID1 in
> + * the current pid namespace. The child gets the usual "random" pid in any
> + * ancestor pid namespaces.
> + */
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <string.h>
> +#include <signal.h>
> +#include <errno.h>
> +#include <unistd.h>
> +#include <wait.h>
> +#include <sys/syscall.h>
> +
> +#define __NR_eclone		337
> +#define CLONE_NEWPID            0x20000000
> +#define CLONE_CHILD_SETTID      0x01000000
> +#define CLONE_PARENT_SETTID     0x00100000
> +#define CLONE_UNUSED		0x00001000
> +
> +#define STACKSIZE		8192
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef int pid_t;
> +struct clone_args {
> +	u64 clone_flags_high;
> +	u64 child_stack;
> +	u64 child_stack_size;
> +
> +	u64 parent_tid_ptr;
> +	u64 child_tid_ptr;
> +
> +	u32 nr_pids;
> +
> +	u32 reserved0;
> +};
> +
> +#define exit		_exit
> +
> +/*
> + * Following eclone() is based on code posted by Oren Laadan at:
> + * https://lists.linux-foundation.org/pipermail/containers/2009-June/018463.html
> + */
> +#if defined(__i386__) && defined(__NR_eclone)
> +
> +int eclone(u32 flags_low, struct clone_args *clone_args, int args_size,
> +		int *pids)
> +{
> +	long retval;
> +
> +	__asm__ __volatile__(
> +		 "movl %3, %%ebx\n\t"	/* flags_low -> 1st (ebx) */
> +		 "movl %4, %%ecx\n\t"	/* clone_args -> 2nd (ecx)*/
> +		 "movl %5, %%edx\n\t"	/* args_size -> 3rd (edx) */
> +		 "movl %6, %%edi\n\t"	/* pids -> 4th (edi)*/
> +
> +		 "pushl %%ebp\n\t"	/* save value of ebp */
> +		 "int $0x80\n\t"	/* Linux/i386 system call */
> +		 "testl %0,%0\n\t"	/* check return value */
> +		 "jne 1f\n\t"		/* jump if parent */
> +
> +		 "popl %%esi\n\t"	/* get subthread function */
> +		 "call *%%esi\n\t"	/* start subthread function */
> +		 "movl %2,%0\n\t"
> +		 "int $0x80\n"		/* exit system call: exit subthread */
> +		 "1:\n\t"
> +		 "popl %%ebp\t"		/* restore parent's ebp */
> +
> +		:"=a" (retval)
> +
> +		:"0" (__NR_eclone),
> +		 "i" (__NR_exit),
> +		 "m" (flags_low),
> +		 "m" (clone_args),
> +		 "m" (args_size),
> +		 "m" (pids)
> +		);
> +
> +	if (retval < 0) {
> +		errno = -retval;
> +		retval = -1;
> +	}
> +	return retval;
> +}
> +
> +/*
> + * Allocate a stack for the clone-child and arrange to have the child
> + * execute @child_fn with @child_arg as the argument.
> + */
> +void *setup_stack(int (*child_fn)(void *), void *child_arg, int size)
> +{
> +	void *stack_base;
> +	void **stack_top;
> +
> +	stack_base = malloc(size + size);
> +	if (!stack_base) {
> +		perror("malloc()");
> +		exit(1);
> +	}
> +
> +	stack_top = (void **)((char *)stack_base + (size - 4));
> +	*--stack_top = child_arg;
> +	*--stack_top = child_fn;
> +
> +	return stack_top;
> +}
> +#endif
> +
> +/* gettid() is a bit more useful than getpid() when messing with clone() */
> +int gettid()
> +{
> +	int rc;
> +
> +	rc = syscall(__NR_gettid, 0, 0, 0);
> +	if (rc < 0) {
> +		printf("rc %d, errno %d\n", rc, errno);
> +		exit(1);
> +	}
> +	return rc;
> +}
> +
> +#define CHILD_TID1	377
> +#define CHILD_TID2	1177
> +#define CHILD_TID3	2799
> +
> +struct clone_args clone_args;
> +void *child_arg = &clone_args;
> +int child_tid;
> +
> +int do_child(void *arg)
> +{
> +	struct clone_args *cs = (struct clone_args *)arg;
> +	int ctid;
> +
> +	/* Verify we pushed the arguments correctly on the stack... */
> +	if (arg != child_arg)  {
> +		printf("Child: Incorrect child arg pointer, expected %p,"
> +				"actual %p\n", child_arg, arg);
> +		exit(1);
> +	}
> +
> +	/* ... and that we got the thread-id we expected */
> +	ctid = *((int *)(unsigned long)cs->child_tid_ptr);
> +	if (ctid != CHILD_TID1) {
> +		printf("Child: Incorrect child tid, expected %d, actual %d\n",
> +				CHILD_TID1, ctid);
> +		exit(1);
> +	} else {
> +		printf("Child got the expected tid, %d\n", gettid());
> +	}
> +	sleep(2);
> +
> +	printf("[%d, %d]: Child exiting\n", getpid(), ctid);
> +	exit(0);
> +}
> +
> +static int do_clone(int (*child_fn)(void *), void *child_arg,
> +		unsigned int flags_low, int nr_pids, pid_t *pids_list)
> +{
> +	int rc;
> +	void *stack;
> +	struct clone_args *ca = &clone_args;
> +	int args_size;
> +
> +	stack = setup_stack(child_fn, child_arg, STACKSIZE);
> +
> +	memset(ca, 0, sizeof(*ca));
> +
> +	ca->child_stack		= (u64)(unsigned long)stack;
> +	ca->child_stack_size	= (u64)0;
> +	ca->child_tid_ptr	= (u64)(unsigned long)&child_tid;
> +	ca->nr_pids		= nr_pids;
> +
> +	args_size = sizeof(struct clone_args);
> +	rc = eclone(flags_low, ca, args_size, pids_list);
> +
> +	printf("[%d, %d]: eclone() returned %d, error %d\n", getpid(), gettid(),
> +				rc, errno);
> +	return rc;
> +}
> +
> +/*
> + * Multiple pid_t pid_t values in pids_list[] here are just for illustration.
> + * The test case creates a child in the current pid namespace and uses only
> + * the first value, CHILD_TID1.
> + */
> +pid_t pids_list[] = { CHILD_TID1, CHILD_TID2, CHILD_TID3 };
> +int main()
> +{
> +	int rc, pid, status;
> +	unsigned long flags;
> +	int nr_pids = 1;
> +
> +	flags = SIGCHLD|CLONE_CHILD_SETTID;
> +
> +	pid = do_clone(do_child, &clone_args, flags, nr_pids, pids_list);
> +
> +	printf("[%d, %d]: Parent waiting for %d\n", getpid(), gettid(), pid);
> +
> +	rc = waitpid(pid, &status, __WALL);
> +	if (rc < 0) {
> +		printf("waitpid(): rc %d, error %d\n", rc, errno);
> +	} else {
> +		printf("[%d, %d]: child %d:\n\t wait-status 0x%x\n", getpid(),
> +			 gettid(), rc, status);
> +
> +		if (WIFEXITED(status)) {
> +			printf("\t EXITED, %d\n", WEXITSTATUS(status));
> +		} else if (WIFSIGNALED(status)) {
> +			printf("\t SIGNALED, %d\n", WTERMSIG(status));
> +		}
> +	}
> +	return 0;
> +}
> -- 


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v21 011/100] eclone (11/11): Document sys_eclone
  2010-05-05 21:14   ` Randy Dunlap
@ 2010-05-05 22:25     ` Sukadev Bhattiprolu
  0 siblings, 0 replies; 26+ messages in thread
From: Sukadev Bhattiprolu @ 2010-05-05 22:25 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-s390, Oren Laadan, containers, x86, linux-kernel,
	linuxppc-dev, Matt Helsley, linux-api, Serge Hallyn,
	Andrew Morton, Pavel Emelyanov

Randy Dunlap [randy.dunlap@oracle.com] wrote:
| > +		base of the region allocated for stack. These architectures
| > +		must pass in the size of the stack-region in ->child_stack_size.
| 
| 		                             stack region
| 
| Seems unfortunate that different architectures use the fields differently.

Yes and no. The field still has a single purpose, just that some architectures
may not need it. We enforce that if unused on an architecture, the field must
be 0. It looked like the easiest way to keep the API common across
architectures.

| 
| Is this example program meant to build only on i386?

Yes. Will add a pointer to the clone*.[chS] and libeclone.a files in

	git://git.ncl.cs.columbia.edu/pub/git/user-cr.git

for other architectures (currently x86_64, ppc, s390).

Thanks for the review. Will fix the errors and repost.

Sukadev

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2010-05-05 22:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1272723382-19470-1-git-send-email-orenl@cs.columbia.edu>
2010-05-01 14:14 ` [PATCH v21 001/100] eclone (1/11): Factor out code to allocate pidmap page Oren Laadan
2010-05-01 22:10   ` David Miller
2010-05-02  0:14     ` Josh Boyer
2010-05-02  0:25     ` Matt Helsley
2010-05-03  8:48     ` Brian K. White
2010-05-03 21:02     ` Dave Hansen
2010-05-03 21:12       ` David Miller
2010-05-04 14:43   ` David Howells
2010-05-05 15:13     ` Oren Laadan
2010-05-01 14:14 ` [PATCH v21 002/100] eclone (2/11): Have alloc_pidmap() return actual error code Oren Laadan
2010-05-01 14:14 ` [PATCH v21 003/100] eclone (3/11): Define set_pidmap() function Oren Laadan
2010-05-01 14:14 ` [PATCH v21 004/100] eclone (4/11): Add target_pids parameter to alloc_pid() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 005/100] eclone (5/11): Add target_pids parameter to copy_process() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 006/100] eclone (6/11): Check invalid clone flags Oren Laadan
2010-05-01 14:14 ` [PATCH v21 007/100] eclone (7/11): Define do_fork_with_pids() Oren Laadan
2010-05-01 14:14 ` [PATCH v21 008/100] eclone (8/11): Implement sys_eclone for x86 (32, 64) Oren Laadan
2010-05-01 14:14 ` [PATCH v21 009/100] eclone (9/11): Implement sys_eclone for s390 Oren Laadan
2010-05-01 14:14 ` [PATCH v21 010/100] eclone (10/11): Implement sys_eclone for powerpc Oren Laadan
2010-05-01 14:14 ` [PATCH v21 011/100] eclone (11/11): Document sys_eclone Oren Laadan
2010-05-05 21:14   ` Randy Dunlap
2010-05-05 22:25     ` Sukadev Bhattiprolu
2010-05-01 14:14 ` [PATCH v21 012/100] c/r: extend arch_setup_additional_pages() Oren Laadan
2010-05-01 14:15 ` [PATCH v21 021/100] c/r: create syscalls: sys_checkpoint, sys_restart Oren Laadan
2010-05-01 14:15 ` [PATCH v21 058/100] c/r: (s390): expose a constant for the number of words (CRs) Oren Laadan
2010-05-01 14:15 ` [PATCH v21 059/100] c/r: add CKPT_COPY() macro Oren Laadan
2010-05-01 14:15 ` [PATCH v21 060/100] c/r: define s390-specific checkpoint-restart code Oren Laadan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox