[RFC PATCH 5/5] use next syscall data to predefine the file descriptor value

All of lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
  2008-07-03 14:40 [RFC PATCH 0/5] Resend " Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-03 14:40 ` Nadia.Derbey-6ktuUTfB/bM
  0 siblings, 0 replies; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-03 14:40 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Nadia Derbey, pavel-+ZI9xUNit7I

[-- Attachment #1: fileopen_use_next_syscall_data.patch --]
[-- Type: text/plain, Size: 2589 bytes --]

[PATCH 05/05]

This patch uses the value written into the next_syscall_data proc file
as a target file descriptor for the next file to be opened.

This makes it easy to restart a process with the same fds as the ones it was
using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
the open file descriptor.

The following syscalls are impacted if next_syscall_data is set:
. open()
. openat()

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 fs/open.c |   58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 57 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc5-mm3/fs/open.c
===================================================================
--- linux-2.6.26-rc5-mm3.orig/fs/open.c	2008-06-25 17:11:06.000000000 +0200
+++ linux-2.6.26-rc5-mm3/fs/open.c	2008-07-01 17:51:53.000000000 +0200
@@ -967,6 +967,55 @@ struct file *dentry_open(struct dentry *
 EXPORT_SYMBOL(dentry_open);
 
 /*
+ * Marks a given file descriptor entry as busy (should not be busy when this
+ * routine is called.
+ *
+ * files->next_fd is not updated: this lets the potentially created hole be
+ * filled up on next calls to get_unused_fd_flags.
+ *
+ * Returns the specified fd if successful, -errno else.
+ */
+static int get_predefined_fd_flags(int fd, int flags)
+{
+	struct files_struct *files = current->files;
+	int error;
+	struct fdtable *fdt;
+
+	error = -EINVAL;
+	if (fd < 0)
+		goto out;
+
+	error = -EMFILE;
+	if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
+		goto out;
+
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	error = expand_files(files, fd);
+	if (error < 0)
+		goto out_unlock;
+
+	error = -EBUSY;
+	if (FD_ISSET(fd, fdt->open_fds))
+		goto out_unlock;
+
+	FD_SET(fd, fdt->open_fds);
+	if (flags & O_CLOEXEC)
+		FD_SET(fd, fdt->close_on_exec);
+	else
+		FD_CLR(fd, fdt->close_on_exec);
+
+	BUG_ON(fdt->fd[fd] != NULL);
+
+	error = fd;
+out_unlock:
+	spin_unlock(&files->file_lock);
+out:
+	return error;
+}
+
+/*
  * Find an empty file descriptor entry, and mark it busy.
  */
 int get_unused_fd_flags(int flags)
@@ -1081,7 +1130,14 @@ long do_sys_open(int dfd, const char __u
 	int fd = PTR_ERR(tmp);
 
 	if (!IS_ERR(tmp)) {
-		fd = get_unused_fd_flags(flags);
+		if (next_data_set(current)) {
+			int next_fd = get_next_data(current);
+
+			fd = get_predefined_fd_flags(next_fd, flags);
+			reset_next_syscall_data(current);
+		} else
+			fd = get_unused_fd_flags(flags);
+
 		if (fd >= 0) {
 			struct file *f = do_filp_open(dfd, tmp, flags, mode);
 			if (IS_ERR(f)) {

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
@ 2008-07-08 11:24 Nadia.Derbey-6ktuUTfB/bM
  2008-07-08 11:24 ` [RFC PATCH 1/5] adds the procfs facilities Nadia.Derbey-6ktuUTfB/bM
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Resending after fixing the issues pointed out by Serge.

Also ported to 2.6.26-rc8-mm1.

Regards,
Nadia

--------------

This patchset is a part of an effort to change some syscalls behavior for
checkpoint restart.

When restarting an object that has previously been checkpointed, its state
should be unchanged compared to the checkpointed image.
For example, a restarted process should have the same upid nr as the one it
used to have when being checkpointed; an ipc object should have the same id
as the one it had when the checkpoint occured.
Also, talking about system V ipcs, they should be restored with the same
state (e.g. in terms of pid of last operation).

This means that several syscalls should not behave in a default mode when
they are called during a restart phase.

One solution consists in defining a new syscall for each syscall that is
called during restart:
 . sys_fork_with_id() would fork a process with a predefined id.
 . sys_msgget_with_id() would create a msg queue with a predefined id
 . sys_semget_with_id() would create a semaphore set with a predefined id
 . etc,

This solution requires defining a new syscall each time we need an existing
syscall to behave in a non-default way.

An alternative to this solution consists in defining a new field in the
task structure (let's call it next_syscall_data) that, if set, would change
the behavior of next syscall to be called. The sys_fork_with_id() previously
cited can be replaced by
 1) set next_syscall_data to a target upid nr
 2) call fork().

This patch series implements the 2nd solution. Actually I've already sent it
some times ago, and things ended up with Pavel complaining about the "ugly
interface" (see
https://lists.linux-foundation.org/pipermail/containers/2008-April/010909.html).

Now, I'm resending the series because this 2nd solution has the advantage of
being easily reusable for many subsystems: the only thing needed is just
to set a field in the task structure and rewrite the code portion that is
sensitive to this field being set (it's successfully being used in cryo code -
git tree at git://git.sr71.net/~hallyn/cryodev.git).

The patches have been ported to 2.6.26-rc8-mm1 and the open() syscall in now
covered.

A new file is created in procfs: /proc/self/task/<my_tid>/next_syscall_data.
This makes it possible to avoid races between several threads belonging to
the same process.

Setting a value into this file fills in the next_syscall_data in the task
structure.

The following subsystems have been changed to take this value into account:
1) sysvipc:
   . if there's a value in next_syscall_data when msgget() is called, msgget()
     creates a msg queue with that value as an id
   . this applies to semget() and shmget().
   . if next_syscall_data is set to 1 when msgctl(IPC_SET) is called, msgctl()
     sets more that the usual permission fields for the target msg queue (it
     sets the time fields, and the pid of last operation fields).
   . this applies to semctl() and shmctl().
2) process creation:
   . if there's a value in next_syscall_data when fork() is called, fork()
     creates a process with that value as a pid.
   . this applies to vfork() and clone().
3) file descriptors:
   . if there's a value in next_syscall_data when open() is called, open()
     uses that value as the file descriptor for the open file

The syntax is:
# echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
     next object to be created will have an id set to XX

Today, the ids are specified as long, but having a type string specified in
the next_syscall_data file makes it possible to cover more types in the future,
if needed.
Also, only a single value can be set. But the number that immediatly follows
the type string makes it possible to specify more values in the future, if
needed. This can be applied, e.g. to predefine all the upid nrs for a process
that belongs to nested namespaces, if needed in the future.

These patches should be applied to 2.6.26-rc8-mm1, in the following order:

[PATCH 1/5] : next_syscall_data_proc_file.patch
[PATCH 2/5] : ipccreate_use_next_syscall_data.patch
[PATCH 3/5] : proccreate_use_next_syscall_data.patch
[PATCH 4/5] : ipcset_use_next_syscall_data.patch
[PATCH 5/5] : fileopen_use_next_syscall_data.patch

Any comment and/or suggestions are welcome.

Regards,
Nadia

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 1/5] adds the procfs facilities
  2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-08 11:24 ` Nadia.Derbey-6ktuUTfB/bM
       [not found]   ` <20080708112457.994105000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 11:24 ` [RFC PATCH 2/5] use next syscall data to predefine ipc objects ids Nadia.Derbey-6ktuUTfB/bM
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nadia Derbey

[-- Attachment #1: next_syscall_data_proc_file.patch --]
[-- Type: text/plain, Size: 11331 bytes --]

[PATCH 01/05]

This patch adds the procfs facility needed to feed some data for the
next syscall to be called.

The effect of issuing
echo "LONG<Y> <XX>" > /proc/self/task/<tid>/next_syscall_data
is that <XX> will be stored in a new field of the task structure
(next_syscall_data). This field, in turn will be taken as the data to feed
next syscall that supports the feature.

<Y> is the number of values provided on the line.
For the sake of simplicity it is now fixed to 1, but this can be extended as
needed, in the future.

This is particularly useful when restarting an application, as we need
sometimes the syscalls to have a non-default behavior.

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 fs/exec.c                         |    6 +
 fs/proc/base.c                    |   75 ++++++++++++++++++
 include/linux/next_syscall_data.h |   32 ++++++++
 include/linux/sched.h             |    6 +
 kernel/Makefile                   |    3 
 kernel/exit.c                     |    4 +
 kernel/fork.c                     |    2 
 kernel/next_syscall_data.c        |  151 ++++++++++++++++++++++++++++++++++++++
 8 files changed, 278 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc8-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.26-rc8-mm1.orig/include/linux/sched.h	2008-07-08 09:04:21.000000000 +0200
+++ linux-2.6.26-rc8-mm1/include/linux/sched.h	2008-07-08 09:13:43.000000000 +0200
@@ -87,6 +87,7 @@ struct sched_param {
 #include <linux/task_io_accounting.h>
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
+#include <linux/next_syscall_data.h>
 
 #include <asm/processor.h>
 
@@ -1296,6 +1297,11 @@ struct task_struct {
 	int latency_record_count;
 	struct latency_record latency_record[LT_SAVECOUNT];
 #endif
+	/*
+	 * If non-NULL indicates that next operation will be forced, e.g.
+	 * that next object to be created will have a predefined id.
+	 */
+	struct next_syscall_data *nsd;
 };
 
 /*
Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 09:24:38.000000000 +0200
@@ -0,0 +1,32 @@
+/*
+ * include/linux/next_syscall_data.h
+ *
+ * Definitions to support fixed data for next syscall to be called.
+ */
+
+#ifndef _LINUX_NEXT_SYSCALL_DATA_H
+#define _LINUX_NEXT_SYSCALL_DATA_H
+
+#define NDATA 1
+
+/*
+ * If this structure is pointed to by a task_struct, next syscall to be called
+ * by the task will have a non-default behavior.
+ * For example, it can be used to pre-set the id of the object to be created
+ * by next syscall.
+ */
+struct next_syscall_data {
+	int ndata;
+	long data[NDATA];
+};
+
+extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
+extern int set_next_syscall_data(struct task_struct *, char *);
+extern void reset_next_syscall_data(struct task_struct *);
+
+static inline void exit_next_syscall_data(struct task_struct *tsk)
+{
+	reset_next_syscall_data(tsk);
+}
+
+#endif /* _LINUX_NEXT_SYSCALL_DATA_H */
Index: linux-2.6.26-rc8-mm1/fs/proc/base.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/fs/proc/base.c	2008-07-08 09:05:13.000000000 +0200
+++ linux-2.6.26-rc8-mm1/fs/proc/base.c	2008-07-08 09:18:12.000000000 +0200
@@ -1158,6 +1158,76 @@ static const struct file_operations proc
 };
 #endif
 
+static ssize_t next_syscall_data_read(struct file *file, char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct task_struct *task;
+	char *page;
+	ssize_t length;
+
+	task = get_proc_task(file->f_path.dentry->d_inode);
+	if (!task)
+		return -ESRCH;
+
+	if (count >= PAGE_SIZE)
+		count = PAGE_SIZE - 1;
+
+	length = -ENOMEM;
+	page = (char *) __get_free_page(GFP_TEMPORARY);
+	if (!page)
+		goto out;
+
+	length = get_next_syscall_data(task, (char *) page, count);
+	if (length >= 0)
+		length = simple_read_from_buffer(buf, count, ppos,
+						(char *)page, length);
+	free_page((unsigned long) page);
+
+out:
+	put_task_struct(task);
+	return length;
+}
+
+static ssize_t next_syscall_data_write(struct file *file,
+				const char __user *buf,
+				size_t count, loff_t *ppos)
+{
+	struct inode *inode = file->f_path.dentry->d_inode;
+	char *page;
+	ssize_t length;
+
+	if (pid_task(proc_pid(inode), PIDTYPE_PID) != current)
+		return -EPERM;
+
+	if (count >= PAGE_SIZE)
+		count = PAGE_SIZE - 1;
+
+	if (*ppos != 0) {
+		/* No partial writes. */
+		return -EINVAL;
+	}
+	page = (char *)__get_free_page(GFP_TEMPORARY);
+	if (!page)
+		return -ENOMEM;
+	length = -EFAULT;
+	if (copy_from_user(page, buf, count))
+		goto out_free_page;
+
+	page[count] = '\0';
+
+	length = set_next_syscall_data(current, page);
+	if (!length)
+		length = count;
+
+out_free_page:
+	free_page((unsigned long) page);
+	return length;
+}
+
+static const struct file_operations proc_next_syscall_data_operations = {
+	.read		= next_syscall_data_read,
+	.write		= next_syscall_data_write,
+};
 
 #ifdef CONFIG_SCHED_DEBUG
 /*
@@ -2853,6 +2923,11 @@ static const struct pid_entry tid_base_s
 #ifdef CONFIG_TASK_IO_ACCOUNTING
 	INF("io",	S_IRUGO, tid_io_accounting),
 #endif
+	/*
+	 * NOTE that this file is not added into tgid_base_stuff[] since it
+	 * has to be specified on a per-thread basis.
+	 */
+	REG("next_syscall_data", S_IRUGO|S_IWUSR, next_syscall_data),
 };
 
 static int proc_tid_base_readdir(struct file * filp,
Index: linux-2.6.26-rc8-mm1/kernel/Makefile
===================================================================
--- linux-2.6.26-rc8-mm1.orig/kernel/Makefile	2008-07-08 09:04:35.000000000 +0200
+++ linux-2.6.26-rc8-mm1/kernel/Makefile	2008-07-08 09:19:14.000000000 +0200
@@ -9,7 +9,8 @@ obj-y     = sched.o fork.o exec_domain.o
 	    rcupdate.o extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
-	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o
+	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o \
+	    next_syscall_data.o
 
 CFLAGS_REMOVE_sched.o = -pg -mno-spe
 
Index: linux-2.6.26-rc8-mm1/kernel/next_syscall_data.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc8-mm1/kernel/next_syscall_data.c	2008-07-08 09:35:27.000000000 +0200
@@ -0,0 +1,151 @@
+/*
+ * linux/kernel/next_syscall_data.c
+ *
+ *
+ * Provide the get_next_syscall_data() / set_next_syscall_data() routines
+ * (called from fs/proc/base.c).
+ * They allow to specify some particular data for the next syscall to be
+ * called.
+ * E.g. they can be used to specify the id for the next resource to be
+ * allocated, instead of letting the allocator set it for us.
+ */
+
+#include <linux/sched.h>
+#include <linux/ctype.h>
+
+
+
+ssize_t get_next_syscall_data(struct task_struct *task, char *buffer,
+				size_t size)
+{
+	struct next_syscall_data *nsd;
+	char *bufptr = buffer;
+	ssize_t rc, count = 0;
+	int i;
+
+	nsd = task->nsd;
+	if (!nsd || !nsd->ndata)
+		return snprintf(buffer, size, "UNSET\n");
+
+	count = snprintf(bufptr, size, "LONG%d ", nsd->ndata);
+
+	for (i = 0; i < nsd->ndata - 1; i++) {
+		rc = snprintf(&bufptr[count], size - count, "%ld ",
+				nsd->data[i]);
+		if (rc >= size - count)
+			return -ENOMEM;
+		count += rc;
+	}
+
+	rc = snprintf(&bufptr[count], size - count, "%ld\n", nsd->data[i]);
+	if (rc >= size - count)
+		return -ENOMEM;
+	count += rc;
+
+	return count;
+}
+
+static int fill_next_syscall_data(struct task_struct *task, int ndata,
+				char *buffer)
+{
+	char *token, *buff = buffer;
+	char *end;
+	struct next_syscall_data *nsd = task->nsd;
+	int i;
+
+	if (!nsd) {
+		nsd = kmalloc(sizeof(*nsd), GFP_KERNEL);
+		if (!nsd)
+			return -ENOMEM;
+		task->nsd = nsd;
+	}
+
+	nsd->ndata = ndata;
+
+	i = 0;
+	while ((token = strsep(&buff, " ")) != NULL && i < ndata) {
+		long data;
+
+		if (!*token)
+			goto out_free;
+		data = simple_strtol(token, &end, 0);
+		if (end == token || (*end && !isspace(*end)))
+			goto out_free;
+		nsd->data[i] = data;
+		i++;
+	}
+
+	if (i != ndata)
+		goto out_free;
+
+	return 0;
+
+out_free:
+	kfree(nsd);
+	task->nsd = NULL;
+	return -EINVAL;
+}
+
+/*
+ * Parses a line with the following format:
+ * <x> <id0> ... <idx-1>
+ * Currently, only x=1 is accepted.
+ * Any trailing character on the line is skipped.
+ */
+static int do_set_next_syscall_data(struct task_struct *task, char *nb,
+					char *buffer)
+{
+	int ndata;
+	char *end;
+
+	ndata = simple_strtol(nb, &end, 0);
+	if (*end)
+		return -EINVAL;
+
+	if (ndata > NDATA)
+		return -EINVAL;
+
+	return fill_next_syscall_data(task, ndata, buffer);
+}
+
+void reset_next_syscall_data(struct task_struct *task)
+{
+	struct next_syscall_data *nsd = task->nsd;
+
+	if (nsd) {
+		task->nsd = NULL;
+		kfree(nsd);
+	}
+}
+
+#define LONG_STR	"LONG"
+#define RESET_STR	"RESET"
+
+/*
+ * Parses a line written to /proc/self/task/<my_tid>/next_syscall_data.
+ * this line has the following format:
+ * LONG<x> id              --> a sequence of id(s) is specified
+ *                             currently, only x=1 is accepted
+ */
+int set_next_syscall_data(struct task_struct *task, char *buffer)
+{
+	char *token, *out = buffer;
+	size_t sz;
+
+	if (!out)
+		return -EINVAL;
+
+	token = strsep(&out, " ");
+
+	sz = strlen(LONG_STR);
+
+	if (!strncmp(token, LONG_STR, sz))
+		return do_set_next_syscall_data(task, token + sz, out);
+
+	if (!strncmp(token, RESET_STR, strlen(RESET_STR))) {
+		reset_next_syscall_data(task);
+		return 0;
+	}
+
+	return -EINVAL;
+}
Index: linux-2.6.26-rc8-mm1/kernel/fork.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/kernel/fork.c	2008-07-08 09:04:35.000000000 +0200
+++ linux-2.6.26-rc8-mm1/kernel/fork.c	2008-07-08 09:25:35.000000000 +0200
@@ -1085,6 +1085,8 @@ static struct task_struct *copy_process(
 	p->blocked_on = NULL; /* not blocked yet */
 #endif
 
+	p->nsd = NULL;	/* no next syscall data is the default */
+
 	/* Perform scheduler related setup. Assign this task to a CPU. */
 	sched_fork(p, clone_flags);
 
Index: linux-2.6.26-rc8-mm1/fs/exec.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/fs/exec.c	2008-07-08 09:05:13.000000000 +0200
+++ linux-2.6.26-rc8-mm1/fs/exec.c	2008-07-08 09:26:21.000000000 +0200
@@ -1016,6 +1016,12 @@ int flush_old_exec(struct linux_binprm *
 	flush_signal_handlers(current, 0);
 	flush_old_files(current->files);
 
+	/*
+	 * the next syscall data is not inherited across execve()
+	 */
+	if (unlikely(current->nsd))
+		reset_next_syscall_data(current);
+
 	return 0;
 
 out:
Index: linux-2.6.26-rc8-mm1/kernel/exit.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/kernel/exit.c	2008-07-08 09:04:35.000000000 +0200
+++ linux-2.6.26-rc8-mm1/kernel/exit.c	2008-07-08 09:27:31.000000000 +0200
@@ -1066,6 +1066,10 @@ NORET_TYPE void do_exit(long code)
 
 	proc_exit_connector(tsk);
 	exit_notify(tsk, group_dead);
+
+	if (unlikely(tsk->nsd))
+		exit_next_syscall_data(tsk);
+
 #ifdef CONFIG_NUMA
 	mpol_put(tsk->mempolicy);
 	tsk->mempolicy = NULL;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 2/5] use next syscall data to predefine ipc objects ids
  2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
  2008-07-08 11:24 ` [RFC PATCH 1/5] adds the procfs facilities Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-08 11:24 ` Nadia.Derbey-6ktuUTfB/bM
       [not found]   ` <20080708112458.416998000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 11:24 ` [RFC PATCH 3/5] use next syscall data to predefine process ids Nadia.Derbey-6ktuUTfB/bM
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nadia Derbey

[-- Attachment #1: ipccreate_use_next_syscall_data.patch --]
[-- Type: text/plain, Size: 3579 bytes --]

[PATCH 02/05]

This patch uses the value written into the next_syscall_data proc file
as a target id for the next IPC object to be created.
The following syscalls have a new behavior if next_syscall_data is set:
. mssget()
. semget()
. shmget()

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 include/linux/next_syscall_data.h |   17 ++++++++++++++--
 ipc/util.c                        |   39 ++++++++++++++++++++++++++++++--------
 2 files changed, 46 insertions(+), 10 deletions(-)

Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
===================================================================
--- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 09:24:38.000000000 +0200
+++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:12:39.000000000 +0200
@@ -1,7 +1,10 @@
 /*
  * include/linux/next_syscall_data.h
  *
- * Definitions to support fixed data for next syscall to be called.
+ * Definitions to support fixed data for next syscall to be called. The
+ * following is supported today:
+ *    . object creation with a predefined id
+ *         . for a sysv ipc object
  */
 
 #ifndef _LINUX_NEXT_SYSCALL_DATA_H
@@ -13,13 +16,23 @@
  * If this structure is pointed to by a task_struct, next syscall to be called
  * by the task will have a non-default behavior.
  * For example, it can be used to pre-set the id of the object to be created
- * by next syscall.
+ * by next syscall. The following syscalls support this feature:
+ *    . msgget(), semget(), shmget()
  */
 struct next_syscall_data {
 	int ndata;
 	long data[NDATA];
 };
 
+/*
+ * Returns true if tsk has some data set in its next_syscall_data, 0 else
+ */
+#define next_data_set(tsk)	((tsk)->nsd				\
+					? ((tsk)->nsd->ndata ? 1 : 0)	\
+					: 0)
+
+#define get_next_data(tsk)	((tsk)->nsd->data[0])
+
 extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
 extern int set_next_syscall_data(struct task_struct *, char *);
 extern void reset_next_syscall_data(struct task_struct *);
Index: linux-2.6.26-rc8-mm1/ipc/util.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/ipc/util.c	2008-07-08 09:05:09.000000000 +0200
+++ linux-2.6.26-rc8-mm1/ipc/util.c	2008-07-08 12:13:40.000000000 +0200
@@ -266,20 +266,43 @@ int ipc_addid(struct ipc_ids* ids, struc
 	if (ids->in_use >= size)
 		return -ENOSPC;
 
-	err = idr_get_new(&ids->ipcs_idr, new, &id);
-	if (err)
-		return err;
+	if (unlikely(next_data_set(current))) {
+		/* There is a target id specified, try to use it */
+		int next_id = get_next_data(current);
+		int new_lid = next_id % SEQ_MULTIPLIER;
+		unsigned long new_seq = next_id / SEQ_MULTIPLIER;
+
+		reset_next_syscall_data(current);
+
+		if (next_id != (new_lid + (new_seq * SEQ_MULTIPLIER)))
+			return -EINVAL;
+
+		err = idr_get_new_above(&ids->ipcs_idr, new, new_lid, &id);
+		if (err)
+			return err;
+		if (id != new_lid) {
+			idr_remove(&ids->ipcs_idr, id);
+			return -EBUSY;
+		}
+
+		new->id = next_id;
+		new->seq = new_seq;
+	} else {
+		err = idr_get_new(&ids->ipcs_idr, new, &id);
+		if (err)
+			return err;
+
+		new->seq = ids->seq++;
+		if (ids->seq > ids->seq_max)
+			ids->seq = 0;
+		new->id = ipc_buildid(id, new->seq);
+	}
 
 	ids->in_use++;
 
 	new->cuid = new->uid = current->euid;
 	new->gid = new->cgid = current->egid;
 
-	new->seq = ids->seq++;
-	if(ids->seq > ids->seq_max)
-		ids->seq = 0;
-
-	new->id = ipc_buildid(id, new->seq);
 	spin_lock_init(&new->lock);
 	new->deleted = 0;
 	rcu_read_lock();

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 3/5] use next syscall data to predefine process ids
  2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
  2008-07-08 11:24 ` [RFC PATCH 1/5] adds the procfs facilities Nadia.Derbey-6ktuUTfB/bM
  2008-07-08 11:24 ` [RFC PATCH 2/5] use next syscall data to predefine ipc objects ids Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-08 11:24 ` Nadia.Derbey-6ktuUTfB/bM
       [not found]   ` <20080708112458.946320000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 11:24 ` [RFC PATCH 4/5] use next syscall data to change the behavior of IPC_SET Nadia.Derbey-6ktuUTfB/bM
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nadia Derbey

[-- Attachment #1: proccreate_use_next_syscall_data.patch --]
[-- Type: text/plain, Size: 6253 bytes --]

[PATCH 03/05]

This patch uses the value written into the next_syscall_data proc file
as a target upid nr for the next process to be created.
The following syscalls have a new behavior if next_syscall_data is set:
. fork()
. vfork()
. clone()

In the current version, if the process belongs to nested namespaces, only
the upper namespace level upid nr is allowed to be predefined, since there
is not yet a way to take a snapshot of upid nrs at all namespaces levels.

But this can easily be extended in the future.

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 include/linux/next_syscall_data.h |    2 
 kernel/fork.c                     |    5 -
 kernel/pid.c                      |  116 +++++++++++++++++++++++++++++++-------
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux-2.6.26-rc8-mm1/kernel/pid.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/kernel/pid.c	2008-07-08 12:12:39.000000000 +0200
+++ linux-2.6.26-rc8-mm1/kernel/pid.c	2008-07-08 12:24:04.000000000 +0200
@@ -122,6 +122,26 @@ static void free_pidmap(struct upid *upi
 	atomic_inc(&map->nr_free);
 }
 
+static inline int alloc_pidmap_page(struct pidmap *map)
+{
+	if (unlikely(!map->page)) {
+		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
+		/*
+		 * Free the page if someone raced with us
+		 * installing it:
+		 */
+		spin_lock_irq(&pidmap_lock);
+		if (map->page)
+			kfree(page);
+		else
+			map->page = page;
+		spin_unlock_irq(&pidmap_lock);
+		if (unlikely(!map->page))
+			return -1;
+	}
+	return 0;
+}
+
 static int alloc_pidmap(struct pid_namespace *pid_ns)
 {
 	int i, offset, max_scan, pid, last = pid_ns->last_pid;
@@ -134,21 +154,8 @@ static int alloc_pidmap(struct pid_names
 	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
 	for (i = 0; i <= max_scan; ++i) {
-		if (unlikely(!map->page)) {
-			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
-			/*
-			 * Free the page if someone raced with us
-			 * installing it:
-			 */
-			spin_lock_irq(&pidmap_lock);
-			if (map->page)
-				kfree(page);
-			else
-				map->page = page;
-			spin_unlock_irq(&pidmap_lock);
-			if (unlikely(!map->page))
-				break;
-		}
+		if (unlikely(alloc_pidmap_page(map)))
+			break;
 		if (likely(atomic_read(&map->nr_free))) {
 			do {
 				if (!test_and_set_bit(offset, map->page)) {
@@ -182,6 +189,33 @@ static int alloc_pidmap(struct pid_names
 	return -1;
 }
 
+/*
+ * Return 0 if successful (i.e. next_nr could be assigned as a upid nr).
+ * -errno else
+ */
+static int alloc_fixed_pidmap(struct pid_namespace *pid_ns, int next_nr)
+{
+	int offset;
+	struct pidmap *map;
+
+	if (next_nr < RESERVED_PIDS || next_nr >= pid_max)
+		return -EINVAL;
+
+	map = &pid_ns->pidmap[next_nr / BITS_PER_PAGE];
+
+	if (unlikely(alloc_pidmap_page(map)))
+		return -ENOMEM;
+
+	offset = next_nr & BITS_PER_PAGE_MASK;
+	if (test_and_set_bit(offset, map->page))
+		return -EBUSY;
+
+	atomic_dec(&map->nr_free);
+	pid_ns->last_pid = max(pid_ns->last_pid, next_nr);
+
+	return 0;
+}
+
 int next_pidmap(struct pid_namespace *pid_ns, int last)
 {
 	int offset;
@@ -239,6 +273,24 @@ void free_pid(struct pid *pid)
 	call_rcu(&pid->rcu, delayed_put_pid);
 }
 
+/*
+ * Sets a predefined upid nr for the process' upper namespace level
+ */
+static int set_predefined_pid(struct pid_namespace *ns, struct pid *pid,
+				int next_nr)
+{
+	int i = ns->level;
+	int rc;
+
+	rc = alloc_fixed_pidmap(ns, next_nr);
+	if (rc < 0)
+		return rc;
+
+	pid->numbers[i].nr = next_nr;
+	pid->numbers[i].ns = ns;
+	return 0;
+}
+
 struct pid *alloc_pid(struct pid_namespace *ns)
 {
 	struct pid *pid;
@@ -248,14 +300,41 @@ struct pid *alloc_pid(struct pid_namespa
 	struct upid *upid;
 
 	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
-	if (!pid)
+	if (!pid) {
+		pid = ERR_PTR(-ENOMEM);
 		goto out;
+	}
 
 	tmp = ns;
-	for (i = ns->level; i >= 0; i--) {
+	i = ns->level;
+	if (unlikely(next_data_set(current))) {
+		/*
+		 * There is a upid nr specified, use it instead of letting
+		 * the kernel chose it for us.
+		 */
+		int next_nr = get_next_data(current);
+		int rc;
+
+		reset_next_syscall_data(current);
+		rc = set_predefined_pid(tmp, pid, next_nr);
+		if (rc < 0) {
+			pid = ERR_PTR(rc);
+			goto out_free;
+		}
+		/* Go up one level */
+		tmp = tmp->parent;
+		i--;
+	}
+
+	/*
+	 * Let the lower levels upid nrs be automatically allocated
+	 */
+	for ( ; i >= 0; i--) {
 		nr = alloc_pidmap(tmp);
-		if (nr < 0)
+		if (nr < 0) {
+			pid = ERR_PTR(-ENOMEM);
 			goto out_free;
+		}
 
 		pid->numbers[i].nr = nr;
 		pid->numbers[i].ns = tmp;
@@ -284,7 +363,6 @@ out_free:
 		free_pidmap(pid->numbers + i);
 
 	kmem_cache_free(ns->pid_cachep, pid);
-	pid = NULL;
 	goto out;
 }
 
Index: linux-2.6.26-rc8-mm1/kernel/fork.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/kernel/fork.c	2008-07-08 12:12:39.000000000 +0200
+++ linux-2.6.26-rc8-mm1/kernel/fork.c	2008-07-08 12:22:47.000000000 +0200
@@ -1118,10 +1118,11 @@ static struct task_struct *copy_process(
 		goto bad_fork_cleanup_io;
 
 	if (pid != &init_struct_pid) {
-		retval = -ENOMEM;
 		pid = alloc_pid(task_active_pid_ns(p));
-		if (!pid)
+		if (IS_ERR(pid)) {
+			retval = PTR_ERR(pid);
 			goto bad_fork_cleanup_io;
+		}
 
 		if (clone_flags & CLONE_NEWPID) {
 			retval = pid_ns_prepare_proc(task_active_pid_ns(p));
Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
===================================================================
--- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 12:12:39.000000000 +0200
+++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:22:47.000000000 +0200
@@ -5,6 +5,7 @@
  * following is supported today:
  *    . object creation with a predefined id
  *         . for a sysv ipc object
+ *         . for a process
  */
 
 #ifndef _LINUX_NEXT_SYSCALL_DATA_H
@@ -18,6 +19,7 @@
  * For example, it can be used to pre-set the id of the object to be created
  * by next syscall. The following syscalls support this feature:
  *    . msgget(), semget(), shmget()
+ *    . fork(), vfork(), clone()
  */
 struct next_syscall_data {
 	int ndata;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 4/5] use next syscall data to change the behavior of IPC_SET
  2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
                   ` (2 preceding siblings ...)
  2008-07-08 11:24 ` [RFC PATCH 3/5] use next syscall data to predefine process ids Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-08 11:24 ` Nadia.Derbey-6ktuUTfB/bM
       [not found]   ` <20080708112459.231249000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 11:24 ` [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value Nadia.Derbey-6ktuUTfB/bM
       [not found] ` <20080708112422.164370000-6ktuUTfB/bM@public.gmane.org>
  5 siblings, 1 reply; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nadia Derbey

[-- Attachment #1: ipcset_use_next_syscall_data.patch --]
[-- Type: text/plain, Size: 5433 bytes --]

[PATCH 04/05]

This patch uses the value written into the next_syscall_data proc file
as a flag to change the way msgctl(IPC_SET), semctl(IPC_SET) and
shmctl(IPC_SET) behave.

When "LONG1 1" is echoed to this file, xxxctl(IPC_SET) will set the time
fields and the pid fields according to what is specified in the input
parameter (while currently only the permission fields are allowed to be set).
The following syscalls are impacted:
. msgctl(IPC_SET)
. semctl(IPC_SET)
. shmctl(IPC_SET)

This makes it easy to restart an ipc object exactly is it was during the
checkpoint phase.

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 include/linux/next_syscall_data.h |   12 ++++++++++++
 ipc/msg.c                         |   19 ++++++++++++++++++-
 ipc/sem.c                         |   16 +++++++++++++++-
 ipc/shm.c                         |   19 ++++++++++++++++++-
 4 files changed, 63 insertions(+), 3 deletions(-)

Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
===================================================================
--- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 12:22:47.000000000 +0200
+++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:24:29.000000000 +0200
@@ -6,6 +6,7 @@
  *    . object creation with a predefined id
  *         . for a sysv ipc object
  *         . for a process
+ *    . set more than the usual ipc_perm fields during and IPC_SET operation.
  */
 
 #ifndef _LINUX_NEXT_SYSCALL_DATA_H
@@ -20,6 +21,10 @@
  * by next syscall. The following syscalls support this feature:
  *    . msgget(), semget(), shmget()
  *    . fork(), vfork(), clone()
+ *
+ * If it is set to a non null value before a call to:
+ *    . msgctl(IPC_SET), semctl(IPC_SET), shmctl(IPC_SET),
+ * this means that we are going to set more than the usual ipc_perms fields.
  */
 struct next_syscall_data {
 	int ndata;
@@ -35,6 +40,13 @@ struct next_syscall_data {
 
 #define get_next_data(tsk)	((tsk)->nsd->data[0])
 
+/*
+ * Returns true if next call to xxxctl(IPC_SET) should have a non-default
+ * behavior.
+ */
+#define ipc_set_all(tsk)	(next_data_set(tsk) ? get_next_data(tsk) : 0)
+
+
 extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
 extern int set_next_syscall_data(struct task_struct *, char *);
 extern void reset_next_syscall_data(struct task_struct *);
Index: linux-2.6.26-rc8-mm1/ipc/msg.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/ipc/msg.c	2008-07-08 12:12:36.000000000 +0200
+++ linux-2.6.26-rc8-mm1/ipc/msg.c	2008-07-08 12:26:03.000000000 +0200
@@ -446,7 +446,24 @@ static int msgctl_down(struct ipc_namesp
 		msq->q_qbytes = msqid64.msg_qbytes;
 
 		ipc_update_perm(&msqid64.msg_perm, ipcp);
-		msq->q_ctime = get_seconds();
+		if (unlikely(ipc_set_all(current))) {
+			/*
+			 * If this field is set in the task struct, this
+			 * means that we want to set more than the usual
+			 * fields. Particularly useful to restart a msgq
+			 * in the same state as it was before being
+			 * checkpointed.
+			 */
+			msq->q_stime = msqid64.msg_stime;
+			msq->q_rtime = msqid64.msg_rtime;
+			msq->q_ctime = msqid64.msg_ctime;
+			msq->q_lspid = msqid64.msg_lspid;
+			msq->q_lrpid = msqid64.msg_lrpid;
+
+			reset_next_syscall_data(current);
+		} else
+			msq->q_ctime = get_seconds();
+
 		/* sleeping receivers might be excluded by
 		 * stricter permissions.
 		 */
Index: linux-2.6.26-rc8-mm1/ipc/sem.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/ipc/sem.c	2008-07-08 12:12:36.000000000 +0200
+++ linux-2.6.26-rc8-mm1/ipc/sem.c	2008-07-08 12:27:06.000000000 +0200
@@ -874,7 +874,21 @@ static int semctl_down(struct ipc_namesp
 		goto out_up;
 	case IPC_SET:
 		ipc_update_perm(&semid64.sem_perm, ipcp);
-		sma->sem_ctime = get_seconds();
+
+		if (unlikely(ipc_set_all(current))) {
+			/*
+			 * If this field is set in the task struct, this
+			 * means that we want to set more than the usual
+			 * fields. Particularly useful to restart a semaphore
+			 * in the same state as it was before being
+			 * checkpointed.
+			 */
+			sma->sem_ctime = semid64.sem_ctime;
+			sma->sem_otime = semid64.sem_otime;
+
+			reset_next_syscall_data(current);
+		} else
+			sma->sem_ctime = get_seconds();
 		break;
 	default:
 		err = -EINVAL;
Index: linux-2.6.26-rc8-mm1/ipc/shm.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/ipc/shm.c	2008-07-08 12:12:36.000000000 +0200
+++ linux-2.6.26-rc8-mm1/ipc/shm.c	2008-07-08 12:27:32.000000000 +0200
@@ -609,7 +609,24 @@ static int shmctl_down(struct ipc_namesp
 		goto out_up;
 	case IPC_SET:
 		ipc_update_perm(&shmid64.shm_perm, ipcp);
-		shp->shm_ctim = get_seconds();
+
+		if (unlikely(ipc_set_all(current))) {
+			/*
+			 * If this field is set in the task struct, this
+			 * means that we want to set more than the usual
+			 * fields. Particularly useful to restart a shm seg
+			 * in the same state as it was before being
+			 * checkpointed.
+			 */
+			shp->shm_atim = shmid64.shm_atime;
+			shp->shm_dtim = shmid64.shm_dtime;
+			shp->shm_ctim = shmid64.shm_ctime;
+			shp->shm_cprid = shmid64.shm_cpid;
+			shp->shm_lprid = shmid64.shm_lpid;
+
+			reset_next_syscall_data(current);
+		} else
+			shp->shm_ctim = get_seconds();
 		break;
 	default:
 		err = -EINVAL;

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
  2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
                   ` (3 preceding siblings ...)
  2008-07-08 11:24 ` [RFC PATCH 4/5] use next syscall data to change the behavior of IPC_SET Nadia.Derbey-6ktuUTfB/bM
@ 2008-07-08 11:24 ` Nadia.Derbey-6ktuUTfB/bM
       [not found]   ` <20080708112459.632357000-6ktuUTfB/bM@public.gmane.org>
       [not found] ` <20080708112422.164370000-6ktuUTfB/bM@public.gmane.org>
  5 siblings, 1 reply; 32+ messages in thread
From: Nadia.Derbey-6ktuUTfB/bM @ 2008-07-08 11:24 UTC (permalink / raw)
  To: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA; +Cc: Nadia Derbey

[-- Attachment #1: fileopen_use_next_syscall_data.patch --]
[-- Type: text/plain, Size: 2715 bytes --]

[PATCH 05/05]

This patch uses the value written into the next_syscall_data proc file
as a target file descriptor for the next file to be opened.

This makes it easy to restart a process with the same fds as the ones it was
using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
the open file descriptor.

The following syscalls are impacted if next_syscall_data is set:
. open()
. openat()

Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

---
 fs/open.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 61 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc8-mm1/fs/open.c
===================================================================
--- linux-2.6.26-rc8-mm1.orig/fs/open.c	2008-07-08 12:12:34.000000000 +0200
+++ linux-2.6.26-rc8-mm1/fs/open.c	2008-07-08 13:23:03.000000000 +0200
@@ -974,6 +974,59 @@ struct file *dentry_open(struct dentry *
 EXPORT_SYMBOL(dentry_open);
 
 /*
+ * Marks a given file descriptor entry as busy (should not be busy when this
+ * routine is called.
+ *
+ * files->next_fd is not updated: this lets the potentially created hole be
+ * filled up on next calls to get_unused_fd_flags.
+ *
+ * Returns the specified fd if successful, -errno else.
+ */
+static int get_predefined_fd_flags(int fd, int flags)
+{
+	struct files_struct *files = current->files;
+	int error;
+	struct fdtable *fdt;
+
+	error = -EINVAL;
+	if (fd < 0)
+		goto out;
+
+	error = -EMFILE;
+	if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
+		goto out;
+
+	spin_lock(&files->file_lock);
+	fdt = files_fdtable(files);
+
+	error = expand_files(files, fd);
+	if (error < 0)
+		goto out_unlock;
+
+	error = -EBUSY;
+	if (FD_ISSET(fd, fdt->open_fds))
+		goto out_unlock;
+
+	FD_SET(fd, fdt->open_fds);
+	if (flags & O_CLOEXEC)
+		FD_SET(fd, fdt->close_on_exec);
+	else
+		FD_CLR(fd, fdt->close_on_exec);
+
+	/* Sanity check */
+	if (fdt->fd[fd] != NULL) {
+		printk(KERN_WARNING "get_unused_fd: slot %d not NULL!\n", fd);
+		fdt->fd[fd] = NULL;
+	}
+
+	error = fd;
+out_unlock:
+	spin_unlock(&files->file_lock);
+out:
+	return error;
+}
+
+/*
  * Find an empty file descriptor entry, and mark it busy.
  */
 int get_unused_fd_flags(int flags)
@@ -1088,7 +1141,14 @@ long do_sys_open(int dfd, const char __u
 	int fd = PTR_ERR(tmp);
 
 	if (!IS_ERR(tmp)) {
-		fd = get_unused_fd_flags(flags);
+		if (unlikely(next_data_set(current))) {
+			int next_fd = get_next_data(current);
+
+			fd = get_predefined_fd_flags(next_fd, flags);
+			reset_next_syscall_data(current);
+		} else
+			fd = get_unused_fd_flags(flags);
+
 		if (fd >= 0) {
 			struct file *f = do_filp_open(dfd, tmp, flags, mode);
 			if (IS_ERR(f)) {

--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 1/5] adds the procfs facilities
       [not found]   ` <20080708112457.994105000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-08 19:32     ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-08 19:32 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
> [PATCH 01/05]
> 
> This patch adds the procfs facility needed to feed some data for the
> next syscall to be called.
> 
> The effect of issuing
> echo "LONG<Y> <XX>" > /proc/self/task/<tid>/next_syscall_data
> is that <XX> will be stored in a new field of the task structure
> (next_syscall_data). This field, in turn will be taken as the data to feed
> next syscall that supports the feature.
> 
> <Y> is the number of values provided on the line.
> For the sake of simplicity it is now fixed to 1, but this can be extended as
> needed, in the future.
> 
> This is particularly useful when restarting an application, as we need
> sometimes the syscalls to have a non-default behavior.
> 
> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

> 
> ---
>  fs/exec.c                         |    6 +
>  fs/proc/base.c                    |   75 ++++++++++++++++++
>  include/linux/next_syscall_data.h |   32 ++++++++
>  include/linux/sched.h             |    6 +
>  kernel/Makefile                   |    3 
>  kernel/exit.c                     |    4 +
>  kernel/fork.c                     |    2 
>  kernel/next_syscall_data.c        |  151 ++++++++++++++++++++++++++++++++++++++
>  8 files changed, 278 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.26-rc8-mm1/include/linux/sched.h
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/include/linux/sched.h	2008-07-08 09:04:21.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/include/linux/sched.h	2008-07-08 09:13:43.000000000 +0200
> @@ -87,6 +87,7 @@ struct sched_param {
>  #include <linux/task_io_accounting.h>
>  #include <linux/kobject.h>
>  #include <linux/latencytop.h>
> +#include <linux/next_syscall_data.h>
> 
>  #include <asm/processor.h>
> 
> @@ -1296,6 +1297,11 @@ struct task_struct {
>  	int latency_record_count;
>  	struct latency_record latency_record[LT_SAVECOUNT];
>  #endif
> +	/*
> +	 * If non-NULL indicates that next operation will be forced, e.g.
> +	 * that next object to be created will have a predefined id.
> +	 */
> +	struct next_syscall_data *nsd;
>  };
> 
>  /*
> Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 09:24:38.000000000 +0200
> @@ -0,0 +1,32 @@
> +/*
> + * include/linux/next_syscall_data.h
> + *
> + * Definitions to support fixed data for next syscall to be called.
> + */
> +
> +#ifndef _LINUX_NEXT_SYSCALL_DATA_H
> +#define _LINUX_NEXT_SYSCALL_DATA_H
> +
> +#define NDATA 1
> +
> +/*
> + * If this structure is pointed to by a task_struct, next syscall to be called
> + * by the task will have a non-default behavior.
> + * For example, it can be used to pre-set the id of the object to be created
> + * by next syscall.
> + */
> +struct next_syscall_data {
> +	int ndata;
> +	long data[NDATA];
> +};
> +
> +extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
> +extern int set_next_syscall_data(struct task_struct *, char *);
> +extern void reset_next_syscall_data(struct task_struct *);
> +
> +static inline void exit_next_syscall_data(struct task_struct *tsk)
> +{
> +	reset_next_syscall_data(tsk);
> +}
> +
> +#endif /* _LINUX_NEXT_SYSCALL_DATA_H */
> Index: linux-2.6.26-rc8-mm1/fs/proc/base.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/fs/proc/base.c	2008-07-08 09:05:13.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/fs/proc/base.c	2008-07-08 09:18:12.000000000 +0200
> @@ -1158,6 +1158,76 @@ static const struct file_operations proc
>  };
>  #endif
> 
> +static ssize_t next_syscall_data_read(struct file *file, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct task_struct *task;
> +	char *page;
> +	ssize_t length;
> +
> +	task = get_proc_task(file->f_path.dentry->d_inode);
> +	if (!task)
> +		return -ESRCH;
> +
> +	if (count >= PAGE_SIZE)
> +		count = PAGE_SIZE - 1;
> +
> +	length = -ENOMEM;
> +	page = (char *) __get_free_page(GFP_TEMPORARY);
> +	if (!page)
> +		goto out;
> +
> +	length = get_next_syscall_data(task, (char *) page, count);
> +	if (length >= 0)
> +		length = simple_read_from_buffer(buf, count, ppos,
> +						(char *)page, length);
> +	free_page((unsigned long) page);
> +
> +out:
> +	put_task_struct(task);
> +	return length;
> +}
> +
> +static ssize_t next_syscall_data_write(struct file *file,
> +				const char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct inode *inode = file->f_path.dentry->d_inode;
> +	char *page;
> +	ssize_t length;
> +
> +	if (pid_task(proc_pid(inode), PIDTYPE_PID) != current)
> +		return -EPERM;
> +
> +	if (count >= PAGE_SIZE)
> +		count = PAGE_SIZE - 1;
> +
> +	if (*ppos != 0) {
> +		/* No partial writes. */
> +		return -EINVAL;
> +	}
> +	page = (char *)__get_free_page(GFP_TEMPORARY);
> +	if (!page)
> +		return -ENOMEM;
> +	length = -EFAULT;
> +	if (copy_from_user(page, buf, count))
> +		goto out_free_page;
> +
> +	page[count] = '\0';
> +
> +	length = set_next_syscall_data(current, page);
> +	if (!length)
> +		length = count;
> +
> +out_free_page:
> +	free_page((unsigned long) page);
> +	return length;
> +}
> +
> +static const struct file_operations proc_next_syscall_data_operations = {
> +	.read		= next_syscall_data_read,
> +	.write		= next_syscall_data_write,
> +};
> 
>  #ifdef CONFIG_SCHED_DEBUG
>  /*
> @@ -2853,6 +2923,11 @@ static const struct pid_entry tid_base_s
>  #ifdef CONFIG_TASK_IO_ACCOUNTING
>  	INF("io",	S_IRUGO, tid_io_accounting),
>  #endif
> +	/*
> +	 * NOTE that this file is not added into tgid_base_stuff[] since it
> +	 * has to be specified on a per-thread basis.
> +	 */
> +	REG("next_syscall_data", S_IRUGO|S_IWUSR, next_syscall_data),
>  };
> 
>  static int proc_tid_base_readdir(struct file * filp,
> Index: linux-2.6.26-rc8-mm1/kernel/Makefile
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/kernel/Makefile	2008-07-08 09:04:35.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/kernel/Makefile	2008-07-08 09:19:14.000000000 +0200
> @@ -9,7 +9,8 @@ obj-y     = sched.o fork.o exec_domain.o
>  	    rcupdate.o extable.o params.o posix-timers.o \
>  	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>  	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
> -	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o
> +	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o \
> +	    next_syscall_data.o
> 
>  CFLAGS_REMOVE_sched.o = -pg -mno-spe
> 
> Index: linux-2.6.26-rc8-mm1/kernel/next_syscall_data.c
> ===================================================================
> --- /dev/null	1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.26-rc8-mm1/kernel/next_syscall_data.c	2008-07-08 09:35:27.000000000 +0200
> @@ -0,0 +1,151 @@
> +/*
> + * linux/kernel/next_syscall_data.c
> + *
> + *
> + * Provide the get_next_syscall_data() / set_next_syscall_data() routines
> + * (called from fs/proc/base.c).
> + * They allow to specify some particular data for the next syscall to be
> + * called.
> + * E.g. they can be used to specify the id for the next resource to be
> + * allocated, instead of letting the allocator set it for us.
> + */
> +
> +#include <linux/sched.h>
> +#include <linux/ctype.h>
> +
> +
> +
> +ssize_t get_next_syscall_data(struct task_struct *task, char *buffer,
> +				size_t size)
> +{
> +	struct next_syscall_data *nsd;
> +	char *bufptr = buffer;
> +	ssize_t rc, count = 0;
> +	int i;
> +
> +	nsd = task->nsd;
> +	if (!nsd || !nsd->ndata)
> +		return snprintf(buffer, size, "UNSET\n");
> +
> +	count = snprintf(bufptr, size, "LONG%d ", nsd->ndata);
> +
> +	for (i = 0; i < nsd->ndata - 1; i++) {
> +		rc = snprintf(&bufptr[count], size - count, "%ld ",
> +				nsd->data[i]);
> +		if (rc >= size - count)
> +			return -ENOMEM;
> +		count += rc;
> +	}
> +
> +	rc = snprintf(&bufptr[count], size - count, "%ld\n", nsd->data[i]);
> +	if (rc >= size - count)
> +		return -ENOMEM;
> +	count += rc;
> +
> +	return count;
> +}
> +
> +static int fill_next_syscall_data(struct task_struct *task, int ndata,
> +				char *buffer)
> +{
> +	char *token, *buff = buffer;
> +	char *end;
> +	struct next_syscall_data *nsd = task->nsd;
> +	int i;
> +
> +	if (!nsd) {
> +		nsd = kmalloc(sizeof(*nsd), GFP_KERNEL);
> +		if (!nsd)
> +			return -ENOMEM;
> +		task->nsd = nsd;
> +	}
> +
> +	nsd->ndata = ndata;
> +
> +	i = 0;
> +	while ((token = strsep(&buff, " ")) != NULL && i < ndata) {
> +		long data;
> +
> +		if (!*token)
> +			goto out_free;
> +		data = simple_strtol(token, &end, 0);
> +		if (end == token || (*end && !isspace(*end)))
> +			goto out_free;
> +		nsd->data[i] = data;
> +		i++;
> +	}
> +
> +	if (i != ndata)
> +		goto out_free;
> +
> +	return 0;
> +
> +out_free:
> +	kfree(nsd);
> +	task->nsd = NULL;
> +	return -EINVAL;
> +}
> +
> +/*
> + * Parses a line with the following format:
> + * <x> <id0> ... <idx-1>
> + * Currently, only x=1 is accepted.
> + * Any trailing character on the line is skipped.
> + */
> +static int do_set_next_syscall_data(struct task_struct *task, char *nb,
> +					char *buffer)
> +{
> +	int ndata;
> +	char *end;
> +
> +	ndata = simple_strtol(nb, &end, 0);
> +	if (*end)
> +		return -EINVAL;
> +
> +	if (ndata > NDATA)
> +		return -EINVAL;
> +
> +	return fill_next_syscall_data(task, ndata, buffer);
> +}
> +
> +void reset_next_syscall_data(struct task_struct *task)
> +{
> +	struct next_syscall_data *nsd = task->nsd;
> +
> +	if (nsd) {
> +		task->nsd = NULL;
> +		kfree(nsd);
> +	}
> +}
> +
> +#define LONG_STR	"LONG"
> +#define RESET_STR	"RESET"
> +
> +/*
> + * Parses a line written to /proc/self/task/<my_tid>/next_syscall_data.
> + * this line has the following format:
> + * LONG<x> id              --> a sequence of id(s) is specified
> + *                             currently, only x=1 is accepted
> + */
> +int set_next_syscall_data(struct task_struct *task, char *buffer)
> +{
> +	char *token, *out = buffer;
> +	size_t sz;
> +
> +	if (!out)
> +		return -EINVAL;
> +
> +	token = strsep(&out, " ");
> +
> +	sz = strlen(LONG_STR);
> +
> +	if (!strncmp(token, LONG_STR, sz))
> +		return do_set_next_syscall_data(task, token + sz, out);
> +
> +	if (!strncmp(token, RESET_STR, strlen(RESET_STR))) {
> +		reset_next_syscall_data(task);
> +		return 0;
> +	}
> +
> +	return -EINVAL;
> +}
> Index: linux-2.6.26-rc8-mm1/kernel/fork.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/kernel/fork.c	2008-07-08 09:04:35.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/kernel/fork.c	2008-07-08 09:25:35.000000000 +0200
> @@ -1085,6 +1085,8 @@ static struct task_struct *copy_process(
>  	p->blocked_on = NULL; /* not blocked yet */
>  #endif
> 
> +	p->nsd = NULL;	/* no next syscall data is the default */
> +
>  	/* Perform scheduler related setup. Assign this task to a CPU. */
>  	sched_fork(p, clone_flags);
> 
> Index: linux-2.6.26-rc8-mm1/fs/exec.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/fs/exec.c	2008-07-08 09:05:13.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/fs/exec.c	2008-07-08 09:26:21.000000000 +0200
> @@ -1016,6 +1016,12 @@ int flush_old_exec(struct linux_binprm *
>  	flush_signal_handlers(current, 0);
>  	flush_old_files(current->files);
> 
> +	/*
> +	 * the next syscall data is not inherited across execve()
> +	 */
> +	if (unlikely(current->nsd))
> +		reset_next_syscall_data(current);
> +
>  	return 0;
> 
>  out:
> Index: linux-2.6.26-rc8-mm1/kernel/exit.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/kernel/exit.c	2008-07-08 09:04:35.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/kernel/exit.c	2008-07-08 09:27:31.000000000 +0200
> @@ -1066,6 +1066,10 @@ NORET_TYPE void do_exit(long code)
> 
>  	proc_exit_connector(tsk);
>  	exit_notify(tsk, group_dead);
> +
> +	if (unlikely(tsk->nsd))
> +		exit_next_syscall_data(tsk);
> +
>  #ifdef CONFIG_NUMA
>  	mpol_put(tsk->mempolicy);
>  	tsk->mempolicy = NULL;
> 
> --

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 2/5] use next syscall data to predefine ipc objects ids
       [not found]   ` <20080708112458.416998000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-08 19:38     ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-08 19:38 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
> [PATCH 02/05]
> 
> This patch uses the value written into the next_syscall_data proc file
> as a target id for the next IPC object to be created.
> The following syscalls have a new behavior if next_syscall_data is set:
> . mssget()
> . semget()
> . shmget()
> 
> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

> 
> ---
>  include/linux/next_syscall_data.h |   17 ++++++++++++++--
>  ipc/util.c                        |   39 ++++++++++++++++++++++++++++++--------
>  2 files changed, 46 insertions(+), 10 deletions(-)
> 
> Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 09:24:38.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:12:39.000000000 +0200
> @@ -1,7 +1,10 @@
>  /*
>   * include/linux/next_syscall_data.h
>   *
> - * Definitions to support fixed data for next syscall to be called.
> + * Definitions to support fixed data for next syscall to be called. The
> + * following is supported today:
> + *    . object creation with a predefined id
> + *         . for a sysv ipc object
>   */
> 
>  #ifndef _LINUX_NEXT_SYSCALL_DATA_H
> @@ -13,13 +16,23 @@
>   * If this structure is pointed to by a task_struct, next syscall to be called
>   * by the task will have a non-default behavior.
>   * For example, it can be used to pre-set the id of the object to be created
> - * by next syscall.
> + * by next syscall. The following syscalls support this feature:
> + *    . msgget(), semget(), shmget()
>   */
>  struct next_syscall_data {
>  	int ndata;
>  	long data[NDATA];
>  };
> 
> +/*
> + * Returns true if tsk has some data set in its next_syscall_data, 0 else
> + */
> +#define next_data_set(tsk)	((tsk)->nsd				\
> +					? ((tsk)->nsd->ndata ? 1 : 0)	\
> +					: 0)
> +
> +#define get_next_data(tsk)	((tsk)->nsd->data[0])
> +
>  extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
>  extern int set_next_syscall_data(struct task_struct *, char *);
>  extern void reset_next_syscall_data(struct task_struct *);
> Index: linux-2.6.26-rc8-mm1/ipc/util.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/ipc/util.c	2008-07-08 09:05:09.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/ipc/util.c	2008-07-08 12:13:40.000000000 +0200
> @@ -266,20 +266,43 @@ int ipc_addid(struct ipc_ids* ids, struc
>  	if (ids->in_use >= size)
>  		return -ENOSPC;
> 
> -	err = idr_get_new(&ids->ipcs_idr, new, &id);
> -	if (err)
> -		return err;
> +	if (unlikely(next_data_set(current))) {
> +		/* There is a target id specified, try to use it */
> +		int next_id = get_next_data(current);
> +		int new_lid = next_id % SEQ_MULTIPLIER;
> +		unsigned long new_seq = next_id / SEQ_MULTIPLIER;
> +
> +		reset_next_syscall_data(current);
> +
> +		if (next_id != (new_lid + (new_seq * SEQ_MULTIPLIER)))
> +			return -EINVAL;
> +
> +		err = idr_get_new_above(&ids->ipcs_idr, new, new_lid, &id);
> +		if (err)
> +			return err;
> +		if (id != new_lid) {
> +			idr_remove(&ids->ipcs_idr, id);
> +			return -EBUSY;
> +		}
> +
> +		new->id = next_id;
> +		new->seq = new_seq;
> +	} else {
> +		err = idr_get_new(&ids->ipcs_idr, new, &id);
> +		if (err)
> +			return err;
> +
> +		new->seq = ids->seq++;
> +		if (ids->seq > ids->seq_max)
> +			ids->seq = 0;
> +		new->id = ipc_buildid(id, new->seq);
> +	}
> 
>  	ids->in_use++;
> 
>  	new->cuid = new->uid = current->euid;
>  	new->gid = new->cgid = current->egid;
> 
> -	new->seq = ids->seq++;
> -	if(ids->seq > ids->seq_max)
> -		ids->seq = 0;
> -
> -	new->id = ipc_buildid(id, new->seq);
>  	spin_lock_init(&new->lock);
>  	new->deleted = 0;
>  	rcu_read_lock();
> 
> --

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/5] use next syscall data to predefine process ids
       [not found]   ` <20080708112458.946320000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-08 19:49     ` Serge E. Hallyn
  2008-07-10  0:27     ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-08 19:49 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
> [PATCH 03/05]
> 
> This patch uses the value written into the next_syscall_data proc file
> as a target upid nr for the next process to be created.
> The following syscalls have a new behavior if next_syscall_data is set:
> . fork()
> . vfork()
> . clone()
> 
> In the current version, if the process belongs to nested namespaces, only
> the upper namespace level upid nr is allowed to be predefined, since there
> is not yet a way to take a snapshot of upid nrs at all namespaces levels.
> 
> But this can easily be extended in the future.
> 
> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

> 
> ---
>  include/linux/next_syscall_data.h |    2 
>  kernel/fork.c                     |    5 -
>  kernel/pid.c                      |  116 +++++++++++++++++++++++++++++++-------
>  3 files changed, 102 insertions(+), 21 deletions(-)
> 
> Index: linux-2.6.26-rc8-mm1/kernel/pid.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/kernel/pid.c	2008-07-08 12:12:39.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/kernel/pid.c	2008-07-08 12:24:04.000000000 +0200
> @@ -122,6 +122,26 @@ static void free_pidmap(struct upid *upi
>  	atomic_inc(&map->nr_free);
>  }
> 
> +static inline int alloc_pidmap_page(struct pidmap *map)
> +{
> +	if (unlikely(!map->page)) {
> +		void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> +		/*
> +		 * Free the page if someone raced with us
> +		 * installing it:
> +		 */
> +		spin_lock_irq(&pidmap_lock);
> +		if (map->page)
> +			kfree(page);
> +		else
> +			map->page = page;
> +		spin_unlock_irq(&pidmap_lock);
> +		if (unlikely(!map->page))
> +			return -1;
> +	}
> +	return 0;
> +}
> +
>  static int alloc_pidmap(struct pid_namespace *pid_ns)
>  {
>  	int i, offset, max_scan, pid, last = pid_ns->last_pid;
> @@ -134,21 +154,8 @@ static int alloc_pidmap(struct pid_names
>  	map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
>  	max_scan = (pid_max + BITS_PER_PAGE - 1)/BITS_PER_PAGE - !offset;
>  	for (i = 0; i <= max_scan; ++i) {
> -		if (unlikely(!map->page)) {
> -			void *page = kzalloc(PAGE_SIZE, GFP_KERNEL);
> -			/*
> -			 * Free the page if someone raced with us
> -			 * installing it:
> -			 */
> -			spin_lock_irq(&pidmap_lock);
> -			if (map->page)
> -				kfree(page);
> -			else
> -				map->page = page;
> -			spin_unlock_irq(&pidmap_lock);
> -			if (unlikely(!map->page))
> -				break;
> -		}
> +		if (unlikely(alloc_pidmap_page(map)))
> +			break;
>  		if (likely(atomic_read(&map->nr_free))) {
>  			do {
>  				if (!test_and_set_bit(offset, map->page)) {
> @@ -182,6 +189,33 @@ static int alloc_pidmap(struct pid_names
>  	return -1;
>  }
> 
> +/*
> + * Return 0 if successful (i.e. next_nr could be assigned as a upid nr).
> + * -errno else
> + */
> +static int alloc_fixed_pidmap(struct pid_namespace *pid_ns, int next_nr)
> +{
> +	int offset;
> +	struct pidmap *map;
> +
> +	if (next_nr < RESERVED_PIDS || next_nr >= pid_max)
> +		return -EINVAL;
> +
> +	map = &pid_ns->pidmap[next_nr / BITS_PER_PAGE];
> +
> +	if (unlikely(alloc_pidmap_page(map)))
> +		return -ENOMEM;
> +
> +	offset = next_nr & BITS_PER_PAGE_MASK;
> +	if (test_and_set_bit(offset, map->page))
> +		return -EBUSY;
> +
> +	atomic_dec(&map->nr_free);
> +	pid_ns->last_pid = max(pid_ns->last_pid, next_nr);
> +
> +	return 0;
> +}
> +
>  int next_pidmap(struct pid_namespace *pid_ns, int last)
>  {
>  	int offset;
> @@ -239,6 +273,24 @@ void free_pid(struct pid *pid)
>  	call_rcu(&pid->rcu, delayed_put_pid);
>  }
> 
> +/*
> + * Sets a predefined upid nr for the process' upper namespace level
> + */
> +static int set_predefined_pid(struct pid_namespace *ns, struct pid *pid,
> +				int next_nr)
> +{
> +	int i = ns->level;
> +	int rc;
> +
> +	rc = alloc_fixed_pidmap(ns, next_nr);
> +	if (rc < 0)
> +		return rc;
> +
> +	pid->numbers[i].nr = next_nr;
> +	pid->numbers[i].ns = ns;
> +	return 0;
> +}
> +
>  struct pid *alloc_pid(struct pid_namespace *ns)
>  {
>  	struct pid *pid;
> @@ -248,14 +300,41 @@ struct pid *alloc_pid(struct pid_namespa
>  	struct upid *upid;
> 
>  	pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
> -	if (!pid)
> +	if (!pid) {
> +		pid = ERR_PTR(-ENOMEM);
>  		goto out;
> +	}
> 
>  	tmp = ns;
> -	for (i = ns->level; i >= 0; i--) {
> +	i = ns->level;
> +	if (unlikely(next_data_set(current))) {
> +		/*
> +		 * There is a upid nr specified, use it instead of letting
> +		 * the kernel chose it for us.
> +		 */
> +		int next_nr = get_next_data(current);
> +		int rc;
> +
> +		reset_next_syscall_data(current);
> +		rc = set_predefined_pid(tmp, pid, next_nr);
> +		if (rc < 0) {
> +			pid = ERR_PTR(rc);
> +			goto out_free;
> +		}
> +		/* Go up one level */
> +		tmp = tmp->parent;
> +		i--;
> +	}
> +
> +	/*
> +	 * Let the lower levels upid nrs be automatically allocated
> +	 */
> +	for ( ; i >= 0; i--) {
>  		nr = alloc_pidmap(tmp);
> -		if (nr < 0)
> +		if (nr < 0) {
> +			pid = ERR_PTR(-ENOMEM);
>  			goto out_free;
> +		}
> 
>  		pid->numbers[i].nr = nr;
>  		pid->numbers[i].ns = tmp;
> @@ -284,7 +363,6 @@ out_free:
>  		free_pidmap(pid->numbers + i);
> 
>  	kmem_cache_free(ns->pid_cachep, pid);
> -	pid = NULL;
>  	goto out;
>  }
> 
> Index: linux-2.6.26-rc8-mm1/kernel/fork.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/kernel/fork.c	2008-07-08 12:12:39.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/kernel/fork.c	2008-07-08 12:22:47.000000000 +0200
> @@ -1118,10 +1118,11 @@ static struct task_struct *copy_process(
>  		goto bad_fork_cleanup_io;
> 
>  	if (pid != &init_struct_pid) {
> -		retval = -ENOMEM;
>  		pid = alloc_pid(task_active_pid_ns(p));
> -		if (!pid)
> +		if (IS_ERR(pid)) {
> +			retval = PTR_ERR(pid);
>  			goto bad_fork_cleanup_io;
> +		}
> 
>  		if (clone_flags & CLONE_NEWPID) {
>  			retval = pid_ns_prepare_proc(task_active_pid_ns(p));
> Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 12:12:39.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:22:47.000000000 +0200
> @@ -5,6 +5,7 @@
>   * following is supported today:
>   *    . object creation with a predefined id
>   *         . for a sysv ipc object
> + *         . for a process
>   */
> 
>  #ifndef _LINUX_NEXT_SYSCALL_DATA_H
> @@ -18,6 +19,7 @@
>   * For example, it can be used to pre-set the id of the object to be created
>   * by next syscall. The following syscalls support this feature:
>   *    . msgget(), semget(), shmget()
> + *    . fork(), vfork(), clone()
>   */
>  struct next_syscall_data {
>  	int ndata;
> 
> --

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 4/5] use next syscall data to change the behavior of IPC_SET
       [not found]   ` <20080708112459.231249000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-08 19:56     ` Serge E. Hallyn
  0 siblings, 0 replies; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-08 19:56 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
> [PATCH 04/05]
> 
> This patch uses the value written into the next_syscall_data proc file
> as a flag to change the way msgctl(IPC_SET), semctl(IPC_SET) and
> shmctl(IPC_SET) behave.
> 
> When "LONG1 1" is echoed to this file, xxxctl(IPC_SET) will set the time
> fields and the pid fields according to what is specified in the input
> parameter (while currently only the permission fields are allowed to be set).
> The following syscalls are impacted:
> . msgctl(IPC_SET)
> . semctl(IPC_SET)
> . shmctl(IPC_SET)
> 
> This makes it easy to restart an ipc object exactly is it was during the
> checkpoint phase.
> 
> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

thanks,
-serge

> ---
>  include/linux/next_syscall_data.h |   12 ++++++++++++
>  ipc/msg.c                         |   19 ++++++++++++++++++-
>  ipc/sem.c                         |   16 +++++++++++++++-
>  ipc/shm.c                         |   19 ++++++++++++++++++-
>  4 files changed, 63 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/include/linux/next_syscall_data.h	2008-07-08 12:22:47.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/include/linux/next_syscall_data.h	2008-07-08 12:24:29.000000000 +0200
> @@ -6,6 +6,7 @@
>   *    . object creation with a predefined id
>   *         . for a sysv ipc object
>   *         . for a process
> + *    . set more than the usual ipc_perm fields during and IPC_SET operation.
>   */
> 
>  #ifndef _LINUX_NEXT_SYSCALL_DATA_H
> @@ -20,6 +21,10 @@
>   * by next syscall. The following syscalls support this feature:
>   *    . msgget(), semget(), shmget()
>   *    . fork(), vfork(), clone()
> + *
> + * If it is set to a non null value before a call to:
> + *    . msgctl(IPC_SET), semctl(IPC_SET), shmctl(IPC_SET),
> + * this means that we are going to set more than the usual ipc_perms fields.
>   */
>  struct next_syscall_data {
>  	int ndata;
> @@ -35,6 +40,13 @@ struct next_syscall_data {
> 
>  #define get_next_data(tsk)	((tsk)->nsd->data[0])
> 
> +/*
> + * Returns true if next call to xxxctl(IPC_SET) should have a non-default
> + * behavior.
> + */
> +#define ipc_set_all(tsk)	(next_data_set(tsk) ? get_next_data(tsk) : 0)
> +
> +
>  extern ssize_t get_next_syscall_data(struct task_struct *, char *, size_t);
>  extern int set_next_syscall_data(struct task_struct *, char *);
>  extern void reset_next_syscall_data(struct task_struct *);
> Index: linux-2.6.26-rc8-mm1/ipc/msg.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/ipc/msg.c	2008-07-08 12:12:36.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/ipc/msg.c	2008-07-08 12:26:03.000000000 +0200
> @@ -446,7 +446,24 @@ static int msgctl_down(struct ipc_namesp
>  		msq->q_qbytes = msqid64.msg_qbytes;
> 
>  		ipc_update_perm(&msqid64.msg_perm, ipcp);
> -		msq->q_ctime = get_seconds();
> +		if (unlikely(ipc_set_all(current))) {
> +			/*
> +			 * If this field is set in the task struct, this
> +			 * means that we want to set more than the usual
> +			 * fields. Particularly useful to restart a msgq
> +			 * in the same state as it was before being
> +			 * checkpointed.
> +			 */
> +			msq->q_stime = msqid64.msg_stime;
> +			msq->q_rtime = msqid64.msg_rtime;
> +			msq->q_ctime = msqid64.msg_ctime;
> +			msq->q_lspid = msqid64.msg_lspid;
> +			msq->q_lrpid = msqid64.msg_lrpid;
> +
> +			reset_next_syscall_data(current);
> +		} else
> +			msq->q_ctime = get_seconds();
> +
>  		/* sleeping receivers might be excluded by
>  		 * stricter permissions.
>  		 */
> Index: linux-2.6.26-rc8-mm1/ipc/sem.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/ipc/sem.c	2008-07-08 12:12:36.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/ipc/sem.c	2008-07-08 12:27:06.000000000 +0200
> @@ -874,7 +874,21 @@ static int semctl_down(struct ipc_namesp
>  		goto out_up;
>  	case IPC_SET:
>  		ipc_update_perm(&semid64.sem_perm, ipcp);
> -		sma->sem_ctime = get_seconds();
> +
> +		if (unlikely(ipc_set_all(current))) {
> +			/*
> +			 * If this field is set in the task struct, this
> +			 * means that we want to set more than the usual
> +			 * fields. Particularly useful to restart a semaphore
> +			 * in the same state as it was before being
> +			 * checkpointed.
> +			 */
> +			sma->sem_ctime = semid64.sem_ctime;
> +			sma->sem_otime = semid64.sem_otime;
> +
> +			reset_next_syscall_data(current);
> +		} else
> +			sma->sem_ctime = get_seconds();
>  		break;
>  	default:
>  		err = -EINVAL;
> Index: linux-2.6.26-rc8-mm1/ipc/shm.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/ipc/shm.c	2008-07-08 12:12:36.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/ipc/shm.c	2008-07-08 12:27:32.000000000 +0200
> @@ -609,7 +609,24 @@ static int shmctl_down(struct ipc_namesp
>  		goto out_up;
>  	case IPC_SET:
>  		ipc_update_perm(&shmid64.shm_perm, ipcp);
> -		shp->shm_ctim = get_seconds();
> +
> +		if (unlikely(ipc_set_all(current))) {
> +			/*
> +			 * If this field is set in the task struct, this
> +			 * means that we want to set more than the usual
> +			 * fields. Particularly useful to restart a shm seg
> +			 * in the same state as it was before being
> +			 * checkpointed.
> +			 */
> +			shp->shm_atim = shmid64.shm_atime;
> +			shp->shm_dtim = shmid64.shm_dtime;
> +			shp->shm_ctim = shmid64.shm_ctime;
> +			shp->shm_cprid = shmid64.shm_cpid;
> +			shp->shm_lprid = shmid64.shm_lpid;
> +
> +			reset_next_syscall_data(current);
> +		} else
> +			shp->shm_ctim = get_seconds();
>  		break;
>  	default:
>  		err = -EINVAL;
> 
> --

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]   ` <20080708112459.632357000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-08 20:14     ` Serge E. Hallyn
       [not found]       ` <20080708201452.GE22904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
  2008-07-10  0:32     ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Serge E. Hallyn @ 2008-07-08 20:14 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM, Kathy Staples
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
> [PATCH 05/05]
> 
> This patch uses the value written into the next_syscall_data proc file
> as a target file descriptor for the next file to be opened.
> 
> This makes it easy to restart a process with the same fds as the ones it was
> using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
> the open file descriptor.
> 
> The following syscalls are impacted if next_syscall_data is set:
> . open()
> . openat()

Oh, neat, I somehow missed the fact that you had this in your previous
posting  :)

> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>

It'd be nice if the get_predefined_fd_flags() could share a helper
with get_unused_fd_flags() (in particular because the "/* snaity check */
at the end is between a '#if 1' which sounds like it may one day be
removed), but I'm not sure offhand the best way to do that.  So for now

Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>

Thanks, Nadia.

Kathy, I'd love to see a -lxc release with this patchset so we can test
it with cryo.

Suka, the open with specified id here might help your simplify your pipe
c/r patches for cryo?

-serge

> ---
>  fs/open.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 61 insertions(+), 1 deletion(-)
> 
> Index: linux-2.6.26-rc8-mm1/fs/open.c
> ===================================================================
> --- linux-2.6.26-rc8-mm1.orig/fs/open.c	2008-07-08 12:12:34.000000000 +0200
> +++ linux-2.6.26-rc8-mm1/fs/open.c	2008-07-08 13:23:03.000000000 +0200
> @@ -974,6 +974,59 @@ struct file *dentry_open(struct dentry *
>  EXPORT_SYMBOL(dentry_open);
> 
>  /*
> + * Marks a given file descriptor entry as busy (should not be busy when this
> + * routine is called.
> + *
> + * files->next_fd is not updated: this lets the potentially created hole be
> + * filled up on next calls to get_unused_fd_flags.
> + *
> + * Returns the specified fd if successful, -errno else.
> + */
> +static int get_predefined_fd_flags(int fd, int flags)
> +{
> +	struct files_struct *files = current->files;
> +	int error;
> +	struct fdtable *fdt;
> +
> +	error = -EINVAL;
> +	if (fd < 0)
> +		goto out;
> +
> +	error = -EMFILE;
> +	if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
> +		goto out;
> +
> +	spin_lock(&files->file_lock);
> +	fdt = files_fdtable(files);
> +
> +	error = expand_files(files, fd);
> +	if (error < 0)
> +		goto out_unlock;
> +
> +	error = -EBUSY;
> +	if (FD_ISSET(fd, fdt->open_fds))
> +		goto out_unlock;
> +
> +	FD_SET(fd, fdt->open_fds);
> +	if (flags & O_CLOEXEC)
> +		FD_SET(fd, fdt->close_on_exec);
> +	else
> +		FD_CLR(fd, fdt->close_on_exec);
> +
> +	/* Sanity check */
> +	if (fdt->fd[fd] != NULL) {
> +		printk(KERN_WARNING "get_unused_fd: slot %d not NULL!\n", fd);
> +		fdt->fd[fd] = NULL;
> +	}
> +
> +	error = fd;
> +out_unlock:
> +	spin_unlock(&files->file_lock);
> +out:
> +	return error;
> +}
> +
> +/*
>   * Find an empty file descriptor entry, and mark it busy.
>   */
>  int get_unused_fd_flags(int flags)
> @@ -1088,7 +1141,14 @@ long do_sys_open(int dfd, const char __u
>  	int fd = PTR_ERR(tmp);
> 
>  	if (!IS_ERR(tmp)) {
> -		fd = get_unused_fd_flags(flags);
> +		if (unlikely(next_data_set(current))) {
> +			int next_fd = get_next_data(current);
> +
> +			fd = get_predefined_fd_flags(next_fd, flags);
> +			reset_next_syscall_data(current);
> +		} else
> +			fd = get_unused_fd_flags(flags);
> +
>  		if (fd >= 0) {
>  			struct file *f = do_filp_open(dfd, tmp, flags, mode);
>  			if (IS_ERR(f)) {
> 
> --

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]       ` <20080708201452.GE22904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
@ 2008-07-09  5:00         ` kathys
       [not found]           ` <487445E4.6060107-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: kathys @ 2008-07-09  5:00 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Kathy Staples,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

Hi Nadia,

I am trying with great difficulty to incorporate these patches into the 
existing lxc-tree on 2.6.26-rc8-mm1-lxc1, they are conflicting with a 
number of other patches from checkpoint/. Serge has asked me to include 
them in the next lxc release so I need to know how to make them fit.

I will put out 2.6.26-rc8-mm1-lxc1 without your patches because its 
taking me too long, I will endeavor to include them in the 
2.6.26-rc8-mm1-lxc2, so if you could have a look at them against the 
next release of lxc which I hope to get out by tomorrow (Thursday) 
afternoon.

Thanks,

Kathy

Serge E. Hallyn wrote:
> Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
>   
>> [PATCH 05/05]
>>
>> This patch uses the value written into the next_syscall_data proc file
>> as a target file descriptor for the next file to be opened.
>>
>> This makes it easy to restart a process with the same fds as the ones it was
>> using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
>> the open file descriptor.
>>
>> The following syscalls are impacted if next_syscall_data is set:
>> . open()
>> . openat()
>>     
>
> Oh, neat, I somehow missed the fact that you had this in your previous
> posting  :)
>
>   
>> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>
>>     
>
> It'd be nice if the get_predefined_fd_flags() could share a helper
> with get_unused_fd_flags() (in particular because the "/* snaity check */
> at the end is between a '#if 1' which sounds like it may one day be
> removed), but I'm not sure offhand the best way to do that.  So for now
>
> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>
> Thanks, Nadia.
>
> Kathy, I'd love to see a -lxc release with this patchset so we can test
> it with cryo.
>
> Suka, the open with specified id here might help your simplify your pipe
> c/r patches for cryo?
>
> -serge
>
>   
>> ---
>>  fs/open.c |   62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 61 insertions(+), 1 deletion(-)
>>
>> Index: linux-2.6.26-rc8-mm1/fs/open.c
>> ===================================================================
>> --- linux-2.6.26-rc8-mm1.orig/fs/open.c	2008-07-08 12:12:34.000000000 +0200
>> +++ linux-2.6.26-rc8-mm1/fs/open.c	2008-07-08 13:23:03.000000000 +0200
>> @@ -974,6 +974,59 @@ struct file *dentry_open(struct dentry *
>>  EXPORT_SYMBOL(dentry_open);
>>
>>  /*
>> + * Marks a given file descriptor entry as busy (should not be busy when this
>> + * routine is called.
>> + *
>> + * files->next_fd is not updated: this lets the potentially created hole be
>> + * filled up on next calls to get_unused_fd_flags.
>> + *
>> + * Returns the specified fd if successful, -errno else.
>> + */
>> +static int get_predefined_fd_flags(int fd, int flags)
>> +{
>> +	struct files_struct *files = current->files;
>> +	int error;
>> +	struct fdtable *fdt;
>> +
>> +	error = -EINVAL;
>> +	if (fd < 0)
>> +		goto out;
>> +
>> +	error = -EMFILE;
>> +	if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
>> +		goto out;
>> +
>> +	spin_lock(&files->file_lock);
>> +	fdt = files_fdtable(files);
>> +
>> +	error = expand_files(files, fd);
>> +	if (error < 0)
>> +		goto out_unlock;
>> +
>> +	error = -EBUSY;
>> +	if (FD_ISSET(fd, fdt->open_fds))
>> +		goto out_unlock;
>> +
>> +	FD_SET(fd, fdt->open_fds);
>> +	if (flags & O_CLOEXEC)
>> +		FD_SET(fd, fdt->close_on_exec);
>> +	else
>> +		FD_CLR(fd, fdt->close_on_exec);
>> +
>> +	/* Sanity check */
>> +	if (fdt->fd[fd] != NULL) {
>> +		printk(KERN_WARNING "get_unused_fd: slot %d not NULL!\n", fd);
>> +		fdt->fd[fd] = NULL;
>> +	}
>> +
>> +	error = fd;
>> +out_unlock:
>> +	spin_unlock(&files->file_lock);
>> +out:
>> +	return error;
>> +}
>> +
>> +/*
>>   * Find an empty file descriptor entry, and mark it busy.
>>   */
>>  int get_unused_fd_flags(int flags)
>> @@ -1088,7 +1141,14 @@ long do_sys_open(int dfd, const char __u
>>  	int fd = PTR_ERR(tmp);
>>
>>  	if (!IS_ERR(tmp)) {
>> -		fd = get_unused_fd_flags(flags);
>> +		if (unlikely(next_data_set(current))) {
>> +			int next_fd = get_next_data(current);
>> +
>> +			fd = get_predefined_fd_flags(next_fd, flags);
>> +			reset_next_syscall_data(current);
>> +		} else
>> +			fd = get_unused_fd_flags(flags);
>> +
>>  		if (fd >= 0) {
>>  			struct file *f = do_filp_open(dfd, tmp, flags, mode);
>>  			if (IS_ERR(f)) {
>>
>> --
>>     
>
>   

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found] ` <20080708112422.164370000-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-09 22:10   ` Alexey Dobriyan
       [not found]     ` <20080709221028.GA4926-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
  2008-07-10  0:36   ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Alexey Dobriyan @ 2008-07-09 22:10 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

On Tue, Jul 08, 2008 at 01:24:22PM +0200, Nadia.Derbey-6ktuUTfB/bM@public.gmane.org wrote:
> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data

Same stuff.

There is struct task_struct::did_exec , what about it?

Also, patches are about de-serializing, how serializing from userspace looks like?
You freezed group of processes, then what?

How, for example, dump all VMAs correctly?
[prepares counter-example]

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/5] use next syscall data to predefine process ids
       [not found]   ` <20080708112458.946320000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 19:49     ` Serge E. Hallyn
@ 2008-07-10  0:27     ` Eric W. Biederman
       [not found]       ` <m1hcayfusi.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  0:27 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:

> [PATCH 03/05]
>
> This patch uses the value written into the next_syscall_data proc file
> as a target upid nr for the next process to be created.
> The following syscalls have a new behavior if next_syscall_data is set:
> . fork()
> . vfork()
> . clone()
>
> In the current version, if the process belongs to nested namespaces, only
> the upper namespace level upid nr is allowed to be predefined, since there
> is not yet a way to take a snapshot of upid nrs at all namespaces levels.
>
> But this can easily be extended in the future.

This patch is unnecessary.  The and a mess.   The existing limits on the pid range should
be enough.  We may need to export it via /proc/sys.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]   ` <20080708112459.632357000-6ktuUTfB/bM@public.gmane.org>
  2008-07-08 20:14     ` Serge E. Hallyn
@ 2008-07-10  0:32     ` Eric W. Biederman
       [not found]       ` <m1tzeyefz9.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  0:32 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:

> [PATCH 05/05]
>
> This patch uses the value written into the next_syscall_data proc file
> as a target file descriptor for the next file to be opened.
>
> This makes it easy to restart a process with the same fds as the ones it was
> using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
> the open file descriptor.

As it happens the behavior of open is deterministic.  So if you open
the files in the right order you should not need this.  dup2 is only needed
if there is a gap in the fds used.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found] ` <20080708112422.164370000-6ktuUTfB/bM@public.gmane.org>
  2008-07-09 22:10   ` [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Alexey Dobriyan
@ 2008-07-10  0:36   ` Eric W. Biederman
       [not found]     ` <m1lk0aefs1.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  0:36 UTC (permalink / raw)
  To: Nadia.Derbey-6ktuUTfB/bM
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:

> This patchset is a part of an effort to change some syscalls behavior for
> checkpoint restart.

Thanks for doing this.

Unfortunately this makes a very good case of why we don't want to go down
this route.  Adding magic parameters to syscalls that are only useful
in one very specific restart case.

We need good clean interfaces with well defined semantics.

Something as narrow focused on this is not really useful and it takes
a lot of code to do something very few people will want to actively
do.

> The syntax is:
> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
>      next object to be created will have an id set to XX

Which his horrible in another way because it is hugely race prone.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]     ` <20080709221028.GA4926-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
@ 2008-07-10  0:43       ` Eric W. Biederman
       [not found]         ` <m1tzeyd0x3.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2008-07-10 16:01       ` Dave Hansen
  1 sibling, 1 reply; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  0:43 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> On Tue, Jul 08, 2008 at 01:24:22PM +0200, Nadia.Derbey-6ktuUTfB/bM@public.gmane.org wrote:
>> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
>
> Same stuff.
>
> There is struct task_struct::did_exec , what about it?
>
> Also, patches are about de-serializing, how serializing from userspace looks
> like?
> You freezed group of processes, then what?
>
> How, for example, dump all VMAs correctly?
> [prepares counter-example]

Alexey userspace vs a kernel space implementation is the wrong argument.

It is clearly established that the current user space interfaces are
insufficient to do the job.  So we need to implement something in the kernel.

Further I have heard of no one suggesting running a single kernel on multiple
machines.  Therefore there no one seems to be doing this entirely in the kernel
and so we need a user space component.

So the question should not be user space vs. kernel space but can we build clean
interfaces for checkpoint/restart?  What will those interfaces be?

Although I think it is good that we are seeing more people play with this as
that should mean that our pool of people for doing code review on the implementation
should be reasonable.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]         ` <m1tzeyd0x3.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-07-10  1:39           ` Alexey Dobriyan
       [not found]             ` <20080710013915.GB8327-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
  2008-07-17 22:42           ` Oren Laadan
  1 sibling, 1 reply; 32+ messages in thread
From: Alexey Dobriyan @ 2008-07-10  1:39 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

On Wed, Jul 09, 2008 at 05:43:04PM -0700, Eric W. Biederman wrote:
> Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> 
> > On Tue, Jul 08, 2008 at 01:24:22PM +0200, Nadia.Derbey-6ktuUTfB/bM@public.gmane.org wrote:
> >> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
> >
> > Same stuff.
> >
> > There is struct task_struct::did_exec , what about it?
> >
> > Also, patches are about de-serializing, how serializing from userspace looks
> > like?
> > You freezed group of processes, then what?
> >
> > How, for example, dump all VMAs correctly?
> > [prepares counter-example]
> 
> Alexey userspace vs a kernel space implementation is the wrong argument.
> 
> It is clearly established that the current user space interfaces are
> insufficient to do the job.  So we need to implement something in the kernel.
> 
> Further I have heard of no one suggesting running a single kernel on multiple
> machines.  Therefore there no one seems to be doing this entirely in the kernel
> and so we need a user space component.
> 
> So the question should not be user space vs. kernel space but can we build clean
> interfaces for checkpoint/restart?

> What will those interfaces be?

In case of ->did_exec the only clean interface I see is:

	tsk->did_exec = !!tsk_img->did_exec;

It would be pretty silly to wrap this one line in a system call (two
actually -- one in, one out), since you're going to restore some more
fields of such variety anyway (like ->pdeath_signal).

Given the diversity of kernel internal data structures and all sorts of
links between them, the only system call suitable is ioctl(2), not all
this zoo of system calls proposed. They are all extendable and without
rules, but ioctl(2) is also without rules.


This is all said in assumption that serializing kernel-internal data for
checkpoint/restart to userspace is acceptable for mainline.
I don't think it is.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]             ` <20080710013915.GB8327-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
@ 2008-07-10  2:14               ` Eric W. Biederman
  2008-07-15 18:18               ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  2:14 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> In case of ->did_exec the only clean interface I see is:
>
> 	tsk->did_exec = !!tsk_img->did_exec;
>
> It would be pretty silly to wrap this one line in a system call (two
> actually -- one in, one out), since you're going to restore some more
> fields of such variety anyway (like ->pdeath_signal).

There I agree the granularity seems small enough to be a major pain
for the implementation.

> Given the diversity of kernel internal data structures and all sorts of
> links between them, the only system call suitable is ioctl(2), not all
> this zoo of system calls proposed. They are all extendable and without
> rules, but ioctl(2) is also without rules.

At least for processes my gut reaction is to look at binary formats
and coredumps.  Something with at least that large of a granularity seems
to make most sense.

> This is all said in assumption that serializing kernel-internal data for
> checkpoint/restart to userspace is acceptable for mainline.
> I don't think it is.

I don't believe that serializing kernel-internal data is acceptable
for mainline.  I believe that serializing user-visible data is acceptable.
Note: user-visible data does not mean user-manipulatable data.

On a socket you may not save the skbs but you can save the pending packets
for example.  Assuming the transition cost is not too high.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]           ` <487445E4.6060107-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org>
@ 2008-07-10  6:12             ` Nadia Derbey
       [not found]               ` <4875A849.1030206-6ktuUTfB/bM@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Nadia Derbey @ 2008-07-10  6:12 UTC (permalink / raw)
  To: kathys
  Cc: Kathy Staples,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

kathys wrote:
> Hi Nadia,
> 
> I am trying with great difficulty to incorporate these patches into the 
> existing lxc-tree on 2.6.26-rc8-mm1-lxc1, they are conflicting with a 
> number of other patches from checkpoint/.

Kathy,

Is it the same problem as the one we have solved by private e-mail?

Regards,
Nadia

> Serge has asked me to include 
> them in the next lxc release so I need to know how to make them fit.
> 
> I will put out 2.6.26-rc8-mm1-lxc1 without your patches because its 
> taking me too long, I will endeavor to include them in the 
> 2.6.26-rc8-mm1-lxc2, so if you could have a look at them against the 
> next release of lxc which I hope to get out by tomorrow (Thursday) 
> afternoon.
> 
> Thanks,
> 
> Kathy
> 
> Serge E. Hallyn wrote:
> 
>> Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
>>  
>>
>>> [PATCH 05/05]
>>>
>>> This patch uses the value written into the next_syscall_data proc file
>>> as a target file descriptor for the next file to be opened.
>>>
>>> This makes it easy to restart a process with the same fds as the ones 
>>> it was
>>> using during the checkpoint phase, instead of 1. opening the file, 2. 
>>> dup2'ing
>>> the open file descriptor.
>>>
>>> The following syscalls are impacted if next_syscall_data is set:
>>> . open()
>>> . openat()
>>>     
>>
>>
>> Oh, neat, I somehow missed the fact that you had this in your previous
>> posting  :)
>>
>>  
>>
>>> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>
>>>     
>>
>>
>> It'd be nice if the get_predefined_fd_flags() could share a helper
>> with get_unused_fd_flags() (in particular because the "/* snaity check */
>> at the end is between a '#if 1' which sounds like it may one day be
>> removed), but I'm not sure offhand the best way to do that.  So for now
>>
>> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>>
>> Thanks, Nadia.
>>
>> Kathy, I'd love to see a -lxc release with this patchset so we can test
>> it with cryo.
>>
>> Suka, the open with specified id here might help your simplify your pipe
>> c/r patches for cryo?
>>
>> -serge
>>
>>  
>>
>>> ---
>>>  fs/open.c |   62 
>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  1 file changed, 61 insertions(+), 1 deletion(-)
>>>
>>> Index: linux-2.6.26-rc8-mm1/fs/open.c
>>> ===================================================================
>>> --- linux-2.6.26-rc8-mm1.orig/fs/open.c    2008-07-08 
>>> 12:12:34.000000000 +0200
>>> +++ linux-2.6.26-rc8-mm1/fs/open.c    2008-07-08 13:23:03.000000000 
>>> +0200
>>> @@ -974,6 +974,59 @@ struct file *dentry_open(struct dentry *
>>>  EXPORT_SYMBOL(dentry_open);
>>>
>>>  /*
>>> + * Marks a given file descriptor entry as busy (should not be busy 
>>> when this
>>> + * routine is called.
>>> + *
>>> + * files->next_fd is not updated: this lets the potentially created 
>>> hole be
>>> + * filled up on next calls to get_unused_fd_flags.
>>> + *
>>> + * Returns the specified fd if successful, -errno else.
>>> + */
>>> +static int get_predefined_fd_flags(int fd, int flags)
>>> +{
>>> +    struct files_struct *files = current->files;
>>> +    int error;
>>> +    struct fdtable *fdt;
>>> +
>>> +    error = -EINVAL;
>>> +    if (fd < 0)
>>> +        goto out;
>>> +
>>> +    error = -EMFILE;
>>> +    if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
>>> +        goto out;
>>> +
>>> +    spin_lock(&files->file_lock);
>>> +    fdt = files_fdtable(files);
>>> +
>>> +    error = expand_files(files, fd);
>>> +    if (error < 0)
>>> +        goto out_unlock;
>>> +
>>> +    error = -EBUSY;
>>> +    if (FD_ISSET(fd, fdt->open_fds))
>>> +        goto out_unlock;
>>> +
>>> +    FD_SET(fd, fdt->open_fds);
>>> +    if (flags & O_CLOEXEC)
>>> +        FD_SET(fd, fdt->close_on_exec);
>>> +    else
>>> +        FD_CLR(fd, fdt->close_on_exec);
>>> +
>>> +    /* Sanity check */
>>> +    if (fdt->fd[fd] != NULL) {
>>> +        printk(KERN_WARNING "get_unused_fd: slot %d not NULL!\n", fd);
>>> +        fdt->fd[fd] = NULL;
>>> +    }
>>> +
>>> +    error = fd;
>>> +out_unlock:
>>> +    spin_unlock(&files->file_lock);
>>> +out:
>>> +    return error;
>>> +}
>>> +
>>> +/*
>>>   * Find an empty file descriptor entry, and mark it busy.
>>>   */
>>>  int get_unused_fd_flags(int flags)
>>> @@ -1088,7 +1141,14 @@ long do_sys_open(int dfd, const char __u
>>>      int fd = PTR_ERR(tmp);
>>>
>>>      if (!IS_ERR(tmp)) {
>>> -        fd = get_unused_fd_flags(flags);
>>> +        if (unlikely(next_data_set(current))) {
>>> +            int next_fd = get_next_data(current);
>>> +
>>> +            fd = get_predefined_fd_flags(next_fd, flags);
>>> +            reset_next_syscall_data(current);
>>> +        } else
>>> +            fd = get_unused_fd_flags(flags);
>>> +
>>>          if (fd >= 0) {
>>>              struct file *f = do_filp_open(dfd, tmp, flags, mode);
>>>              if (IS_ERR(f)) {
>>>
>>> -- 
>>>     
>>
>>
>>   
> 
> 
> 
> 


-- 
===============================================================
Name.......... Nadia DERBEY
Organization.. BULL/DT/OSwR&D/Linux
---------------------------------------------------------------
Email......... mailto:Nadia.Derbey-6ktuUTfB/bM@public.gmane.org
Address....... BULL, B.P. 208, 38432 Echirolles Cedex, France
Tel........... (33) 76 29 77 62 [Internal Bull: (229) 77 62]
Telex,Fax..... 980648 F - (33) 76 29 76 00
Internal Bull. Mail: FREC-B1208
===============================================================

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]       ` <m1tzeyefz9.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-07-10  6:25         ` Nadia Derbey
  0 siblings, 0 replies; 32+ messages in thread
From: Nadia Derbey @ 2008-07-10  6:25 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Eric W. Biederman wrote:
> Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:
> 
> 
>>[PATCH 05/05]
>>
>>This patch uses the value written into the next_syscall_data proc file
>>as a target file descriptor for the next file to be opened.
>>
>>This makes it easy to restart a process with the same fds as the ones it was
>>using during the checkpoint phase, instead of 1. opening the file, 2. dup2'ing
>>the open file descriptor.
> 
> 
> As it happens the behavior of open is deterministic.  So if you open
> the files in the right order you should not need this.  dup2 is only needed
> if there is a gap in the fds used.
> 

This covers the case where you're checkpointing a process that has
1. opened, say 3 files (fds x, x+1, and x+2)
2. closed fd x+1
--> checkpoint occurs at that point.

During restart, you'll have to only recreate fds x and x+2.

But I'm realizing that this might be what you're calling a gap in the 
fds ;-)

Regards,
Nadia

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/5] use next syscall data to predefine process ids
       [not found]       ` <m1hcayfusi.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-07-10  8:32         ` Nadia Derbey
       [not found]           ` <4875C932.2020503-6ktuUTfB/bM@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Nadia Derbey @ 2008-07-10  8:32 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Eric W. Biederman wrote:
> Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:
> 
> 
>>[PATCH 03/05]
>>
>>This patch uses the value written into the next_syscall_data proc file
>>as a target upid nr for the next process to be created.
>>The following syscalls have a new behavior if next_syscall_data is set:
>>. fork()
>>. vfork()
>>. clone()
>>
>>In the current version, if the process belongs to nested namespaces, only
>>the upper namespace level upid nr is allowed to be predefined, since there
>>is not yet a way to take a snapshot of upid nrs at all namespaces levels.
>>
>>But this can easily be extended in the future.
> 
> 
> This patch is unnecessary.  The and a mess.   The existing limits on the pid range should
> be enough.  We may need to export it via /proc/sys.
> 

Eric,

If I correctly understood what you're saying, it means set min = max = 
target_pid using /proc/sys, i.e. for the whole system: don't you think 
this might be dangerous: allocating pids will fail for any other running 
process  during the entire period of time where /proc/sys will be set 
like that.
I really think this is a feature that should be confined to a process.

Regards,
Nadia

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 3/5] use next syscall data to predefine process ids
       [not found]           ` <4875C932.2020503-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-10  9:36             ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-10  9:36 UTC (permalink / raw)
  To: Nadia Derbey; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org> writes:

> If I correctly understood what you're saying, it means set min = max =
> target_pid using /proc/sys, i.e. for the whole system: don't you think this
> might be dangerous: allocating pids will fail for any other running process
> during the entire period of time where /proc/sys will be set like that.
> I really think this is a feature that should be confined to a process.

Well for a pid namespace, so that is more confined.
Grr.  We still need to move /proc/sys into /proc/<pid>/sys so it is
clear that sysctls are per namespace.

You are right that doing it that way has downsides.  In particular
it is hard to parallelize the restoration of a pid namespace.

However the interface does exist, and it didn't look like you were
reusing that code in your allocator.

It is my firm suspicion that restoring a process one syscall
at a time is too fine a granularity.  Certainly for the VM
of a process it is.

So here is my suggestion for now.  Take whatever approach you
are doing and make it work for you.  Go as far as you can
go and see what the pitfalls are.  Then on the 22nd we can
all get in a room and discuss things, and if we are lucky
agree on a path forward.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]     ` <m1lk0aefs1.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
@ 2008-07-10  9:54       ` Nadia Derbey
  0 siblings, 0 replies; 32+ messages in thread
From: Nadia Derbey @ 2008-07-10  9:54 UTC (permalink / raw)
  To: Eric W. Biederman; +Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Eric W. Biederman wrote:
> Nadia.Derbey-6ktuUTfB/bM@public.gmane.org writes:
> 
> 
>>This patchset is a part of an effort to change some syscalls behavior for
>>checkpoint restart.
> 
> 
> Thanks for doing this.
> 
> Unfortunately this makes a very good case of why we don't want to go down
> this route.  Adding magic parameters to syscalls that are only useful
> in one very specific restart case.
> 
> We need good clean interfaces with well defined semantics.
> 
> Something as narrow focused on this is not really useful and it takes
> a lot of code to do something very few people will want to actively
> do.

All this seems reasonable.
Ok, so since we are taking the "new syscalls" direction, I'll try to 
make a list of the potentially duplicated syscalls.

Regards,
Nadia

> 
> 
>>The syntax is:
>># echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
>>     next object to be created will have an id set to XX
> 
> 
> Which his horrible in another way because it is hugely race prone.
> 
> Eric
> 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]     ` <20080709221028.GA4926-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
  2008-07-10  0:43       ` Eric W. Biederman
@ 2008-07-10 16:01       ` Dave Hansen
  1 sibling, 0 replies; 32+ messages in thread
From: Dave Hansen @ 2008-07-10 16:01 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

On Thu, 2008-07-10 at 02:10 +0400, Alexey Dobriyan wrote:
> How, for example, dump all VMAs correctly?
> [prepares counter-example]

Are there some particular pitfalls that you'd like to share?  I'd love
to hear some of the issues the you've run into with Virtuozzo as its
implementation was created.

-- Dave

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value
       [not found]               ` <4875A849.1030206-6ktuUTfB/bM@public.gmane.org>
@ 2008-07-14  4:58                 ` kathys
  0 siblings, 0 replies; 32+ messages in thread
From: kathys @ 2008-07-14  4:58 UTC (permalink / raw)
  To: Nadia Derbey
  Cc: Kathy Staples,
	containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA

Nadia Derbey wrote:
> kathys wrote:
>> Hi Nadia,
>>
>> I am trying with great difficulty to incorporate these patches into 
>> the existing lxc-tree on 2.6.26-rc8-mm1-lxc1, they are conflicting 
>> with a number of other patches from checkpoint/.
>
> Kathy,
>
> Is it the same problem as the one we have solved by private e-mail?
>
> Regards,
> Nadia
Hi Nadia, thanks, I think the confusion was that I was working my way 
through and sent a number of emails in the threads telling you what I 
was going to do.   So yes, this is the same issue.   Thankyou for the 
information.   I will re apply the patches and remove the old ones.

Thanks,

Kathy
>
>> Serge has asked me to include them in the next lxc release so I need 
>> to know how to make them fit.
>>
>> I will put out 2.6.26-rc8-mm1-lxc1 without your patches because its 
>> taking me too long, I will endeavor to include them in the 
>> 2.6.26-rc8-mm1-lxc2, so if you could have a look at them against the 
>> next release of lxc which I hope to get out by tomorrow (Thursday) 
>> afternoon.
>>
>> Thanks,
>>
>> Kathy
>>
>> Serge E. Hallyn wrote:
>>
>>> Quoting Nadia.Derbey-6ktuUTfB/bM@public.gmane.org (Nadia.Derbey-6ktuUTfB/bM@public.gmane.org):
>>>  
>>>
>>>> [PATCH 05/05]
>>>>
>>>> This patch uses the value written into the next_syscall_data proc file
>>>> as a target file descriptor for the next file to be opened.
>>>>
>>>> This makes it easy to restart a process with the same fds as the 
>>>> ones it was
>>>> using during the checkpoint phase, instead of 1. opening the file, 
>>>> 2. dup2'ing
>>>> the open file descriptor.
>>>>
>>>> The following syscalls are impacted if next_syscall_data is set:
>>>> . open()
>>>> . openat()
>>>>     
>>>
>>>
>>> Oh, neat, I somehow missed the fact that you had this in your previous
>>> posting  :)
>>>
>>>  
>>>
>>>> Signed-off-by: Nadia Derbey <Nadia.Derbey-6ktuUTfB/bM@public.gmane.org>
>>>>     
>>>
>>>
>>> It'd be nice if the get_predefined_fd_flags() could share a helper
>>> with get_unused_fd_flags() (in particular because the "/* snaity 
>>> check */
>>> at the end is between a '#if 1' which sounds like it may one day be
>>> removed), but I'm not sure offhand the best way to do that.  So for now
>>>
>>> Acked-by: Serge Hallyn <serue-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
>>>
>>> Thanks, Nadia.
>>>
>>> Kathy, I'd love to see a -lxc release with this patchset so we can test
>>> it with cryo.
>>>
>>> Suka, the open with specified id here might help your simplify your 
>>> pipe
>>> c/r patches for cryo?
>>>
>>> -serge
>>>
>>>  
>>>
>>>> ---
>>>>  fs/open.c |   62 
>>>> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>>>>  1 file changed, 61 insertions(+), 1 deletion(-)
>>>>
>>>> Index: linux-2.6.26-rc8-mm1/fs/open.c
>>>> ===================================================================
>>>> --- linux-2.6.26-rc8-mm1.orig/fs/open.c    2008-07-08 
>>>> 12:12:34.000000000 +0200
>>>> +++ linux-2.6.26-rc8-mm1/fs/open.c    2008-07-08 13:23:03.000000000 
>>>> +0200
>>>> @@ -974,6 +974,59 @@ struct file *dentry_open(struct dentry *
>>>>  EXPORT_SYMBOL(dentry_open);
>>>>
>>>>  /*
>>>> + * Marks a given file descriptor entry as busy (should not be busy 
>>>> when this
>>>> + * routine is called.
>>>> + *
>>>> + * files->next_fd is not updated: this lets the potentially 
>>>> created hole be
>>>> + * filled up on next calls to get_unused_fd_flags.
>>>> + *
>>>> + * Returns the specified fd if successful, -errno else.
>>>> + */
>>>> +static int get_predefined_fd_flags(int fd, int flags)
>>>> +{
>>>> +    struct files_struct *files = current->files;
>>>> +    int error;
>>>> +    struct fdtable *fdt;
>>>> +
>>>> +    error = -EINVAL;
>>>> +    if (fd < 0)
>>>> +        goto out;
>>>> +
>>>> +    error = -EMFILE;
>>>> +    if (fd >= current->signal->rlim[RLIMIT_NOFILE].rlim_cur)
>>>> +        goto out;
>>>> +
>>>> +    spin_lock(&files->file_lock);
>>>> +    fdt = files_fdtable(files);
>>>> +
>>>> +    error = expand_files(files, fd);
>>>> +    if (error < 0)
>>>> +        goto out_unlock;
>>>> +
>>>> +    error = -EBUSY;
>>>> +    if (FD_ISSET(fd, fdt->open_fds))
>>>> +        goto out_unlock;
>>>> +
>>>> +    FD_SET(fd, fdt->open_fds);
>>>> +    if (flags & O_CLOEXEC)
>>>> +        FD_SET(fd, fdt->close_on_exec);
>>>> +    else
>>>> +        FD_CLR(fd, fdt->close_on_exec);
>>>> +
>>>> +    /* Sanity check */
>>>> +    if (fdt->fd[fd] != NULL) {
>>>> +        printk(KERN_WARNING "get_unused_fd: slot %d not NULL!\n", 
>>>> fd);
>>>> +        fdt->fd[fd] = NULL;
>>>> +    }
>>>> +
>>>> +    error = fd;
>>>> +out_unlock:
>>>> +    spin_unlock(&files->file_lock);
>>>> +out:
>>>> +    return error;
>>>> +}
>>>> +
>>>> +/*
>>>>   * Find an empty file descriptor entry, and mark it busy.
>>>>   */
>>>>  int get_unused_fd_flags(int flags)
>>>> @@ -1088,7 +1141,14 @@ long do_sys_open(int dfd, const char __u
>>>>      int fd = PTR_ERR(tmp);
>>>>
>>>>      if (!IS_ERR(tmp)) {
>>>> -        fd = get_unused_fd_flags(flags);
>>>> +        if (unlikely(next_data_set(current))) {
>>>> +            int next_fd = get_next_data(current);
>>>> +
>>>> +            fd = get_predefined_fd_flags(next_fd, flags);
>>>> +            reset_next_syscall_data(current);
>>>> +        } else
>>>> +            fd = get_unused_fd_flags(flags);
>>>> +
>>>>          if (fd >= 0) {
>>>>              struct file *f = do_filp_open(dfd, tmp, flags, mode);
>>>>              if (IS_ERR(f)) {
>>>>
>>>> -- 
>>>>     
>>>
>>>
>>>   
>>
>>
>>
>>
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]             ` <20080710013915.GB8327-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
  2008-07-10  2:14               ` Eric W. Biederman
@ 2008-07-15 18:18               ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-15 18:18 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM

Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

> This is all said in assumption that serializing kernel-internal data for
> checkpoint/restart to userspace is acceptable for mainline.
> I don't think it is.

Just a quick comment here.  We mentioned checkpoint/restart is where
we were going last kernel summit, and no one was opposed.

So while I expect technical objects if we are not careful, I believe
a well chosen checkpoint/restart framework has every chance of being
merged into mainline.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]         ` <m1tzeyd0x3.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
  2008-07-10  1:39           ` Alexey Dobriyan
@ 2008-07-17 22:42           ` Oren Laadan
       [not found]             ` <487FCAF0.70607-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Oren Laadan @ 2008-07-17 22:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Nadia.Derbey-6ktuUTfB/bM

Eric W. Biederman wrote:
> Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:

I seem to not have received any of Alexey's emails... ?

> 
>> On Tue, Jul 08, 2008 at 01:24:22PM +0200, Nadia.Derbey-6ktuUTfB/bM@public.gmane.org wrote:
>>> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
>> Same stuff.
>>
>> There is struct task_struct::did_exec , what about it?
>>
>> Also, patches are about de-serializing, how serializing from userspace looks
>> like?
>> You freezed group of processes, then what?
>>
>> How, for example, dump all VMAs correctly?
>> [prepares counter-example]
> 
> Alexey userspace vs a kernel space implementation is the wrong argument.
> 
> It is clearly established that the current user space interfaces are
> insufficient to do the job.  So we need to implement something in the kernel.
> 
> Further I have heard of no one suggesting running a single kernel on multiple
> machines.  Therefore there no one seems to be doing this entirely in the kernel
> and so we need a user space component.

I'm not sure I understand this argument ?

In a kernel implementation, the component will merely open a file descriptor
(to which the data will be streamed), freeze the container and invoke a
system call. In a userland implementation, the component will do most of
the work by continuously probing the kernel for information about the
processes that are being checkpointed.

So, of course we need a "component" - but what does that component do ?

> So the question should not be user space vs. kernel space but can we build clean
> interfaces for checkpoint/restart?  What will those interfaces be?

My question is why build a set of interfaces to export this and that from
the kernel to user space ?  if a kernel implementation (with minimal user
space support) is chosen, then information extraction (and restoration) is
straightforward and we don't get ourselves tied until the end of times to
API exported to userland.

The output of the module will be a binary (like a core dump) that can be
used by the same module to restart. User utilities will be available to
inspect the contents (for whatever reason - like a debugger can inspect a
core dump), and moreover to convert between old and new formats when moving
from older to newer kernels.

By doing so, we avoid many API issues - design, complexity, contents, and
the amount of interfaces to be added.

By doing so, we also gain much in terms of atomicity, possibility to add
optimizations and improve performance, as well as add features as we wish,
without the burden of commitments to userspace.

I think the kernel space vs. user space must be the first issue on our
table to solve, as it has a wide impact on the rest of the work.

Oren.

> 
> Although I think it is good that we are seeing more people play with this as
> that should mean that our pool of people for doing code review on the implementation
> should be reasonable.
> 
> Eric
> _______________________________________________
> Containers mailing list
> Containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linux-foundation.org/mailman/listinfo/containers

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]             ` <487FCAF0.70607-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
@ 2008-07-18  1:09               ` Matt Helsley
       [not found]                 ` <1216343365.4844.308.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
  2008-07-18  2:40               ` Eric W. Biederman
  1 sibling, 1 reply; 32+ messages in thread
From: Matt Helsley @ 2008-07-18  1:09 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Nadia.Derbey-6ktuUTfB/bM, Eric W. Biederman, Alexey Dobriyan


On Thu, 2008-07-17 at 18:42 -0400, Oren Laadan wrote:
> 
> Eric W. Biederman wrote:
> > Alexey Dobriyan <adobriyan-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> writes:
> 
> I seem to not have received any of Alexey's emails... ?
> 
> > 
> >> On Tue, Jul 08, 2008 at 01:24:22PM +0200, Nadia.Derbey-6ktuUTfB/bM@public.gmane.org wrote:
> >>> # echo "LONG1 XX" > /proc/self/task/<my_tid>/next_syscall_data
> >> Same stuff.
> >>
> >> There is struct task_struct::did_exec , what about it?
> >>
> >> Also, patches are about de-serializing, how serializing from userspace looks
> >> like?
> >> You freezed group of processes, then what?
> >>
> >> How, for example, dump all VMAs correctly?
> >> [prepares counter-example]
> > 
> > Alexey userspace vs a kernel space implementation is the wrong argument.
> > 
> > It is clearly established that the current user space interfaces are
> > insufficient to do the job.  So we need to implement something in the kernel.
> > 
> > Further I have heard of no one suggesting running a single kernel on multiple
> > machines.  Therefore there no one seems to be doing this entirely in the kernel
> > and so we need a user space component.
> 
> I'm not sure I understand this argument ?
> 
> In a kernel implementation, the component will merely open a file descriptor
> (to which the data will be streamed), freeze the container and invoke a
> system call. In a userland implementation, the component will do most of
> the work by continuously probing the kernel for information about the
> processes that are being checkpointed.
> 
> So, of course we need a "component" - but what does that component do ?
> 
> > So the question should not be user space vs. kernel space but can we build clean
> > interfaces for checkpoint/restart?  What will those interfaces be?
> 
> My question is why build a set of interfaces to export this and that from
> the kernel to user space ?  if a kernel implementation (with minimal user
> space support) is chosen, then information extraction (and restoration) is
> straightforward and we don't get ourselves tied until the end of times to
> API exported to userland.

	That still seems like an API exported to userland. It just combines the
data into one block rather than distributing it amongst a bunch of
pseudo-filesystems. Does this form of API really free us from always
supporting it in the future?

> The output of the module will be a binary (like a core dump) that can be
> used by the same module to restart. User utilities will be available to
> inspect the contents (for whatever reason - like a debugger can inspect a
> core dump), and moreover to convert between old and new formats when moving
> from older to newer kernels.
>
> By doing so, we avoid many API issues - design, complexity, contents, and
> the amount of interfaces to be added.

	Userspace is expected to inspect or convert the binary data. How does
that truly avoid many of the API issues mentioned above? If it's really
supposed to be a minimal API then the binary should be considered opaque
and userspace tools which inspect or convert these binaries should be
considered unreliable hacks at best. Otherwise it seems to me that it
has most of the familiar problems associated with a kernel/userspace API
-- including an obligation to support it.

Cheers,
	-Matt Helsley

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]             ` <487FCAF0.70607-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
  2008-07-18  1:09               ` Matt Helsley
@ 2008-07-18  2:40               ` Eric W. Biederman
  1 sibling, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-18  2:40 UTC (permalink / raw)
  To: Oren Laadan
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Nadia.Derbey-6ktuUTfB/bM

Oren Laadan <orenl-eQaUEPhvms7ENvBUuze7eA@public.gmane.org> writes:

> I think the kernel space vs. user space must be the first issue on our
> table to solve, as it has a wide impact on the rest of the work.

We first need to talk about what kinds of problems we are trying to
solve.  If we don't agree what the problem is I expect we will have a
hard time agreeing on a solution.

For example we are using namespaces now instead of the potentially
simpler isolation mechanism of Vserver because checkpoint/restart
could not be done with the Vserver approach.

The use case that I expect we all have in common is migrating
an isolated container from one machine to another transparent
to applications.  Except those that directly access the hardware
at which point we can treat it as a hotplug event from the
perspective of userspace.

There are several other interesting use cases that I think we
should solve if possible.
- Live/Incremental migration.  
- Remote fork.  Which can be seen as an extreme case of migrating
  only a partial container.
- A checkpoint that can be restarted multiple times and work properly.
  Which means you need to include the state of the filesystem.
- A distributed checkpoint of multiple containers at the same time.

Given how brutally hard and inefficient it is to restore a checkpoint
using the existing system calls even with namespaces in the kernel.  We
can pretty much rule that implementation out as it does not match
our efficiency criteria, and likely isn't especially maintainble either.

On the maintenance side we can generally rule out an out of tree module.
As that does not afford visible to people changing a subsystem that the
checkpoint/restart code needs to change as well.

I believe the live migration will have the most stringent performance
requirements and at the same time be one of the most useful features,
as it immediately improve maintenance of clusters.

In the extreme case of a distributed checkpoint the kernel simply does
not have enough state so we need user space code coordinating all of
the pieces.

For a multi-start checkpoint I expect userspace will be coordinating
filesystem snapshots and checkpoints.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior
       [not found]                 ` <1216343365.4844.308.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
@ 2008-07-18  2:49                   ` Eric W. Biederman
  0 siblings, 0 replies; 32+ messages in thread
From: Eric W. Biederman @ 2008-07-18  2:49 UTC (permalink / raw)
  To: Matt Helsley
  Cc: containers-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Alexey Dobriyan, Nadia.Derbey-6ktuUTfB/bM

Matt Helsley <matthltc-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org> writes:

> On Thu, 2008-07-17 at 18:42 -0400, Oren Laadan wrote:
>> 
>> My question is why build a set of interfaces to export this and that from
>> the kernel to user space ?  if a kernel implementation (with minimal user
>> space support) is chosen, then information extraction (and restoration) is
>> straightforward and we don't get ourselves tied until the end of times to
>> API exported to userland.
>
> 	That still seems like an API exported to userland. It just combines the
> data into one block rather than distributing it amongst a bunch of
> pseudo-filesystems. Does this form of API really free us from always
> supporting it in the future?

A larger granularity reduces the support burden.  You don't wind up
introducing a bunch of little system calls that you only use for
restore.  You introduce one that does exactly what you need it to do.
Because you know it is only used in checkpoint/restart conditions you
can make assumptions about the users and have more freedom.

Yes it would still be a user/kernel interface.

If we abstract it something like binformats are abstracted we
may eventually be able to stop including an old format that no
one uses anymore.

>
> 	Userspace is expected to inspect or convert the binary data. How does
> that truly avoid many of the API issues mentioned above? If it's really
> supposed to be a minimal API then the binary should be considered opaque
> and userspace tools which inspect or convert these binaries should be
> considered unreliable hacks at best. Otherwise it seems to me that it
> has most of the familiar problems associated with a kernel/userspace API
> -- including an obligation to support it.

The best precedent we have for something like this today is the core
dump.  That is a single process and does not do well at tying multiple
processes together.  Even though you can inspect a core dump there is
still a lot of freedom in the implementation that we would not have
in a more general API.

As for userspace converting old data to new data.  I'm not sold on the
idea yet.  It is a good tool to plan on, but I'm not yet convinced
that it is necessary, at least when moving from older to newer kernels.
I expect newer kernels to have state that the older kernels don't know
how to handle, so we would at least need to strip that out.

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2008-07-18  2:49 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-08 11:24 [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Nadia.Derbey-6ktuUTfB/bM
2008-07-08 11:24 ` [RFC PATCH 1/5] adds the procfs facilities Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080708112457.994105000-6ktuUTfB/bM@public.gmane.org>
2008-07-08 19:32     ` Serge E. Hallyn
2008-07-08 11:24 ` [RFC PATCH 2/5] use next syscall data to predefine ipc objects ids Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080708112458.416998000-6ktuUTfB/bM@public.gmane.org>
2008-07-08 19:38     ` Serge E. Hallyn
2008-07-08 11:24 ` [RFC PATCH 3/5] use next syscall data to predefine process ids Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080708112458.946320000-6ktuUTfB/bM@public.gmane.org>
2008-07-08 19:49     ` Serge E. Hallyn
2008-07-10  0:27     ` Eric W. Biederman
     [not found]       ` <m1hcayfusi.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-07-10  8:32         ` Nadia Derbey
     [not found]           ` <4875C932.2020503-6ktuUTfB/bM@public.gmane.org>
2008-07-10  9:36             ` Eric W. Biederman
2008-07-08 11:24 ` [RFC PATCH 4/5] use next syscall data to change the behavior of IPC_SET Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080708112459.231249000-6ktuUTfB/bM@public.gmane.org>
2008-07-08 19:56     ` Serge E. Hallyn
2008-07-08 11:24 ` [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value Nadia.Derbey-6ktuUTfB/bM
     [not found]   ` <20080708112459.632357000-6ktuUTfB/bM@public.gmane.org>
2008-07-08 20:14     ` Serge E. Hallyn
     [not found]       ` <20080708201452.GE22904-r/Jw6+rmf7HQT0dZR+AlfA@public.gmane.org>
2008-07-09  5:00         ` kathys
     [not found]           ` <487445E4.6060107-8fk3Idey6ehBDgjK7y7TUQ@public.gmane.org>
2008-07-10  6:12             ` Nadia Derbey
     [not found]               ` <4875A849.1030206-6ktuUTfB/bM@public.gmane.org>
2008-07-14  4:58                 ` kathys
2008-07-10  0:32     ` Eric W. Biederman
     [not found]       ` <m1tzeyefz9.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-07-10  6:25         ` Nadia Derbey
     [not found] ` <20080708112422.164370000-6ktuUTfB/bM@public.gmane.org>
2008-07-09 22:10   ` [Devel] [RFC PATCH 0/5] Resend -v2 - Use procfs to change a syscall behavior Alexey Dobriyan
     [not found]     ` <20080709221028.GA4926-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
2008-07-10  0:43       ` Eric W. Biederman
     [not found]         ` <m1tzeyd0x3.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-07-10  1:39           ` Alexey Dobriyan
     [not found]             ` <20080710013915.GB8327-QDJVlCTZ4KWTKS93B3g+7KFoa47nwP16@public.gmane.org>
2008-07-10  2:14               ` Eric W. Biederman
2008-07-15 18:18               ` Eric W. Biederman
2008-07-17 22:42           ` Oren Laadan
     [not found]             ` <487FCAF0.70607-eQaUEPhvms7ENvBUuze7eA@public.gmane.org>
2008-07-18  1:09               ` Matt Helsley
     [not found]                 ` <1216343365.4844.308.camel-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>
2008-07-18  2:49                   ` Eric W. Biederman
2008-07-18  2:40               ` Eric W. Biederman
2008-07-10 16:01       ` Dave Hansen
2008-07-10  0:36   ` Eric W. Biederman
     [not found]     ` <m1lk0aefs1.fsf-B27657KtZYmhTnVgQlOflh2eb7JE58TQ@public.gmane.org>
2008-07-10  9:54       ` Nadia Derbey
  -- strict thread matches above, loose matches on Subject: below --
2008-07-03 14:40 [RFC PATCH 0/5] Resend " Nadia.Derbey-6ktuUTfB/bM
2008-07-03 14:40 ` [RFC PATCH 5/5] use next syscall data to predefine the file descriptor value Nadia.Derbey-6ktuUTfB/bM

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.