* [PATCH v8 1/2] fork: extend clone3() to support setting a PID
@ 2019-11-13 8:03 Adrian Reber
2019-11-13 8:03 ` [PATCH v8 2/2] selftests: add tests for clone3() Adrian Reber
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Adrian Reber @ 2019-11-13 8:03 UTC (permalink / raw)
To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn,
Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes
Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov,
Adrian Reber
The main motivation to add set_tid to clone3() is CRIU.
To restore a process with the same PID/TID CRIU currently uses
/proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
ns_last_pid and then (quickly) does a clone(). This works most of the
time, but it is racy. It is also slow as it requires multiple syscalls.
Extending clone3() to support *set_tid makes it possible restore a
process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
race free (as long as the desired PID/TID is available).
This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
on clone3() with *set_tid as they are currently in place for ns_last_pid.
The original version of this change was using a single value for
set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
decided to change set_tid to an array to enable setting the PID of a
process in multiple PID namespaces at the same time. If a process is
created in a PID namespace it is possible to influence the PID inside
and outside of the PID namespace. Details also in the corresponding
selftest.
To create a process with the following PIDs:
PID NS level Requested PID
0 (host) 31496
1 42
2 1
For that example the two newly introduced parameters to struct
clone_args (set_tid and set_tid_size) would need to be:
set_tid[0] = 1;
set_tid[1] = 42;
set_tid[2] = 31496;
set_tid_size = 3;
If only the PIDs of the two innermost nested PID namespaces should be
defined it would look like this:
set_tid[0] = 1;
set_tid[1] = 42;
set_tid_size = 2;
The PID of the newly created process would then be the next available
free PID in the PID namespace level 0 (host) and 42 in the PID namespace
at level 1 and the PID of the process in the innermost PID namespace
would be 1.
The set_tid array is used to specify the PID of a process starting
from the innermost nested PID namespaces up to set_tid_size PID namespaces.
set_tid_size cannot be larger then the current PID namespace level.
Signed-off-by: Adrian Reber <areber@redhat.com>
---
v2:
- Removed (size < sizeof(struct clone_args)) as discussed with
Christian and Dmitry
- Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg)
- Use idr_alloc() instead of idr_alloc_cyclic() (Oleg)
v3:
- Return EEXIST if PID is already in use (Christian)
- Drop CLONE_SET_TID (Christian and Oleg)
- Use idr_is_empty() instead of idr_get_cursor() (Oleg)
- Handle different `struct clone_args` sizes (Dmitry)
v4:
- Rework struct size check with defines (Christian)
- Reduce number of set_tid checks (Oleg)
- Less parentheses and more robust code (Oleg)
- Do ns_capable() on correct user_ns (Oleg, Christian)
v5:
- make set_tid checks earlier in alloc_pid() (Christian)
- remove unnecessary comment and struct size check (Christian)
v6:
- remove CLONE_SET_TID from description (Christian)
- add clone3() tests for different clone_args sizes (Christian)
- move more set_tid checks to alloc_pid() (Oleg)
- make all set_tid checks lockless (Oleg)
v7:
- changed set_tid to be an array to set the PID of a process
in multiple nested PID namespaces at the same time as discussed
at LPC 2019 (container MC)
v8:
- skip unnecessary memset() (Rasmus)
- replace set_tid copy loop with a single copy (Christian)
- more parameter error checking (Christian)
- cache set_tid in alloc_pid() (Oleg)
- move code in "else" branch (Oleg)
---
include/linux/pid.h | 3 +-
include/linux/pid_namespace.h | 2 ++
include/linux/sched/task.h | 3 ++
include/uapi/linux/sched.h | 2 ++
kernel/fork.c | 24 ++++++++++++-
kernel/pid.c | 64 +++++++++++++++++++++++++++--------
kernel/pid_namespace.c | 2 --
7 files changed, 82 insertions(+), 18 deletions(-)
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 9645b1194c98..034b7df25888 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr);
extern struct pid *find_get_pid(int nr);
extern struct pid *find_ge_pid(int nr, struct pid_namespace *);
-extern struct pid *alloc_pid(struct pid_namespace *ns);
+extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size);
extern void free_pid(struct pid *pid);
extern void disable_pid_allocation(struct pid_namespace *ns);
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 49538b172483..2ed6af88794b 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -12,6 +12,8 @@
#include <linux/ns_common.h>
#include <linux/idr.h>
+/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
+#define MAX_PID_NS_LEVEL 32
struct fs_pin;
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 4b1c3b664f51..f1879884238e 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -26,6 +26,9 @@ struct kernel_clone_args {
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
+ pid_t *set_tid;
+ /* Number of elements in *set_tid */
+ size_t set_tid_size;
};
/*
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index 25b4fa00bad1..13f74c40a629 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -72,6 +72,8 @@ struct clone_args {
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
+ __aligned_u64 set_tid;
+ __aligned_u64 set_tid_size;
};
#endif
diff --git a/kernel/fork.c b/kernel/fork.c
index 55af6931c6ec..6f9e6443e12c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2026,7 +2026,8 @@ static __latent_entropy struct task_struct *copy_process(
stackleak_task_init(p);
if (pid != &init_struct_pid) {
- pid = alloc_pid(p->nsproxy->pid_ns_for_children);
+ pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid,
+ args->set_tid_size);
if (IS_ERR(pid)) {
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
@@ -2529,6 +2530,7 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
{
int err;
struct clone_args args;
+ pid_t *kset_tid = kargs->set_tid;
if (unlikely(usize > PAGE_SIZE))
return -E2BIG;
@@ -2539,6 +2541,15 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
if (err)
return err;
+ if (unlikely(args.set_tid_size > MAX_PID_NS_LEVEL))
+ return -EINVAL;
+
+ if (unlikely(!args.set_tid && args.set_tid_size > 0))
+ return -EINVAL;
+
+ if (unlikely(args.set_tid && args.set_tid_size == 0))
+ return -EINVAL;
+
/*
* Verify that higher 32bits of exit_signal are unset and that
* it is a valid signal
@@ -2556,8 +2567,16 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
.stack = args.stack,
.stack_size = args.stack_size,
.tls = args.tls,
+ .set_tid_size = args.set_tid_size,
};
+ if (args.set_tid &&
+ copy_from_user(kset_tid, u64_to_user_ptr(args.set_tid),
+ (kargs->set_tid_size * sizeof(pid_t))))
+ return -EFAULT;
+
+ kargs->set_tid = kset_tid;
+
return 0;
}
@@ -2631,6 +2650,9 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
int err;
struct kernel_clone_args kargs;
+ pid_t set_tid[MAX_PID_NS_LEVEL];
+
+ kargs.set_tid = set_tid;
err = copy_clone_args_from_user(&kargs, uargs, size);
if (err)
diff --git a/kernel/pid.c b/kernel/pid.c
index 0a9f2e437217..17280da48224 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -157,7 +157,8 @@ void free_pid(struct pid *pid)
call_rcu(&pid->rcu, delayed_put_pid);
}
-struct pid *alloc_pid(struct pid_namespace *ns)
+struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid,
+ size_t set_tid_size)
{
struct pid *pid;
enum pid_type type;
@@ -166,6 +167,17 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
int retval = -ENOMEM;
+ /*
+ * set_tid_size contains the size of the set_tid array. Starting at
+ * the most nested currently active PID namespace it tells alloc_pid()
+ * which PID to set for a process in that most nested PID namespace
+ * up to set_tid_size PID namespaces. It does not have to set the PID
+ * for a process in all nested PID namespaces but set_tid_size must
+ * never be greater than the current ns->level + 1.
+ */
+ if (set_tid_size > ns->level + 1)
+ return ERR_PTR(-EINVAL);
+
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
@@ -175,23 +187,47 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (i = ns->level; i >= 0; i--) {
int pid_min = 1;
+ int tid = 0;
+
+ if (set_tid_size) {
+ tid = set_tid[ns->level - i];
+ if (tid < 1 || tid >= pid_max)
+ return ERR_PTR(-EINVAL);
+ /* Also fail if a PID != 1 is requested and no PID 1 exists */
+ if (tid != 1 && !tmp->child_reaper)
+ return ERR_PTR(-EINVAL);
+ if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+ }
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);
- /*
- * init really needs pid 1, but after reaching the maximum
- * wrap back to RESERVED_PIDS
- */
- if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
- pid_min = RESERVED_PIDS;
-
- /*
- * Store a null pointer so find_pid_ns does not find
- * a partially initialized PID (see below).
- */
- nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
- pid_max, GFP_ATOMIC);
+ if (tid) {
+ nr = idr_alloc(&tmp->idr, NULL, tid,
+ tid + 1, GFP_ATOMIC);
+ /*
+ * If ENOSPC is returned it means that the PID is
+ * alreay in use. Return EEXIST in that case.
+ */
+ if (nr == -ENOSPC)
+ nr = -EEXIST;
+ set_tid_size--;
+ } else {
+ /*
+ * init really needs pid 1, but after reaching the maximum
+ * wrap back to RESERVED_PIDS
+ */
+ if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
+ pid_min = RESERVED_PIDS;
+
+ /*
+ * Store a null pointer so find_pid_ns does not find
+ * a partially initialized PID (see below).
+ */
+ nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
+ pid_max, GFP_ATOMIC);
+ }
spin_unlock_irq(&pidmap_lock);
idr_preload_end();
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index a6a79f85c81a..d40017e79ebe 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -26,8 +26,6 @@
static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
-/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */
-#define MAX_PID_NS_LEVEL 32
/* Write once array, filled from the beginning. */
static struct kmem_cache *pid_cache[MAX_PID_NS_LEVEL];
--
2.23.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v8 2/2] selftests: add tests for clone3() 2019-11-13 8:03 [PATCH v8 1/2] fork: extend clone3() to support setting a PID Adrian Reber @ 2019-11-13 8:03 ` Adrian Reber 2019-11-13 12:35 ` Christian Brauner 2019-11-13 11:50 ` [PATCH v8 1/2] fork: extend clone3() to support setting a PID Christian Brauner 2019-11-13 13:42 ` Oleg Nesterov 2 siblings, 1 reply; 5+ messages in thread From: Adrian Reber @ 2019-11-13 8:03 UTC (permalink / raw) To: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn, Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes Cc: linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov, Adrian Reber This tests clone3() with *set_tid to see if all desired PIDs are working as expected. The tests are trying multiple invalid input parameters as well as creating processes while specifying a certain PID in multiple PID namespaces at the same time. Signed-off-by: Adrian Reber <areber@redhat.com> --- tools/testing/selftests/clone3/.gitignore | 1 + tools/testing/selftests/clone3/Makefile | 2 +- .../testing/selftests/clone3/clone3_set_tid.c | 345 ++++++++++++++++++ 3 files changed, 347 insertions(+), 1 deletion(-) create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore index 85d9d3ba2524..d56c3c49d869 100644 --- a/tools/testing/selftests/clone3/.gitignore +++ b/tools/testing/selftests/clone3/.gitignore @@ -1 +1,2 @@ clone3 +clone3_set_tid diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile index ea922c014ae4..2d292545ca8e 100644 --- a/tools/testing/selftests/clone3/Makefile +++ b/tools/testing/selftests/clone3/Makefile @@ -2,6 +2,6 @@ CFLAGS += -I../../../../usr/include/ -TEST_GEN_PROGS := clone3 +TEST_GEN_PROGS := clone3 clone3_set_tid include ../lib.mk diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c new file mode 100644 index 000000000000..9a234fd2031e --- /dev/null +++ b/tools/testing/selftests/clone3/clone3_set_tid.c @@ -0,0 +1,345 @@ +// SPDX-License-Identifier: GPL-2.0 + +/* + * Based on Christian Brauner's clone3() example. + * These tests are assuming to be running in the host's + * PID namespace. + */ + +#define _GNU_SOURCE +#include <errno.h> +#include <linux/types.h> +#include <linux/sched.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/syscall.h> +#include <sys/types.h> +#include <sys/un.h> +#include <sys/wait.h> +#include <unistd.h> +#include <sched.h> + +#include "../kselftest.h" + +#ifndef MAX_PID_NS_LEVEL +#define MAX_PID_NS_LEVEL 32 +#endif + +static int pipe_1[2]; +static int pipe_2[2]; + +static pid_t raw_clone(struct clone_args *args) +{ + return syscall(__NR_clone3, args, sizeof(struct clone_args)); +} + +static int call_clone3_set_tid(pid_t *set_tid, + size_t set_tid_size, + int flags, + int expected_pid, + int wait_for_it) +{ + int status; + int ret = 0; + pid_t pid = -1; + struct clone_args args = {0}; + + args.flags = flags; + args.exit_signal = SIGCHLD; + args.set_tid = (__u64)set_tid; + args.set_tid_size = set_tid_size; + + pid = raw_clone(&args); + if (pid < 0) { + ksft_print_msg("%s - Failed to create new process\n", + strerror(errno)); + return -errno; + } + + if (pid == 0) { + char tmp = 0; + ksft_print_msg("I am the child, my PID is %d (expected %d)\n", + getpid(), set_tid[0]); + if (wait_for_it) { + ksft_print_msg("[%d] Child is ready and waiting\n", getpid()); + /* Signal the parent that the child is ready */ + close(pipe_1[0]); + write(pipe_1[1], &tmp, 1); + close(pipe_1[1]); + close(pipe_2[1]); + read(pipe_2[0], &tmp, 1); + close(pipe_2[0]); + } + + if (set_tid[0] != getpid()) + _exit(EXIT_FAILURE); + _exit(EXIT_SUCCESS); + } + + if (expected_pid == 0 || expected_pid == pid) + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", + getpid(), pid); + else { + ksft_print_msg( + "Expected child pid %d does not match actual pid %d\n", + expected_pid, pid); + ret = -1; + } + + if (wait(&status) < 0) { + ksft_print_msg("Child returned %s\n", strerror(errno)); + return -errno; + } + if (WEXITSTATUS(status)) + return WEXITSTATUS(status); + + return ret; +} + +static void test_clone3_set_tid(pid_t *set_tid, + size_t set_tid_size, + int flags, + int expected, + int expected_pid, + int wait_for_it) +{ + int ret; + + ksft_print_msg( + "[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n", + getpid(), set_tid[0], flags); + ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid, + wait_for_it); + ksft_print_msg( + "[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n", + getpid(), set_tid[0], ret, expected); + if (ret != expected) + ksft_test_result_fail( + "[%d] Result (%d) is different than expected (%d)\n", + getpid(), ret, expected); + else + ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n", + getpid(), ret, expected); +} +int main(int argc, char *argv[]) +{ + FILE *f; + char buf; + pid_t pid; + pid_t ns1; + pid_t ns2; + pid_t ns3; + int status; + char *proc; + int ret = -1; + pid_t ns_pid; + int pid_max = 0; + uid_t uid = getuid(); + char line[1024] = {0}; + pid_t set_tid[MAX_PID_NS_LEVEL * 2]; + pid_t set_tid_small[1]; + + if (pipe(pipe_1) == -1 || pipe(pipe_2)) + ksft_exit_fail_msg("pipe() failed\n"); + + ksft_print_header(); + ksft_set_plan(27); + + f = fopen("/proc/sys/kernel/pid_max", "r"); + if (f == NULL) + ksft_exit_fail_msg( + "%s - Could not open /proc/sys/kernel/pid_max\n", + strerror(errno)); + fscanf(f, "%d", &pid_max); + fclose(f); + ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max); + + /* Try invalid settings */ + memset(&set_tid, 0, sizeof(set_tid)); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0); + + /* small set_tid array, but maximum set_tid_size */ + /* Find the current active PID */ + pid = fork(); + if (pid == 0) { + ksft_print_msg("Child has PID %d\n", getpid()); + _exit(EXIT_SUCCESS); + } + (void)wait(NULL); + /* After the child has finished, its PID should be free. */ + set_tid_small[0] = pid; + /* + * There is a chance that this can return -EFAULT as the actual + * set_tid array has only one entry, but we are telling the kernel + * that it has the size MAX_PID_NS_LEVEL. This could lead to a + * situation where copy_from_user() fails. So far it always + * succeeds and copies random data (whatever is after set_tid_small). + */ + test_clone3_set_tid(set_tid_small, MAX_PID_NS_LEVEL, 0, -EINVAL, 0, 0); + + /* + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1 + * nested PID namespace. + */ + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0); + + memset(&set_tid, 0xff, sizeof(set_tid)); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0); + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0); + /* + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1 + * nested PID namespace. + */ + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0); + + memset(&set_tid, 0, sizeof(set_tid)); + /* Try with an invalid PID */ + set_tid[0] = 0; + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); + set_tid[0] = -1; + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); + /* Claim that the set_tid array actually contains 2 elements. */ + test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0); + /* Try it in a new PID namespace */ + if (uid == 0) + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); + else + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); + + /* + * Try with a valid PID (1) but as non-root. This should fail + * with -EPERM if running in the initial user namespace. + * As root it should tell us -EEXIST. + */ + set_tid[0] = 1; + if (uid == 0) + test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0); + else + test_clone3_set_tid(set_tid, 1, 0, -EPERM, 0, 0); + + /* Try it in a new PID namespace */ + if (uid == 0) + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0); + else + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); + + /* pid_max should fail everywhere */ + set_tid[0] = pid_max; + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); + if (uid == 0) + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); + else + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); + + if (uid != 0) { + /* + * All remaining tests require root. Tell the framework + * that all those tests are skipped as non-root. + */ + ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num(); + goto out; + } + + /* Find the current active PID */ + pid = fork(); + if (pid == 0) { + ksft_print_msg("Child has PID %d\n", getpid()); + usleep(500); + _exit(EXIT_SUCCESS); + } + (void)wait(NULL); + /* After the child has finished, its PID should be free. */ + set_tid[0] = pid; + test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0); + /* This should fail as there is no PID 1 in that namespace */ + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); + set_tid[0] = 1; + set_tid[1] = pid; + test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0); + + ksft_print_msg("unshare PID namespace\n"); + unshare(CLONE_NEWPID); + set_tid[0] = pid; + /* This should fail as there is no PID 1 in that namespace */ + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); + + /* Let's create a PID 1 */ + ns_pid = fork(); + if (ns_pid == 0) { + ksft_print_msg("Child in PID namespace has PID %d\n", getpid()); + set_tid[0] = 2; + test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0); + set_tid[0] = 1; + set_tid[1] = 42; + set_tid[2] = pid; + /* + * This should fail as there are not enough active PID + * namespaces. Again assuming this is running in the host's + * PID namespace. Not yet nested. + */ + test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0); + /* + * This should work and from the parent we should see + * something like 'NSpid: pid 42 1'. + */ + test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, 1); + _exit(ksft_cnt.ksft_pass); + } + + close(pipe_1[1]); + close(pipe_2[0]); + while (read(pipe_1[0], &buf, 1) > 0) { + ksft_print_msg("[%d] Child is ready and waiting\n", getpid()); + break; + } + + asprintf(&proc, "/proc/%d/status", pid); + f = fopen(proc, "r"); + if (f == NULL) + ksft_exit_fail_msg( + "%s - Could not open %s\n", + strerror(errno), proc); + while (fgets(line, 1024, f)) { + if (strstr(line, "NSpid")) { + /* Verify that all generated PIDs are as expected. */ + sscanf(line, "NSpid:\t%d\t%d\t%d", &ns3, &ns2, &ns1); + break; + } + } + fclose(f); + free(proc); + close(pipe_2[0]); + /* Tell the clone3()'d child to finish. */ + write(pipe_2[1], &buf, 1); + close(pipe_2[1]); + + if (wait(&status) < 0) { + ksft_print_msg("Child returned %s\n", strerror(errno)); + ret = -errno; + goto out; + } + if (WEXITSTATUS(status)) + /* + * Update the number of total tests with the tests from the + * child processes. + */ + ksft_cnt.ksft_pass = WEXITSTATUS(status); + + if (ns3 == pid && ns2 == 42 && ns1 == 1) + ksft_test_result_pass( + "PIDs in all namespaces as expected (%d,%d,%d)\n", + ns3, ns2, ns1); + else + ksft_test_result_fail( + "PIDs in all namespaces not as expected (%d,%d,%d)\n", + ns3, ns2, ns1); +out: + ret = 0; + + return !ret ? ksft_exit_pass() : ksft_exit_fail(); +} -- 2.23.0 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v8 2/2] selftests: add tests for clone3() 2019-11-13 8:03 ` [PATCH v8 2/2] selftests: add tests for clone3() Adrian Reber @ 2019-11-13 12:35 ` Christian Brauner 0 siblings, 0 replies; 5+ messages in thread From: Christian Brauner @ 2019-11-13 12:35 UTC (permalink / raw) To: Adrian Reber Cc: Eric Biederman, Pavel Emelyanov, Jann Horn, Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Wed, Nov 13, 2019 at 09:03:01AM +0100, Adrian Reber wrote: > This tests clone3() with *set_tid to see if all desired PIDs are working > as expected. The tests are trying multiple invalid input parameters as > well as creating processes while specifying a certain PID in multiple > PID namespaces at the same time. > > Signed-off-by: Adrian Reber <areber@redhat.com> > --- > tools/testing/selftests/clone3/.gitignore | 1 + > tools/testing/selftests/clone3/Makefile | 2 +- > .../testing/selftests/clone3/clone3_set_tid.c | 345 ++++++++++++++++++ > 3 files changed, 347 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/clone3/clone3_set_tid.c > > diff --git a/tools/testing/selftests/clone3/.gitignore b/tools/testing/selftests/clone3/.gitignore > index 85d9d3ba2524..d56c3c49d869 100644 > --- a/tools/testing/selftests/clone3/.gitignore > +++ b/tools/testing/selftests/clone3/.gitignore > @@ -1 +1,2 @@ > clone3 > +clone3_set_tid > diff --git a/tools/testing/selftests/clone3/Makefile b/tools/testing/selftests/clone3/Makefile > index ea922c014ae4..2d292545ca8e 100644 > --- a/tools/testing/selftests/clone3/Makefile > +++ b/tools/testing/selftests/clone3/Makefile > @@ -2,6 +2,6 @@ > > CFLAGS += -I../../../../usr/include/ > > -TEST_GEN_PROGS := clone3 > +TEST_GEN_PROGS := clone3 clone3_set_tid Another example, where --base would help me out. :) (Since I can already see that this is missing the clone3_clear_sighand test that's scheduled for 5.5. :)) > > include ../lib.mk > diff --git a/tools/testing/selftests/clone3/clone3_set_tid.c b/tools/testing/selftests/clone3/clone3_set_tid.c > new file mode 100644 > index 000000000000..9a234fd2031e > --- /dev/null > +++ b/tools/testing/selftests/clone3/clone3_set_tid.c > @@ -0,0 +1,345 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Based on Christian Brauner's clone3() example. > + * These tests are assuming to be running in the host's > + * PID namespace. > + */ > + > +#define _GNU_SOURCE > +#include <errno.h> > +#include <linux/types.h> > +#include <linux/sched.h> > +#include <stdio.h> > +#include <stdlib.h> > +#include <sys/syscall.h> > +#include <sys/types.h> > +#include <sys/un.h> > +#include <sys/wait.h> > +#include <unistd.h> > +#include <sched.h> > + > +#include "../kselftest.h" > + > +#ifndef MAX_PID_NS_LEVEL > +#define MAX_PID_NS_LEVEL 32 > +#endif > + > +static int pipe_1[2]; > +static int pipe_2[2]; > + > +static pid_t raw_clone(struct clone_args *args) > +{ > + return syscall(__NR_clone3, args, sizeof(struct clone_args)); > +} So that function is now present in the clone3.c test you've added and also in the clone3_clear_sighand.c under sys_clone3(). If you take [1] as your base tree you could add patches that consolidate this definition into a header so we avoid the pointless code duplication. (Again, if you pass --base when doing --format-patch this will be transparent to others.) [1]: https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=pidfd > + > +static int call_clone3_set_tid(pid_t *set_tid, > + size_t set_tid_size, > + int flags, > + int expected_pid, > + int wait_for_it) Nit: Why is that an int and not a bool? :) > +{ > + int status; > + int ret = 0; > + pid_t pid = -1; > + struct clone_args args = {0}; > + > + args.flags = flags; > + args.exit_signal = SIGCHLD; > + args.set_tid = (__u64)set_tid; > + args.set_tid_size = set_tid_size; Nit: You can save the {0} by just doing: struct clone_args args = { .flags = flags, .exit_signal = SIGCHLD, .set_tid = (__u64)set_tid, .set_tid_size = set_tid_size, }; Also, I'd prefer if the cast would be done "properly" via #define ptr_to_u64(ptr) ((__u64)((uintptr_t)(ptr))) struct clone_args args = { .flags = flags, .exit_signal = SIGCHLD, .set_tid = ptr_to_u64(set_tid), .set_tid_size = set_tid_size, }; > + > + pid = raw_clone(&args); > + if (pid < 0) { > + ksft_print_msg("%s - Failed to create new process\n", > + strerror(errno)); > + return -errno; > + } > + > + if (pid == 0) { > + char tmp = 0; > + ksft_print_msg("I am the child, my PID is %d (expected %d)\n", > + getpid(), set_tid[0]); > + if (wait_for_it) { > + ksft_print_msg("[%d] Child is ready and waiting\n", getpid()); > + /* Signal the parent that the child is ready */ > + close(pipe_1[0]); > + write(pipe_1[1], &tmp, 1); > + close(pipe_1[1]); > + close(pipe_2[1]); > + read(pipe_2[0], &tmp, 1); > + close(pipe_2[0]); > + } > + > + if (set_tid[0] != getpid()) > + _exit(EXIT_FAILURE); > + _exit(EXIT_SUCCESS); > + } > + > + if (expected_pid == 0 || expected_pid == pid) > + ksft_print_msg("I am the parent (%d). My child's pid is %d\n", > + getpid(), pid); > + else { If one branch needs {} all branches should have {} according to kernel coding style. > + ksft_print_msg( > + "Expected child pid %d does not match actual pid %d\n", > + expected_pid, pid); > + ret = -1; > + } > + > + if (wait(&status) < 0) { > + ksft_print_msg("Child returned %s\n", strerror(errno)); > + return -errno; > + } > + if (WEXITSTATUS(status)) > + return WEXITSTATUS(status); You should probably rather check: if (!WIFEXITED(status)) return -1; return WEXITSTATUS(status)); Since you can also get killed by a signal. > + > + return ret; > +} > + > +static void test_clone3_set_tid(pid_t *set_tid, > + size_t set_tid_size, > + int flags, > + int expected, > + int expected_pid, > + int wait_for_it) > +{ > + int ret; > + > + ksft_print_msg( > + "[%d] Trying clone3() with CLONE_SET_TID to %d and 0x%x\n", > + getpid(), set_tid[0], flags); > + ret = call_clone3_set_tid(set_tid, set_tid_size, flags, expected_pid, > + wait_for_it); > + ksft_print_msg( > + "[%d] clone3() with CLONE_SET_TID %d says :%d - expected %d\n", > + getpid(), set_tid[0], ret, expected); > + if (ret != expected) > + ksft_test_result_fail( > + "[%d] Result (%d) is different than expected (%d)\n", > + getpid(), ret, expected); > + else > + ksft_test_result_pass("[%d] Result (%d) matches expectation (%d)\n", > + getpid(), ret, expected); > +} > +int main(int argc, char *argv[]) > +{ > + FILE *f; > + char buf; > + pid_t pid; > + pid_t ns1; > + pid_t ns2; > + pid_t ns3; Nit: pid_t pid, ns1, ns2, ns3, ns_pid; > + int status; > + char *proc; > + int ret = -1; > + pid_t ns_pid; > + int pid_max = 0; > + uid_t uid = getuid(); > + char line[1024] = {0}; > + pid_t set_tid[MAX_PID_NS_LEVEL * 2]; > + pid_t set_tid_small[1]; > + > + if (pipe(pipe_1) == -1 || pipe(pipe_2)) > + ksft_exit_fail_msg("pipe() failed\n"); Nit: The error checking here is inconsistent. Either do: if (pipe(pipe_1) || pipe(pipe_2)) ksft_exit_fail_msg("pipe() failed\n"); or if (pipe(pipe_1) < 0 || pipe(pipe_2) < 0) ksft_exit_fail_msg("pipe() failed\n"); or == -1 but not mixed, please. :) > + > + ksft_print_header(); > + ksft_set_plan(27); Now that's ambitious testing. I like it. :) > + > + f = fopen("/proc/sys/kernel/pid_max", "r"); > + if (f == NULL) > + ksft_exit_fail_msg( > + "%s - Could not open /proc/sys/kernel/pid_max\n", > + strerror(errno)); > + fscanf(f, "%d", &pid_max); > + fclose(f); > + ksft_print_msg("/proc/sys/kernel/pid_max %d\n", pid_max); > + > + /* Try invalid settings */ > + memset(&set_tid, 0, sizeof(set_tid)); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0); > + > + /* small set_tid array, but maximum set_tid_size */ > + /* Find the current active PID */ > + pid = fork(); > + if (pid == 0) { > + ksft_print_msg("Child has PID %d\n", getpid()); > + _exit(EXIT_SUCCESS); > + } > + (void)wait(NULL); > + /* After the child has finished, its PID should be free. */ > + set_tid_small[0] = pid; > + /* > + * There is a chance that this can return -EFAULT as the actual > + * set_tid array has only one entry, but we are telling the kernel > + * that it has the size MAX_PID_NS_LEVEL. This could lead to a > + * situation where copy_from_user() fails. So far it always > + * succeeds and copies random data (whatever is after set_tid_small). > + */ > + test_clone3_set_tid(set_tid_small, MAX_PID_NS_LEVEL, 0, -EINVAL, 0, 0); Hm, I'm not sure that test makes a lot of sense. > + > + /* > + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1 > + * nested PID namespace. > + */ > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0); > + > + memset(&set_tid, 0xff, sizeof(set_tid)); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL + 1, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 2 + 1, 0, -E2BIG, 0, 0); > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL * 42, 0, -E2BIG, 0, 0); > + /* > + * This can actually work if this test running in a MAX_PID_NS_LEVEL - 1 > + * nested PID namespace. > + */ > + test_clone3_set_tid(set_tid, MAX_PID_NS_LEVEL - 1, 0, -EINVAL, 0, 0); > + > + memset(&set_tid, 0, sizeof(set_tid)); > + /* Try with an invalid PID */ > + set_tid[0] = 0; > + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); > + set_tid[0] = -1; > + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); > + /* Claim that the set_tid array actually contains 2 elements. */ > + test_clone3_set_tid(set_tid, 2, 0, -EINVAL, 0, 0); > + /* Try it in a new PID namespace */ > + if (uid == 0) > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); > + else > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); [1]: I mean, you could really just do if (uid == 0) test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); the else branch doesn't really test anything meaningful other than you can't create new pid namespaces as an unprivileged user without also creating a new user namespace. > + > + /* > + * Try with a valid PID (1) but as non-root. This should fail > + * with -EPERM if running in the initial user namespace. > + * As root it should tell us -EEXIST. > + */ > + set_tid[0] = 1; > + if (uid == 0) > + test_clone3_set_tid(set_tid, 1, 0, -EEXIST, 0, 0); > + else > + test_clone3_set_tid(set_tid, 1, 0, -EPERM, 0, 0); See [1]. > + > + /* Try it in a new PID namespace */ > + if (uid == 0) > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, 0, 0, 0); > + else > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); See [1]. > + > + /* pid_max should fail everywhere */ > + set_tid[0] = pid_max; > + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); > + if (uid == 0) > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); > + else > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EPERM, 0, 0); See [1]. > + > + if (uid != 0) { > + /* > + * All remaining tests require root. Tell the framework > + * that all those tests are skipped as non-root. > + */ > + ksft_cnt.ksft_xskip += ksft_plan - ksft_test_num(); > + goto out; > + } > + > + /* Find the current active PID */ > + pid = fork(); > + if (pid == 0) { > + ksft_print_msg("Child has PID %d\n", getpid()); > + usleep(500); Why the usleep(500)? > + _exit(EXIT_SUCCESS); > + } > + (void)wait(NULL); \n > + /* After the child has finished, its PID should be free. */ > + set_tid[0] = pid; > + test_clone3_set_tid(set_tid, 1, 0, 0, 0, 0); \n > + /* This should fail as there is no PID 1 in that namespace */ > + test_clone3_set_tid(set_tid, 1, CLONE_NEWPID, -EINVAL, 0, 0); Nit: please add \n between a two tests everywhere, especially if they have comments on top of it. This is a little hard to read. :) > + set_tid[0] = 1; > + set_tid[1] = pid; > + test_clone3_set_tid(set_tid, 2, CLONE_NEWPID, 0, pid, 0); This is missing a comment. > + > + ksft_print_msg("unshare PID namespace\n"); > + unshare(CLONE_NEWPID); Why no error checking here when the next line assumes that there is no PID 1 in that namespace? > + set_tid[0] = pid; > + /* This should fail as there is no PID 1 in that namespace */ > + test_clone3_set_tid(set_tid, 1, 0, -EINVAL, 0, 0); > + > + /* Let's create a PID 1 */ > + ns_pid = fork(); > + if (ns_pid == 0) { > + ksft_print_msg("Child in PID namespace has PID %d\n", getpid()); > + set_tid[0] = 2; > + test_clone3_set_tid(set_tid, 1, 0, 0, 2, 0); > + set_tid[0] = 1; > + set_tid[1] = 42; > + set_tid[2] = pid; > + /* > + * This should fail as there are not enough active PID > + * namespaces. Again assuming this is running in the host's > + * PID namespace. Not yet nested. > + */ > + test_clone3_set_tid(set_tid, 4, CLONE_NEWPID, -EINVAL, 0, 0); > + /* > + * This should work and from the parent we should see > + * something like 'NSpid: pid 42 1'. [2]: You could verify this... In tools/testing/selftests/pidfd/pidfd_fdinfo_test.c in my pidfd tree is code that does this. You could just copy it. Look for verify_fdinfo(). Though, I'll it for you to judge whether this is too much hazzle. > + */ > + test_clone3_set_tid(set_tid, 3, CLONE_NEWPID, 0, 42, 1); > + _exit(ksft_cnt.ksft_pass); Again, the missing \n makes the different tests hard to follow. > + } > + > + close(pipe_1[1]); > + close(pipe_2[0]); > + while (read(pipe_1[0], &buf, 1) > 0) { > + ksft_print_msg("[%d] Child is ready and waiting\n", getpid()); > + break; > + } > + > + asprintf(&proc, "/proc/%d/status", pid); char proc_path[100]; snprintf(proc_path, sizeof(proc_path), "/proc/%d/status", pid); should do just fine and doesn't require you to free memory. > + f = fopen(proc, "r"); > + if (f == NULL) > + ksft_exit_fail_msg( > + "%s - Could not open %s\n", > + strerror(errno), proc); > + while (fgets(line, 1024, f)) { Ugh, fgets... :) I'd feel much happier with getline(). > + if (strstr(line, "NSpid")) { > + /* Verify that all generated PIDs are as expected. */ > + sscanf(line, "NSpid:\t%d\t%d\t%d", &ns3, &ns2, &ns1); > + break; Oh this is the verification code for fdinfo, I mentioned in [2]. Again, maybe you want to copy verify_fdinfo() and move this into a helper. > + } > + } > + fclose(f); > + free(proc); > + close(pipe_2[0]); \n > + /* Tell the clone3()'d child to finish. */ > + write(pipe_2[1], &buf, 1); > + close(pipe_2[1]); > + > + if (wait(&status) < 0) { > + ksft_print_msg("Child returned %s\n", strerror(errno)); > + ret = -errno; > + goto out; > + } > + if (WEXITSTATUS(status)) Nit: This should probably also verify if (WIFEXITED(status)). ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v8 1/2] fork: extend clone3() to support setting a PID 2019-11-13 8:03 [PATCH v8 1/2] fork: extend clone3() to support setting a PID Adrian Reber 2019-11-13 8:03 ` [PATCH v8 2/2] selftests: add tests for clone3() Adrian Reber @ 2019-11-13 11:50 ` Christian Brauner 2019-11-13 13:42 ` Oleg Nesterov 2 siblings, 0 replies; 5+ messages in thread From: Christian Brauner @ 2019-11-13 11:50 UTC (permalink / raw) To: Adrian Reber Cc: Eric Biederman, Pavel Emelyanov, Jann Horn, Oleg Nesterov, Dmitry Safonov, Rasmus Villemoes, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On Wed, Nov 13, 2019 at 09:03:00AM +0100, Adrian Reber wrote: > The main motivation to add set_tid to clone3() is CRIU. > > To restore a process with the same PID/TID CRIU currently uses > /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to > ns_last_pid and then (quickly) does a clone(). This works most of the > time, but it is racy. It is also slow as it requires multiple syscalls. > > Extending clone3() to support *set_tid makes it possible restore a > process using CRIU without accessing /proc/sys/kernel/ns_last_pid and > race free (as long as the desired PID/TID is available). > > This clone3() extension places the same restrictions (CAP_SYS_ADMIN) > on clone3() with *set_tid as they are currently in place for ns_last_pid. > > The original version of this change was using a single value for > set_tid. At the 2019 LPC, after presenting set_tid, it was, however, > decided to change set_tid to an array to enable setting the PID of a > process in multiple PID namespaces at the same time. If a process is > created in a PID namespace it is possible to influence the PID inside > and outside of the PID namespace. Details also in the corresponding > selftest. > > To create a process with the following PIDs: > > PID NS level Requested PID > 0 (host) 31496 > 1 42 > 2 1 > > For that example the two newly introduced parameters to struct > clone_args (set_tid and set_tid_size) would need to be: > > set_tid[0] = 1; > set_tid[1] = 42; > set_tid[2] = 31496; > set_tid_size = 3; > > If only the PIDs of the two innermost nested PID namespaces should be > defined it would look like this: > > set_tid[0] = 1; > set_tid[1] = 42; > set_tid_size = 2; > > The PID of the newly created process would then be the next available > free PID in the PID namespace level 0 (host) and 42 in the PID namespace > at level 1 and the PID of the process in the innermost PID namespace > would be 1. > > The set_tid array is used to specify the PID of a process starting > from the innermost nested PID namespaces up to set_tid_size PID namespaces. > > set_tid_size cannot be larger then the current PID namespace level. > > Signed-off-by: Adrian Reber <areber@redhat.com> Adrian, when you resend, can you please add --base=<commit> with the base commit this series applies to? This makes my life easier when applying this series. Other from missing kernel-doc (see below) I don't have any further complaints atm. > --- > v2: > - Removed (size < sizeof(struct clone_args)) as discussed with > Christian and Dmitry > - Added comment to ((set_tid != 1) && idr_get_cursor() <= 1) (Oleg) > - Use idr_alloc() instead of idr_alloc_cyclic() (Oleg) > > v3: > - Return EEXIST if PID is already in use (Christian) > - Drop CLONE_SET_TID (Christian and Oleg) > - Use idr_is_empty() instead of idr_get_cursor() (Oleg) > - Handle different `struct clone_args` sizes (Dmitry) > > v4: > - Rework struct size check with defines (Christian) > - Reduce number of set_tid checks (Oleg) > - Less parentheses and more robust code (Oleg) > - Do ns_capable() on correct user_ns (Oleg, Christian) > > v5: > - make set_tid checks earlier in alloc_pid() (Christian) > - remove unnecessary comment and struct size check (Christian) > > v6: > - remove CLONE_SET_TID from description (Christian) > - add clone3() tests for different clone_args sizes (Christian) > - move more set_tid checks to alloc_pid() (Oleg) > - make all set_tid checks lockless (Oleg) > > v7: > - changed set_tid to be an array to set the PID of a process > in multiple nested PID namespaces at the same time as discussed > at LPC 2019 (container MC) > > v8: > - skip unnecessary memset() (Rasmus) > - replace set_tid copy loop with a single copy (Christian) > - more parameter error checking (Christian) > - cache set_tid in alloc_pid() (Oleg) > - move code in "else" branch (Oleg) > --- > include/linux/pid.h | 3 +- > include/linux/pid_namespace.h | 2 ++ > include/linux/sched/task.h | 3 ++ > include/uapi/linux/sched.h | 2 ++ > kernel/fork.c | 24 ++++++++++++- > kernel/pid.c | 64 +++++++++++++++++++++++++++-------- > kernel/pid_namespace.c | 2 -- > 7 files changed, 82 insertions(+), 18 deletions(-) > > diff --git a/include/linux/pid.h b/include/linux/pid.h > index 9645b1194c98..034b7df25888 100644 > --- a/include/linux/pid.h > +++ b/include/linux/pid.h > @@ -120,7 +120,8 @@ extern struct pid *find_vpid(int nr); > extern struct pid *find_get_pid(int nr); > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > > -extern struct pid *alloc_pid(struct pid_namespace *ns); > +extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > + size_t set_tid_size); > extern void free_pid(struct pid *pid); > extern void disable_pid_allocation(struct pid_namespace *ns); > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > index 49538b172483..2ed6af88794b 100644 > --- a/include/linux/pid_namespace.h > +++ b/include/linux/pid_namespace.h > @@ -12,6 +12,8 @@ > #include <linux/ns_common.h> > #include <linux/idr.h> > > +/* MAX_PID_NS_LEVEL is needed for limiting size of 'struct pid' */ > +#define MAX_PID_NS_LEVEL 32 > > struct fs_pin; > > diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h > index 4b1c3b664f51..f1879884238e 100644 > --- a/include/linux/sched/task.h > +++ b/include/linux/sched/task.h > @@ -26,6 +26,9 @@ struct kernel_clone_args { > unsigned long stack; > unsigned long stack_size; > unsigned long tls; > + pid_t *set_tid; > + /* Number of elements in *set_tid */ > + size_t set_tid_size; > }; > > /* > diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h > index 25b4fa00bad1..13f74c40a629 100644 > --- a/include/uapi/linux/sched.h > +++ b/include/uapi/linux/sched.h > @@ -72,6 +72,8 @@ struct clone_args { > __aligned_u64 stack; > __aligned_u64 stack_size; > __aligned_u64 tls; > + __aligned_u64 set_tid; > + __aligned_u64 set_tid_size; Please add kernel-doc comments for these two new fields to the top of the like we did for all the other fields. Christian ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v8 1/2] fork: extend clone3() to support setting a PID 2019-11-13 8:03 [PATCH v8 1/2] fork: extend clone3() to support setting a PID Adrian Reber 2019-11-13 8:03 ` [PATCH v8 2/2] selftests: add tests for clone3() Adrian Reber 2019-11-13 11:50 ` [PATCH v8 1/2] fork: extend clone3() to support setting a PID Christian Brauner @ 2019-11-13 13:42 ` Oleg Nesterov 2 siblings, 0 replies; 5+ messages in thread From: Oleg Nesterov @ 2019-11-13 13:42 UTC (permalink / raw) To: Adrian Reber Cc: Christian Brauner, Eric Biederman, Pavel Emelyanov, Jann Horn, Dmitry Safonov, Rasmus Villemoes, linux-kernel, Andrei Vagin, Mike Rapoport, Radostin Stoyanov On 11/13, Adrian Reber wrote: > > @@ -175,23 +187,47 @@ struct pid *alloc_pid(struct pid_namespace *ns) > > for (i = ns->level; i >= 0; i--) { > int pid_min = 1; Well, this is really minor but again, pid_min is only used in the "else" branch below, you can move this declaration there. > + > + if (set_tid_size) { > + tid = set_tid[ns->level - i]; > + if (tid < 1 || tid >= pid_max) > + return ERR_PTR(-EINVAL); > + /* Also fail if a PID != 1 is requested and no PID 1 exists */ > + if (tid != 1 && !tmp->child_reaper) > + return ERR_PTR(-EINVAL); > + if (!ns_capable(tmp->user_ns, CAP_SYS_ADMIN)) > + return ERR_PTR(-EPERM); > + } > > idr_preload(GFP_KERNEL); > spin_lock_irq(&pidmap_lock); > > - /* > - * init really needs pid 1, but after reaching the maximum > - * wrap back to RESERVED_PIDS > - */ > - if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS) > - pid_min = RESERVED_PIDS; > - > - /* > - * Store a null pointer so find_pid_ns does not find > - * a partially initialized PID (see below). > - */ > - nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min, > - pid_max, GFP_ATOMIC); > + if (tid) { > + nr = idr_alloc(&tmp->idr, NULL, tid, > + tid + 1, GFP_ATOMIC); > + /* > + * If ENOSPC is returned it means that the PID is > + * alreay in use. Return EEXIST in that case. > + */ > + if (nr == -ENOSPC) > + nr = -EEXIST; > + set_tid_size--; ^^^^^^^^^^^^^^^ May I ask you to move this decrement into the "if (set_tid_size)" block above? Oleg. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2019-11-13 13:42 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-11-13 8:03 [PATCH v8 1/2] fork: extend clone3() to support setting a PID Adrian Reber 2019-11-13 8:03 ` [PATCH v8 2/2] selftests: add tests for clone3() Adrian Reber 2019-11-13 12:35 ` Christian Brauner 2019-11-13 11:50 ` [PATCH v8 1/2] fork: extend clone3() to support setting a PID Christian Brauner 2019-11-13 13:42 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox