From: Szabolcs Nagy <nsz@port70.net>
To: Christian Brauner <christian@brauner.io>
Cc: viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org,
torvalds@linux-foundation.org, jannh@google.com,
fweimer@redhat.com, oleg@redhat.com, arnd@arndb.de,
dhowells@redhat.com, Pavel Emelyanov <xemul@virtuozzo.com>,
Andrew Morton <akpm@linux-foundation.org>,
Adrian Reber <adrian@lisas.de>, Andrei Vagin <avagin@gmail.com>,
linux-api@vger.kernel.org
Subject: Re: [PATCH v1 1/2] fork: add clone3
Date: Thu, 30 May 2019 15:20:12 +0200 [thread overview]
Message-ID: <20190530132012.GS16415@port70.net> (raw)
In-Reply-To: <20190529152237.10719-1-christian@brauner.io>
* Christian Brauner <christian@brauner.io> [2019-05-29 17:22:36 +0200]:
> This adds the clone3 system call.
>
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
>
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
>
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again.
>
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
> in clone() into a dedicated struct clone_args
> - choose a struct layout that is easy to handle on 32 and on 64 bit
> - choose a struct layout that is extensible
> - give new flags that currently need to abuse another flag's dedicated
> return argument in clone() their own dedicated return argument
> (e.g. CLONE_PIDFD)
> - use a separate kernel internal struct kernel_clone_args that is
> properly typed according to current kernel conventions in fork.c and is
> different from the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
> syscalls such as fork(), vfork(), clone(), and clone3() behave identical
> (Arnd suggested, that we can probably also port do_fork() itself in a
> separate patchset.)
> - ease of transition for userspace from clone() to clone3()
> This very much means that we do *not* remove functionality that userspace
> currently relies on as the latter is a good way of creating a syscall
> that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
>
> In accordance with Linus suggestions, clone3() has the following signature:
>
> /* uapi */
> struct clone_args {
> __aligned_u64 flags;
> __aligned_u64 pidfd;
> __aligned_u64 parent_tidptr;
> __aligned_u64 child_tidptr;
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> };
is this new linux syscall api style to pass pointers as u64?
i think it will look a bit ugly in userspace where cast
to u64 would signextend pointers on most 32bit targets, so
user code would have to do something like
arg.ptr = (uint64_t)(uintptr_t)ptr;
such ugliness can be hidden by the libc with a different
struct definition, but it will require bigendian and alignment
hackery (or translation in libc, but that does not really work
when user calls raw syscall).
>
> /* kernel internal */
> struct kernel_clone_args {
> u64 flags;
> int __user *pidfd;
> int __user *parent_tidptr;
> int __user *child_tidptr;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> };
>
> long sys_clone3(struct clone_args __user *uargs, size_t size)
>
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
>
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
>
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/
>
> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Pavel Emelyanov <xemul@virtuozzo.com>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Adrian Reber <adrian@lisas.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Andrei Vagin <avagin@gmail.com>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Florian Weimer <fweimer@redhat.com>
> Cc: linux-api@vger.kernel.org
> --
> v1:
> - Linus Torvalds <torvalds@linux-foundation.org>:
> - redesign based on Linus proposal
> - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@arndb.de>:
> - use a single copy_from_user() instead of multiple get_user() calls
> since the latter have a constant overhead on some architectures
> - a range of other tweaks and suggestions
> ---
> arch/x86/ia32/sys_ia32.c | 11 ++-
> include/linux/sched/task.h | 13 ++-
> include/linux/syscalls.h | 6 ++
> include/uapi/linux/sched.h | 16 ++++
> kernel/fork.c | 176 ++++++++++++++++++++++++++++---------
> 5 files changed, 177 insertions(+), 45 deletions(-)
...
next prev parent reply other threads:[~2019-05-30 13:20 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-29 15:22 [PATCH v1 1/2] fork: add clone3 Christian Brauner
2019-05-29 15:22 ` [PATCH v1 2/2] arch: wire-up clone3() syscall on x86 Christian Brauner
2019-05-29 15:42 ` [PATCH v1 1/2] fork: add clone3 Yann Droneaud
2019-05-31 22:08 ` Christian Brauner
2019-06-01 8:49 ` Yann Droneaud
2019-05-29 22:24 ` Andrei Vagin
2019-05-31 20:38 ` Linus Torvalds
2019-05-31 22:13 ` Christian Brauner
2019-05-31 22:08 ` Christian Brauner
2019-05-30 13:20 ` Szabolcs Nagy [this message]
2019-05-31 8:14 ` Arnd Bergmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190530132012.GS16415@port70.net \
--to=nsz@port70.net \
--cc=adrian@lisas.de \
--cc=akpm@linux-foundation.org \
--cc=arnd@arndb.de \
--cc=avagin@gmail.com \
--cc=christian@brauner.io \
--cc=dhowells@redhat.com \
--cc=fweimer@redhat.com \
--cc=jannh@google.com \
--cc=linux-api@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=oleg@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=viro@zeniv.linux.org.uk \
--cc=xemul@virtuozzo.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.