Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: kbuild test robot @ 2019-05-26 20:24 UTC (permalink / raw)
  To: Renzo Davoli
  Cc: kbuild-all, Alexander Viro, Davide Libenzi, linux-fsdevel,
	linux-kernel, linux-api
In-Reply-To: <20190526142521.GA21842@cs.unibo.it>

[-- Attachment #1: Type: text/plain, Size: 3479 bytes --]

Hi Renzo,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linus/master]
[also build test ERROR on v5.2-rc1 next-20190524]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Renzo-Davoli/eventfd-new-tag-EFD_VPOLL-generate-epoll-events/20190527-023620
config: i386-randconfig-x011-201921 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

If you fix the issue, kindly add following tag
Reported-by: kbuild test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

   In file included from fs/eventfd.c:23:0:
   fs/eventfd.c: In function 'eventfd_vpoll_write':
>> include/linux/eventfd.h:31:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_ADDEVENTS (1UL << 32)
                                     ^
>> fs/eventfd.c:360:7: note: in expansion of macro 'EFD_VPOLL_ADDEVENTS'
     case EFD_VPOLL_ADDEVENTS:
          ^~~~~~~~~~~~~~~~~~~
   include/linux/eventfd.h:32:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_DELEVENTS (2UL << 32)
                                     ^
>> fs/eventfd.c:363:7: note: in expansion of macro 'EFD_VPOLL_DELEVENTS'
     case EFD_VPOLL_DELEVENTS:
          ^~~~~~~~~~~~~~~~~~~
>> fs/eventfd.c:363:2: error: duplicate case value
     case EFD_VPOLL_DELEVENTS:
     ^~~~
   fs/eventfd.c:360:2: note: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^~~~
   In file included from fs/eventfd.c:23:0:
   include/linux/eventfd.h:33:34: warning: left shift count >= width of type [-Wshift-count-overflow]
    #define EFD_VPOLL_MODEVENTS (3UL << 32)
                                     ^
>> fs/eventfd.c:366:7: note: in expansion of macro 'EFD_VPOLL_MODEVENTS'
     case EFD_VPOLL_MODEVENTS:
          ^~~~~~~~~~~~~~~~~~~
   fs/eventfd.c:366:2: error: duplicate case value
     case EFD_VPOLL_MODEVENTS:
     ^~~~
   fs/eventfd.c:360:2: note: previously used here
     case EFD_VPOLL_ADDEVENTS:
     ^~~~

vim +363 fs/eventfd.c

   342	
   343	static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
   344			size_t count, loff_t *ppos)
   345	{
   346		struct eventfd_ctx *ctx = file->private_data;
   347		ssize_t res;
   348		__u64 ucnt;
   349		__u32 events;
   350	
   351		if (count < sizeof(ucnt))
   352			return -EINVAL;
   353		if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
   354			return -EFAULT;
   355		spin_lock_irq(&ctx->wqh.lock);
   356	
   357		events = ucnt & EPOLLALLMASK;
   358		res = sizeof(ucnt);
   359		switch (ucnt & ~((__u64)EPOLLALLMASK)) {
 > 360		case EFD_VPOLL_ADDEVENTS:
   361			ctx->count |= events;
   362			break;
 > 363		case EFD_VPOLL_DELEVENTS:
   364			ctx->count &= ~(events);
   365			break;
 > 366		case EFD_VPOLL_MODEVENTS:
   367			ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
   368			break;
   369		default:
   370			res = -EINVAL;
   371		}
   372	
   373		/* wake up waiting threads */
   374		if (res >= 0 && waitqueue_active(&ctx->wqh))
   375			wake_up_locked_poll(&ctx->wqh, res);
   376	
   377		spin_unlock_irq(&ctx->wqh.lock);
   378	
   379		return res;
   380	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 36461 bytes --]

^ permalink raw reply

* Re: [PATCH v2 1/2] open: add close_range()
From: Szabolcs Nagy @ 2019-05-26 20:20 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190523154747.15162-2-christian@brauner.io>

* Christian Brauner <christian@brauner.io> [2019-05-23 17:47:46 +0200]:
> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
> 
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
> 
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
> 
>         /* that exec is sensitive */
>         unshare(CLONE_FILES);
>         /* we don't want anything past stderr here */
>         close_range(3, ~0U);
>         execve(....);

this does not work in a hosted c implementation unless the libc
guarantees not to use libc internal fds (e.g. in execve).
(the libc cannot easily abstract fds, so the syscall abi layer
fd semantics is necessarily visible to user code.)

i think this is a new constraint for userspace runtimes.
(not entirely unreasonable though)

> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).

was cloexec_range(a,b) considered?

> (Please note, unshare(CLONE_FILES) should only be needed if the calling
>  task is multi-threaded and shares the file descriptor table with another
>  thread in which case two threads could race with one thread allocating
>  file descriptors and the other one closing them via close_range(). For the
>  general case close_range() before the execve() is sufficient.)

assuming there is no unblocked signal handler that may open fds.

a syscall that tramples on fds not owned by the caller is ugly
(not generally safe to use and may break things if it gets used),
i don't have a better solution for fd leaks or missing cloexec,
but i think it needs more analysis how it can be used.

^ permalink raw reply

* Re: [PATCH 1/2] fork: add clone6
From: Linus Torvalds @ 2019-05-26 16:50 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Linux List Kernel Mailing, Jann Horn, Florian Weimer,
	Oleg Nesterov, Arnd Bergmann, David Howells, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, Linux API
In-Reply-To: <20190526102612.6970-1-christian@brauner.io>

On Sun, May 26, 2019 at 3:27 AM Christian Brauner <christian@brauner.io> wrote:
>
> This adds the clone6 system call.

No, this is not acceptable.

> +       struct clone6_args args = {

First of all, we don't pass in "clone6_args" to the actual implementation.

Passing in lots of args as a structure is fine. But it damn well isn't
a "clone6" structure. It's just a "clone_args" inside the kernel, and
there should be a separate clone_uapi_args for the user/kernel
interface.

The user interface (in the uapi file) may be called "clone6_args", but
that is *not* then what we pass around inside the kernel.

But even for the uapi version, the "clone6" name doesn't make any
sense as a name to begin with. It's not the sixth revision of clone,
and that clone6 structure isn't even "six arguments", because it has
lots of other arguments (with the extra flags registers you did, but
I'll get to that later).

Finally, the actual definition of that thing is wrong for a uapi interface too:

   struct clone6_args {
          __s32 pidfd;
          __aligned_u64 parent_tidptr;
          __aligned_u64 child_tidptr;
          ...

because now it has a hole in that structure definition because of
alignment issues. So make "pidfd" be 64-bit too. You clearly tried to
make this be compat-aware etc, but we really don't want to have parts
of user structures with random padding that we then don't check.

So the *user* visible structure should be full of those __aligned_u64
to make sure everything is aligned and doesn't care about compat
models.

But the *kernel* structure that we use should be nice and tailored for
kernel use, and use proper kernel types.

And related to that disgusting thing:

> -extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
> +extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);

This is the correct thing to do (except for the "clone6" name), but:

>  static __latent_entropy struct task_struct *copy_process(
> -                                       unsigned long clone_flags,
> -                                       unsigned long stack_start,
> -                                       unsigned long stack_size,
> -                                       int __user *parent_tidptr,
> -                                       int __user *child_tidptr,
> +                                       u64 clone_flags,
>                                         struct pid *pid,
>                                         int trace,
> -                                       unsigned long tls,
> -                                       int node)
> +                                       int node,
> +                                       struct clone6_args *args,
> +                                       bool is_clone6)

But this is absolutely wrong.

That "bool is_clone6" is garbage.

The in-kernel "struict clone_args" should just be everything that
clone needs to know.

So the in-kernel "struct clone_args" should never ever need some
"is_clone6" boolean to specify how you then treat the arguments.

Stuff like this:

> +       int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);

should have been done in the stubs that set up the "struct clone_args"
thing, not in copy_process().

So all those

        if (is_clone6) ...

things need to go away, and it just needs to be the caller (who knows
what kind of clone call it is) setting up the clone_args properly,
depending on the *different* user interfaces.

And no, we don't do crazy stuff like this either:

+SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
+                       unsigned int, flags1,
+                       unsigned int, flags2,
+                       unsigned int, flags3,
+                       unsigned int, flags4,
+                       unsigned int, flags5)

where this is wrong because it randomly just decides that everything
is flags (BS), and then doubles down on stupidity by making them
"unsigned int", so that the tests of the flags actually don't test the
upper bits anyway.

Why would you reserve 5 words of flags for future use when you have a
whole structure in user space that you _didn't_ reserve anything in?

So remove all those random flags, put ONE of them in the structure you
already have (as a "__aligned_u64", so that you actually get 64 bits,
not the 32 in "unsigned int"), and then perhaps you can add *one*
other register argument, which is the *size* of the structure that
user space is passing in.

That way, if we ever expand things, we'll just add new flags to the
end of the in-memory structure we have, but old binaries that don't
know about the size will continue to pass in the old size, and we'll
be all good.

So I hate the whole patch with a passion. It does absolutely
everything wrong from an interface standpoint.

I don't hate the notion of just adding flags. But do it cleanly, not
with random "is_clone6" bits.

And no, the structure we pass in from user space must *NOT* be the
same structure that we just copy blindly around as a kernel structure.
We've done that mistake before, and it is *always* a mistake. Think
"struct stat" and friends. No, we have a "struct kstat" for the kernel
version, and then we convert at the system call boundary. Let's not
repeat traditional mistakes.

And part of the conversion is exactly that

  Make everybody use the same in-kernel interface, so that the
lower-down routines don't have those kinds of "if it's this system
call, do this, otherwise, do that"

kind of horrible nasty mis-designs.

So to summarize:

 (a) make a separate kernel-only "clone_args" structure that is
unified and works for every single version of clone/fork/vfork, and
that has pointers and types in the kernel native format.

     So this will have things like "int __user *parent_tidptr" etc,
and something like *one* u64 flag field.

 (b) have the new system call have a nice compat-safe but
_independent_ format ("struct clone_uapi_struct")

 (c) you can now make the new system-call interface be something like

int clone_ext(struct clone_uapi_struct __user *uargs, size_t size);
{
        struct clone_uapi_struct kargs;
        struct clone_args args;

        // The API is defined as a stucture of 64-bit words
        if (size & 7)
                return -EINVAL;
        if (!access_ok(uargs, size))
                return -EFAULT;

        // If the user passes in new flags we don't
        // know about (because it was compiled against
        // a newer kernel than what is runn ing), make
        // sure they are zero
        if (size > sizeof(kargs)) {
                size_t n = (size - sizeof(kargs))>>3;
                u64 __user *ptr = (u64 __user *)(uargs+1)

                if (n > 10)
                        return -EINVAL;
                do {
                        u64 val;
                        if (get_user(val, ptr++))
                                return -EFAULT;
                        if (val)
                                return -EINVAL;
                } while (--n);

                // Ok, everything else was zero, we use
                // the part we know about
                size = sizeof(kargs);
        }
        memset(&kargs, 0, sizeof(kargs));
        memcpy_from_user(&kargs, uargs, size);

        .. now convert 'kargs' to 'args' ..

See? The above may be a bit *unnecessarily* anal about the whole
checking etc, but it's actually fairly simple in the end. And it means
that we have that "convert to internal format" in just one place - the
place where it makes sense. And it's a lot cleaner interface to user
mode, and allows for easy expansion later.

NOTE! By all means make "clone_ext()" take some of the other arguments
as part of the argument registers, not everything has to be part of
"struct clone_uapi_struct". But none of this "flag1..5" stuff. That's
what a struct is for.

Maybe the basic ":u64 clone_flags" could be the first argument (but
64-bit arguments can be a bit nasty for compat layers, so it's
probably not even worth it - once you have to copy an argument
structure from user space, you might as well put everything there).

                Linus

^ permalink raw reply

* [PATCH 1/1] eventfd new tag EFD_VPOLL: generate epoll events
From: Renzo Davoli @ 2019-05-26 14:25 UTC (permalink / raw)
  To: Alexander Viro, Davide Libenzi, linux-fsdevel, linux-kernel; +Cc: linux-api

This patch implements an extension of eventfd to define file descriptors 
whose I/O events can be generated at user level. These file descriptors
trigger notifications for [p]select/[p]poll/epoll.

This feature is useful for user-level implementations of network stacks
or virtual device drivers as libraries.

Development and porting of code often requires to find the way to wait for I/O
events both coming from file descriptors and generated by user-level code (e.g.
user-implemented net stacks or drivers).  While it is possible to provide a
partial support (e.g. using pipes or socketpairs), a clean and complete
solution is still missing (as far as I have seen); e.g. I have not seen any
clean way to generate EPOLLPRI, EPOLLERR, etc.

This proposal is based on a new tag for eventfd2(2): EFD_VPOLL.

This statement:
	fd = eventfd(EPOLLOUT, EFD_VPOLL | EFD_CLOEXEC);
creates a file descriptor for I/O event generation. In this case EPOLLOUT is
initially true.

Likewise all the other eventfs services, read(2) and write(2) use a 8-byte 
integer argument.

read(2) returns the current state of the pending events.

The argument of write(2) is an or-composition of a control command
(EFD_VPOLL_ADDEVENTS, EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the
bitmap of events to be added, deleted to the current set of pending events.
EFD_VPOLL_MODEVENTS completely redefines the set of pending events.

e.g.:
	uint64_t request = EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI;
	write(fd, &request, sizeof(request);
adds EPOLLIN and EPOLLPRI to the set of pending events.

These are examples of messages asking for a feature like EFD_VPOLL:
https://stackoverflow.com/questions/909189/simulating-file-descriptor-in-user-space
https://stackoverflow.com/questions/1648147/running-a-simple-tcp-server-with-poll-how-do-i-trigger-events-artificially
... and I need it to write networking and device modules for vuos:
https://github.com/virtualsquare/vuos
(it is the new codebase of ViewOS, see www.virtualsquare.org).

EXAMPLE:
The following program creates an eventfd/EFD_VPOLL file descriptor and then forks
a child process.  While the parent waits for events using epoll_wait the child
generates a sequence of events. When the parent receives an event (or a set of events)
it prints it and disarm it.
The following shell session shows a sample run of the program:
	timeout...
	timeout...
	GOT event 1
	timeout...
	GOT event 1
	timeout...
	GOT event 3
	timeout...
	GOT event 2
	timeout...
	GOT event 4
	timeout...
	GOT event 10

Program source:
#include <sys/eventfd.h>
#include <sys/epoll.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>             /* Definition of uint64_t */

#ifndef EFD_VPOLL
#define EFD_VPOLL (1 << 1)
#define EFD_VPOLL_ADDEVENTS (1UL << 32)
#define EFD_VPOLL_DELEVENTS (2UL << 32)
#define EFD_VPOLL_MODEVENTS (3UL << 32)
#endif

#define handle_error(msg) \
	do { perror(msg); exit(EXIT_FAILURE); } while (0)

static void vpoll_ctl(int fd, uint64_t request) {
	ssize_t s;
	s = write(fd, &request, sizeof(request));
	if (s != sizeof(uint64_t))
		handle_error("write");
}

int
main(int argc, char *argv[])
{
	int efd, epollfd; 
	struct epoll_event ev;
	ev.events = EPOLLIN | EPOLLRDHUP | EPOLLERR | EPOLLOUT | EPOLLHUP | EPOLLPRI;
	ev.data.u64 = 0;

	efd = eventfd(0, EFD_VPOLL | EFD_CLOEXEC);
	if (efd == -1)
		handle_error("eventfd");
	epollfd = epoll_create1(EPOLL_CLOEXEC);
	if (efd == -1)
		handle_error("epoll_create1");
	if (epoll_ctl(epollfd, EPOLL_CTL_ADD, efd, &ev) == -1) 
		handle_error("epoll_ctl");

	switch (fork()) {
		case 0:
			sleep(3);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLIN | EPOLLPRI);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLPRI);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLOUT);
			sleep(2);
			vpoll_ctl(efd, EFD_VPOLL_ADDEVENTS | EPOLLHUP);
			exit(EXIT_SUCCESS);
		default:
			while (1) {
				int nfds;
				nfds = epoll_wait(epollfd, &ev, 1, 1000);
				if (nfds < 0)
					handle_error("epoll_wait");
				else if (nfds == 0)
					printf("timeout...\n");
				else {
					printf("GOT event %x\n", ev.events);
					vpoll_ctl(efd, EFD_VPOLL_DELEVENTS | ev.events);
					if (ev.events & EPOLLHUP)
						break;
				}
			}
		case -1:
			handle_error("fork");
	}
	close(epollfd);
	close(efd);
	return 0;
}

Signed-off-by: Renzo Davoli <renzo@cs.unibo.it>
---
 fs/eventfd.c                   | 115 +++++++++++++++++++++++++++++++--
 include/linux/eventfd.h        |   7 +-
 include/uapi/linux/eventpoll.h |   2 +
 3 files changed, 116 insertions(+), 8 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 8aa0ea8c55e8..f83b7d02307e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -3,6 +3,7 @@
  *  fs/eventfd.c
  *
  *  Copyright (C) 2007  Davide Libenzi <davidel@xmailserver.org>
+ *  EFD_VPOLL support: 2019 Renzo Davoli <renzo@cs.unibo.it>
  *
  */
 
@@ -30,12 +31,24 @@ struct eventfd_ctx {
 	struct kref kref;
 	wait_queue_head_t wqh;
 	/*
-	 * Every time that a write(2) is performed on an eventfd, the
-	 * value of the __u64 being written is added to "count" and a
-	 * wakeup is performed on "wqh". A read(2) will return the "count"
-	 * value to userspace, and will reset "count" to zero. The kernel
-	 * side eventfd_signal() also, adds to the "count" counter and
-	 * issue a wakeup.
+	 * If the EFD_VPOLL flag was NOT set at eventfd creation:
+	 *   Every time that a write(2) is performed on an eventfd, the
+	 *   value of the __u64 being written is added to "count" and a
+	 *   wakeup is performed on "wqh". A read(2) will return the "count"
+	 *   value to userspace, and will reset "count" to zero (or decrement
+	 *   "count" by 1 if the flag EFD_SEMAPHORE has been set). The kernel
+	 *   side eventfd_signal() also, adds to the "count" counter and
+	 *   issue a wakeup.
+	 *
+	 * If the EFD_VPOLL flag was set at eventfd creation:
+	 *   count is the set of pending EPOLL events.
+	 *   read(2) returns the current value of count.
+	 *   The argument of write(2) is an 8-byte integer:
+	 *   it is an or-composition of a control command (EFD_VPOLL_ADDEVENTS,
+	 *   EFD_VPOLL_DELEVENTS or EFD_VPOLL_MODEVENTS) and the bitmap of
+	 *   events to be added, deleted to the current set of pending events.
+	 *   (i.e. which bits of "count" must be set or reset).
+	 *   EFD_VPOLL_MODEVENTS redefines the set of pending events.
 	 */
 	__u64 count;
 	unsigned int flags;
@@ -295,6 +308,78 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
 	return res;
 }
 
+static __poll_t eventfd_vpoll_poll(struct file *file, poll_table *wait)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	__poll_t events = 0;
+	u64 count;
+
+	poll_wait(file, &ctx->wqh, wait);
+
+	count = READ_ONCE(ctx->count);
+
+	events = (count & EPOLLALLMASK);
+
+	return events;
+}
+
+static ssize_t eventfd_vpoll_read(struct file *file, char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt = 0;
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	res = sizeof(ucnt);
+	ucnt = READ_ONCE(ctx->count);
+	if (put_user(ucnt, (__u64 __user *)buf))
+		return -EFAULT;
+
+	return res;
+}
+
+static ssize_t eventfd_vpoll_write(struct file *file, const char __user *buf,
+		size_t count, loff_t *ppos)
+{
+	struct eventfd_ctx *ctx = file->private_data;
+	ssize_t res;
+	__u64 ucnt;
+	__u32 events;
+
+	if (count < sizeof(ucnt))
+		return -EINVAL;
+	if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+		return -EFAULT;
+	spin_lock_irq(&ctx->wqh.lock);
+
+	events = ucnt & EPOLLALLMASK;
+	res = sizeof(ucnt);
+	switch (ucnt & ~((__u64)EPOLLALLMASK)) {
+	case EFD_VPOLL_ADDEVENTS:
+		ctx->count |= events;
+		break;
+	case EFD_VPOLL_DELEVENTS:
+		ctx->count &= ~(events);
+		break;
+	case EFD_VPOLL_MODEVENTS:
+		ctx->count = (ctx->count & ~EPOLLALLMASK) | events;
+		break;
+	default:
+		res = -EINVAL;
+	}
+
+	/* wake up waiting threads */
+	if (res >= 0 && waitqueue_active(&ctx->wqh))
+		wake_up_locked_poll(&ctx->wqh, res);
+
+	spin_unlock_irq(&ctx->wqh.lock);
+
+	return res;
+
+}
+
 #ifdef CONFIG_PROC_FS
 static void eventfd_show_fdinfo(struct seq_file *m, struct file *f)
 {
@@ -319,6 +404,17 @@ static const struct file_operations eventfd_fops = {
 	.llseek		= noop_llseek,
 };
 
+static const struct file_operations eventfd_vpoll_fops = {
+#ifdef CONFIG_PROC_FS
+	.show_fdinfo	= eventfd_show_fdinfo,
+#endif
+	.release	= eventfd_release,
+	.poll		= eventfd_vpoll_poll,
+	.read		= eventfd_vpoll_read,
+	.write		= eventfd_vpoll_write,
+	.llseek		= noop_llseek,
+};
+
 /**
  * eventfd_fget - Acquire a reference of an eventfd file descriptor.
  * @fd: [in] Eventfd file descriptor.
@@ -391,6 +487,7 @@ EXPORT_SYMBOL_GPL(eventfd_ctx_fileget);
 static int do_eventfd(unsigned int count, int flags)
 {
 	struct eventfd_ctx *ctx;
+	const struct file_operations *fops = &eventfd_fops;
 	int fd;
 
 	/* Check the EFD_* constants for consistency.  */
@@ -410,7 +507,11 @@ static int do_eventfd(unsigned int count, int flags)
 	ctx->flags = flags;
 	ctx->id = ida_simple_get(&eventfd_ida, 0, 0, GFP_KERNEL);
 
-	fd = anon_inode_getfd("[eventfd]", &eventfd_fops, ctx,
+	if (flags & EFD_VPOLL) {
+		fops = &eventfd_vpoll_fops;
+		ctx->count &= EPOLLALLMASK;
+	}
+	fd = anon_inode_getfd("[eventfd]", fops, ctx,
 			      O_RDWR | (flags & EFD_SHARED_FCNTL_FLAGS));
 	if (fd < 0)
 		eventfd_free_ctx(ctx);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index ffcc7724ca21..63258cf29344 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -21,11 +21,16 @@
  * shared O_* flags.
  */
 #define EFD_SEMAPHORE (1 << 0)
+#define EFD_VPOLL (1 << 1)
 #define EFD_CLOEXEC O_CLOEXEC
 #define EFD_NONBLOCK O_NONBLOCK
 
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
-#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
+#define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE | EFD_VPOLL)
+
+#define EFD_VPOLL_ADDEVENTS (1UL << 32)
+#define EFD_VPOLL_DELEVENTS (2UL << 32)
+#define EFD_VPOLL_MODEVENTS (3UL << 32)
 
 struct eventfd_ctx;
 struct file;
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index 8a3432d0f0dc..814de6d869c7 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -41,6 +41,8 @@
 #define EPOLLMSG	(__force __poll_t)0x00000400
 #define EPOLLRDHUP	(__force __poll_t)0x00002000
 
+#define EPOLLALLMASK	((__force __poll_t)0x0fffffff)
+
 /* Set exclusive wakeup mode for the target file descriptor */
 #define EPOLLEXCLUSIVE	((__force __poll_t)(1U << 28))
 
-- 
2.20.1

^ permalink raw reply related

* [PATCH 2/2] arch: wire-up clone6() syscall on x86
From: Christian Brauner @ 2019-05-26 10:26 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner, Andrew Morton,
	Adrian Reber, linux-api, linux-arch, x86
In-Reply-To: <20190526102612.6970-1-christian@brauner.io>

Wire up the clone6() call on x86.

This patch only wires up clone6() on x86. Some of the arches look like they
need special assembly massaging and it is probably smarter if the
appropriate arch maintainers would do the actual wiring.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/uapi/asm-generic/unistd.h      | 4 +++-
 3 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..dffcd57990b3 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+436	i386	clone6			sys_clone6			__ia32_sys_clone6
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..73bf4cc099a2 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+436	common	clone6			__x64_sys_clone6/ptregs
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..500bdb4c5e36 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_clone6 436
+__SYSCALL(__NR_clone6, sys_clone6)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 437
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

* [PATCH 1/2] fork: add clone6
From: Christian Brauner @ 2019-05-26 10:26 UTC (permalink / raw)
  To: viro, linux-kernel, torvalds, jannh
  Cc: fweimer, oleg, arnd, dhowells, Christian Brauner, Pavel Emelyanov,
	Andrew Morton, Adrian Reber, Andrei Vagin, linux-api

This adds the clone6 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone6().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recycable as we have identified at least two large or widely used codebases
that currently pass this flag (cf. [2], [3], and [4]). Given that we
grabbed the last clone() flag we effectively blocked the time namespace
patchset. It just seems right that we unblock it again.

The idea is to keep clone6() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward
independent of spawn().
The following requirements guided us for clone6():
- bump the number of available flags as much as possible while ensuring
  that all flag arguments are passed in registers so they remain easily
  accessible for seccomp.
- move non-flag arguments that are currently passed as separate arguments
  in clone() into a dedicated struct
  - choose a struct layout that is easy to handle on 32 and on 64 bit
  - choose a struct layout that is extensible
  - give new flags that currently need to abuse another flag's dedicated
    return argument in clone() their own dedicated return argument
    (e.g. CLONE_PIDFD)
- ease of transition for userspace from clone() to clone6()
- do not try to be clever or complex: keep clone6() as dumb as possible

What we came up with is clone6() which has the following signature:

struct clone6_args {
        __s32 pidfd;
        __aligned_u64 parent_tidptr;
        __aligned_u64 child_tidptr;
        __aligned_u64 stack;
        __aligned_u64 stack_size;
        __aligned_u64 tls;
};

long sys_clone6(struct clone6_args __user *uargs,
                unsigned int flags1,
                unsigned int flags2,
                unsigned int flags3,
                unsigned int flags4,
                unsigned int flags5);

clone6() cleanly supports all of the supported flags from clone() in
flags1, i.e. flags1 is full and all legacy workloads are supported with
clone6().
With clone6() we have 160 flag values in total which - even for a feature
heavy syscall like clone - should hold quite a while. If they are really
all taken at some point we can simply bite the bullet and start adding
additional flag arguments into struct clone6 itself.

Another advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone6() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone6().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone6() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@arista.com/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@arista.com/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@mail.gmail.com/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@brauner.io/

Co-developed-by: Jann Horn <jannh@google.com>
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Adrian Reber <adrian@lisas.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
 arch/x86/ia32/sys_ia32.c   |   9 ++-
 include/linux/sched/task.h |   2 +-
 include/linux/syscalls.h   |   9 +++
 include/uapi/linux/sched.h |  14 ++++
 kernel/fork.c              | 161 ++++++++++++++++++++++++++++---------
 5 files changed, 152 insertions(+), 43 deletions(-)

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..55a8c550ba74 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -237,6 +237,11 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
 		       unsigned long, newsp, int __user *, parent_tidptr,
 		       unsigned long, tls_val, int __user *, child_tidptr)
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
-			tls_val);
+	struct clone6_args args = {
+		.stack = newsp,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.tls = tls_val,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+	return _do_fork(clone_flags, &args);
 }
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1227f2c38a4..06e7c0df6ab0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -73,7 +73,7 @@ extern void do_group_exit(int);
 extern void exit_files(struct task_struct *);
 extern void exit_itimers(struct signal_struct *);
 
-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);
 extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
 struct task_struct *fork_idle(int);
 struct mm_struct *copy_init_mm(void);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..235df5c5e711 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -70,6 +70,7 @@ struct sigaltstack;
 struct rseq;
 union bpf_attr;
 struct io_uring_params;
+struct clone6_args;
 
 #include <linux/types.h>
 #include <linux/aio_abi.h>
@@ -852,6 +853,14 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
 	       int __user *, unsigned long);
 #endif
 #endif
+
+#ifdef __ARCH_WANT_SYS_CLONE
+asmlinkage long sys_clone6(struct clone6_args __user *uargs,
+			   unsigned int flags1, unsigned int flags2,
+			   unsigned int flags3, unsigned int flags4,
+			   unsigned int flags5);
+#endif
+
 asmlinkage long sys_execve(const char __user *filename,
 		const char __user *const __user *argv,
 		const char __user *const __user *envp);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..b8d2809c5bc6 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -2,6 +2,8 @@
 #ifndef _UAPI_LINUX_SCHED_H
 #define _UAPI_LINUX_SCHED_H
 
+#include <linux/types.h>
+
 /*
  * cloning flags:
  */
@@ -31,6 +33,18 @@
 #define CLONE_NEWNET		0x40000000	/* New network namespace */
 #define CLONE_IO		0x80000000	/* Clone io context */
 
+/*
+ * Arguments for the clone6 syscall
+ */
+struct clone6_args {
+	__s32 pidfd;
+	__aligned_u64 parent_tidptr;
+	__aligned_u64 child_tidptr;
+	__aligned_u64 stack;
+	__aligned_u64 stack_size;
+	__aligned_u64 tls;
+};
+
 /*
  * Scheduling policies
  */
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..ffd5471c64af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1760,19 +1760,21 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
  * flags). The actual kick-off is left to the caller.
  */
 static __latent_entropy struct task_struct *copy_process(
-					unsigned long clone_flags,
-					unsigned long stack_start,
-					unsigned long stack_size,
-					int __user *parent_tidptr,
-					int __user *child_tidptr,
+					u64 clone_flags,
 					struct pid *pid,
 					int trace,
-					unsigned long tls,
-					int node)
+					int node,
+					struct clone6_args *args,
+					bool is_clone6)
 {
 	int pidfd = -1, retval;
 	struct task_struct *p;
 	struct multiprocess_signals delayed;
+	int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);
+	int __user *parent_tidptr = NULL;
+	unsigned long tls = args->tls;
+	unsigned long stack_start = args->stack;
+	unsigned long stack_size = args->stack_size;
 
 	/*
 	 * Don't allow sharing the root directory with processes in a different
@@ -1824,25 +1826,32 @@ static __latent_entropy struct task_struct *copy_process(
 		int reserved;
 
 		/*
-		 * - CLONE_PARENT_SETTID is useless for pidfds and also
-		 *   parent_tidptr is used to return pidfds.
 		 * - CLONE_DETACHED is blocked so that we can potentially
 		 *   reuse it later for CLONE_PIDFD.
 		 * - CLONE_THREAD is blocked until someone really needs it.
 		 */
-		if (clone_flags &
-		    (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+		if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
 			return ERR_PTR(-EINVAL);
 
-		/*
-		 * Verify that parent_tidptr is sane so we can potentially
-		 * reuse it later.
-		 */
-		if (get_user(reserved, parent_tidptr))
-			return ERR_PTR(-EFAULT);
+		if (!is_clone6) {
+			/*
+			 * For non-clone6() versions parent_tidptr is used to
+			 * return pidfds.
+			 */
+			if (clone_flags & CLONE_PARENT_SETTID)
+				return ERR_PTR(-EINVAL);
 
-		if (reserved != 0)
-			return ERR_PTR(-EINVAL);
+			/*
+			 * Verify that parent_tidptr is sane so we can
+			 * potentially reuse it later.
+			 */
+			parent_tidptr = u64_to_user_ptr(args->parent_tidptr);
+			if (get_user(reserved, parent_tidptr))
+				return ERR_PTR(-EFAULT);
+
+			if (reserved != 0)
+				return ERR_PTR(-EINVAL);
+		}
 	}
 
 	/*
@@ -2062,9 +2071,14 @@ static __latent_entropy struct task_struct *copy_process(
 			goto bad_fork_free_pid;
 
 		pidfd = retval;
-		retval = put_user(pidfd, parent_tidptr);
-		if (retval)
-			goto bad_fork_put_pidfd;
+		if (!is_clone6) {
+			/* store pidfd in parent_tidptr for legacy clone */
+			retval = put_user(pidfd, parent_tidptr);
+			if (retval)
+				goto bad_fork_put_pidfd;
+		} else {
+			args->pidfd = pidfd;
+		}
 	}
 
 #ifdef CONFIG_BLOCK
@@ -2313,8 +2327,10 @@ static inline void init_idle_pids(struct task_struct *idle)
 struct task_struct *fork_idle(int cpu)
 {
 	struct task_struct *task;
-	task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
-			    cpu_to_node(cpu));
+	struct clone6_args args = { 0 };
+
+	task = copy_process(CLONE_VM, &init_struct_pid, 0,
+			    cpu_to_node(cpu), &args, false);
 	if (!IS_ERR(task)) {
 		init_idle_pids(task);
 		init_idle(task, cpu);
@@ -2334,18 +2350,15 @@ struct mm_struct *copy_init_mm(void)
  * It copies the process, and if successful kick-starts
  * it and waits for it to finish using the VM if required.
  */
-long _do_fork(unsigned long clone_flags,
-	      unsigned long stack_start,
-	      unsigned long stack_size,
-	      int __user *parent_tidptr,
-	      int __user *child_tidptr,
-	      unsigned long tls)
+static long _do_clone_common(u64 clone_flags, struct clone6_args *args,
+			     bool is_clone6)
 {
 	struct completion vfork;
 	struct pid *pid;
 	struct task_struct *p;
 	int trace = 0;
 	long nr;
+	int __user *parent_tidptr = u64_to_user_ptr(args->parent_tidptr);
 
 	/*
 	 * Determine whether and which event to report to ptracer.  When
@@ -2365,8 +2378,8 @@ long _do_fork(unsigned long clone_flags,
 			trace = 0;
 	}
 
-	p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
-			 child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+	p = copy_process(clone_flags, NULL, trace, NUMA_NO_NODE, args,
+			 is_clone6);
 	add_latent_entropy();
 
 	if (IS_ERR(p))
@@ -2405,6 +2418,11 @@ long _do_fork(unsigned long clone_flags,
 	return nr;
 }
 
+long _do_fork(u64 clone_flags, struct clone6_args *args)
+{
+	return _do_clone_common(clone_flags, args, false);
+}
+
 #ifndef CONFIG_HAVE_COPY_THREAD_TLS
 /* For compatibility with architectures that call do_fork directly rather than
  * using the syscall entry points below. */
@@ -2414,8 +2432,14 @@ long do_fork(unsigned long clone_flags,
 	      int __user *parent_tidptr,
 	      int __user *child_tidptr)
 {
-	return _do_fork(clone_flags, stack_start, stack_size,
-			parent_tidptr, child_tidptr, 0);
+	struct clone6_args args = {
+		.stack = stack_start,
+		.stack_size = stack_size,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+
+	return _do_fork(clone_flags, &args);
 }
 #endif
 
@@ -2424,15 +2448,20 @@ long do_fork(unsigned long clone_flags,
  */
 pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
 {
-	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
-		(unsigned long)arg, NULL, NULL, 0);
+	struct clone6_args args = {
+		.stack = (unsigned long)fn,
+		.stack_size = (unsigned long)arg,
+	};
+	return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, &args);
 }
 
 #ifdef __ARCH_WANT_SYS_FORK
 SYSCALL_DEFINE0(fork)
 {
 #ifdef CONFIG_MMU
-	return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+	struct clone6_args args = { 0 };
+
+	return _do_fork(SIGCHLD, &args);
 #else
 	/* can not support in nommu mode */
 	return -EINVAL;
@@ -2443,8 +2472,9 @@ SYSCALL_DEFINE0(fork)
 #ifdef __ARCH_WANT_SYS_VFORK
 SYSCALL_DEFINE0(vfork)
 {
-	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
-			0, NULL, NULL, 0);
+	struct clone6_args args = { 0 };
+
+	return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, &args);
 }
 #endif
 
@@ -2472,7 +2502,58 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
 		 unsigned long, tls)
 #endif
 {
-	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+	struct clone6_args args = {
+		.stack = newsp,
+		.parent_tidptr = (uintptr_t)parent_tidptr,
+		.tls = tls,
+		.child_tidptr = (uintptr_t)child_tidptr
+	};
+	return _do_fork(clone_flags, &args);
+}
+
+SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
+			unsigned int, flags1,
+			unsigned int, flags2,
+			unsigned int, flags3,
+			unsigned int, flags4,
+			unsigned int, flags5)
+{
+	struct clone6_args args = { 0 };
+	u64 flags = flags1;
+	int result;
+
+	if (flags2 || flags3 || flags4 || flags5)
+		return -EINVAL;
+
+	/*
+	 * flags1 is full so we only need to verify that CLONE_DETACHED
+	 * is not passed since we can't use it.
+	 */
+	if (flags & CLONE_DETACHED)
+		return -EINVAL;
+
+	result = get_user(args.stack, &uargs->stack);
+#if defined(CONFIG_CLONE_BACKWARDS3)
+	if (!result)
+		result = get_user(args.stack_size, &uargs->stack_size);
+#endif
+	if (!result && (flags & (CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID)))
+		result = get_user(args.child_tidptr, &uargs->child_tidptr);
+	if (!result && (flags & CLONE_PARENT_SETTID))
+		result = get_user(args.parent_tidptr, &uargs->parent_tidptr);
+	if (!result && (flags & CLONE_SETTLS))
+		result = get_user(args.tls, &uargs->tls);
+	if (result)
+		return -EFAULT;
+
+	result = _do_clone_common(flags, &args, true);
+
+	if (flags & CLONE_PIDFD) {
+		if (put_user(args.pidfd, &uargs->pidfd))
+			return -EFAULT;
+	}
+
+	return result;
 }
 #endif
 
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v7 5/5] namei: resolveat(2) syscall
From: Aleksa Sarai @ 2019-05-26  5:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <CAHk-=wiKFi5wi33AmJ4XJmzQaCMHa21-Z-GD_OKPNz=js7R7ig@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6565 bytes --]

On 2019-05-25, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> In fact, I think resolveat() as a model is fundamentally wrong for yet
> another reason: O_CREAT. If you want to _create_ a new file, and you
> want to still have the path resolution modifiers in place, the
> resolveat() model is broken, because it only gives you path resolution
> for the lookup, and then when you do openat(O_CREAT) for the final
> component, you now don't have any way to limit that last component.
> 
> Sure,  you can probably effectively hack around it with resolveat() on
> everything but the last component, and then
> openat(O_CREAT|O_EXCL|O_NOFOLLOW) on the last component, because the
> O_EXCL|O_NOFOLLOW should mean that it won't do any of the unsafe
> things. And then (if you didn't actually want the O_EXCL), you handle
> the race between "somebody else got there first" by re-trying etc. So
> I suspect the O_CREAT thing could be worked around with extra
> complexity, but it's an example of how the O_PATH model really screws
> you over.
> 
> End result: I really think resolveat() is broken. It absolutely
> *needs* to be a full-fledged "openat()" system call, just with added
> path resolution flags.

That's a very good point. I was starting to work on O_CREAT via
resolveat(2) and it definitely was much harder than most people would be
happy to deal with -- I'm not even sure that I handled all the cases.

I'll go for an openat2(2) then. Thinking about it some more -- since
it's a new syscall, I could actually implement the O_PATH link-mode as
being just the regular mode argument (since openat(2) ignores the mode
for non-O_CREAT anyway). The openat(2) wrapper might be more than
one-line as a result but it should avoid polluting the resolution flags
with mode flags (since openat(O_PATH) needs to have a g+rwx mode for
backwards-compatibility).

> >   openat2(rootfd, "path/to/file", O_PATH, RESOLVE_IN_ROOT, &statbuf);
> 
> Note that for O_CREAT, it either needs the 'mode' parameter too, or
> the statbuf would need to be an in-out thing. I think the in-out model
> might be nice (the varargs model with a conditional 'mode' parameter
> is horrid, I think), but at some point it's just bike-shedding.
> 
> Also, I'm not absolutely married to the statbuf, but I do think it
> might be a useful extension. A *lot* of users need the size of the
> file for subsequent mmap() calls (or for buffer management for
> read/write interface) or for things like just headers (ie
> "Content-length:" in html etc).
> 
> I'm not sure people actually want a full 'struct stat', but people
> historically also do st_ino/st_dev for indexing into existing
> user-space caches (or to check permissions like that it's on the right
> filesystem, but the resolve flags kind of make that less of an issue).
> And st_mode to verify that it's a proper regular file etc etc.
> 
> You can always just leave it as NULL if you don't care, there's almost
> no downside to adding it, and I do think that people who want a "walk
> pathname carefully" model almost always would want to also check the
> end result anyway.

Yeah, I agree -- most folks would want to double-check what they've
opened or O_PATH'd. Though I'm still not clear what is the best way of
doing the "stat" argument is -- especially given how much fun
architecture-specific shenanigans are in fs/stat.c (should I only use
cp_new_stat64 or have a separate 64-bit syscall).

> > Is there a large amount of overhead or downside to the current
> > open->fstat way of doing things, or is this more of a "if we're going to
> > add more ways of opening we might as well add more useful things"?
> 
> Right now, system calls are sadly very expensive on a lot of hardware.
> We used to be very proud of the fact that Linux system calls were
> fast, but with meltdown and retpoline etc, we're back to "system calls
> can be several thousand cycles each, just in overhead, on commonly
> available hardware".
> 
> Is "several thousand cycles" fatal? Not necessarily. But I think that
> if we do add a new system call, particularly a fancy one for special
> security-conscious models, we should look at what people need and use,
> and want. And performance should always be a concern.
> 
> I realize that people who come at this from primarily just a security
> issue background may think that security is the primary goal. But no.
> Security always needs to realize that the _primary_ goal is to have
> people _use_ it. Without users, security is entirely pointless. And
> the users part is partly performance, but mostly "it's convenient".

Yup, I agree. I was hoping to shunt most of the convenience to userspace
to avoid ruffling feathers in VFS-land, but I'm more than happy to make
the kernel ABI more convenient.

> The whole "this is Linux-specific" is a big inconvenience point

I hope that other OSes will take our lead and have a similar interface
so that the particular inconvenience can go away eventually (this was
one of the arguments for resolveat(2) -- it's a clear "this is a new
idea" interface rather than mixing it with other O_* flags).

> Talking about securely opening things - another flag that we may want
> to avoid issues is a "don't open device nodes" flag. Sure, O_NONBLOCK
> plus checking the st_mode of the result is usually sufficient, but
> it's actually fairly easy to get that wrong. Things like /dev/tty and
> /dev/zero often need to be available for various reasons, and have
> been used to screw careless "open and read" users up that missed a
> check.

Sure, this could be added -- though I'm sure folks would have
disagreements over whether it should be a resolution flag on an open
flag.

> I also do wonder that if the only actual user-facing interface for the
> resolution flags is a new system call, should we not make the
> *default* value be "don't open anything odd at all".
>
> So instead of saying RESOLVE_XDEV for "don't cross mount points",
> maybe the flags should be the other way around, and say "yes, allow
> mount point crossings", and "yes, explicitly allow device node
> opening", and "yes, allow DOTDOT" etc.

This seems like a reasonable default, except for the cases of
RESOLVE_BENEATH and RESOLVE_IN_ROOT (that would make using AT_FDCWD more
cumbersome than necessary). But other than that, I'd be happy to invert
all the other flags.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v7 5/5] namei: resolveat(2) syscall
From: Linus Torvalds @ 2019-05-25 16:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <20190525070307.bxbvjh2254sx2z6g@yavin>

On Sat, May 25, 2019 at 12:03 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> You might not have seen the v8 of this set I sent a few days ago[1]. The
> new set includes an example of a feature that is possible with
> resolveat(2) but not with the current openat(O_PATH) interface.

It's the "forced O_PATH" model that makes resolveat() basically
entirely pointless to me.

You can do almost nothing with an O_PATH file descriptor. Yes, it's a
really cool feature, and it's great for what it is, but it's less than
0.001% of all opens people have.

Even among security-conscious users, it's pointless. Yes, an O_PATH
file descriptor is somewhat more "secure", but it's secure because
it's mostly USELESS, and has to be converted into something else to be
used.

So say you are something like a static web server that actually wants
to use the "don't traverse '..', don't follow symlinks" features etc.
What you want to do with the end result is read() it or mmap it, or
sendpage or whatever.

This is why I think resolveat() is entirely pointless. Even with
O_EMPTYPATH it's pointless - because you shouldn't *need* that extra
"ok, now get me the actual useful fd" phase.

In fact, I think resolveat() as a model is fundamentally wrong for yet
another reason: O_CREAT. If you want to _create_ a new file, and you
want to still have the path resolution modifiers in place, the
resolveat() model is broken, because it only gives you path resolution
for the lookup, and then when you do openat(O_CREAT) for the final
component, you now don't have any way to limit that last component.

Sure,  you can probably effectively hack around it with resolveat() on
everything but the last component, and then
openat(O_CREAT|O_EXCL|O_NOFOLLOW) on the last component, because the
O_EXCL|O_NOFOLLOW should mean that it won't do any of the unsafe
things. And then (if you didn't actually want the O_EXCL), you handle
the race between "somebody else got there first" by re-trying etc. So
I suspect the O_CREAT thing could be worked around with extra
complexity, but it's an example of how the O_PATH model really screws
you over.

End result: I really think resolveat() is broken. It absolutely
*needs* to be a full-fledged "openat()" system call, just with added
path resolution flags.

>   openat2(rootfd, "path/to/file", O_PATH, RESOLVE_IN_ROOT, &statbuf);

Note that for O_CREAT, it either needs the 'mode' parameter too, or
the statbuf would need to be an in-out thing. I think the in-out model
might be nice (the varargs model with a conditional 'mode' parameter
is horrid, I think), but at some point it's just bike-shedding.

Also, I'm not absolutely married to the statbuf, but I do think it
might be a useful extension. A *lot* of users need the size of the
file for subsequent mmap() calls (or for buffer management for
read/write interface) or for things like just headers (ie
"Content-length:" in html etc).

I'm not sure people actually want a full 'struct stat', but people
historically also do st_ino/st_dev for indexing into existing
user-space caches (or to check permissions like that it's on the right
filesystem, but the resolve flags kind of make that less of an issue).
And st_mode to verify that it's a proper regular file etc etc.

You can always just leave it as NULL if you don't care, there's almost
no downside to adding it, and I do think that people who want a "walk
pathname carefully" model almost always would want to also check the
end result anyway.

Again, I'm thinking of the most obvious use cases where you want these
kinds of special pathname traversals: file servers, static web content
serving etc. They *all* want the file size and type when they open a
file.

> Is there a large amount of overhead or downside to the current
> open->fstat way of doing things, or is this more of a "if we're going to
> add more ways of opening we might as well add more useful things"?

Right now, system calls are sadly very expensive on a lot of hardware.
We used to be very proud of the fact that Linux system calls were
fast, but with meltdown and retpoline etc, we're back to "system calls
can be several thousand cycles each, just in overhead, on commonly
available hardware".

Is "several thousand cycles" fatal? Not necessarily. But I think that
if we do add a new system call, particularly a fancy one for special
security-conscious models, we should look at what people need and use,
and want. And performance should always be a concern.

I realize that people who come at this from primarily just a security
issue background may think that security is the primary goal. But no.
Security always needs to realize that the _primary_ goal is to have
people _use_ it. Without users, security is entirely pointless. And
the users part is partly performance, but mostly "it's convenient".

The whole "this is Linux-specific" is a big inconvenience point, but
aside from that, let's make any new interface as welcoming and simple
and useful as possible. Not a "you have to do extra work" interface.
Quite the reverse. Make it something that makes people go "ahh, yes,
this actually means I don't have to do anything extra, because it
already does everything I want for opening and checking a pathname".

So the way to sell the path lookup improvements should not be "look,
here's a secure way to look up paths". No. That's entirely missing the
point.

No, the way to do path lookup improvements is to say "look, here's a
_convenient_ way to look up paths, and btw, it's also easy to make
secure".

Talking about securely opening things - another flag that we may want
to avoid issues is a "don't open device nodes" flag. Sure, O_NONBLOCK
plus checking the st_mode of the result is usually sufficient, but
it's actually fairly easy to get that wrong. Things like /dev/tty and
/dev/zero often need to be available for various reasons, and have
been used to screw careless "open and read" users up that missed a
check.

I also do wonder that if the only actual user-facing interface for the
resolution flags is a new system call, should we not make the
*default* value be "don't open anything odd at all".

So instead of saying RESOLVE_XDEV for "don't cross mount points",
maybe the flags should be the other way around, and say "yes, allow
mount point crossings", and "yes, explicitly allow device node
opening", and "yes, allow DOTDOT" etc.

               Linus

^ permalink raw reply

* Re: [PATCH v7 5/5] namei: resolveat(2) syscall
From: Aleksa Sarai @ 2019-05-25  7:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <CAHk-=whbFMg4+HuWOBuHpvDNiAyowX2HUowv3+pt8vPWk5W-YQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 6554 bytes --]

On 2019-05-24, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, May 7, 2019 at 9:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > The most obvious syscall to add support for the new LOOKUP_* scoping
> > flags would be openat(2) (along with the required execveat(2) change
> > included in this series). However, there are a few reasons to not do
> > this:
> 
> So honestly, this last patch is what turns me off the whole thing.
> 
> It goes from a nice new feature ("you can use O_NOSYMLINKS to disallow
> symlink traversal") to a special-case joke that isn't worth it any
> more. You get a useless path descrptor back from s special hacky
> system call, you don't actually get the useful data that you probably
> *want* the open to get you.
> 
> Sure, you could eventually then use a *second* system call (openat
> with O_EMPTYPATH) to actually get something you can *use*, but at this
> point you've just wasted everybodys time and effort with a pointless
> second system call.
> 
> So I really don't see the point of this whole thing. Why even bother.
> Nobody sane will ever use that odd two-systemcall model, and even if
> they did, it would be slower and inconvenient.
> 
> The whole and only point of this seems to be the two lines that say
> 
>        if (flags & ~VALID_RESOLVE_FLAGS)
>               return -EINVAL;
> 
> but that adds absolutely zero value to anything.  The argument is that
> "we can't add it to existing flags, because old kernels won't honor
> it", but that's a completely BS argument, since the user has to have a
> fallback anyway for the old kernel case - so we literally could much
> more conveniently just expose it as a prctl() or something to _ask_
> the kernel what flags it honors.
> 
> So to me, this whole argument means that "Oh, we'll make it really
> inconvenient to actually use this".
> 
> If we want to introduce a new system call that allows cool new
> features, it should have *more* powerful semantics than the existing
> ones, not be clearly weaker and less useful.

You might not have seen the v8 of this set I sent a few days ago[1]. The
new set includes an example of a feature that is possible with
resolveat(2) but not with the current openat(O_PATH) interface. The
feature is that you can set RESOLVE_UPGRADE_NO{READ,WRITE} which then
blocks the re-opening of the file descriptor with those MAY_* modes.
(Though of course you might be against the entire idea of this feature
which allows for restricting the opening of magic-links.)

This can't be done with openat(2) without adding even more flags such as
O_PATH_UPGRADE_NOWRITE -- because O_RDONLY = 0, which means you can't
distinguish the "don't allow read or write" case (we could define 0x3
for that, but that feels a tad ugly). Not to mention that broken
userspace programs might already be setting O_PATH|O_RDWR.

So, while making it easier for userspace to be sure these flags are
working is one benefit, it's not the only reason. And outside arguments
for future features, several folks (some on-list, some on LWN) argued
that adding more "open" flags which aren't clearly related to the mode
the file is opened with makes not-much-more sense than a separate
syscall for it. Another (weaker) argument is that O_PATH should've been
separate from the beginning because of how unlike an ordinary fd it is.

Funnily enough, v8 does contain O_EMPTYPATH. However, this is just an
example of another /proc-less interface and we need it for O_PATH
descriptors even if you can do an full open in one shot with restricted
path resolution. Having an O_PATH can be useful on its own (LXC takes an
O_PATH of /dev/pts/ptmx inside the container and then re-opens it each
time a new console is required to avoid touching paths inside the
container).

But it would be neat to have a way for userspace to easily check what
flags the kernel honours, regardless of this patchset.

> So how about making the new system call be something that is a
> *superset* of "openat()" so that people can use that, and then if it
> fails, just fall back to openat(). But if it succeeds, it just
> succeeds, and you don't need to then do other system calls to actually
> make it useful.
> 
> Make the new system call something people *want* to use because it's
> useful, not a crippled useless thing that has some special case use
> for some limited thing and just wastes system call space.

At the moment, I'm working on implementing userspace library wrappers
which use resolveat(2) for safe handling of an untrusted rootfs. I would
expect that most users of resolveat(2) would be using a library to
handle it -- because to do an "mkdir -p" you need to do a fair bit of
work for it to be safe unless we add LOOKUP_* flags to mkdirat(2) and
every other syscall. This is true whether or not openat(2) provides this
feature or if it's a separate syscall.

> Example *useful* system call attributes:
> 
>  - make it like openat(), but have another argument with the "limit flags"

Sure, this would also work. I didn't know if anyone was open to the idea
of openat2(2). There is a follow-up question of how RESOLVE_UPGRADE_NO*
flags would be handled (they aren't obviously "lookup" flags so we'd
need to add more openat(2) flags to accommodate them) but I'm sure that
can be ironed out once you've taken a look at that patchset.

>  - maybe return more status of the resulting file. People very
> commonly do "open->fstat" just to get the size for mmap or to check
> some other detail of the file before use.

So something like

  resolveat(rootfd, "path/to/file", RESOLVE_IN_ROOT, &statbuf);

or

  openat2(rootfd, "path/to/file", O_PATH, RESOLVE_IN_ROOT, &statbuf);

? Is there a large amount of overhead or downside to the current
open->fstat way of doing things, or is this more of a "if we're going to
add more ways of opening we might as well add more useful things"?

> In other words, make the new system call *useful*. Not some castrated
> "not useful on its own" thing.
>
> So I still support the whole "let's make it easy to limit path lookup
> in sane ways", but this model of then limiting using the result sanely
> just makes me a sad panda.

I am glad that you agree with the general thrust, and it's just the
interface that is the hang-up.

[1]: https://marc.info/?l=linux-fsdevel&m=155835923516235&w=2

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v7 5/5] namei: resolveat(2) syscall
From: Linus Torvalds @ 2019-05-24 22:59 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Christian Brauner, Eric Biederman, Andy Lutomirski,
	Andrew Morton, Alexei Starovoitov, Kees Cook, Jann Horn,
	Tycho Andersen, David Drysdale, Chanho Min, Oleg Nesterov,
	Aleksa Sarai, Linux Containers, linux-fsdevel, Linux API
In-Reply-To: <20190507164317.13562-6-cyphar@cyphar.com>

On Tue, May 7, 2019 at 9:44 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> The most obvious syscall to add support for the new LOOKUP_* scoping
> flags would be openat(2) (along with the required execveat(2) change
> included in this series). However, there are a few reasons to not do
> this:

So honestly, this last patch is what turns me off the whole thing.

It goes from a nice new feature ("you can use O_NOSYMLINKS to disallow
symlink traversal") to a special-case joke that isn't worth it any
more. You get a useless path descrptor back from s special hacky
system call, you don't actually get the useful data that you probably
*want* the open to get you.

Sure, you could eventually then use a *second* system call (openat
with O_EMPTYPATH) to actually get something you can *use*, but at this
point you've just wasted everybodys time and effort with a pointless
second system call.

So I really don't see the point of this whole thing. Why even bother.
Nobody sane will ever use that odd two-systemcall model, and even if
they did, it would be slower and inconvenient.

The whole and only point of this seems to be the two lines that say

       if (flags & ~VALID_RESOLVE_FLAGS)
              return -EINVAL;

but that adds absolutely zero value to anything.  The argument is that
"we can't add it to existing flags, because old kernels won't honor
it", but that's a completely BS argument, since the user has to have a
fallback anyway for the old kernel case - so we literally could much
more conveniently just expose it as a prctl() or something to _ask_
the kernel what flags it honors.

So to me, this whole argument means that "Oh, we'll make it really
inconvenient to actually use this".

If we want to introduce a new system call that allows cool new
features, it should have *more* powerful semantics than the existing
ones, not be clearly weaker and less useful.

So how about making the new system call be something that is a
*superset* of "openat()" so that people can use that, and then if it
fails, just fall back to openat(). But if it succeeds, it just
succeeds, and you don't need to then do other system calls to actually
make it useful.

Make the new system call something people *want* to use because it's
useful, not a crippled useless thing that has some special case use
for some limited thing and just wastes system call space.

Example *useful* system call attributes:

 - make it like openat(), but have another argument with the "limit flags"

 - maybe return more status of the resulting file. People very
commonly do "open->fstat" just to get the size for mmap or to check
some other detail of the file before use.

In other words, make the new system call *useful*. Not some castrated
"not useful on its own" thing.

So I still support the whole "let's make it easy to limit path lookup
in sane ways", but this model of then limiting using the result sanely
just makes me a sad panda.

                     Linus

^ permalink raw reply

* Re: [PATCH] mm: mlockall error for flag MCL_ONFAULT
From: Daniel Jordan @ 2019-05-24 21:43 UTC (permalink / raw)
  To: Potyra, Stefan
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jordan, Tobias,
	akpm, vbabka, mhocko, kirill.shutemov, linux-api
In-Reply-To: <20190522112329.GA25483@er01809n.ebgroup.elektrobit.com>

[ Adding linux-api and some of the people who were involved in the
MCL_ONFAULT/mlock2/etc discussions.  Author of the Fixes patch appears to
have moved on. ]

On Wed, May 22, 2019 at 11:23:37AM +0000, Potyra, Stefan wrote:
> If mlockall() is called with only MCL_ONFAULT as flag,
> it removes any previously applied lockings and does
> nothing else.

The change looks reasonable.  Hard to imagine any application relies on it, and
they really shouldn't be if they are.  Debian codesearch turned up only a few
cases where stress-ng was doing this for unknown reasons[1] and this change
isn't gonna break those.  In this case I think changing the syscall's behavior
is justified.  

> This behavior is counter-intuitive and doesn't match the
> Linux man page.

I'd quote it for the changelog:

  For mlockall():

  EINVAL Unknown  flags were specified or MCL_ONFAULT was specified with‐
         out either MCL_FUTURE or MCL_CURRENT.

With that you can add

Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com>

[1] https://sources.debian.org/src/stress-ng/0.09.50-1/stress-mlock.c/?hl=203#L203

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-24 14:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190524115239.ugxv766doolc6nsc@box>

On 24.05.2019 14:52, Kirill A. Shutemov wrote:
> On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
>> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
>>> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>
>>> Kirill, could you explain how the change affects rmap and how it is safe.
>>>
>>> My concern is that the patchset allows to map the same page multiple times
>>> within one process or even map page allocated by child to the parrent.
>>>
>>> It was not allowed before.
>>>
>>> In the best case it makes reasoning about rmap substantially more difficult.
>>>
>>> But I'm worry it will introduce hard-to-debug bugs, like described in
>>> https://lwn.net/Articles/383162/.
>>
>> Andy suggested to unmap PTEs from source page table, and this make the single
>> page never be mapped in the same process twice. This is OK for my use case,
>> and here we will just do a small step "allow to inherit VMA by a child process",
>> which we didn't have before this. If someone still needs to continue the work
>> to allow the same page be mapped twice in a single process in the future, this
>> person will have a supported basis we do in this small step. I believe, someone
>> like debugger may want to have this to make a fast snapshot of a process private
>> memory (when the task is stopped for a small time to get its memory). But for
>> me remapping is enough at the moment.
>>
>> What do you think about this?
> 
> I don't think that unmapping alone will do. Consider the following
> scenario:
> 
> 1. Task A creates and populates the mapping.
> 2. Task A forks. We have now Task B mapping the same pages, but
> write-protected.
> 3. Task B calls process_vm_mmap() and passes the mapping to the parent.
> 
> After this Task A will have the same anon pages mapped twice.

Ah, sure.

> One possible way out would be to force CoW on all pages in the mapping,
> before passing the mapping to the new process.

This will pop all swapped pages up, which is the thing the patchset aims
to prevent.

Hm, what about allow remapping only VMA, which anon_vma::rb_root contain
only chain and which vma->anon_vma_chain contains single entry? This is
a vma, which were faulted, but its mm never were duplicated (or which
forks already died).

Thanks,
Kirill

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill A. Shutemov @ 2019-05-24 11:52 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <358bb95e-0dca-6a82-db39-83c0cf09a06c@virtuozzo.com>

On Fri, May 24, 2019 at 01:45:50PM +0300, Kirill Tkhai wrote:
> On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> > On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> > 
> > Kirill, could you explain how the change affects rmap and how it is safe.
> > 
> > My concern is that the patchset allows to map the same page multiple times
> > within one process or even map page allocated by child to the parrent.
> > 
> > It was not allowed before.
> > 
> > In the best case it makes reasoning about rmap substantially more difficult.
> > 
> > But I'm worry it will introduce hard-to-debug bugs, like described in
> > https://lwn.net/Articles/383162/.
> 
> Andy suggested to unmap PTEs from source page table, and this make the single
> page never be mapped in the same process twice. This is OK for my use case,
> and here we will just do a small step "allow to inherit VMA by a child process",
> which we didn't have before this. If someone still needs to continue the work
> to allow the same page be mapped twice in a single process in the future, this
> person will have a supported basis we do in this small step. I believe, someone
> like debugger may want to have this to make a fast snapshot of a process private
> memory (when the task is stopped for a small time to get its memory). But for
> me remapping is enough at the moment.
> 
> What do you think about this?

I don't think that unmapping alone will do. Consider the following
scenario:

1. Task A creates and populates the mapping.
2. Task A forks. We have now Task B mapping the same pages, but
write-protected.
3. Task B calls process_vm_mmap() and passes the mapping to the parent.

After this Task A will have the same anon pages mapped twice.

One possible way out would be to force CoW on all pages in the mapping,
before passing the mapping to the new process.

Thanks,
Kirill.

^ permalink raw reply

* [PATCH v3 3/3] tests: add close_range() tests
From: Christian Brauner @ 2019-05-24 11:10 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	Christian Brauner, linux-api, linux-kselftest
In-Reply-To: <20190524111047.6892-1-christian@brauner.io>

This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
  are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: linux-api@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
v1: unchanged
v2:
- Christian Brauner <christian@brauner.io>:
  - verify that close_range() correctly closes a single file descriptor
v3:
- Christian Brauner <christian@brauner.io>:
  - add missing Cc for Shuah
  - add missing Cc for linux-kselftest
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/core/.gitignore       |   1 +
 tools/testing/selftests/core/Makefile         |   6 +
 .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
 4 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
 TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cgroup
+TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..d6e6079d3d53
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+				  unsigned int flags)
+{
+	return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+	const char *test_name = "close_range";
+	int i, ret;
+	int open_fds[101];
+	int fd_max, fd_mid, fd_min;
+
+	ksft_set_plan(9);
+
+	for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+		int fd;
+
+		fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				ksft_exit_skip(
+					"%s test: skipping test since /dev/null does not exist\n",
+					test_name);
+
+			ksft_exit_fail_msg(
+				"%s test: %s - failed to open /dev/null\n",
+				strerror(errno), test_name);
+		}
+
+		open_fds[i] = fd;
+	}
+
+	fd_min = open_fds[0];
+	fd_max = open_fds[99];
+
+	ret = sys_close_range(fd_min, fd_max, 1);
+	if (!ret)
+		ksft_exit_fail_msg(
+			"%s test: managed to pass invalid flag value\n",
+			test_name);
+	ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+	fd_mid = open_fds[50];
+	ret = sys_close_range(fd_min, fd_mid, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from %d to %d\n",
+			test_name, fd_min, fd_mid);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+	for (i = 0; i <= 50; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from %d to %d\n",
+				test_name, fd_min, fd_mid);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+	/* create a couple of gaps */
+	close(57);
+	close(78);
+	close(81);
+	close(82);
+	close(84);
+	close(90);
+
+	fd_mid = open_fds[51];
+	/* Choose slightly lower limit and leave some fds for a later test */
+	fd_max = open_fds[92];
+	ret = sys_close_range(fd_mid, fd_max, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 51; i <= 92; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	fd_mid = open_fds[93];
+	fd_max = open_fds[99];
+	/* test that the kernel caps and still closes all fds */
+	ret = sys_close_range(fd_mid, UINT_MAX, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 93; i < 100; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	ret = sys_close_range(open_fds[100], open_fds[100], 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close single file descriptor\n",
+			test_name);
+	ksft_test_result_pass("close_range() closed single file descriptor\n");
+
+	ret = fcntl(open_fds[100], F_GETFL);
+	if (ret >= 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close single file descriptor\n",
+			test_name);
+	ksft_test_result_pass("fcntl() verify closed single file descriptor\n");
+
+	return ksft_exit_pass();
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH v3 2/3] arch: wire-up close_range()
From: Christian Brauner @ 2019-05-24 11:10 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	Christian Brauner, linux-api, linux-alpha, linux-arm-kernel,
	linux-ia64, linux-m68k, linux-mips, linux-parisc, linuxppc-dev,
	linux-s390, linux-sh, sparclinux, linux-xtensa, linux-arch, x86
In-Reply-To: <20190524111047.6892-1-christian@brauner.io>

This wires up the close_range() syscall into all arches at once.

Suggested-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Christian Brauner <christian@brauner.io>
Reviewed-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: linux-arch@vger.kernel.org
Cc: x86@kernel.org
---
v1:
v2:
v3: added
- Arnd Bergmann <arnd@arndb.de>:
  - split into two patches:
    1. add close_range()
    2. add syscall to all arches at once
  - bump __NR_compat_syscalls in arch/arm64/include/asm/unistd.h
---
 arch/alpha/kernel/syscalls/syscall.tbl      | 1 +
 arch/arm/tools/syscall.tbl                  | 1 +
 arch/arm64/include/asm/unistd.h             | 2 +-
 arch/arm64/include/asm/unistd32.h           | 2 ++
 arch/ia64/kernel/syscalls/syscall.tbl       | 1 +
 arch/m68k/kernel/syscalls/syscall.tbl       | 1 +
 arch/microblaze/kernel/syscalls/syscall.tbl | 1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   | 1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   | 1 +
 arch/parisc/kernel/syscalls/syscall.tbl     | 1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    | 1 +
 arch/s390/kernel/syscalls/syscall.tbl       | 1 +
 arch/sh/kernel/syscalls/syscall.tbl         | 1 +
 arch/sparc/kernel/syscalls/syscall.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl      | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl      | 1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     | 1 +
 include/uapi/asm-generic/unistd.h           | 4 +++-
 19 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541	common	fsconfig			sys_fsconfig
 542	common	fsmount				sys_fsmount
 543	common	fspick				sys_fspick
+545	common	close_range			sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/arm64/include/asm/unistd.h b/arch/arm64/include/asm/unistd.h
index 70e6882853c0..d04eb26cfaeb 100644
--- a/arch/arm64/include/asm/unistd.h
+++ b/arch/arm64/include/asm/unistd.h
@@ -44,7 +44,7 @@
 #define __ARM_NR_compat_set_tls		(__ARM_NR_COMPAT_BASE + 5)
 #define __ARM_NR_COMPAT_END		(__ARM_NR_COMPAT_BASE + 0x800)
 
-#define __NR_compat_syscalls		434
+#define __NR_compat_syscalls		436
 #endif
 
 #define __ARCH_WANT_SYS_CLONE
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
 431	n32	fsconfig			sys_fsconfig
 432	n32	fsmount				sys_fsmount
 433	n32	fspick				sys_fspick
+435	n32	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
 431	n64	fsconfig			sys_fsconfig
 432	n64	fsmount				sys_fsmount
 433	n64	fspick				sys_fspick
+435	n64	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
 431	o32	fsconfig			sys_fsconfig
 432	o32	fsmount				sys_fsmount
 433	o32	fspick				sys_fspick
+435	o32	close_range			sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431  common	fsconfig		sys_fsconfig			sys_fsconfig
 432  common	fsmount			sys_fsmount			sys_fsmount
 433  common	fspick			sys_fspick			sys_fspick
+435  common	close_range		sys_close_range			sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+435	i386	close_range		sys_close_range			__ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+435	common	close_range		__x64_sys_close_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

* [PATCH v3 1/3] open: add close_range()
From: Christian Brauner @ 2019-05-24 11:10 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	Christian Brauner, linux-api
In-Reply-To: <20190524111047.6892-1-christian@brauner.io>

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
 task is multi-threaded and shares the file descriptor table with another
 thread in which case two threads could race with one thread allocating
 file descriptors and the other one closing them via close_range(). For the
 general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro@zeniv.linux.org.uk>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
v2: unchanged
v3:
- Oleg Nesterov <oleg@redhat.com>:
  - fix braino: s/max()/min()/
---
 fs/file.c                | 62 ++++++++++++++++++++++++++++++++++------
 fs/open.c                | 20 +++++++++++++
 include/linux/fdtable.h  |  2 ++
 include/linux/syscalls.h |  2 ++
 4 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..e896d87f4431 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -10,6 +10,7 @@
 #include <linux/syscalls.h>
 #include <linux/export.h>
 #include <linux/fs.h>
+#include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
@@ -615,12 +616,9 @@ void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-	struct file *file;
+	struct file *file = NULL;
 	struct fdtable *fdt;
 
 	spin_lock(&files->file_lock);
@@ -632,15 +630,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
 		goto out_unlock;
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
-	return filp_close(file, files);
 
 out_unlock:
 	spin_unlock(&files->file_lock);
-	return -EBADF;
+	return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+	struct file *file;
+
+	file = pick_file(files, fd);
+	if (!file)
+		return -EBADF;
+
+	return filp_close(file, files);
 }
 EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
 
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+	unsigned int cur_max;
+
+	if (fd > max_fd)
+		return -EINVAL;
+
+	rcu_read_lock();
+	cur_max = files_fdtable(files)->max_fds;
+	rcu_read_unlock();
+
+	/* cap to last valid index into fdtable */
+	max_fd = min(max_fd, (cur_max - 1));
+	while (fd <= max_fd) {
+		struct file *file;
+
+		file = pick_file(files, fd++);
+		if (!file)
+			continue;
+
+		filp_close(file, files);
+		cond_resched();
+	}
+
+	return 0;
+}
+
 /*
  * variant of __close_fd that gets a ref on the file for later fput
  */
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags:  reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return __close_range(current->files, fd, max_fd);
+}
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
 		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+			 unsigned int max_fd);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 
 extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
 asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+				unsigned int flags);
 asmlinkage long sys_vhangup(void);
 
 /* fs/pipe.c */
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v3 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-24 10:54 UTC (permalink / raw)
  To: jannh, oleg, viro, torvalds, arnd
  Cc: linux-ia64, linux-sh, linux-mips, dhowells, joel, linux-kselftest,
	sparclinux, elena.reshetova, linux-arch, linux-s390, dancol,
	kernel-team, serge, linux-xtensa, keescook, linux-m68k, luto,
	tglx, surenb, linux-arm-kernel, linux-parisc, linux-api, cyphar,
	luto, ebiederm, linux-alpha, akpm, linuxppc-dev
In-Reply-To: <20190521143220.crb2zyvdov3fl4g7@brauner.io>

On Tue, May 21, 2019 at 04:32:20PM +0200, Christian Brauner wrote:
> On Mon, May 20, 2019 at 05:56:29PM +0200, Christian Brauner wrote:
> > This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> > pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> > process that is created via traditional fork()/clone() calls that is only
> > referenced by a PID:
> > 
> > int pidfd = pidfd_open(1234, 0);
> > ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
> > 
> > With the introduction of pidfds through CLONE_PIDFD it is possible to
> > created pidfds at process creation time.
> > However, a lot of processes get created with traditional PID-based calls
> > such as fork() or clone() (without CLONE_PIDFD). For these processes a
> > caller can currently not create a pollable pidfd. This is a problem for
> > Android's low memory killer (LMK) and service managers such as systemd.
> > Both are examples of tools that want to make use of pidfds to get reliable
> > notification of process exit for non-parents (pidfd polling) and race-free
> > signal sending (pidfd_send_signal()). They intend to switch to this API for
> > process supervision/management as soon as possible. Having no way to get
> > pollable pidfds from PID-only processes is one of the biggest blockers for
> > them in adopting this api. With pidfd_open() making it possible to retrieve
> > pidfds for PID-based processes we enable them to adopt this api.
> > 
> > In line with Arnd's recent changes to consolidate syscall numbers across
> > architectures, I have added the pidfd_open() syscall to all architectures
> > at the same time.
> > 
> > Signed-off-by: Christian Brauner <christian@brauner.io>
> > Reviewed-by: Oleg Nesterov <oleg@redhat.com>
> 
> This now also carries a Reviewed-by from David.
> 
> > Acked-by: Arnd Bergmann <arnd@arndb.de>
> > Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> > Cc: Kees Cook <keescook@chromium.org>
> > Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> > Cc: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Jann Horn <jannh@google.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Andy Lutomirsky <luto@kernel.org>
> > Cc: Andrew Morton <akpm@linux-foundation.org>
> > Cc: Aleksa Sarai <cyphar@cyphar.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Al Viro <viro@zeniv.linux.org.uk>
> > Cc: linux-api@vger.kernel.org
> 
> I've moved pidfd_open() into my for-next branch together with Joel's
> pidfd polling changes. Everything is based on v5.2-rc1.
> 
> The chosen syscall number for now is 434. David is going to send out
> another pile of mount api related syscalls. I'll coordinate with him
> accordingly prior to the 5.3 merge window.

After talking to Arnd, I split the syscall addition and the per-arch
wiring-up of pidfd_open() into two patches. There are no functional
changes and everything is still sitting in for-next.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-24 10:45 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190522152254.5cyxhjizuwuojlix@box>

On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
> 
> In the best case it makes reasoning about rmap substantially more difficult.
> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

Andy suggested to unmap PTEs from source page table, and this make the single
page never be mapped in the same process twice. This is OK for my use case,
and here we will just do a small step "allow to inherit VMA by a child process",
which we didn't have before this. If someone still needs to continue the work
to allow the same page be mapped twice in a single process in the future, this
person will have a supported basis we do in this small step. I believe, someone
like debugger may want to have this to make a fast snapshot of a process private
memory (when the task is stopped for a small time to get its memory). But for
me remapping is enough at the moment.

What do you think about this?

[...]

Kirill

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-24 10:36 UTC (permalink / raw)
  To: Andy Lutomirski, Kirill A. Shutemov
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	alexander.h.duyck, Weiny Ira, Andrey Konovalov, arunks,
	Vlastimil Babka, Christoph Lameter, Rik van Riel, Kees Cook,
	Johannes Weiner, Nicholas Piggin, Mathieu Desnoyers, Shakeel Butt,
	Roman Gushchin, Andrea Arcangeli, Hugh Dickins, Jerome Glisse,
	Mel Gorman
In-Reply-To: <CALCETrWzuH3=Uh91UeGwpCj28kjQ82Lj2OTuXm7_3d871PyZSA@mail.gmail.com>

On 23.05.2019 19:19, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 10:44 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 19:43, Andy Lutomirski wrote:
>>> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
> 
>>> Do you mean that the code you sent rejects this case?  If so, please
>>> document it.  In any case, I looked at the code, and it seems to be
>>> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
>>> would reject copying a vDSO.
>>
>> I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
>> I'll check carefully, whether it's enough for vDSO.
> 
> I think you could make the new syscall a lot more comprehensible bg
> restricting it to just MAP_ANONYMOUS, by making it unmap the source,
> or possibly both.  If the new syscall unmaps the source (in order so
> that the source is gone before the newly mapped pages become
> accessible), then you avoid issues in which you need to define
> sensible semantics for what happens if both copies are accessed
> simultaneously.

In case of we unmap source, this does not introduce a new principal
behavior with the same page mapped twice in a single process like
Kirill pointed. This sounds as a good idea and this covers my
application area.

The only new principal thing is a child process will be able to inherit
a parent's VMA, which is not possible now. But it looks like we never
depend on processes relationship in the mapping code, and process
reparenting already gives many combinations, so the new change should
not affect much on this.

Kirill

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-24 10:31 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <67e4458a-9cc4-d1aa-608c-73ebe9e2f7a3@yandex-team.ru>

On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> >
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> >
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> >
> >          /* that exec is sensitive */
> >          unshare(CLONE_FILES);
> >          /* we don't want anything past stderr here */
> >          close_range(3, ~0U);
> >          execve(....);
> >
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >   task is multi-threaded and shares the file descriptor table with another
> >   thread in which case two threads could race with one thread allocating
> >   file descriptors and the other one closing them via close_range(). For the
> >   general case close_range() before the execve() is sufficient.)
> >
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >    - Python (cf. [2])
> >    - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> >
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> >
> > static int close_all_fds(void)
> > {
> >          int dir_fd;
> >          DIR *dir;
> >          struct dirent *direntp;
> >
> >          dir = opendir("/proc/self/fd");
> >          if (!dir)
> >                  return -1;
> >          dir_fd = dirfd(dir);
> >          while ((direntp = readdir(dir))) {
> >                  int fd;
> >                  if (strcmp(direntp->d_name, ".") == 0)
> >                          continue;
> >                  if (strcmp(direntp->d_name, "..") == 0)
> >                          continue;
> >                  fd = atoi(direntp->d_name);
> >                  if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
> >                          continue;
> >                  close(fd);
> >          }
> >          closedir(dir);
> >          return 0;
> > }
> >
> > to close_range() yields:
> > 1. closing 4 open files:
> >     - close_all_fds(): ~280 us
> >     - close_range():    ~24 us
> >
> > 2. closing 1000 open files:
> >     - close_all_fds(): ~5000 us
> >     - close_range():   ~800 us
> >
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> 
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> 
> Here is another strange but real-live scenario: crash handler for dumping core.
> 
> If applications has network connections it would be better to close them all,
> otherwise clients will wait until end of dumping process or timeout.
> Also closing normal files might be a good idea for releasing locks.
> 
> But simple closing might race with other threads - closed fd will be reused
> while some code still thinks it refers to original file.
> 
> Our solution closes files without freeing fd: it opens /dev/null and
> replaces all opened descriptors using dup2.
> 
> So, special flag for close_range() could close files without clearing bitmap.
> Effect should be the same - fd wouldn't be reused.
> 
> Actually two flags for two phases: closing files and releasing fd.

Konstantin, I'm sorry, I totally missed that part of your mail
yesterday.
Without speaking to the feasibility of this it's at least a good
illustration that people really do have the possible need for a flag
argument.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-24  9:57 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Konstantin Khlebnikov, Al Viro, Linux Kernel Mailing List,
	Linux FS-devel Mailing List, Linux API, Linus Torvalds,
	Florian Weimer, Jann Horn, Oleg Nesterov, Thomas Gleixner,
	Shuah Khan, David Howells, Todd Kjos, Dmitry V. Levin,
	Miklos Szeredi, alpha, Linux ARM, linux-ia64, linux-m68k
In-Reply-To: <CAK8P3a26uvqmExJZsezhB+cp2ADM0Ai9jVUKWOFM6kg848bCKg@mail.gmail.com>

On Fri, May 24, 2019 at 09:43:53AM +0200, Arnd Bergmann wrote:
> On Thu, May 23, 2019 at 6:33 PM Christian Brauner <christian@brauner.io> wrote:
> > On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> > > On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() syscall. It allows to efficiently close a range
> > > >   22 files changed, 100 insertions(+), 9 deletions(-)
> > > >
> > >
> > > It would be better to split arch/ wiring into separate patch for better readability.
> >
> > Ok. You mean only do x86 - seems to be the standard - and then move the
> > others into a separate patch? Doesn't seem worth to have a patch
> > per-arch, I'd think.
> 
> I think I would prefer the first patch to just add the call without wiring it up
> anywhere, and a second patch do add it on all architectures including x86.

I've split this into two patches and also bumped arm64
__NR_compat_syscalls that I've missed before as you mentioned!

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Arnd Bergmann @ 2019-05-24  7:43 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Konstantin Khlebnikov, Al Viro, Linux Kernel Mailing List,
	Linux FS-devel Mailing List, Linux API, Linus Torvalds,
	Florian Weimer, Jann Horn, Oleg Nesterov, Thomas Gleixner,
	Shuah Khan, David Howells, Todd Kjos, Dmitry V. Levin,
	Miklos Szeredi, alpha, Linux ARM, linux-ia64, linux-m68k
In-Reply-To: <20190523163345.q5ynd2ytk7nxcvqf@brauner.io>

On Thu, May 23, 2019 at 6:33 PM Christian Brauner <christian@brauner.io> wrote:
> On Thu, May 23, 2019 at 07:22:17PM +0300, Konstantin Khlebnikov wrote:
> > On 22.05.2019 18:52, Christian Brauner wrote:> This adds the close_range() syscall. It allows to efficiently close a range
> > >   22 files changed, 100 insertions(+), 9 deletions(-)
> > >
> >
> > It would be better to split arch/ wiring into separate patch for better readability.
>
> Ok. You mean only do x86 - seems to be the standard - and then move the
> others into a separate patch? Doesn't seem worth to have a patch
> per-arch, I'd think.

I think I would prefer the first patch to just add the call without wiring it up
anywhere, and a second patch do add it on all architectures including x86.

     Arnd

^ permalink raw reply

* Re: [PATCH v18 1/3] proc: add /proc/<pid>/arch_status
From: Andrew Morton @ 2019-05-24  3:18 UTC (permalink / raw)
  To: Aubrey Li
  Cc: tglx, mingo, peterz, hpa, ak, tim.c.chen, dave.hansen, arjan,
	adobriyan, aubrey.li, linux-api, linux-kernel, Andy Lutomirski
In-Reply-To: <20190425143219.102258-1-aubrey.li@linux.intel.com>

On Thu, 25 Apr 2019 22:32:17 +0800 Aubrey Li <aubrey.li@linux.intel.com> wrote:

> The architecture specific information of the running processes
> could be useful to the userland. Add /proc/<pid>/arch_status
> interface support to examine process architecture specific
> information externally.

I'll give this an

Acked-by: Andrew Morton <akpm@linux-foundation.org>

from a procfs POV and shall let the x86 maintainers worry about it.

I must say I'm a bit surprised that we don't already provide some form
of per-process CPU-specific info anywhere in procfs.  Something to
piggy-back this onto.  But I can't find such a thing.

I assume we've already discussed why this is a new procfs file rather
than merely a new line in /proc/<pid>/status.  If so, please add the
reasoning to the changelog.  If not, please discuss now ;)

^ permalink raw reply

* Re: [PATCH RFC v8 01/10] namei: obey trailing magic-link DAC permissions
From: Aleksa Sarai @ 2019-05-24  3:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Christian Brauner,
	Eric Biederman, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, Linux Containers
In-Reply-To: <20190523020009.mi25uziu2b3whf4l@yavin>

[-- Attachment #1: Type: text/plain, Size: 2195 bytes --]

On 2019-05-23, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2019-05-22, Andy Lutomirski <luto@kernel.org> wrote:
> > What are actual examples of uses for this exception?  Breaking
> > selftests is not, in and of itself, a huge problem.
> 
> Not as far as I know. All of the re-opening users I know of do re-opens
> of O_PATH or are re-opening with the same (or fewer) privileges. I also
> ran this for a few days on my laptop without this exception, and didn't
> have any visible issues.

I have modified the patch to WARN_ON(may_open_magiclink() == -EACCES).

So far (in the past day on my openSUSE machines) I have only seen two
programs which have hit this case: kbd[1]'s "loadkeys" and "kbd_mode"
binaries. In addition to there not being any user-visible errors -- they
actually handle permission errors gracefully!

  static int
  open_a_console(const char *fnam)
  {
  	int fd;

  	/*
  	 * For ioctl purposes we only need some fd and permissions
  	 * do not matter. But setfont:activatemap() does a write.
  	 */
  	fd = open(fnam, O_RDWR);
  	if (fd < 0)
  		fd = open(fnam, O_WRONLY);
  	if (fd < 0)
  		fd = open(fnam, O_RDONLY);
  	if (fd < 0)
  		return -1;
  	return fd;
  }

The above gets called with "/proc/self/fd/0" as an argument (as well as
other console candidates like "/dev/console"). And setfont:activatemap()
actually does handle read-only fds:

  static void
  send_escseq(int fd, const char *seq, int n)
  {
  	if (write(fd, seq, n) != n) /* maybe fd is read-only */
  		printf("%s", seq);
  }

  void activatemap(int fd)
  {
  	send_escseq(fd, "\033(K", 3);
  }

So, thus far, not only have I not seen anything go wrong -- the only
program which actually hits this case handles the error gracefully.
Obviously we got lucky here, but the lack of any users of this
mis-feature leads me to have some hope that we can block it without
anyone noticing.

But I emphatically do not want to break userspace here (except for
attackers, obviously).

[1]: http://git.altlinux.org/people/legion/packages/kbd.git

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH 5.1 119/122] y2038: Make CONFIG_64BIT_TIME unconditional
From: Greg Kroah-Hartman @ 2019-05-23 19:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Greg Kroah-Hartman, stable, Arnd Bergmann, Thomas Gleixner,
	Joseph Myers, libc-alpha, linux-api, Deepa Dinamani,
	Lukasz Majewski, Stepan Golosunov
In-Reply-To: <20190523181705.091418060@linuxfoundation.org>

From: Arnd Bergmann <arnd@arndb.de>

commit f3d964673b2f1c5d5c68c77273efcf7103eed03b upstream.

As Stepan Golosunov points out, there is a small mistake in the
get_timespec64() function in the kernel. It was originally added under the
assumption that CONFIG_64BIT_TIME would get enabled on all 32-bit and
64-bit architectures, but when the conversion was done, it was only turned
on for 32-bit ones.

The effect is that the get_timespec64() function never clears the upper
half of the tv_nsec field for 32-bit tasks in compat mode. Clearing this is
required for POSIX compliant behavior of functions that pass a 'timespec'
structure with a 64-bit tv_sec and a 32-bit tv_nsec, plus uninitialized
padding.

The easiest fix for linux-5.1 is to just make the Kconfig symbol
unconditional, as it was originally intended. As a follow-up, the #ifdef
CONFIG_64BIT_TIME can be removed completely..

Note: for native 32-bit mode, no change is needed, this works as
designed and user space should never need to clear the upper 32
bits of the tv_nsec field, in or out of the kernel.

Fixes: 00bf25d693e7 ("y2038: use time32 syscall names on 32-bit")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Joseph Myers <joseph@codesourcery.com>
Cc: libc-alpha@sourceware.org
Cc: linux-api@vger.kernel.org
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Lukasz Majewski <lukma@denx.de>
Cc: Stepan Golosunov <stepan@golosunov.pp.ru>
Link: https://lore.kernel.org/lkml/20190422090710.bmxdhhankurhafxq@sghpc.golosunov.pp.ru/
Link: https://lkml.kernel.org/r/20190429131951.471701-1-arnd@arndb.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 arch/Kconfig |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -764,7 +764,7 @@ config COMPAT_OLD_SIGACTION
 	bool

 config 64BIT_TIME
-	def_bool ARCH_HAS_64BIT_TIME
+	def_bool y
 	help
 	  This should be selected by all architectures that need to support
 	  new system calls with a 64-bit time_t. This is relevant on all 32-bit

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox