Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-22 15:52 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
 task is multi-threaded and shares the file descriptor table with another
 thread in which case two threads could race with one thread allocating
 file descriptors and the other one closing them via close_range(). For the
 general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro@zeniv.linux.org.uk>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/file.c                                   | 63 ++++++++++++++++++---
 fs/open.c                                   | 20 +++++++
 include/linux/fdtable.h                     |  2 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 22 files changed, 100 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541	common	fsconfig			sys_fsconfig
 542	common	fsmount				sys_fsmount
 543	common	fspick				sys_fspick
+545	common	close_range			sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
 431	n32	fsconfig			sys_fsconfig
 432	n32	fsmount				sys_fsmount
 433	n32	fspick				sys_fspick
+435	n32	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
 431	n64	fsconfig			sys_fsconfig
 432	n64	fsmount				sys_fsmount
 433	n64	fspick				sys_fspick
+435	n64	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
 431	o32	fsconfig			sys_fsconfig
 432	o32	fsmount				sys_fsmount
 433	o32	fspick				sys_fspick
+435	o32	close_range			sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431  common	fsconfig		sys_fsconfig			sys_fsconfig
 432  common	fsmount			sys_fsmount			sys_fsmount
 433  common	fspick			sys_fspick			sys_fspick
+435  common	close_range		sys_close_range			sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+435	i386	close_range		sys_close_range			__ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+435	common	close_range		__x64_sys_close_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..54945efa046e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -615,12 +615,9 @@ void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-	struct file *file;
+	struct file *file = NULL;
 	struct fdtable *fdt;
 
 	spin_lock(&files->file_lock);
@@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
 		goto out_unlock;
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
-	return filp_close(file, files);
 
 out_unlock:
 	spin_unlock(&files->file_lock);
-	return -EBADF;
+	return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+	struct file *file;
+
+	file = pick_file(files, fd);
+	if (!file)
+		return -EBADF;
+
+	return filp_close(file, files);
 }
 EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
 
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+	unsigned int cur_max;
+
+	if (fd > max_fd)
+		return -EINVAL;
+
+	rcu_read_lock();
+	cur_max = files_fdtable(files)->max_fds;
+	rcu_read_unlock();
+
+	/* cap to last valid index into fdtable */
+	if (max_fd >= cur_max)
+		max_fd = cur_max - 1;
+
+	while (fd <= max_fd) {
+		struct file *file;
+
+		file = pick_file(files, fd++);
+		if (!file)
+			continue;
+
+		filp_close(file, files);
+		cond_resched();
+	}
+
+	return 0;
+}
+
 /*
  * variant of __close_fd that gets a ref on the file for later fput
  */
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags:  reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return __close_range(current->files, fd, max_fd);
+}
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
 		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+			 unsigned int max_fd);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 
 extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
 asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+				unsigned int flags);
 asmlinkage long sys_vhangup(void);
 
 /* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill A. Shutemov @ 2019-05-22 15:22 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <155836064844.2441.10911127801797083064.stgit@localhost.localdomain>

On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.

Kirill, could you explain how the change affects rmap and how it is safe.

My concern is that the patchset allows to map the same page multiple times
within one process or even map page allocated by child to the parrent.

It was not allowed before.

In the best case it makes reasoning about rmap substantially more difficult.

But I'm worry it will introduce hard-to-debug bugs, like described in
https://lwn.net/Articles/383162/.

Note, that is some cases we care about rmap walk order (see for instance
mremap() case). I'm not convinced that the feature will not break
something in the area.

-- 
 Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-22 10:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli
In-Reply-To: <CAG48ez31Kxukg7y4PU-+3RjsYZxEHfjvs2q0EFqxDM2KDcLUoA@mail.gmail.com>

On 21.05.2019 20:28, Jann Horn wrote:
> On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 19:20, Jann Horn wrote:
>>> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>>> New syscall, which allows to clone a remote process VMA
>>>>>> into local process VM. The remote process's page table
>>>>>> entries related to the VMA are cloned into local process's
>>>>>> page table (in any desired address, which makes this different
>>>>>> from that happens during fork()). Huge pages are handled
>>>>>> appropriately.
>>> [...]
>>>>>> There are several problems with process_vm_writev() in this example:
>>>>>>
>>>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>>>   allocation of a new page (if was not preallocated);
>>>>>
>>>>> I don't see how your new syscall helps.  You're writing to remote
>>>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>>>> regardless of whether you use a write-like interface or an mmap-like
>>>>> interface.
>>>>
>>>> No, the talk is not about just another interface for copying memory.
>>>> The talk is about borrowing of remote task's VMA and corresponding
>>>> page table's content. Syscall allows to copy part of page table
>>>> with preallocated pages from remote to local process. See here:
>>>>
>>>> [task1]                                                        [task2]
>>>>
>>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>
>>>> <task1 populates buf>
>>>>
>>>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>>>> munmap(buf);
>>>>
>>>>
>>>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>>>> just like in the way we do during fork syscall.
>>>>
>>>> There is no copying of buf memory content, unless COW happens. This is
>>>> the principal difference to process_vm_writev(), which just allocates
>>>> pages in remote VM.
>>>>
>>>>> Keep in mind that, on x86, just the hardware part of a
>>>>> page fault is very slow -- populating the memory with a syscall
>>>>> instead of a fault may well be faster.
>>>>
>>>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>>>> pages related to buf of task1 are swapped:
>>>>
>>>> 1)process_vm_writev() reads them back into memory;
>>>>
>>>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>>>   to task2 page table.
>>>>
>>>> Also, for faster page faults one may use huge pages for the mappings.
>>>> But really, it's funny to think about page faults, when there are
>>>> disk IO problems I shown.
>>> [...]
>>>>> That only doubles the amount of memory if you let n
>>>>> scale linearly with p, which seems unlikely.
>>>>>
>>>>>>
>>>>>> 3)received data has no a chance to be properly swapped for
>>>>>>   a long time.
>>>>>
>>>>> ...
>>>>>
>>>>>> a)kernel moves @buf pages into swap right after recv();
>>>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>>>
>>>>> If you're under that much memory pressure and thrashing that badly,
>>>>> your performance is going to be awful no matter what you're doing.  If
>>>>> you indeed observe this behavior under normal loads, then this seems
>>>>> like a VM issue that should be addressed in its own right.
>>>>
>>>> I don't think so. Imagine: a container migrates from one node to another.
>>>> The nodes are the same, say, every of them has 4GB of RAM.
>>>>
>>>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>>>> After the page server on the second node received the pages, we want these
>>>> pages become swapped as soon as possible, and we don't want to read them from
>>>> swap to pass a read consumer.
>>>
>>> But you don't have to copy that memory into the container's tasks all
>>> at once, right? Can't you, every time you've received a few dozen
>>> kilobytes of data or whatever, shove them into the target task? That
>>> way you don't have problems with swap because the time before the data
>>> has arrived in its final VMA is tiny.
>>
>> We try to maintain online migration with as small downtime as possible,
>> and the container on source node is completely stopped at the very end.
>> Memory of container tasks is copied in background without container
>> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>>
>> Container may create any new processes during the migration, and these
>> processes may contain any memory mappings.
>>
>> Imagine the situation. We migrate a big web server with a lot of processes,
>> and some of children processes have the same COW mapping as parent has.
>> In case of all memory dump is available at the moment of the grand parent
>> web server process creation, we populate the mapping in parent, and all
>> the children may inherit the mapping in case of they want after fork.
>> COW works here. But in case of some processes are created before all memory
>> is available on destination node, we can't do such the COW inheritance.
>> This will be the reason, the memory consumed by container grows many
>> times after the migration. So, the only solution is to create process
>> tree after memory is available and all mappings are known.
> 
> But if one of the processes modifies the memory after you've started
> migrating it to the new machine, that memory can't be CoW anymore
> anyway, right? So it should work if you first do a first pass of
> copying the memory and creating the process hierarchy, and then copy
> more recent changes into the individual processes, breaking the CoW
> for those pages, right?

Not so. We have to have all processes killed on source node, before
creation the same process tree on destination machine. The process
tree should be completely analyzed before a try of its recreation
from ground. The analysis allows to choose the strategy and the sequence
of each process creation and inheritance of entities like namespaces,
mm, fdtables, etc. It's impossible to restore a process tree in case
of you already have started to create it, but haven't stopped processes
on source node. Also, we can restore only subset of process trees,
but you never know what will happen with a live process tree in further.
So, source process tree must be freezed before restore.

A restore of arbitrary process tree in laws of all linux limitations
on sequence of action to recreate an entity of a process makes this
nontrivial mathematical problem, which has no a solution at the moment,
and it's unknown whether it has a solution.

So, at the moment of restore starts, we know all about all tasks and
their relation ships. Yes, we start to copy memory from source to destination,
when container on source is alive. But we track this memory with _PAGE_SOFT_DIRTY
flag, and dirtied pages are repopulated with new content. Please, see the comment
about hellish below.

>> It's on of the examples. But believe me, there are a lot of another reasons,
>> why process tree should be created only after all process tree is freezed,
>> and no new tasks on source are possible. PGID and SSID inheritance, for
>> example. All of this requires special order of tasks creation. In case of
>> you try to restore process tree with correct namespaces and especial in
>> case of many user namespaces in a container, you will just see like a hell
>> will open before your eyes, and we never can think about this.
> 
> Could you elaborate on why that is so hellish?

Because you never know the way, the system came into the state you're seeing
at the moment. Even in a simple process chain like:

          task1
            |
          task2
            |
          task3
          /   \
      task4  task5

any of these processes may change its namespace, pgid, ssid, unshare mm,
files, become a child subreaper, die and make its children reparented,
do all of these before or after parent did one of its actions. All of
these actions are not independent between each other, and some actions
prohibit another actions. And you can restore the process tree only
in case you repeat all of the sequence in the same order it was made
by the container.

It's impossible to say "task2, set your session to 5!", because the only
way to set ssid is call setsid() syscall, which may be made only once in
a process life. But setsid itself implies limitations on further setpgid()
syscall (see the code, if interesting). And this limitation does not mean
we always should call setpgid() before it, no, these are no such the simple
rules. The same and moreover with inheritance of namespaces, when some of
task's children may inherit old task's namespaces, some may inherit current,
some will inherit further. A child will be able to assign a namespace, say,
net ns, only in case of its userns allows to do this. This implies another
limitation on process creation order, while this does not mean this gives
a stable rule of choosing an order you create process tree. No, the only
rule is "in the moment of time, one of tasks should made one of the above
actions".

This is a mathematical problem of ordering finite number of actors with
finite number of rules and relationship between them, and the rules do not
imply unambiguous replay from the ground by the end state.

We behave well in limited and likely cases of process tree configurations,
and this is enough for most cases. But common problem is very difficult,
and currently it's not proven it even has a solution as a finite set of
rules you may apply to restore any process tree.

Kirill

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-22  8:12 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: David Howells, Al Viro, Linux List Kernel Mailing, linux-fsdevel,
	Linux API, Jann Horn, Florian Weimer, Oleg Nesterov,
	Thomas Gleixner, Arnd Bergmann, Shuah Khan, Todd Kjos,
	Dmitry V. Levin, Miklos Szeredi, alpha, Linux ARM, linux-ia64,
	linux-m68k, linux-mips, Parisc List
In-Reply-To: <CAHk-=wgtHm4t71oKbykE=awiVv2H2wCy8yH0L_FsyhHQ5OSO+Q@mail.gmail.com>

On Tue, May 21, 2019 at 10:23 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, May 21, 2019 at 9:41 AM Christian Brauner <christian@brauner.io> wrote:
> >
> > Yeah, you mentioned this before. I do like being able to specify an
> > upper bound to have the ability to place fds strategically after said
> > upper bound.
>
> I suspect that's the case.
>
> And if somebody really wants to just close everything and uses a large
> upper bound, we can - if we really want to - just compare the upper
> bound to the file table size, and do an optimized case for that. We do
> that upper bound comparison anyway to limit the size of the walk, so
> *if* it's a big deal, that case could then do the whole "shrink
> fdtable" case too.

Makes sense.

>
> But I don't believe it's worth optimizing for unless somebody really
> has a load where that is shown to be a big deal.   Just do the silly
> and simple loop, and add a cond_resched() in the loop, like
> close_files() does for the "we have a _lot_ of files open" case.

Ok. I will resend a v1 later with the cond_resched() logic you and Al
suggested added.

Thanks!
Christian

^ permalink raw reply

* Re: [RFC 0/7] introduce memory hinting API for external process
From: Brian Geffon @ 2019-05-22  4:23 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: Anshuman Khandual, Tim Murray, Minchan Kim, Andrew Morton, LKML,
	linux-mm, Michal Hocko, Johannes Weiner, Joel Fernandes,
	Suren Baghdasaryan, Daniel Colascione, Sonny Rao, linux-api
In-Reply-To: <CALvZod6ioRxSi7tHB-uSTxN1-hsxD+8O3mfFAjaqdsimjUVmcw@mail.gmail.com>

To expand on the ChromeOS use case we're in a very similar situation
to Android. For example, the Chrome browser uses a separate process
for each individual tab (with some exceptions) and over time many tabs
remain open in a back-grounded or idle state. Given that we have a lot
of information about the weight of a tab, when it was last active,
etc, we can benefit tremendously from per-process reclaim. We're
working on getting real world numbers but all of our initial testing
shows very promising results.


On Tue, May 21, 2019 at 5:57 AM Shakeel Butt <shakeelb@google.com> wrote:
>
> On Mon, May 20, 2019 at 7:55 PM Anshuman Khandual
> <anshuman.khandual@arm.com> wrote:
> >
> >
> >
> > On 05/20/2019 10:29 PM, Tim Murray wrote:
> > > On Sun, May 19, 2019 at 11:37 PM Anshuman Khandual
> > > <anshuman.khandual@arm.com> wrote:
> > >>
> > >> Or Is the objective here is reduce the number of processes which get killed by
> > >> lmkd by triggering swapping for the unused memory (user hinted) sooner so that
> > >> they dont get picked by lmkd. Under utilization for zram hardware is a concern
> > >> here as well ?
> > >
> > > The objective is to avoid some instances of memory pressure by
> > > proactively swapping pages that userspace knows to be cold before
> > > those pages reach the end of the LRUs, which in turn can prevent some
> > > apps from being killed by lmk/lmkd. As soon as Android userspace knows
> > > that an application is not being used and is only resident to improve
> > > performance if the user returns to that app, we can kick off
> > > process_madvise on that process's pages (or some portion of those
> > > pages) in a power-efficient way to reduce memory pressure long before
> > > the system hits the free page watermark. This allows the system more
> > > time to put pages into zram versus waiting for the watermark to
> > > trigger kswapd, which decreases the likelihood that later memory
> > > allocations will cause enough pressure to trigger a kill of one of
> > > these apps.
> >
> > So this opens up bit of LRU management to user space hints. Also because the app
> > in itself wont know about the memory situation of the entire system, new system
> > call needs to be called from an external process.
> >
> > >
> > >> Swapping out memory into zram wont increase the latency for a hot start ? Or
> > >> is it because as it will prevent a fresh cold start which anyway will be slower
> > >> than a slow hot start. Just being curious.
> > >
> > > First, not all swapped pages will be reloaded immediately once an app
> > > is resumed. We've found that an app's working set post-process_madvise
> > > is significantly smaller than what an app allocates when it first
> > > launches (see the delta between pswpin and pswpout in Minchan's
> > > results). Presumably because of this, faulting to fetch from zram does
> >
> > pswpin      417613    1392647     975034     233.00
> > pswpout    1274224    2661731    1387507     108.00
> >
> > IIUC the swap-in ratio is way higher in comparison to that of swap out. Is that
> > always the case ? Or it tend to swap out from an active area of the working set
> > which faulted back again.
> >
> > > not seem to introduce a noticeable hot start penalty, not does it
> > > cause an increase in performance problems later in the app's
> > > lifecycle. I've measured with and without process_madvise, and the
> > > differences are within our noise bounds. Second, because we're not
> >
> > That is assuming that post process_madvise() working set for the application is
> > always smaller. There is another challenge. The external process should ideally
> > have the knowledge of active areas of the working set for an application in
> > question for it to invoke process_madvise() correctly to prevent such scenarios.
> >
> > > preemptively evicting file pages and only making them more likely to
> > > be evicted when there's already memory pressure, we avoid the case
> > > where we process_madvise an app then immediately return to the app and
> > > reload all file pages in the working set even though there was no
> > > intervening memory pressure. Our initial version of this work evicted
> >
> > That would be the worst case scenario which should be avoided. Memory pressure
> > must be a parameter before actually doing the swap out. But pages if know to be
> > inactive/cold can be marked high priority to be swapped out.
> >
> > > file pages preemptively and did cause a noticeable slowdown (~15%) for
> > > that case; this patch set avoids that slowdown. Finally, the benefit
> > > from avoiding cold starts is huge. The performance improvement from
> > > having a hot start instead of a cold start ranges from 3x for very
> > > small apps to 50x+ for larger apps like high-fidelity games.
> >
> > Is there any other real world scenario apart from this app based ecosystem where
> > user hinted LRU management might be helpful ? Just being curious. Thanks for the
> > detailed explanation. I will continue looking into this series.
>
> Chrome OS is another real world use-case for this user hinted LRU
> management approach by proactively reclaiming reclaim from tabs not
> accessed by the user for some time.

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Minchan Kim @ 2019-05-22  1:50 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190521153310.GA3218@cmpxchg.org>

On Tue, May 21, 2019 at 11:33:10AM -0400, Johannes Weiner wrote:
> On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> > On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > > [cc linux-api]
> > > 
> > > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > > System could have much faster swap device like zRAM. In that case, swapping
> > > > is extremely cheaper than file-IO on the low-end storage.
> > > > In this configuration, userspace could handle different strategy for each
> > > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > > file IO is more expensive in this case so want to keep them in memory
> > > > until memory pressure happens.
> > > > 
> > > > To support such strategy easier, this patch introduces
> > > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > > that /proc/<pid>/clear_refs already has supported same filters.
> > > > They are filters could be Ored with other existing hints using top two bits
> > > > of (int behavior).
> > > 
> > > madvise operates on top of ranges and it is quite trivial to do the
> > > filtering from the userspace so why do we need any additional filtering?
> > > 
> > > > Once either of them is set, the hint could affect only the interested vma
> > > > either anonymous or file-backed.
> > > > 
> > > > With that, user could call a process_madvise syscall simply with a entire
> > > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > > 
> > > OK, so here is the reason you want that. The immediate question is why
> > > cannot the monitor do the filtering from the userspace. Slightly more
> > > work, all right, but less of an API to expose and that itself is a
> > > strong argument against.
> > 
> > What I should do if we don't have such filter option is to enumerate all of
> > vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> > which would be painful for 2000+ vmas.
> 
> Just out of curiosity, how do you get to 2000+ distinct memory regions
> in the address space of a mobile app? I'm assuming these aren't files,
> but rather anon objects with poor grouping. Is that from guard pages
> between individual heap allocations or something?

Android uses preload library model to speed up app launch so it loads
all of library in advance on zygote and forks new app based on it.

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Linus Torvalds @ 2019-05-21 20:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-ia64, Linux-sh list, Oleg Nesterov, David Howells,
	open list:KERNEL SELFTEST FRAMEWORK, sparclinux, Shuah Khan,
	linux-arch, linux-s390, Miklos Szeredi, the arch/x86 maintainers,
	linux-mips, linux-xtensa, tkjos, Arnd Bergmann, Jann Horn,
	linux-m68k, Al Viro, Thomas Gleixner, Dmitry V. Levin, Linux ARM,
	Florian Weimer, linux-parisc, Linux API
In-Reply-To: <20190521164141.rbehqnghiej3gfua@brauner.io>

On Tue, May 21, 2019 at 9:41 AM Christian Brauner <christian@brauner.io> wrote:
>
> Yeah, you mentioned this before. I do like being able to specify an
> upper bound to have the ability to place fds strategically after said
> upper bound.

I suspect that's the case.

And if somebody really wants to just close everything and uses a large
upper bound, we can - if we really want to - just compare the upper
bound to the file table size, and do an optimized case for that. We do
that upper bound comparison anyway to limit the size of the walk, so
*if* it's a big deal, that case could then do the whole "shrink
fdtable" case too.

But I don't believe it's worth optimizing for unless somebody really
has a load where that is shown to be a big deal.   Just do the silly
and simple loop, and add a cond_resched() in the loop, like
close_files() does for the "we have a _lot_ of files open" case.

                   Linus

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Matthew Wilcox @ 2019-05-21 19:59 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-ia64, linux-sh, oleg, David Howells, linux-kselftest,
	sparclinux, shuah, linux-arch, linux-s390, miklos, x86, torvalds,
	Christian Brauner, linux-mips, linux-xtensa, tkjos, arnd, jannh,
	linux-m68k, tglx, ldv, linux-arm-kernel, fweimer, linux-parisc,
	linux-api, linux-kernel, linux-alpha, linux-fsdevel, linuxppc-dev
In-Reply-To: <20190521192009.GK17978@ZenIV.linux.org.uk>

On Tue, May 21, 2019 at 08:20:09PM +0100, Al Viro wrote:
> On Tue, May 21, 2019 at 05:30:27PM +0100, David Howells wrote:
> 
> > If we can live with close_from(int first) rather than close_range(), then this
> > can perhaps be done a lot more efficiently by:
> > 
> > 	new = alloc_fdtable(first);
> > 	spin_lock(&files->file_lock);
> > 	old = files_fdtable(files);
> > 	copy_fds(new, old, 0, first - 1);
> > 	rcu_assign_pointer(files->fdt, new);
> > 	spin_unlock(&files->file_lock);
> > 	clear_fds(old, 0, first - 1);
> > 	close_fdt_from(old, first);
> > 	kfree_rcu(old);
> 
> I really hate to think how that would interact with POSIX locks...

POSIX locks store current->files in fl_owner; David's resizing the
underlying files->fdt, just like growing from 64 to 256 fds.

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Al Viro @ 2019-05-21 19:20 UTC (permalink / raw)
  To: David Howells
  Cc: Christian Brauner, linux-kernel, linux-fsdevel, linux-api, jannh,
	fweimer, oleg, tglx, torvalds, arnd, shuah, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <28114.1558456227@warthog.procyon.org.uk>

On Tue, May 21, 2019 at 05:30:27PM +0100, David Howells wrote:

> If we can live with close_from(int first) rather than close_range(), then this
> can perhaps be done a lot more efficiently by:
> 
> 	new = alloc_fdtable(first);
> 	spin_lock(&files->file_lock);
> 	old = files_fdtable(files);
> 	copy_fds(new, old, 0, first - 1);
> 	rcu_assign_pointer(files->fdt, new);
> 	spin_unlock(&files->file_lock);
> 	clear_fds(old, 0, first - 1);
> 	close_fdt_from(old, first);
> 	kfree_rcu(old);

I really hate to think how that would interact with POSIX locks...

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-21 17:44 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli,
	Hugh Dickins, Jerome
In-Reply-To: <CALCETrUMDTGRtLFocw6vnN___7rkb6r82ULehs0=yQO5PZL8MA@mail.gmail.com>

On 21.05.2019 19:43, Andy Lutomirski wrote:
> On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>>
>>>
>>>> [Summary]
>>>>
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
>>>>
>>>> This allows to improve performance in significant way like
>>>> it's shows in the example below.
>>>>
>>>> [Description]
>>>>
>>>> This patchset adds a new syscall, which makes possible
>>>> to clone a VMA from a process to current process.
>>>> The syscall supplements the functionality provided
>>>> by process_vm_writev() and process_vm_readv() syscalls,
>>>> and it may be useful in many situation.
>>>>
>>>> For example, it allows to make a zero copy of data,
>>>> when process_vm_writev() was previously used:
>>>>
>>>>         struct iovec local_iov, remote_iov;
>>>>         void *buf;
>>>>
>>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>>
>>>>         local_iov->iov_base = buf;
>>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>>         remove_iov = ...;
>>>>
>>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>>         munmap(buf, n * PAGE_SIZE);
>>>>
>>>>         (Note, that above completely ignores error handling)
>>>>
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
> 
> If I understand this correctly, your intended use is to have one task
> allocate memory and fill it, have the other task clone the VMA, and
> have the first task free the VMA?  If so, that wasn't at all obvious
> from your original email.

Yes, exactly this. Sorry for confusing in initial description, it's not intentionally.

> Why don't you use splice() instead?

I just don't see a possibility of anonymous memory may be moved from
one process to another via splice(). Maybe you may explain your idea
more detailed?

> splice() the data to the remote
> task and have the remove task read() it?  All these VMA games will
> result in a lot of flushes, which is bad for performance.  Or,
> depending on your exact constraints, you could map a memfd in both
> tasks instead, which has the same flushing issues but at least has a
> sensible API.

memfd() is file-backed mapping, and it is not suitable for that.
In case of a process had anonymous mapping before the migration,
it wants the mapping remains the same after the migration. So,
if we use memfd(), we have to copy the memory from memfd mapping
to its real anonymous mapping target, which has the same problems
as process_vm_writev().

>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> 
> What are you doing that is causing *disk* IO in any of this?  I
> suspect your real problem is that you are using far too large of a
> buffer. See below.

Imagine, we are migrating a container, which consists of 9 GB of pages,
and we have 8GB RAM on destination node. Before the migration, we had
some of pages in RAM and some of pages in swap.

Source node sends pages to destination node. And there are limitations,
which do not allow to start creation of process tree on the destination
node, before all memory is received.

Pages are received by some page server task on destination. After all pages
are received, we create process tree and populate container tasks mappings.

When we're populating tasks mapping, we have to copy memory from page server
to a target task. In case of the pages were swapped from page server's
address space, we have to read synchronously them from swap. This introduces
big latency, and big IO I talked.

> 
>>
>>>>
>>>> 2)amount of memory for this example is doubled in a moment --
>>>>   n pages in current and n pages in remote tasks are occupied
>>>>   at the same time;
>>>
>>> This seems disingenuous.  If you're writing p pages total in chunks of
>>> n pages, you will use a total of p pages if you use mmap and p+n if
>>> you use write.
>>
>> I didn't understand this sentence because of many ifs, sorry. Could you
>> please explain your thought once again?
> 
> You seem to have a function that tries to populate p pages of memory
> with data received from a socket.  It looks like you're doing
> something like this:
> 
> void copy_p_pages(size_t p)
> {
>   size_t n = some_value(p);
>   char *buf = malloc(n * PAGE_SIZE);
>   for (int i = 0; i < p; i += n*PAGE_SIZE) {
>     read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
>     process_vm_writev(write n*PAGE_SIZE bytes to remote process);
>   }
>   free(buf);
> }
> 
> If you have a *constant* n (i.e. some_value(p) is just a number like
> 16)), then you aren't doubling memory usage.  If you have
> some_value(p) return p, then you are indeed doubling memory usage.  So
> don't do that!
> If buf is getting swapped out, you are very likely doing something
> wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
> surprised you have problems.  Try something reasonable like 128kB. For
> extra fun, you could mlock() that buf, but if you're thrashing on
> access to a 128kB working set, you will probably also get your *code*
> swapped out, in which case you pretty much lose.

The thing is we can't use small buffer. We have to receive all the restored
tasks pages on the destination node, before we start the process tree
creation like I wrote above. All the anonymous memory is mapped into
page server's MM, so it becomes swapped before container's process
tree starts to create.
 
>>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>>> a CoW copy of it? I assume you don't since the whole point is to
>>> write to remote memory
>>
>> But, no, there *is* COW semantic. We do not copy memory. We copy
>> page table content. This is just the same we have on fork(), when
>> children duplicates parent's VMA and related page table subset,
>> and parent's PTEs lose _PAGE_RW flag.
> 
> Then you need to document this very carefully, because other people
> will use your syscall in different ways than you use it.

Ok, I'll do.

> And, if you are doing CoW like this, then your syscall is basically
> only useful for your really weird use case in which you're using it to
> import an already-populated VMA.  Maybe this is a reasonable feature
> to add to the kernel, but it needs a benchmark against a reasonable
> alternative.

Do you mean comparison with process_vm_writev/readv() or something like
this?

>>
>> There is all copy_page_range() code reused for that. Please, see [3/7]
>> for the details.
> 
> You can't as users of a syscall to read the nitty gritty mm code to
> figure out what the syscall does from a user's perspective.

Yeah, sure :)
 
>>> But there are plenty of other questions.
>>> What happens if the remote VMA is a gate area or other special mapping
>>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>>> that wasn't expecting the mapping to get magically copied to a
>>> different process?
>>
>> In case of someone wants to duplicate such the mappings, we may consider
>> that, and extend the interface in the future for VMA types, which are
>> safe for that.
> 
> Do you mean that the code you sent rejects this case?  If so, please
> document it.  In any case, I looked at the code, and it seems to be
> trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
> would reject copying a vDSO.

I prohibit all the VMAs, which contain on of flags: VM_HUGETLB|VM_DONTEXPAND|VM_PFNMAP|VM_IO.
I'll check carefully, whether it's enough for vDSO.

Thanks,
Kirill

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Jann Horn @ 2019-05-21 17:28 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli
In-Reply-To: <069c90d6-924b-fa97-90d7-7d74f8785d9b@virtuozzo.com>

On Tue, May 21, 2019 at 7:04 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 19:20, Jann Horn wrote:
> > On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> On 21.05.2019 17:43, Andy Lutomirski wrote:
> >>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>>> New syscall, which allows to clone a remote process VMA
> >>>> into local process VM. The remote process's page table
> >>>> entries related to the VMA are cloned into local process's
> >>>> page table (in any desired address, which makes this different
> >>>> from that happens during fork()). Huge pages are handled
> >>>> appropriately.
> > [...]
> >>>> There are several problems with process_vm_writev() in this example:
> >>>>
> >>>> 1)it causes pagefault on remote process memory, and it forces
> >>>>   allocation of a new page (if was not preallocated);
> >>>
> >>> I don't see how your new syscall helps.  You're writing to remote
> >>> memory.  If that memory wasn't allocated, it's going to get allocated
> >>> regardless of whether you use a write-like interface or an mmap-like
> >>> interface.
> >>
> >> No, the talk is not about just another interface for copying memory.
> >> The talk is about borrowing of remote task's VMA and corresponding
> >> page table's content. Syscall allows to copy part of page table
> >> with preallocated pages from remote to local process. See here:
> >>
> >> [task1]                                                        [task2]
> >>
> >> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>
> >> <task1 populates buf>
> >>
> >>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> >> munmap(buf);
> >>
> >>
> >> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> >> just like in the way we do during fork syscall.
> >>
> >> There is no copying of buf memory content, unless COW happens. This is
> >> the principal difference to process_vm_writev(), which just allocates
> >> pages in remote VM.
> >>
> >>> Keep in mind that, on x86, just the hardware part of a
> >>> page fault is very slow -- populating the memory with a syscall
> >>> instead of a fault may well be faster.
> >>
> >> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> >> pages related to buf of task1 are swapped:
> >>
> >> 1)process_vm_writev() reads them back into memory;
> >>
> >> 2)process_vm_mmap() just copies swap PTEs from task1 page table
> >>   to task2 page table.
> >>
> >> Also, for faster page faults one may use huge pages for the mappings.
> >> But really, it's funny to think about page faults, when there are
> >> disk IO problems I shown.
> > [...]
> >>> That only doubles the amount of memory if you let n
> >>> scale linearly with p, which seems unlikely.
> >>>
> >>>>
> >>>> 3)received data has no a chance to be properly swapped for
> >>>>   a long time.
> >>>
> >>> ...
> >>>
> >>>> a)kernel moves @buf pages into swap right after recv();
> >>>> b)process_vm_writev() reads the data back from swap to pages;
> >>>
> >>> If you're under that much memory pressure and thrashing that badly,
> >>> your performance is going to be awful no matter what you're doing.  If
> >>> you indeed observe this behavior under normal loads, then this seems
> >>> like a VM issue that should be addressed in its own right.
> >>
> >> I don't think so. Imagine: a container migrates from one node to another.
> >> The nodes are the same, say, every of them has 4GB of RAM.
> >>
> >> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> >> After the page server on the second node received the pages, we want these
> >> pages become swapped as soon as possible, and we don't want to read them from
> >> swap to pass a read consumer.
> >
> > But you don't have to copy that memory into the container's tasks all
> > at once, right? Can't you, every time you've received a few dozen
> > kilobytes of data or whatever, shove them into the target task? That
> > way you don't have problems with swap because the time before the data
> > has arrived in its final VMA is tiny.
>
> We try to maintain online migration with as small downtime as possible,
> and the container on source node is completely stopped at the very end.
> Memory of container tasks is copied in background without container
> completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.
>
> Container may create any new processes during the migration, and these
> processes may contain any memory mappings.
>
> Imagine the situation. We migrate a big web server with a lot of processes,
> and some of children processes have the same COW mapping as parent has.
> In case of all memory dump is available at the moment of the grand parent
> web server process creation, we populate the mapping in parent, and all
> the children may inherit the mapping in case of they want after fork.
> COW works here. But in case of some processes are created before all memory
> is available on destination node, we can't do such the COW inheritance.
> This will be the reason, the memory consumed by container grows many
> times after the migration. So, the only solution is to create process
> tree after memory is available and all mappings are known.

But if one of the processes modifies the memory after you've started
migrating it to the new machine, that memory can't be CoW anymore
anyway, right? So it should work if you first do a first pass of
copying the memory and creating the process hierarchy, and then copy
more recent changes into the individual processes, breaking the CoW
for those pages, right?

> It's on of the examples. But believe me, there are a lot of another reasons,
> why process tree should be created only after all process tree is freezed,
> and no new tasks on source are possible. PGID and SSID inheritance, for
> example. All of this requires special order of tasks creation. In case of
> you try to restore process tree with correct namespaces and especial in
> case of many user namespaces in a container, you will just see like a hell
> will open before your eyes, and we never can think about this.

Could you elaborate on why that is so hellish?


> So, no, we can't create any task before the whole process tree is knows.
> Believe me, the reason is heavy and serious.
>
> Kirill
>

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-21 17:03 UTC (permalink / raw)
  To: Jann Horn
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli
In-Reply-To: <CAG48ez2BcVCwYGmAo4MwZ2crZ9f7=qKrORcN=fYz=K5xP2xfgQ@mail.gmail.com>

On 21.05.2019 19:20, Jann Horn wrote:
> On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>> On 21.05.2019 17:43, Andy Lutomirski wrote:
>>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>> New syscall, which allows to clone a remote process VMA
>>>> into local process VM. The remote process's page table
>>>> entries related to the VMA are cloned into local process's
>>>> page table (in any desired address, which makes this different
>>>> from that happens during fork()). Huge pages are handled
>>>> appropriately.
> [...]
>>>> There are several problems with process_vm_writev() in this example:
>>>>
>>>> 1)it causes pagefault on remote process memory, and it forces
>>>>   allocation of a new page (if was not preallocated);
>>>
>>> I don't see how your new syscall helps.  You're writing to remote
>>> memory.  If that memory wasn't allocated, it's going to get allocated
>>> regardless of whether you use a write-like interface or an mmap-like
>>> interface.
>>
>> No, the talk is not about just another interface for copying memory.
>> The talk is about borrowing of remote task's VMA and corresponding
>> page table's content. Syscall allows to copy part of page table
>> with preallocated pages from remote to local process. See here:
>>
>> [task1]                                                        [task2]
>>
>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>
>> <task1 populates buf>
>>
>>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
>> munmap(buf);
>>
>>
>> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
>> just like in the way we do during fork syscall.
>>
>> There is no copying of buf memory content, unless COW happens. This is
>> the principal difference to process_vm_writev(), which just allocates
>> pages in remote VM.
>>
>>> Keep in mind that, on x86, just the hardware part of a
>>> page fault is very slow -- populating the memory with a syscall
>>> instead of a fault may well be faster.
>>
>> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
>> pages related to buf of task1 are swapped:
>>
>> 1)process_vm_writev() reads them back into memory;
>>
>> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>>   to task2 page table.
>>
>> Also, for faster page faults one may use huge pages for the mappings.
>> But really, it's funny to think about page faults, when there are
>> disk IO problems I shown.
> [...]
>>> That only doubles the amount of memory if you let n
>>> scale linearly with p, which seems unlikely.
>>>
>>>>
>>>> 3)received data has no a chance to be properly swapped for
>>>>   a long time.
>>>
>>> ...
>>>
>>>> a)kernel moves @buf pages into swap right after recv();
>>>> b)process_vm_writev() reads the data back from swap to pages;
>>>
>>> If you're under that much memory pressure and thrashing that badly,
>>> your performance is going to be awful no matter what you're doing.  If
>>> you indeed observe this behavior under normal loads, then this seems
>>> like a VM issue that should be addressed in its own right.
>>
>> I don't think so. Imagine: a container migrates from one node to another.
>> The nodes are the same, say, every of them has 4GB of RAM.
>>
>> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
>> After the page server on the second node received the pages, we want these
>> pages become swapped as soon as possible, and we don't want to read them from
>> swap to pass a read consumer.
> 
> But you don't have to copy that memory into the container's tasks all
> at once, right? Can't you, every time you've received a few dozen
> kilobytes of data or whatever, shove them into the target task? That
> way you don't have problems with swap because the time before the data
> has arrived in its final VMA is tiny.

We try to maintain online migration with as small downtime as possible,
and the container on source node is completely stopped at the very end.
Memory of container tasks is copied in background without container
completely stop, and _PAGE_SOFT_DIRTY is used to track dirty pages.

Container may create any new processes during the migration, and these
processes may contain any memory mappings.

Imagine the situation. We migrate a big web server with a lot of processes,
and some of children processes have the same COW mapping as parent has.
In case of all memory dump is available at the moment of the grand parent
web server process creation, we populate the mapping in parent, and all
the children may inherit the mapping in case of they want after fork.
COW works here. But in case of some processes are created before all memory
is available on destination node, we can't do such the COW inheritance.
This will be the reason, the memory consumed by container grows many
times after the migration. So, the only solution is to create process
tree after memory is available and all mappings are known.

It's on of the examples. But believe me, there are a lot of another reasons,
why process tree should be created only after all process tree is freezed,
and no new tasks on source are possible. PGID and SSID inheritance, for
example. All of this requires special order of tasks creation. In case of
you try to restore process tree with correct namespaces and especial in
case of many user namespaces in a container, you will just see like a hell
will open before your eyes, and we never can think about this.

So, no, we can't create any task before the whole process tree is knows.
Believe me, the reason is heavy and serious.

Kirill

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-21 16:53 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-kernel, linux-fsdevel, linux-api, jannh, fweimer, oleg,
	tglx, torvalds, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190521150006.GJ17978@ZenIV.linux.org.uk>

On Tue, May 21, 2019 at 04:00:06PM +0100, Al Viro wrote:
> On Tue, May 21, 2019 at 01:34:47PM +0200, Christian Brauner wrote:
> 
> > This adds the close_range() syscall. It allows to efficiently close a range
> > of file descriptors up to all file descriptors of a calling task.
> > 
> > The syscall came up in a recent discussion around the new mount API and
> > making new file descriptor types cloexec by default. During this
> > discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> > syscall in this manner has been requested by various people over time.
> > 
> > First, it helps to close all file descriptors of an exec()ing task. This
> > can be done safely via (quoting Al's example from [1] verbatim):
> > 
> >         /* that exec is sensitive */
> >         unshare(CLONE_FILES);
> >         /* we don't want anything past stderr here */
> >         close_range(3, ~0U);
> >         execve(....);
> > 
> > The code snippet above is one way of working around the problem that file
> > descriptors are not cloexec by default. This is aggravated by the fact that
> > we can't just switch them over without massively regressing userspace. For
> > a whole class of programs having an in-kernel method of closing all file
> > descriptors is very helpful (e.g. demons, service managers, programming
> > language standard libraries, container managers etc.).
> > (Please note, unshare(CLONE_FILES) should only be needed if the calling
> >  task is multi-threaded and shares the file descriptor table with another
> >  thread in which case two threads could race with one thread allocating
> >  file descriptors and the other one closing them via close_range(). For the
> >  general case close_range() before the execve() is sufficient.)
> > 
> > Second, it allows userspace to avoid implementing closing all file
> > descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> > file descriptor. From looking at various large(ish) userspace code bases
> > this or similar patterns are very common in:
> > - service managers (cf. [4])
> > - libcs (cf. [6])
> > - container runtimes (cf. [5])
> > - programming language runtimes/standard libraries
> >   - Python (cf. [2])
> >   - Rust (cf. [7], [8])
> > As Dmitry pointed out there's even a long-standing glibc bug about missing
> > kernel support for this task (cf. [3]).
> > In addition, the syscall will also work for tasks that do not have procfs
> > mounted and on kernels that do not have procfs support compiled in. In such
> > situations the only way to make sure that all file descriptors are closed
> > is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> > OPEN_MAX trickery (cf. comment [8] on Rust).
> > 
> > The performance is striking. For good measure, comparing the following
> > simple close_all_fds() userspace implementation that is essentially just
> > glibc's version in [6]:
> > 
> > static int close_all_fds(void)
> > {
> >         DIR *dir;
> >         struct dirent *direntp;
> > 
> >         dir = opendir("/proc/self/fd");
> >         if (!dir)
> >                 return -1;
> > 
> >         while ((direntp = readdir(dir))) {
> >                 int fd;
> >                 if (strcmp(direntp->d_name, ".") == 0)
> >                         continue;
> >                 if (strcmp(direntp->d_name, "..") == 0)
> >                         continue;
> >                 fd = atoi(direntp->d_name);
> >                 if (fd == 0 || fd == 1 || fd == 2)
> >                         continue;
> >                 close(fd);
> >         }
> > 
> >         closedir(dir); /* cannot fail */
> >         return 0;
> > }
> > 
> > to close_range() yields:
> > 1. closing 4 open files:
> >    - close_all_fds(): ~280 us
> >    - close_range():    ~24 us
> > 
> > 2. closing 1000 open files:
> >    - close_all_fds(): ~5000 us
> >    - close_range():   ~800 us
> > 
> > close_range() is designed to allow for some flexibility. Specifically, it
> > does not simply always close all open file descriptors of a task. Instead,
> > callers can specify an upper bound.
> > This is e.g. useful for scenarios where specific file descriptors are
> > created with well-known numbers that are supposed to be excluded from
> > getting closed.
> > For extra paranoia close_range() comes with a flags argument. This can e.g.
> > be used to implement extension. Once can imagine userspace wanting to stop
> > at the first error instead of ignoring errors under certain circumstances.
> > There might be other valid ideas in the future. In any case, a flag
> > argument doesn't hurt and keeps us on the safe side.
> > 
> > >From an implementation side this is kept rather dumb. It saw some input
> > from David and Jann but all nonsense is obviously my own!
> > - Errors to close file descriptors are currently ignored. (Could be changed
> >   by setting a flag in the future if needed.)
> > - __close_range() is a rather simplistic wrapper around __close_fd().
> >   My reasoning behind this is based on the nature of how __close_fd() needs
> >   to release an fd. But maybe I misunderstood specifics:
> >   We take the files_lock and rcu-dereference the fdtable of the calling
> >   task, we find the entry in the fdtable, get the file and need to release
> >   files_lock before calling filp_close().
> >   In the meantime the fdtable might have been altered so we can't just
> >   retake the spinlock and keep the old rcu-reference of the fdtable
> >   around. Instead we need to grab a fresh reference to the fdtable.
> >   If my reasoning is correct then there's really no point in fancyfying
> >   __close_range(): We just need to rcu-dereference the fdtable of the
> >   calling task once to cap the max_fd value correctly and then go on
> >   calling __close_fd() in a loop.
> 
> > +/**
> > + * __close_range() - Close all file descriptors in a given range.
> > + *
> > + * @fd:     starting file descriptor to close
> > + * @max_fd: last file descriptor to close
> > + *
> > + * This closes a range of file descriptors. All file descriptors
> > + * from @fd up to and including @max_fd are closed.
> > + */
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > +	unsigned int cur_max;
> > +
> > +	if (fd > max_fd)
> > +		return -EINVAL;
> > +
> > +	rcu_read_lock();
> > +	cur_max = files_fdtable(files)->max_fds;
> > +	rcu_read_unlock();
> > +
> > +	/* cap to last valid index into fdtable */
> > +	if (max_fd >= cur_max)
> > +		max_fd = cur_max - 1;
> > +
> > +	while (fd <= max_fd)
> > +		__close_fd(files, fd++);
> > +
> > +	return 0;
> > +}
> 
> Umm...  That's going to be very painful if you dup2() something to MAX_INT and
> then run that; roughly 2G iterations of bouncing ->file_lock up and down,
> without anything that would yield CPU in process.
> 
> If anything, I would suggest something like
> 
> 	fd = *start_fd;
> 	grab the lock
>         fdt = files_fdtable(files);
> more:
> 	look for the next eviction candidate in ->open_fds, starting at fd
> 	if there's none up to max_fd
> 		drop the lock
> 		return NULL
> 	*start_fd = fd + 1;
> 	if the fscker is really opened and not just reserved
> 		rcu_assign_pointer(fdt->fd[fd], NULL);
> 		__put_unused_fd(files, fd);
> 		drop the lock
> 		return the file we'd got
> 	if (unlikely(need_resched()))
> 		drop lock
> 		cond_resched();
> 		grab lock
> 		fdt = files_fdtable(files);
> 	goto more;
> 
> with the main loop being basically
> 	while ((file = pick_next(files, &start_fd, max_fd)) != NULL)
> 		filp_close(file, files);

That's obviously much more clever than what I had.
I honestly have never thought about using open_fds before this. Seemed
extremely localized to file.c
Thanks for the pointers!

Christian

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Andy Lutomirski @ 2019-05-21 16:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli,
	Hugh
In-Reply-To: <9638a51c-4295-924f-1852-1783c7f3e82d@virtuozzo.com>

On Tue, May 21, 2019 at 8:52 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >>
> >
> >> [Summary]
> >>
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
> >>
> >> This allows to improve performance in significant way like
> >> it's shows in the example below.
> >>
> >> [Description]
> >>
> >> This patchset adds a new syscall, which makes possible
> >> to clone a VMA from a process to current process.
> >> The syscall supplements the functionality provided
> >> by process_vm_writev() and process_vm_readv() syscalls,
> >> and it may be useful in many situation.
> >>
> >> For example, it allows to make a zero copy of data,
> >> when process_vm_writev() was previously used:
> >>
> >>         struct iovec local_iov, remote_iov;
> >>         void *buf;
> >>
> >>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> >>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
> >>         recv(sock, buf, n * PAGE_SIZE, 0);
> >>
> >>         local_iov->iov_base = buf;
> >>         local_iov->iov_len = n * PAGE_SIZE;
> >>         remove_iov = ...;
> >>
> >>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
> >>         munmap(buf, n * PAGE_SIZE);
> >>
> >>         (Note, that above completely ignores error handling)
> >>
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.

If I understand this correctly, your intended use is to have one task
allocate memory and fill it, have the other task clone the VMA, and
have the first task free the VMA?  If so, that wasn't at all obvious
from your original email.

Why don't you use splice() instead?  splice() the data to the remote
task and have the remove task read() it?  All these VMA games will
result in a lot of flushes, which is bad for performance.  Or,
depending on your exact constraints, you could map a memfd in both
tasks instead, which has the same flushing issues but at least has a
sensible API.

>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.

What are you doing that is causing *disk* IO in any of this?  I
suspect your real problem is that you are using far too large of a
buffer. See below.

>
> >>
> >> 2)amount of memory for this example is doubled in a moment --
> >>   n pages in current and n pages in remote tasks are occupied
> >>   at the same time;
> >
> > This seems disingenuous.  If you're writing p pages total in chunks of
> > n pages, you will use a total of p pages if you use mmap and p+n if
> > you use write.
>
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?

You seem to have a function that tries to populate p pages of memory
with data received from a socket.  It looks like you're doing
something like this:

void copy_p_pages(size_t p)
{
  size_t n = some_value(p);
  char *buf = malloc(n * PAGE_SIZE);
  for (int i = 0; i < p; i += n*PAGE_SIZE) {
    read(fd, buf, n*PAGE_SIZE);  /* check return value, etc */
    process_vm_writev(write n*PAGE_SIZE bytes to remote process);
  }
  free(buf);
}

If you have a *constant* n (i.e. some_value(p) is just a number like
16)), then you aren't doubling memory usage.  If you have
some_value(p) return p, then you are indeed doubling memory usage.  So
don't do that!
If buf is getting swapped out, you are very likely doing something
wrong.  If you're using a 100MB buffer or a 10GB, then I'm not
surprised you have problems.  Try something reasonable like 128kB. For
extra fun, you could mlock() that buf, but if you're thrashing on
access to a 128kB working set, you will probably also get your *code*
swapped out, in which case you pretty much lose.


> > For example, if the remote VMA is MAP_ANONYMOUS, do you get
> > a CoW copy of it? I assume you don't since the whole point is to
> > write to remote memory
>
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.

Then you need to document this very carefully, because other people
will use your syscall in different ways than you use it.

And, if you are doing CoW like this, then your syscall is basically
only useful for your really weird use case in which you're using it to
import an already-populated VMA.  Maybe this is a reasonable feature
to add to the kernel, but it needs a benchmark against a reasonable
alternative.

>
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.

You can't as users of a syscall to read the nitty gritty mm code to
figure out what the syscall does from a user's perspective.

> > But there are plenty of other questions.
> > What happens if the remote VMA is a gate area or other special mapping
> > (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> > that wasn't expecting the mapping to get magically copied to a
> > different process?
>
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.

Do you mean that the code you sent rejects this case?  If so, please
document it.  In any case, I looked at the code, and it seems to be
trying to handle MAP_SHARED and MAP_ANONYMOUS.  I don't see where it
would reject copying a vDSO.

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-21 16:41 UTC (permalink / raw)
  To: David Howells
  Cc: Al Viro, linux-kernel, linux-fsdevel, linux-api, jannh, fweimer,
	oleg, tglx, torvalds, arnd, shuah, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <28114.1558456227@warthog.procyon.org.uk>

On Tue, May 21, 2019 at 05:30:27PM +0100, David Howells wrote:
> Al Viro <viro@zeniv.linux.org.uk> wrote:
> 
> > Umm...  That's going to be very painful if you dup2() something to MAX_INT and
> > then run that; roughly 2G iterations of bouncing ->file_lock up and down,
> > without anything that would yield CPU in process.
> > 
> > If anything, I would suggest something like
> > 
> > 	fd = *start_fd;
> > 	grab the lock
> >         fdt = files_fdtable(files);
> > more:
> > 	look for the next eviction candidate in ->open_fds, starting at fd
> > 	if there's none up to max_fd
> > 		drop the lock
> > 		return NULL
> > 	*start_fd = fd + 1;
> > 	if the fscker is really opened and not just reserved
> > 		rcu_assign_pointer(fdt->fd[fd], NULL);
> > 		__put_unused_fd(files, fd);
> > 		drop the lock
> > 		return the file we'd got
> > 	if (unlikely(need_resched()))
> > 		drop lock
> > 		cond_resched();
> > 		grab lock
> > 		fdt = files_fdtable(files);
> > 	goto more;
> > 
> > with the main loop being basically
> > 	while ((file = pick_next(files, &start_fd, max_fd)) != NULL)
> > 		filp_close(file, files);
> 
> If we can live with close_from(int first) rather than close_range(), then this
> can perhaps be done a lot more efficiently by:

Yeah, you mentioned this before. I do like being able to specify an
upper bound to have the ability to place fds strategically after said
upper bound.
I have used this quite a few times where I know that given task may have
inherited up to m fds and I want to inherit a specific pipe who's fd I
know. Then I'd dup2(pipe_fd, <upper_bound + 1>) and then close all
other fds. Is that too much of a corner case?

Christian

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: David Howells @ 2019-05-21 16:30 UTC (permalink / raw)
  To: Al Viro
  Cc: dhowells, Christian Brauner, linux-kernel, linux-fsdevel,
	linux-api, jannh, fweimer, oleg, tglx, torvalds, arnd, shuah,
	tkjos, ldv, miklos, linux-alpha, linux-arm-kernel, linux-ia64,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390,
	linux-sh, sparclinux, linux-xtensa, linux-arch, linux-kselftest,
	x86
In-Reply-To: <20190521150006.GJ17978@ZenIV.linux.org.uk>

Al Viro <viro@zeniv.linux.org.uk> wrote:

> Umm...  That's going to be very painful if you dup2() something to MAX_INT and
> then run that; roughly 2G iterations of bouncing ->file_lock up and down,
> without anything that would yield CPU in process.
> 
> If anything, I would suggest something like
> 
> 	fd = *start_fd;
> 	grab the lock
>         fdt = files_fdtable(files);
> more:
> 	look for the next eviction candidate in ->open_fds, starting at fd
> 	if there's none up to max_fd
> 		drop the lock
> 		return NULL
> 	*start_fd = fd + 1;
> 	if the fscker is really opened and not just reserved
> 		rcu_assign_pointer(fdt->fd[fd], NULL);
> 		__put_unused_fd(files, fd);
> 		drop the lock
> 		return the file we'd got
> 	if (unlikely(need_resched()))
> 		drop lock
> 		cond_resched();
> 		grab lock
> 		fdt = files_fdtable(files);
> 	goto more;
> 
> with the main loop being basically
> 	while ((file = pick_next(files, &start_fd, max_fd)) != NULL)
> 		filp_close(file, files);

If we can live with close_from(int first) rather than close_range(), then this
can perhaps be done a lot more efficiently by:

	new = alloc_fdtable(first);
	spin_lock(&files->file_lock);
	old = files_fdtable(files);
	copy_fds(new, old, 0, first - 1);
	rcu_assign_pointer(files->fdt, new);
	spin_unlock(&files->file_lock);
	clear_fds(old, 0, first - 1);
	close_fdt_from(old, first);
	kfree_rcu(old);

David

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Jann Horn @ 2019-05-21 16:20 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andy Lutomirski, Andrew Morton, Dan Williams, Michal Hocko,
	Keith Busch, Kirill A. Shutemov, Alexander Duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli
In-Reply-To: <9638a51c-4295-924f-1852-1783c7f3e82d@virtuozzo.com>

On Tue, May 21, 2019 at 5:52 PM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
> > On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
> >> New syscall, which allows to clone a remote process VMA
> >> into local process VM. The remote process's page table
> >> entries related to the VMA are cloned into local process's
> >> page table (in any desired address, which makes this different
> >> from that happens during fork()). Huge pages are handled
> >> appropriately.
[...]
> >> There are several problems with process_vm_writev() in this example:
> >>
> >> 1)it causes pagefault on remote process memory, and it forces
> >>   allocation of a new page (if was not preallocated);
> >
> > I don't see how your new syscall helps.  You're writing to remote
> > memory.  If that memory wasn't allocated, it's going to get allocated
> > regardless of whether you use a write-like interface or an mmap-like
> > interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1]                                                        [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
> > Keep in mind that, on x86, just the hardware part of a
> > page fault is very slow -- populating the memory with a syscall
> > instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
[...]
> > That only doubles the amount of memory if you let n
> > scale linearly with p, which seems unlikely.
> >
> >>
> >> 3)received data has no a chance to be properly swapped for
> >>   a long time.
> >
> > ...
> >
> >> a)kernel moves @buf pages into swap right after recv();
> >> b)process_vm_writev() reads the data back from swap to pages;
> >
> > If you're under that much memory pressure and thrashing that badly,
> > your performance is going to be awful no matter what you're doing.  If
> > you indeed observe this behavior under normal loads, then this seems
> > like a VM issue that should be addressed in its own right.
>
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
>
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

But you don't have to copy that memory into the container's tasks all
at once, right? Can't you, every time you've received a few dozen
kilobytes of data or whatever, shove them into the target task? That
way you don't have problems with swap because the time before the data
has arrived in its final VMA is tiny.

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-21 15:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli,
	Hugh Dickins, Jerome
In-Reply-To: <9638a51c-4295-924f-1852-1783c7f3e82d@virtuozzo.com>

On 21.05.2019 18:52, Kirill Tkhai wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>>
>>
>>> [Summary]
>>>
>>> New syscall, which allows to clone a remote process VMA
>>> into local process VM. The remote process's page table
>>> entries related to the VMA are cloned into local process's
>>> page table (in any desired address, which makes this different
>>> from that happens during fork()). Huge pages are handled
>>> appropriately.
>>>
>>> This allows to improve performance in significant way like
>>> it's shows in the example below.
>>>
>>> [Description]
>>>
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>>
>>> For example, it allows to make a zero copy of data,
>>> when process_vm_writev() was previously used:
>>>
>>>         struct iovec local_iov, remote_iov;
>>>         void *buf;
>>>
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>>         local_iov->iov_base = buf;
>>>         local_iov->iov_len = n * PAGE_SIZE;
>>>         remove_iov = ...;
>>>
>>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>>         munmap(buf, n * PAGE_SIZE);
>>>
>>>         (Note, that above completely ignores error handling)
>>>
>>> There are several problems with process_vm_writev() in this example:
>>>
>>> 1)it causes pagefault on remote process memory, and it forces
>>>   allocation of a new page (if was not preallocated);
>>
>> I don't see how your new syscall helps.  You're writing to remote
>> memory.  If that memory wasn't allocated, it's going to get allocated
>> regardless of whether you use a write-like interface or an mmap-like
>> interface.
> 
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
> 
> [task1]                                                        [task2]
> 
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>            MAP_PRIVATE|MAP_ANONYMOUS, ...);
> 
> <task1 populates buf>
> 
>                                                                buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
> 
> 
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
> 
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
> 
>> Keep in mind that, on x86, just the hardware part of a
>> page fault is very slow -- populating the memory with a syscall
>> instead of a fault may well be faster.
> 
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
> 
> 1)process_vm_writev() reads them back into memory;
> 
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
>   to task2 page table.
> 
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
> 
>>>
>>> 2)amount of memory for this example is doubled in a moment --
>>>   n pages in current and n pages in remote tasks are occupied
>>>   at the same time;
>>
>> This seems disingenuous.  If you're writing p pages total in chunks of
>> n pages, you will use a total of p pages if you use mmap and p+n if
>> you use write.
> 
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?
> 
>> That only doubles the amount of memory if you let n
>> scale linearly with p, which seems unlikely.
>>
>>>
>>> 3)received data has no a chance to be properly swapped for
>>>   a long time.
>>
>> ...
>>
>>> a)kernel moves @buf pages into swap right after recv();
>>> b)process_vm_writev() reads the data back from swap to pages;
>>
>> If you're under that much memory pressure and thrashing that badly,
>> your performance is going to be awful no matter what you're doing.  If
>> you indeed observe this behavior under normal loads, then this seems
>> like a VM issue that should be addressed in its own right.
> 
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
> 
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

Should be "to pass a *real* consumer".

> 
> The page server is task1 in the example. The real consumer is task2.
> 
> This is a rather normal load, I think.
> 
>>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> [Task 2]
>>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>>
>>> This creates a copy of VMA related to buf from task1 in task2's VM.
>>> Task1's page table entries are copied into corresponding page table
>>> entries of VM of task2.
>>
>> You need to fully explain a whole bunch of details that you're
>> ignored.
> 
> Yeah, it's not a problem :) I'm ready to explain and describe everything,
> what may cause a question. Just ask ;) 
> 
>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>> a CoW copy of it? I assume you don't since the whole point is to
>> write to remote memory
> 
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.
> 
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.
> 
> I'm going to get special performance using THP, when number of entries
> to copy is smaller than in case of PTE.
> 
> Copy several of PMD from one task page table to another's is much much much faster,
> than process_vm_write() copies pages (even not mention about its reading from swap).
> 
>> ,but it's at the very least quite unusual in
>> Linux to have two different anonymous VMAs such that writing one of
>> them changes the other one.
> Writing to a new VMA does not affect old VMA. Old VMA is just used to
> get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
> each other.
> 
>> But there are plenty of other questions.
>> What happens if the remote VMA is a gate area or other special mapping
>> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
>> that wasn't expecting the mapping to get magically copied to a
>> different process?
> 
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.
> 
> But now the logic is very overprotective, and all the unusual mappings
> like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
> for the details.
> 
>> This new API seems quite dangerous and complex to me, and I don't
>> think the value has been adequately demonstrated.
> 
> I don't think it's dangerous and complex, because of I haven't introduced
> any principal VMA conceptions different to what we have now. We just
> borrow vma->anon_vma and vma->vm_file from remote process to local
> like we did on fork() (borrowing of vma->anon_vma means not blindly
> copying, but ordinary anon_vma_fork()).
> 
> Maybe I had to focus the description more on copying of PTE/PMD
> instead of vma duplication. So, it's unexpected for me, that people
> think about simple memory copying after reading the example I gave.
> But I gave more explanation here, so I hope the situation became
> clearer for a reader. Anyway, if you have any questions, please
> ask me.
> 
> Thanks,
> Kirill
> 

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-21 15:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli,
	Hugh Dickins, Jerome
In-Reply-To: <CALCETrU221N6uPmdaj4bRDDsf+Oc5tEfPERuyV24wsYKHn+spA@mail.gmail.com>

On 21.05.2019 17:43, Andy Lutomirski wrote:
> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>>
> 
>> [Summary]
>>
>> New syscall, which allows to clone a remote process VMA
>> into local process VM. The remote process's page table
>> entries related to the VMA are cloned into local process's
>> page table (in any desired address, which makes this different
>> from that happens during fork()). Huge pages are handled
>> appropriately.
>>
>> This allows to improve performance in significant way like
>> it's shows in the example below.
>>
>> [Description]
>>
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
>>
>> For example, it allows to make a zero copy of data,
>> when process_vm_writev() was previously used:
>>
>>         struct iovec local_iov, remote_iov;
>>         void *buf;
>>
>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>>         local_iov->iov_base = buf;
>>         local_iov->iov_len = n * PAGE_SIZE;
>>         remove_iov = ...;
>>
>>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>         munmap(buf, n * PAGE_SIZE);
>>
>>         (Note, that above completely ignores error handling)
>>
>> There are several problems with process_vm_writev() in this example:
>>
>> 1)it causes pagefault on remote process memory, and it forces
>>   allocation of a new page (if was not preallocated);
> 
> I don't see how your new syscall helps.  You're writing to remote
> memory.  If that memory wasn't allocated, it's going to get allocated
> regardless of whether you use a write-like interface or an mmap-like
> interface.

No, the talk is not about just another interface for copying memory.
The talk is about borrowing of remote task's VMA and corresponding
page table's content. Syscall allows to copy part of page table
with preallocated pages from remote to local process. See here:

[task1]                                                        [task2]

buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS, ...);

<task1 populates buf>

                                                               buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
munmap(buf);

process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
just like in the way we do during fork syscall.

There is no copying of buf memory content, unless COW happens. This is
the principal difference to process_vm_writev(), which just allocates
pages in remote VM.

> Keep in mind that, on x86, just the hardware part of a
> page fault is very slow -- populating the memory with a syscall
> instead of a fault may well be faster.

It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
pages related to buf of task1 are swapped:

1)process_vm_writev() reads them back into memory;

2)process_vm_mmap() just copies swap PTEs from task1 page table
  to task2 page table.

Also, for faster page faults one may use huge pages for the mappings.
But really, it's funny to think about page faults, when there are
disk IO problems I shown.

>>
>> 2)amount of memory for this example is doubled in a moment --
>>   n pages in current and n pages in remote tasks are occupied
>>   at the same time;
> 
> This seems disingenuous.  If you're writing p pages total in chunks of
> n pages, you will use a total of p pages if you use mmap and p+n if
> you use write.

I didn't understand this sentence because of many ifs, sorry. Could you
please explain your thought once again?

> That only doubles the amount of memory if you let n
> scale linearly with p, which seems unlikely.
>
>>
>> 3)received data has no a chance to be properly swapped for
>>   a long time.
> 
> ...
> 
>> a)kernel moves @buf pages into swap right after recv();
>> b)process_vm_writev() reads the data back from swap to pages;
> 
> If you're under that much memory pressure and thrashing that badly,
> your performance is going to be awful no matter what you're doing.  If
> you indeed observe this behavior under normal loads, then this seems
> like a VM issue that should be addressed in its own right.

I don't think so. Imagine: a container migrates from one node to another.
The nodes are the same, say, every of them has 4GB of RAM.

Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
After the page server on the second node received the pages, we want these
pages become swapped as soon as possible, and we don't want to read them from
swap to pass a read consumer.

The page server is task1 in the example. The real consumer is task2.

This is a rather normal load, I think.

>>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>         recv(sock, buf, n * PAGE_SIZE, 0);
>>
>> [Task 2]
>>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>
>> This creates a copy of VMA related to buf from task1 in task2's VM.
>> Task1's page table entries are copied into corresponding page table
>> entries of VM of task2.
> 
> You need to fully explain a whole bunch of details that you're
> ignored.

Yeah, it's not a problem :) I'm ready to explain and describe everything,
what may cause a question. Just ask ;) 

> For example, if the remote VMA is MAP_ANONYMOUS, do you get
> a CoW copy of it? I assume you don't since the whole point is to
> write to remote memory

But, no, there *is* COW semantic. We do not copy memory. We copy
page table content. This is just the same we have on fork(), when
children duplicates parent's VMA and related page table subset,
and parent's PTEs lose _PAGE_RW flag.

There is all copy_page_range() code reused for that. Please, see [3/7]
for the details.

I'm going to get special performance using THP, when number of entries
to copy is smaller than in case of PTE.

Copy several of PMD from one task page table to another's is much much much faster,
than process_vm_write() copies pages (even not mention about its reading from swap).

>,but it's at the very least quite unusual in
> Linux to have two different anonymous VMAs such that writing one of
> them changes the other one.
Writing to a new VMA does not affect old VMA. Old VMA is just used to
get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
each other.

> But there are plenty of other questions.
> What happens if the remote VMA is a gate area or other special mapping
> (vDSO, vvar area, etc)?  What if the remote memory comes from a driver
> that wasn't expecting the mapping to get magically copied to a
> different process?

In case of someone wants to duplicate such the mappings, we may consider
that, and extend the interface in the future for VMA types, which are
safe for that.

But now the logic is very overprotective, and all the unusual mappings
like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
for the details.

> This new API seems quite dangerous and complex to me, and I don't
> think the value has been adequately demonstrated.

I don't think it's dangerous and complex, because of I haven't introduced
any principal VMA conceptions different to what we have now. We just
borrow vma->anon_vma and vma->vm_file from remote process to local
like we did on fork() (borrowing of vma->anon_vma means not blindly
copying, but ordinary anon_vma_fork()).

Maybe I had to focus the description more on copying of PTE/PMD
instead of vma duplication. So, it's unexpected for me, that people
think about simple memory copying after reading the example I gave.
But I gave more explanation here, so I hope the situation became
clearer for a reader. Anyway, if you have any questions, please
ask me.

Thanks,
Kirill

^ permalink raw reply

* Re: [RFC 7/7] mm: madvise support MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER
From: Johannes Weiner @ 2019-05-21 15:33 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Michal Hocko, Andrew Morton, LKML, linux-mm, Tim Murray,
	Joel Fernandes, Suren Baghdasaryan, Daniel Colascione,
	Shakeel Butt, Sonny Rao, Brian Geffon, linux-api
In-Reply-To: <20190521025533.GH10039@google.com>

On Tue, May 21, 2019 at 11:55:33AM +0900, Minchan Kim wrote:
> On Mon, May 20, 2019 at 11:28:01AM +0200, Michal Hocko wrote:
> > [cc linux-api]
> > 
> > On Mon 20-05-19 12:52:54, Minchan Kim wrote:
> > > System could have much faster swap device like zRAM. In that case, swapping
> > > is extremely cheaper than file-IO on the low-end storage.
> > > In this configuration, userspace could handle different strategy for each
> > > kinds of vma. IOW, they want to reclaim anonymous pages by MADV_COLD
> > > while it keeps file-backed pages in inactive LRU by MADV_COOL because
> > > file IO is more expensive in this case so want to keep them in memory
> > > until memory pressure happens.
> > > 
> > > To support such strategy easier, this patch introduces
> > > MADV_ANONYMOUS_FILTER and MADV_FILE_FILTER options in madvise(2) like
> > > that /proc/<pid>/clear_refs already has supported same filters.
> > > They are filters could be Ored with other existing hints using top two bits
> > > of (int behavior).
> > 
> > madvise operates on top of ranges and it is quite trivial to do the
> > filtering from the userspace so why do we need any additional filtering?
> > 
> > > Once either of them is set, the hint could affect only the interested vma
> > > either anonymous or file-backed.
> > > 
> > > With that, user could call a process_madvise syscall simply with a entire
> > > range(0x0 - 0xFFFFFFFFFFFFFFFF) but either of MADV_ANONYMOUS_FILTER and
> > > MADV_FILE_FILTER so there is no need to call the syscall range by range.
> > 
> > OK, so here is the reason you want that. The immediate question is why
> > cannot the monitor do the filtering from the userspace. Slightly more
> > work, all right, but less of an API to expose and that itself is a
> > strong argument against.
> 
> What I should do if we don't have such filter option is to enumerate all of
> vma via /proc/<pid>/maps and then parse every ranges and inode from string,
> which would be painful for 2000+ vmas.

Just out of curiosity, how do you get to 2000+ distinct memory regions
in the address space of a mobile app? I'm assuming these aren't files,
but rather anon objects with poor grouping. Is that from guard pages
between individual heap allocations or something?

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Al Viro @ 2019-05-21 15:00 UTC (permalink / raw)
  To: Christian Brauner
  Cc: linux-kernel, linux-fsdevel, linux-api, jannh, fweimer, oleg,
	tglx, torvalds, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190521113448.20654-1-christian@brauner.io>

On Tue, May 21, 2019 at 01:34:47PM +0200, Christian Brauner wrote:

> This adds the close_range() syscall. It allows to efficiently close a range
> of file descriptors up to all file descriptors of a calling task.
> 
> The syscall came up in a recent discussion around the new mount API and
> making new file descriptor types cloexec by default. During this
> discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
> syscall in this manner has been requested by various people over time.
> 
> First, it helps to close all file descriptors of an exec()ing task. This
> can be done safely via (quoting Al's example from [1] verbatim):
> 
>         /* that exec is sensitive */
>         unshare(CLONE_FILES);
>         /* we don't want anything past stderr here */
>         close_range(3, ~0U);
>         execve(....);
> 
> The code snippet above is one way of working around the problem that file
> descriptors are not cloexec by default. This is aggravated by the fact that
> we can't just switch them over without massively regressing userspace. For
> a whole class of programs having an in-kernel method of closing all file
> descriptors is very helpful (e.g. demons, service managers, programming
> language standard libraries, container managers etc.).
> (Please note, unshare(CLONE_FILES) should only be needed if the calling
>  task is multi-threaded and shares the file descriptor table with another
>  thread in which case two threads could race with one thread allocating
>  file descriptors and the other one closing them via close_range(). For the
>  general case close_range() before the execve() is sufficient.)
> 
> Second, it allows userspace to avoid implementing closing all file
> descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
> file descriptor. From looking at various large(ish) userspace code bases
> this or similar patterns are very common in:
> - service managers (cf. [4])
> - libcs (cf. [6])
> - container runtimes (cf. [5])
> - programming language runtimes/standard libraries
>   - Python (cf. [2])
>   - Rust (cf. [7], [8])
> As Dmitry pointed out there's even a long-standing glibc bug about missing
> kernel support for this task (cf. [3]).
> In addition, the syscall will also work for tasks that do not have procfs
> mounted and on kernels that do not have procfs support compiled in. In such
> situations the only way to make sure that all file descriptors are closed
> is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
> OPEN_MAX trickery (cf. comment [8] on Rust).
> 
> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
> 
> static int close_all_fds(void)
> {
>         DIR *dir;
>         struct dirent *direntp;
> 
>         dir = opendir("/proc/self/fd");
>         if (!dir)
>                 return -1;
> 
>         while ((direntp = readdir(dir))) {
>                 int fd;
>                 if (strcmp(direntp->d_name, ".") == 0)
>                         continue;
>                 if (strcmp(direntp->d_name, "..") == 0)
>                         continue;
>                 fd = atoi(direntp->d_name);
>                 if (fd == 0 || fd == 1 || fd == 2)
>                         continue;
>                 close(fd);
>         }
> 
>         closedir(dir); /* cannot fail */
>         return 0;
> }
> 
> to close_range() yields:
> 1. closing 4 open files:
>    - close_all_fds(): ~280 us
>    - close_range():    ~24 us
> 
> 2. closing 1000 open files:
>    - close_all_fds(): ~5000 us
>    - close_range():   ~800 us
> 
> close_range() is designed to allow for some flexibility. Specifically, it
> does not simply always close all open file descriptors of a task. Instead,
> callers can specify an upper bound.
> This is e.g. useful for scenarios where specific file descriptors are
> created with well-known numbers that are supposed to be excluded from
> getting closed.
> For extra paranoia close_range() comes with a flags argument. This can e.g.
> be used to implement extension. Once can imagine userspace wanting to stop
> at the first error instead of ignoring errors under certain circumstances.
> There might be other valid ideas in the future. In any case, a flag
> argument doesn't hurt and keeps us on the safe side.
> 
> >From an implementation side this is kept rather dumb. It saw some input
> from David and Jann but all nonsense is obviously my own!
> - Errors to close file descriptors are currently ignored. (Could be changed
>   by setting a flag in the future if needed.)
> - __close_range() is a rather simplistic wrapper around __close_fd().
>   My reasoning behind this is based on the nature of how __close_fd() needs
>   to release an fd. But maybe I misunderstood specifics:
>   We take the files_lock and rcu-dereference the fdtable of the calling
>   task, we find the entry in the fdtable, get the file and need to release
>   files_lock before calling filp_close().
>   In the meantime the fdtable might have been altered so we can't just
>   retake the spinlock and keep the old rcu-reference of the fdtable
>   around. Instead we need to grab a fresh reference to the fdtable.
>   If my reasoning is correct then there's really no point in fancyfying
>   __close_range(): We just need to rcu-dereference the fdtable of the
>   calling task once to cap the max_fd value correctly and then go on
>   calling __close_fd() in a loop.

> +/**
> + * __close_range() - Close all file descriptors in a given range.
> + *
> + * @fd:     starting file descriptor to close
> + * @max_fd: last file descriptor to close
> + *
> + * This closes a range of file descriptors. All file descriptors
> + * from @fd up to and including @max_fd are closed.
> + */
> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> +	unsigned int cur_max;
> +
> +	if (fd > max_fd)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +	cur_max = files_fdtable(files)->max_fds;
> +	rcu_read_unlock();
> +
> +	/* cap to last valid index into fdtable */
> +	if (max_fd >= cur_max)
> +		max_fd = cur_max - 1;
> +
> +	while (fd <= max_fd)
> +		__close_fd(files, fd++);
> +
> +	return 0;
> +}

Umm...  That's going to be very painful if you dup2() something to MAX_INT and
then run that; roughly 2G iterations of bouncing ->file_lock up and down,
without anything that would yield CPU in process.

If anything, I would suggest something like

	fd = *start_fd;
	grab the lock
        fdt = files_fdtable(files);
more:
	look for the next eviction candidate in ->open_fds, starting at fd
	if there's none up to max_fd
		drop the lock
		return NULL
	*start_fd = fd + 1;
	if the fscker is really opened and not just reserved
		rcu_assign_pointer(fdt->fd[fd], NULL);
		__put_unused_fd(files, fd);
		drop the lock
		return the file we'd got
	if (unlikely(need_resched()))
		drop lock
		cond_resched();
		grab lock
		fdt = files_fdtable(files);
	goto more;

with the main loop being basically
	while ((file = pick_next(files, &start_fd, max_fd)) != NULL)
		filp_close(file, files);

^ permalink raw reply

* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Andy Lutomirski @ 2019-05-21 14:43 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: Andrew Morton, Dan Williams, Michal Hocko, Keith Busch,
	Kirill A. Shutemov, alexander.h.duyck, Weiny Ira,
	Andrey Konovalov, arunks, Vlastimil Babka, Christoph Lameter,
	Rik van Riel, Kees Cook, Johannes Weiner, Nicholas Piggin,
	Mathieu Desnoyers, Shakeel Butt, Roman Gushchin, Andrea Arcangeli,
	Hugh Dickins, Jerome
In-Reply-To: <155836064844.2441.10911127801797083064.stgit@localhost.localdomain>

On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@virtuozzo.com> wrote:
>

> [Summary]
>
> New syscall, which allows to clone a remote process VMA
> into local process VM. The remote process's page table
> entries related to the VMA are cloned into local process's
> page table (in any desired address, which makes this different
> from that happens during fork()). Huge pages are handled
> appropriately.
>
> This allows to improve performance in significant way like
> it's shows in the example below.
>
> [Description]
>
> This patchset adds a new syscall, which makes possible
> to clone a VMA from a process to current process.
> The syscall supplements the functionality provided
> by process_vm_writev() and process_vm_readv() syscalls,
> and it may be useful in many situation.
>
> For example, it allows to make a zero copy of data,
> when process_vm_writev() was previously used:
>
>         struct iovec local_iov, remote_iov;
>         void *buf;
>
>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
>         local_iov->iov_base = buf;
>         local_iov->iov_len = n * PAGE_SIZE;
>         remove_iov = ...;
>
>         process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>         munmap(buf, n * PAGE_SIZE);
>
>         (Note, that above completely ignores error handling)
>
> There are several problems with process_vm_writev() in this example:
>
> 1)it causes pagefault on remote process memory, and it forces
>   allocation of a new page (if was not preallocated);

I don't see how your new syscall helps.  You're writing to remote
memory.  If that memory wasn't allocated, it's going to get allocated
regardless of whether you use a write-like interface or an mmap-like
interface.  Keep in mind that, on x86, just the hardware part of a
page fault is very slow -- populating the memory with a syscall
instead of a fault may well be faster.

>
> 2)amount of memory for this example is doubled in a moment --
>   n pages in current and n pages in remote tasks are occupied
>   at the same time;

This seems disingenuous.  If you're writing p pages total in chunks of
n pages, you will use a total of p pages if you use mmap and p+n if
you use write.  That only doubles the amount of memory if you let n
scale linearly with p, which seems unlikely.

>
> 3)received data has no a chance to be properly swapped for
>   a long time.

...

> a)kernel moves @buf pages into swap right after recv();
> b)process_vm_writev() reads the data back from swap to pages;

If you're under that much memory pressure and thrashing that badly,
your performance is going to be awful no matter what you're doing.  If
you indeed observe this behavior under normal loads, then this seems
like a VM issue that should be addressed in its own right.

>         buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>                    MAP_PRIVATE|MAP_ANONYMOUS, ...);
>         recv(sock, buf, n * PAGE_SIZE, 0);
>
> [Task 2]
>         buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>
> This creates a copy of VMA related to buf from task1 in task2's VM.
> Task1's page table entries are copied into corresponding page table
> entries of VM of task2.

You need to fully explain a whole bunch of details that you're
ignored.  For example, if the remote VMA is MAP_ANONYMOUS, do you get
a CoW copy of it?  I assume you don't since the whole point is to
write to remote memory, but it's at the very least quite unusual in
Linux to have two different anonymous VMAs such that writing one of
them changes the other one.  But there are plenty of other questions.
What happens if the remote VMA is a gate area or other special mapping
(vDSO, vvar area, etc)?  What if the remote memory comes from a driver
that wasn't expecting the mapping to get magically copied to a
different process?

This new API seems quite dangerous and complex to me, and I don't
think the value has been adequately demonstrated.

^ permalink raw reply

* Re: [PATCH v3 1/2] pid: add pidfd_open()
From: Christian Brauner @ 2019-05-21 14:32 UTC (permalink / raw)
  To: jannh, oleg, viro, torvalds, linux-kernel, arnd
  Cc: akpm, cyphar, dhowells, ebiederm, elena.reshetova, keescook, luto,
	luto, tglx, linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k,
	linux-mips, linux-parisc, linuxppc-dev, linux-s390, linux-sh,
	sparclinux, linux-xtensa, linux-api, linux-arch, linux-kselftest,
	joel, dancol, serge, surenb, kernel-team
In-Reply-To: <20190520155630.21684-1-christian@brauner.io>

On Mon, May 20, 2019 at 05:56:29PM +0200, Christian Brauner wrote:
> This adds the pidfd_open() syscall. It allows a caller to retrieve pollable
> pidfds for a process which did not get created via CLONE_PIDFD, i.e. for a
> process that is created via traditional fork()/clone() calls that is only
> referenced by a PID:
> 
> int pidfd = pidfd_open(1234, 0);
> ret = pidfd_send_signal(pidfd, SIGSTOP, NULL, 0);
> 
> With the introduction of pidfds through CLONE_PIDFD it is possible to
> created pidfds at process creation time.
> However, a lot of processes get created with traditional PID-based calls
> such as fork() or clone() (without CLONE_PIDFD). For these processes a
> caller can currently not create a pollable pidfd. This is a problem for
> Android's low memory killer (LMK) and service managers such as systemd.
> Both are examples of tools that want to make use of pidfds to get reliable
> notification of process exit for non-parents (pidfd polling) and race-free
> signal sending (pidfd_send_signal()). They intend to switch to this API for
> process supervision/management as soon as possible. Having no way to get
> pollable pidfds from PID-only processes is one of the biggest blockers for
> them in adopting this api. With pidfd_open() making it possible to retrieve
> pidfds for PID-based processes we enable them to adopt this api.
> 
> In line with Arnd's recent changes to consolidate syscall numbers across
> architectures, I have added the pidfd_open() syscall to all architectures
> at the same time.
> 
> Signed-off-by: Christian Brauner <christian@brauner.io>
> Reviewed-by: Oleg Nesterov <oleg@redhat.com>

This now also carries a Reviewed-by from David.

> Acked-by: Arnd Bergmann <arnd@arndb.de>
> Cc: "Eric W. Biederman" <ebiederm@xmission.com>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Jann Horn <jannh@google.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Andy Lutomirsky <luto@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: linux-api@vger.kernel.org

I've moved pidfd_open() into my for-next branch together with Joel's
pidfd polling changes. Everything is based on v5.2-rc1.

The chosen syscall number for now is 434. David is going to send out
another pile of mount api related syscalls. I'll coordinate with him
accordingly prior to the 5.3 merge window.

Thanks!
Christian

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Rasmus Villemoes @ 2019-05-21 13:39 UTC (permalink / raw)
  To: Christian Brauner, linux-kernel, linux-fsdevel, linux-api
In-Reply-To: <20190521113448.20654-1-christian@brauner.io>

On 21/05/2019 13.34, Christian Brauner wrote:

> The performance is striking. For good measure, comparing the following
> simple close_all_fds() userspace implementation that is essentially just
> glibc's version in [6]:
> 
> static int close_all_fds(void)
> {
>         DIR *dir;
>         struct dirent *direntp;
> 
>         dir = opendir("/proc/self/fd");
>         if (!dir)
>                 return -1;
> 
>         while ((direntp = readdir(dir))) {
>                 int fd;
>                 if (strcmp(direntp->d_name, ".") == 0)
>                         continue;
>                 if (strcmp(direntp->d_name, "..") == 0)
>                         continue;
>                 fd = atoi(direntp->d_name);
>                 if (fd == 0 || fd == 1 || fd == 2)
>                         continue;
>                 close(fd);
>         }
> 
>         closedir(dir); /* cannot fail */
>         return 0;
> }

Before anybody copy-pastes this, please note that it lacks a check for
fd == dirfd(dir). If all of /proc/self/fd is returned in the first
getdents() syscall one won't notice, but...

Rasmus

^ permalink raw reply

* Re: [PATCH 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-21 13:23 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-ia64, linux-sh, linux-kernel, dhowells, linux-kselftest,
	sparclinux, shuah, linux-arch, linux-s390, miklos, x86, torvalds,
	linux-mips, linux-xtensa, tkjos, arnd, jannh, linux-m68k, viro,
	tglx, ldv, linux-arm-kernel, linux-parisc, linux-api, oleg,
	linux-alpha, linux-fsdevel, linuxppc-dev
In-Reply-To: <87h89o9cng.fsf@oldenburg2.str.redhat.com>

On Tue, May 21, 2019 at 03:10:11PM +0200, Florian Weimer wrote:
> * Christian Brauner:
> 
> >> Solaris has an fdwalk function:
> >> 
> >>   <https://docs.oracle.com/cd/E88353_01/html/E37843/closefrom-3c.html>
> >> 
> >> So a different way to implement this would expose a nextfd system call
> >
> > Meh. If nextfd() then I would like it to be able to:
> > - get the nextfd(fd) >= fd
> > - get highest open fd e.g. nextfd(-1)
> 
> The highest open descriptor isn't istering for fdwalk because nextfd
> would just fail.

Sure. I was thinking about other usecases. For example, sometimes in
userspace you want to do the following:
save_fd = dup(fd, <well-known-number-at-the-end-of-the-range);
close_range(3, (save_fd - 1));

Which brings me to another point. So even if we don't do close_range() I
would like libc to maybe give us something like close_range() for such
scenarios.

> 
> > But then I wonder if nextfd() needs to be a syscall and isn't just
> > either:
> > fcntl(fd, F_GET_NEXT)?
> > or
> > prctl(PR_GET_NEXT)?
> 
> I think the fcntl route is a bit iffy because you might need it to get
> the *first* valid descriptor.
> 
> >> to userspace, so that we can use that to implement both fdwalk and
> >> closefrom.  But maybe fdwalk is just too obscure, given the existence of
> >> /proc.
> >
> > Yeah we probably don't need fdwalk.
> 
> Agreed.  Just wanted to bring it up for completeness.  I certainly don't
> want to derail the implementation of close_range.
> 
> Thanks,
> Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox