Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping
From: Kirill Tkhai @ 2019-05-23 16:11 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: akpm, dan.j.williams, mhocko, keith.busch, kirill.shutemov,
	alexander.h.duyck, ira.weiny, andreyknvl, arunks, vbabka, cl,
	riel, keescook, hannes, npiggin, mathieu.desnoyers, shakeelb,
	guro, aarcange, hughd, jglisse, mgorman, daniel.m.jordan, jannh,
	kilobyte, linux-api, linux-kernel, linux-mm
In-Reply-To: <20190522152254.5cyxhjizuwuojlix@box>

On 22.05.2019 18:22, Kirill A. Shutemov wrote:
> On Mon, May 20, 2019 at 05:00:01PM +0300, Kirill Tkhai wrote:
>> This patchset adds a new syscall, which makes possible
>> to clone a VMA from a process to current process.
>> The syscall supplements the functionality provided
>> by process_vm_writev() and process_vm_readv() syscalls,
>> and it may be useful in many situation.
> 
> Kirill, could you explain how the change affects rmap and how it is safe.
> 
> My concern is that the patchset allows to map the same page multiple times
> within one process or even map page allocated by child to the parrent.
> 
> It was not allowed before.
>
> In the best case it makes reasoning about rmap substantially more difficult.

I don't think here is big impact from process relationships, because of
as it existed before, the main rule of VMA chaining is that VMA is younger
or older each other. For example, reusing of anon_vma in anon_vma_clone()
may be done either children or siblings. Also, it is possible reparenting
after some of processes dies; or splitting two branches of processes having
the same grand parent into two chains after the grand parent dies, so it looks
there should be many combinations already available.

Mapping of the same page multiple times is a different thing, and it was never
allowed for rmap.

Could you please say more specifically what looks suspicious for you and I'll
try to answer then? Otherwise, it's possible to write explanations as big as
a dissertation and to miss all answers to that is interested for you :)

> 
> But I'm worry it will introduce hard-to-debug bugs, like described in
> https://lwn.net/Articles/383162/.

I read the article, but there are a lot of messages in thread, I'm not sure,
that found the actual fix there. But it looks like one of the fixes may be
be usage of anon_vma->root in __page_set_anon_rmap().

> Note, that is some cases we care about rmap walk order (see for instance
> mremap() case). I'm not convinced that the feature will not break
> something in the area.

Yeah, thanks for pointing, I'll check this.

Kirill

^ permalink raw reply

* [PATCH v2 2/2] tests: add close_range() tests
From: Christian Brauner @ 2019-05-23 15:47 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner
In-Reply-To: <20190523154747.15162-1-christian@brauner.io>

This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
  are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1: unchanged
v2:
- Christian Brauner <christian@brauner.io>:
  - verify that close_range() correctly closes a single file descriptor
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/core/.gitignore       |   1 +
 tools/testing/selftests/core/Makefile         |   6 +
 .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
 4 files changed, 150 insertions(+)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
 TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cgroup
+TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..d6e6079d3d53
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+				  unsigned int flags)
+{
+	return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+	const char *test_name = "close_range";
+	int i, ret;
+	int open_fds[101];
+	int fd_max, fd_mid, fd_min;
+
+	ksft_set_plan(9);
+
+	for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+		int fd;
+
+		fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				ksft_exit_skip(
+					"%s test: skipping test since /dev/null does not exist\n",
+					test_name);
+
+			ksft_exit_fail_msg(
+				"%s test: %s - failed to open /dev/null\n",
+				strerror(errno), test_name);
+		}
+
+		open_fds[i] = fd;
+	}
+
+	fd_min = open_fds[0];
+	fd_max = open_fds[99];
+
+	ret = sys_close_range(fd_min, fd_max, 1);
+	if (!ret)
+		ksft_exit_fail_msg(
+			"%s test: managed to pass invalid flag value\n",
+			test_name);
+	ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+	fd_mid = open_fds[50];
+	ret = sys_close_range(fd_min, fd_mid, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from %d to %d\n",
+			test_name, fd_min, fd_mid);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+	for (i = 0; i <= 50; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from %d to %d\n",
+				test_name, fd_min, fd_mid);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+	/* create a couple of gaps */
+	close(57);
+	close(78);
+	close(81);
+	close(82);
+	close(84);
+	close(90);
+
+	fd_mid = open_fds[51];
+	/* Choose slightly lower limit and leave some fds for a later test */
+	fd_max = open_fds[92];
+	ret = sys_close_range(fd_mid, fd_max, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 51; i <= 92; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	fd_mid = open_fds[93];
+	fd_max = open_fds[99];
+	/* test that the kernel caps and still closes all fds */
+	ret = sys_close_range(fd_mid, UINT_MAX, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 93; i < 100; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	ret = sys_close_range(open_fds[100], open_fds[100], 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close single file descriptor\n",
+			test_name);
+	ksft_test_result_pass("close_range() closed single file descriptor\n");
+
+	ret = fcntl(open_fds[100], F_GETFL);
+	if (ret >= 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close single file descriptor\n",
+			test_name);
+	ksft_test_result_pass("fcntl() verify closed single file descriptor\n");
+
+	return ksft_exit_pass();
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-23 15:47 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner
In-Reply-To: <20190523154747.15162-1-christian@brauner.io>

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
 task is multi-threaded and shares the file descriptor table with another
 thread in which case two threads could race with one thread allocating
 file descriptors and the other one closing them via close_range(). For the
 general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro@zeniv.linux.org.uk>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
v2: unchanged
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/file.c                                   | 62 ++++++++++++++++++---
 fs/open.c                                   | 20 +++++++
 include/linux/fdtable.h                     |  2 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 22 files changed, 99 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541	common	fsconfig			sys_fsconfig
 542	common	fsmount				sys_fsmount
 543	common	fspick				sys_fspick
+545	common	close_range			sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
 431	n32	fsconfig			sys_fsconfig
 432	n32	fsmount				sys_fsmount
 433	n32	fspick				sys_fspick
+435	n32	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
 431	n64	fsconfig			sys_fsconfig
 432	n64	fsmount				sys_fsmount
 433	n64	fspick				sys_fspick
+435	n64	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
 431	o32	fsconfig			sys_fsconfig
 432	o32	fsmount				sys_fsmount
 433	o32	fspick				sys_fspick
+435	o32	close_range			sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431  common	fsconfig		sys_fsconfig			sys_fsconfig
 432  common	fsmount			sys_fsmount			sys_fsmount
 433  common	fspick			sys_fspick			sys_fspick
+435  common	close_range		sys_close_range			sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+435	i386	close_range		sys_close_range			__ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+435	common	close_range		__x64_sys_close_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..53430d68bbd7 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -10,6 +10,7 @@
 #include <linux/syscalls.h>
 #include <linux/export.h>
 #include <linux/fs.h>
+#include <linux/kernel.h>
 #include <linux/mm.h>
 #include <linux/sched/signal.h>
 #include <linux/slab.h>
@@ -615,12 +616,9 @@ void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-	struct file *file;
+	struct file *file = NULL;
 	struct fdtable *fdt;
 
 	spin_lock(&files->file_lock);
@@ -632,15 +630,63 @@ int __close_fd(struct files_struct *files, unsigned fd)
 		goto out_unlock;
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
-	return filp_close(file, files);
 
 out_unlock:
 	spin_unlock(&files->file_lock);
-	return -EBADF;
+	return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+	struct file *file;
+
+	file = pick_file(files, fd);
+	if (!file)
+		return -EBADF;
+
+	return filp_close(file, files);
 }
 EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
 
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+	unsigned int cur_max;
+
+	if (fd > max_fd)
+		return -EINVAL;
+
+	rcu_read_lock();
+	cur_max = files_fdtable(files)->max_fds;
+	rcu_read_unlock();
+
+	/* cap to last valid index into fdtable */
+	max_fd = max(max_fd, (cur_max - 1));
+	while (fd <= max_fd) {
+		struct file *file;
+
+		file = pick_file(files, fd++);
+		if (!file)
+			continue;
+
+		filp_close(file, files);
+		cond_resched();
+	}
+
+	return 0;
+}
+
 /*
  * variant of __close_fd that gets a ref on the file for later fput
  */
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags:  reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return __close_range(current->files, fd, max_fd);
+}
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
 		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+			 unsigned int max_fd);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 
 extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
 asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+				unsigned int flags);
 asmlinkage long sys_vhangup(void);
 
 /* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related

* [PATCH v2 0/2] close_range()
From: Christian Brauner @ 2019-05-23 15:47 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner

Hey,

This is v2 of this patchset.

In accordance with some comments There's a cond_resched() added to the
close loop similar to what is done for close_files().
A common helper pick_file() for __close_fd() and __close_range() has
been split out. This allows to only make a cond_resched() call when
filp_close() has been called similar to what is done in close_files().
Maybe that's not worth it. Jann mentioned that cond_resched() looks
rather cheap.
So it maybe that we could simply do:

while (fd <= max_fd) {
       __close(files, fd++);
       cond_resched();
}

I also added a missing test for close_range(fd, fd, 0).

Thanks!
Christian

Christian Brauner (2):
  open: add close_range()
  tests: add close_range() tests

 arch/alpha/kernel/syscalls/syscall.tbl        |   1 +
 arch/arm/tools/syscall.tbl                    |   1 +
 arch/arm64/include/asm/unistd32.h             |   2 +
 arch/ia64/kernel/syscalls/syscall.tbl         |   1 +
 arch/m68k/kernel/syscalls/syscall.tbl         |   1 +
 arch/microblaze/kernel/syscalls/syscall.tbl   |   1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl     |   1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl     |   1 +
 arch/parisc/kernel/syscalls/syscall.tbl       |   1 +
 arch/powerpc/kernel/syscalls/syscall.tbl      |   1 +
 arch/s390/kernel/syscalls/syscall.tbl         |   1 +
 arch/sh/kernel/syscalls/syscall.tbl           |   1 +
 arch/sparc/kernel/syscalls/syscall.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl        |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl        |   1 +
 arch/xtensa/kernel/syscalls/syscall.tbl       |   1 +
 fs/file.c                                     |  62 +++++++-
 fs/open.c                                     |  20 +++
 include/linux/fdtable.h                       |   2 +
 include/linux/syscalls.h                      |   2 +
 include/uapi/asm-generic/unistd.h             |   4 +-
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/core/.gitignore       |   1 +
 tools/testing/selftests/core/Makefile         |   6 +
 .../testing/selftests/core/close_range_test.c | 142 ++++++++++++++++++
 26 files changed, 249 insertions(+), 9 deletions(-)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

-- 
2.21.0

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-23 14:37 UTC (permalink / raw)
  To: Jann Horn
  Cc: Oleg Nesterov, Al Viro, kernel list, linux-fsdevel, Linux API,
	Linus Torvalds, Florian Weimer, Thomas Gleixner, Arnd Bergmann,
	Shuah Khan, David Howells, Todd Kjos, Dmitry V. Levin,
	Miklos Szeredi, linux-alpha, linux-arm-kernel, linux-ia64,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390
In-Reply-To: <CAG48ez0Uq2GQnQsuPkNrDdJVku_6GPeZ_5F_-5J3iy2CULr0_Q@mail.gmail.com>

On Thu, May 23, 2019 at 04:32:14PM +0200, Jann Horn wrote:
> On Thu, May 23, 2019 at 1:51 PM Christian Brauner <christian@brauner.io> wrote:
> [...]
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> [...]
> > static int __close_next_open_fd(struct files_struct *files, unsigned *curfd, unsigned maxfd)
> > {
> >         struct file *file = NULL;
> >         unsigned fd;
> >         struct fdtable *fdt;
> >
> >         spin_lock(&files->file_lock);
> >         fdt = files_fdtable(files);
> >         fd = find_next_fd(fdt, *curfd);
> 
> find_next_fd() finds free fds, not used ones.
> 
> >         if (fd >= fdt->max_fds || fd > maxfd)
> >                 goto out_unlock;
> >
> >         file = fdt->fd[fd];
> >         rcu_assign_pointer(fdt->fd[fd], NULL);
> >         __put_unused_fd(files, fd);
> 
> You can't do __put_unused_fd() if the old pointer in fdt->fd[fd] was
> NULL - because that means that the fd has been reserved by another
> thread that is about to put a file pointer in there, and if you
> release the fd here, that messes up the refcounting (or hits the
> BUG_ON() in __fd_install()).
> 
> > out_unlock:
> >         spin_unlock(&files->file_lock);
> >
> >         if (!file)
> >                 return -EBADF;
> >
> >         *curfd = fd;
> >         filp_close(file, files);
> >         return 0;
> > }
> >
> > int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > {
> >         if (fd > max_fd)
> >                 return -EINVAL;
> >
> >         while (fd <= max_fd) {
> 
> Note that with a pattern like this, you have to be careful about what
> happens if someone gives you max_fd==0xffffffff - then this condition
> is always true and the loop can not terminate this way.
> 
> >                 if (__close_next_fd(files, &fd, maxfd))
> >                         break;
> 
> (obviously it can still terminate this way)

Yup, this was only a quick draft.
I think the dumb simple thing that I did before was the best way to do
it for now.
I first thought that the find_next_open_fd() function already exists but
when I went to write a POC for testing realized it doesn't anyway.

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Jann Horn @ 2019-05-23 14:32 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Oleg Nesterov, Al Viro, kernel list, linux-fsdevel, Linux API,
	Linus Torvalds, Florian Weimer, Thomas Gleixner, Arnd Bergmann,
	Shuah Khan, David Howells, Todd Kjos, Dmitry V. Levin,
	Miklos Szeredi, linux-alpha, linux-arm-kernel, linux-ia64,
	linux-m68k, linux-mips, linux-parisc, linuxppc-dev, linux-s390
In-Reply-To: <20190523115118.pmscbd6kaqy37dym@brauner.io>

On Thu, May 23, 2019 at 1:51 PM Christian Brauner <christian@brauner.io> wrote:
[...]
> I kept it dumb and was about to reply that your solution introduces more
> code when it seemed we wanted to keep this very simple for now.
> But then I saw that find_next_opened_fd() already exists as
> find_next_fd(). So it's actually not bad compared to what I sent in v1.
> So - with some small tweaks (need to test it and all now) - how do we
> feel about?:
[...]
> static int __close_next_open_fd(struct files_struct *files, unsigned *curfd, unsigned maxfd)
> {
>         struct file *file = NULL;
>         unsigned fd;
>         struct fdtable *fdt;
>
>         spin_lock(&files->file_lock);
>         fdt = files_fdtable(files);
>         fd = find_next_fd(fdt, *curfd);

find_next_fd() finds free fds, not used ones.

>         if (fd >= fdt->max_fds || fd > maxfd)
>                 goto out_unlock;
>
>         file = fdt->fd[fd];
>         rcu_assign_pointer(fdt->fd[fd], NULL);
>         __put_unused_fd(files, fd);

You can't do __put_unused_fd() if the old pointer in fdt->fd[fd] was
NULL - because that means that the fd has been reserved by another
thread that is about to put a file pointer in there, and if you
release the fd here, that messes up the refcounting (or hits the
BUG_ON() in __fd_install()).

> out_unlock:
>         spin_unlock(&files->file_lock);
>
>         if (!file)
>                 return -EBADF;
>
>         *curfd = fd;
>         filp_close(file, files);
>         return 0;
> }
>
> int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> {
>         if (fd > max_fd)
>                 return -EINVAL;
>
>         while (fd <= max_fd) {

Note that with a pattern like this, you have to be careful about what
happens if someone gives you max_fd==0xffffffff - then this condition
is always true and the loop can not terminate this way.

>                 if (__close_next_fd(files, &fd, maxfd))
>                         break;

(obviously it can still terminate this way)

>                 cond_resched();
>                 fd++;
>         }
>
>         return 0;
> }

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-23 14:28 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190523141447.34s3kc3fuwmoeq7n@brauner.io>

On Thu, May 23, 2019 at 04:14:47PM +0200, Christian Brauner wrote:
> On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> > On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > > On 05/22, Christian Brauner wrote:
> > > >
> > > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > > >  {
> > > > -	struct file *file;
> > > > +	struct file *file = NULL;
> > > >  	struct fdtable *fdt;
> > > >  
> > > >  	spin_lock(&files->file_lock);
> > > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
> > > >  		goto out_unlock;
> > > >  	rcu_assign_pointer(fdt->fd[fd], NULL);
> > > >  	__put_unused_fd(files, fd);
> > > > -	spin_unlock(&files->file_lock);
> > > > -	return filp_close(file, files);
> > > >  
> > > >  out_unlock:
> > > >  	spin_unlock(&files->file_lock);
> > > > -	return -EBADF;
> > > > +	return file;
> > > 
> > > ...
> > > 
> > > > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > > > +{
> > > > +	unsigned int cur_max;
> > > > +
> > > > +	if (fd > max_fd)
> > > > +		return -EINVAL;
> > > > +
> > > > +	rcu_read_lock();
> > > > +	cur_max = files_fdtable(files)->max_fds;
> > > > +	rcu_read_unlock();
> > > > +
> > > > +	/* cap to last valid index into fdtable */
> > > > +	if (max_fd >= cur_max)
> > > > +		max_fd = cur_max - 1;
> > > > +
> > > > +	while (fd <= max_fd) {
> > > > +		struct file *file;
> > > > +
> > > > +		file = pick_file(files, fd++);
> > > 
> > > Well, how about something like
> > > 
> > > 	static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned start)
> > > 	{
> > > 		unsigned int maxfd = fdt->max_fds;
> > > 		unsigned int maxbit = maxfd / BITS_PER_LONG;
> > > 		unsigned int bitbit = start / BITS_PER_LONG;
> > > 
> > > 		bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > > 		if (bitbit > maxfd)
> > > 			return maxfd;
> > > 		if (bitbit > start)
> > > 			start = bitbit;
> > > 		return find_next_bit(fdt->open_fds, maxfd, start);
> > > 	}
> > 
> > > 
> > > 	unsigned close_next_fd(struct files_struct *files, unsigned start, unsigned maxfd)
> > > 	{
> > > 		unsigned fd;
> > > 		struct file *file;
> > > 		struct fdtable *fdt;
> > > 	
> > > 		spin_lock(&files->file_lock);
> > > 		fdt = files_fdtable(files);
> > > 		fd = find_next_opened_fd(fdt, start);
> > > 		if (fd >= fdt->max_fds || fd > maxfd) {
> > > 			fd = -1;
> > > 			goto out;
> > > 		}
> > > 
> > > 		file = fdt->fd[fd];
> > > 		rcu_assign_pointer(fdt->fd[fd], NULL);
> > > 		__put_unused_fd(files, fd);
> > > 	out:
> > > 		spin_unlock(&files->file_lock);
> > > 
> > > 		if (fd == -1u)
> > > 			return fd;
> > > 
> > > 		filp_close(file, files);
> > > 		return fd + 1;
> > > 	}
> > 
> > Thanks, Oleg!
> > 
> > I kept it dumb and was about to reply that your solution introduces more
> > code when it seemed we wanted to keep this very simple for now.
> > But then I saw that find_next_opened_fd() already exists as
> > find_next_fd(). So it's actually not bad compared to what I sent in v1.
> > So - with some small tweaks (need to test it and all now) - how do we
> > feel about?:
> 
> That's obviously not correct atm but I'll send out a tweaked version in
> a bit.

So given that we would really need another find_next_open_fd() I think
sticking to the simple cond_resched() version I sent before is better
for now until we see real-world performance issues.
I was however missing a test for close_range(fd, fd, 0) anyway so I'll
need to send a v2 with this test added.

Christian

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-23 14:14 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190523115118.pmscbd6kaqy37dym@brauner.io>

On Thu, May 23, 2019 at 01:51:18PM +0200, Christian Brauner wrote:
> On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> > On 05/22, Christian Brauner wrote:
> > >
> > > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> > >  {
> > > -	struct file *file;
> > > +	struct file *file = NULL;
> > >  	struct fdtable *fdt;
> > >  
> > >  	spin_lock(&files->file_lock);
> > > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
> > >  		goto out_unlock;
> > >  	rcu_assign_pointer(fdt->fd[fd], NULL);
> > >  	__put_unused_fd(files, fd);
> > > -	spin_unlock(&files->file_lock);
> > > -	return filp_close(file, files);
> > >  
> > >  out_unlock:
> > >  	spin_unlock(&files->file_lock);
> > > -	return -EBADF;
> > > +	return file;
> > 
> > ...
> > 
> > > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > > +{
> > > +	unsigned int cur_max;
> > > +
> > > +	if (fd > max_fd)
> > > +		return -EINVAL;
> > > +
> > > +	rcu_read_lock();
> > > +	cur_max = files_fdtable(files)->max_fds;
> > > +	rcu_read_unlock();
> > > +
> > > +	/* cap to last valid index into fdtable */
> > > +	if (max_fd >= cur_max)
> > > +		max_fd = cur_max - 1;
> > > +
> > > +	while (fd <= max_fd) {
> > > +		struct file *file;
> > > +
> > > +		file = pick_file(files, fd++);
> > 
> > Well, how about something like
> > 
> > 	static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned start)
> > 	{
> > 		unsigned int maxfd = fdt->max_fds;
> > 		unsigned int maxbit = maxfd / BITS_PER_LONG;
> > 		unsigned int bitbit = start / BITS_PER_LONG;
> > 
> > 		bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> > 		if (bitbit > maxfd)
> > 			return maxfd;
> > 		if (bitbit > start)
> > 			start = bitbit;
> > 		return find_next_bit(fdt->open_fds, maxfd, start);
> > 	}
> 
> > 
> > 	unsigned close_next_fd(struct files_struct *files, unsigned start, unsigned maxfd)
> > 	{
> > 		unsigned fd;
> > 		struct file *file;
> > 		struct fdtable *fdt;
> > 	
> > 		spin_lock(&files->file_lock);
> > 		fdt = files_fdtable(files);
> > 		fd = find_next_opened_fd(fdt, start);
> > 		if (fd >= fdt->max_fds || fd > maxfd) {
> > 			fd = -1;
> > 			goto out;
> > 		}
> > 
> > 		file = fdt->fd[fd];
> > 		rcu_assign_pointer(fdt->fd[fd], NULL);
> > 		__put_unused_fd(files, fd);
> > 	out:
> > 		spin_unlock(&files->file_lock);
> > 
> > 		if (fd == -1u)
> > 			return fd;
> > 
> > 		filp_close(file, files);
> > 		return fd + 1;
> > 	}
> 
> Thanks, Oleg!
> 
> I kept it dumb and was about to reply that your solution introduces more
> code when it seemed we wanted to keep this very simple for now.
> But then I saw that find_next_opened_fd() already exists as
> find_next_fd(). So it's actually not bad compared to what I sent in v1.
> So - with some small tweaks (need to test it and all now) - how do we
> feel about?:

That's obviously not correct atm but I'll send out a tweaked version in
a bit.

Christian

^ permalink raw reply

* [USER][PATCH] cpio: add option to add file metadata in copy-out mode
From: Roberto Sassu @ 2019-05-23 13:38 UTC (permalink / raw)
  To: viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98, Roberto Sassu

This patch adds the -e <type> option to include file metadata in the image.
At the moment, only the xattr type is supported.

If the xattr type is selected, the patch includes an additional file for
each file passed to stdin, with special name 'METADATA!!!'. The additional
file might contain multiple metadata records. The format of each record is:

<metadata len (ASCII, 8 chars)><version><type><metadata>

The format of metadata for the xattr type is:

<xattr name>\0<xattr value>

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
 doc/cpio.texi   |   3 ++
 src/copyout.c   | 136 ++++++++++++++++++++++++++++++++++++++++++++++--
 src/dstring.c   |  26 +++++++--
 src/dstring.h   |   1 +
 src/extern.h    |   2 +
 src/global.c    |   2 +
 src/initramfs.h |  21 ++++++++
 src/main.c      |  22 ++++++++
 8 files changed, 206 insertions(+), 7 deletions(-)
 create mode 100644 src/initramfs.h

diff --git a/doc/cpio.texi b/doc/cpio.texi
index e667b48..d7b311f 100644
--- a/doc/cpio.texi
+++ b/doc/cpio.texi
@@ -275,6 +275,9 @@ Set the I/O block size to the given @var{number} of bytes.
 @item -D @var{dir}
 @itemx --directory=@var{dir}
 Change to directory @var{dir}
+@item -e @var{type}
+@itemx --file-metadata=@var{type}
+Include in the image file metadata with the specified type.
 @item --force-local
 Treat the archive file as local, even if its name contains colons.
 @item -F [[@var{user}@@]@var{host}:]@var{archive-file}
diff --git a/src/copyout.c b/src/copyout.c
index 7532dac..f0e512a 100644
--- a/src/copyout.c
+++ b/src/copyout.c
@@ -22,6 +22,7 @@
 #include <stdio.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/xattr.h>
 #include "filetypes.h"
 #include "cpiohdr.h"
 #include "dstring.h"
@@ -578,6 +579,92 @@ assign_string (char **pvar, char *value)
   *pvar = p;
 }
 
+static int
+write_xattrs (int metadata_fd, char *path)
+{
+  struct metadata_hdr hdr = { .c_version = 1, .c_type = TYPE_XATTR };
+  char str[sizeof(hdr.c_size) + 1];
+  char *xattr_list, *list_ptr, *xattr_value;
+  ssize_t list_len, name_len, value_len, len;
+  int ret = -EINVAL;
+
+  if (metadata_fd < 0)
+    return 0;
+
+  list_len = llistxattr(path, NULL, 0);
+  if (list_len <= 0)
+    return -ENOENT;
+
+  list_ptr = xattr_list = malloc(list_len);
+  if (!list_ptr) {
+    error (0, 0, _("out of memory"));
+    return ret;
+  }
+
+  len = llistxattr(path, xattr_list, list_len);
+  if (len != list_len)
+    goto out;
+
+  if (ftruncate(metadata_fd, 0))
+    goto out;
+
+  lseek(metadata_fd, 0, SEEK_SET);
+
+  while (list_ptr < xattr_list + list_len) {
+    name_len = strlen(list_ptr);
+
+    value_len = lgetxattr(path, list_ptr, NULL, 0);
+    if (value_len < 0) {
+      error (0, 0, _("cannot get xattrs"));
+      break;
+    }
+
+    if (value_len) {
+      xattr_value = malloc(value_len);
+      if (!xattr_value) {
+	error (0, 0, _("out of memory"));
+	break;
+      }
+    } else {
+      xattr_value = NULL;
+    }
+
+    len = lgetxattr(path, list_ptr, xattr_value, value_len);
+    if (len != value_len)
+      break;
+
+    snprintf(str, sizeof(str), "%.8lx",
+	     sizeof(hdr) + name_len + 1 + value_len);
+
+    memcpy(hdr.c_size, str, sizeof(hdr.c_size));
+
+    if (write(metadata_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+      break;
+
+    if (write(metadata_fd, list_ptr, name_len + 1) != name_len + 1)
+      break;
+
+    if (write(metadata_fd, xattr_value, value_len) != value_len)
+      break;
+
+    if (fsync(metadata_fd))
+      break;
+
+    list_ptr += name_len + 1;
+    free(xattr_value);
+    xattr_value = NULL;
+  }
+
+  free(xattr_value);
+out:
+  free(xattr_list);
+
+  if (list_ptr != xattr_list + list_len)
+    return ret;
+
+  return 0;
+}
+
 /* Read a list of file names from the standard input
    and write a cpio collection on the standard output.
    The format of the header depends on the compatibility (-c) flag.  */
@@ -591,6 +678,8 @@ process_copy_out ()
   int in_file_des;		/* Source file descriptor.  */
   int out_file_des;		/* Output file descriptor.  */
   char *orig_file_name = NULL;
+  char template[] = "/tmp/cpio-metadata-XXXXXX";
+  int ret, metadata_fd, metadata = 0, old_metadata;
 
   /* Initialize the copy out.  */
   ds_init (&input_name, 128);
@@ -623,9 +712,36 @@ process_copy_out ()
       prepare_append (out_file_des);
     }
 
+  /* Create a temporary file to store file metadata */
+  if (metadata_type != TYPE_NONE) {
+    metadata_fd = mkstemp(template);
+    if (metadata_fd < 0) {
+      error (0, 0, _("cannot create temporary file"));
+      return;
+    }
+  }
+
   /* Copy files with names read from stdin.  */
-  while (ds_fgetstr (stdin, &input_name, name_end) != NULL)
+  while ((metadata_type != TYPE_NONE && metadata) ||
+	 ds_fgetstr (stdin, &input_name, name_end) != NULL)
     {
+      old_metadata = metadata;
+
+      if (metadata) {
+	metadata = 0;
+
+        if (metadata_type != TYPE_XATTR) {
+	  error (0, 0, _("metadata type not supported"));
+	  continue;
+	}
+
+	ret = write_xattrs(metadata_fd, orig_file_name);
+	if (ret < 0)
+	  continue;
+
+	ds_sgetstr (template, &input_name, name_end);
+      }
+
       /* Check for blank line.  */
       if (input_name.ds_string[0] == 0)
 	{
@@ -655,8 +771,15 @@ process_copy_out ()
 		    }
 		}
 	    }
-	  
-	  assign_string (&orig_file_name, input_name.ds_string);
+
+	  if (old_metadata) {
+	    assign_string (&orig_file_name, template);
+	    ds_sgetstr (METADATA_FILENAME, &input_name, name_end);
+	    file_hdr.c_mode |= 0x10000;
+	  } else {
+	    assign_string (&orig_file_name, input_name.ds_string);
+	  }
+
 	  cpio_safer_name_suffix (input_name.ds_string, false,
 				  !no_abs_paths_flag, true);
 #ifndef HPUX_CDF
@@ -844,6 +967,8 @@ process_copy_out ()
 	    fprintf (stderr, "%s\n", orig_file_name);
 	  if (dot_flag)
 	    fputc ('.', stderr);
+	  if (metadata_type != TYPE_NONE && !old_metadata)
+	    metadata = 1;
 	}
     }
 
@@ -882,6 +1007,11 @@ process_copy_out ()
 	       ngettext ("%lu block\n", "%lu blocks\n",
 			 (unsigned long) blocks), (unsigned long) blocks);
     }
+
+  if (metadata_type != TYPE_NONE) {
+    close(metadata_fd);
+    unlink(template);
+  }
 }
 
 
diff --git a/src/dstring.c b/src/dstring.c
index ddad4c8..fe3cfaf 100644
--- a/src/dstring.c
+++ b/src/dstring.c
@@ -60,8 +60,8 @@ ds_resize (dynamic_string *string, int size)
    Return NULL if end of file is detected.  Otherwise,
    Return a pointer to the null-terminated string in S.  */
 
-char *
-ds_fgetstr (FILE *f, dynamic_string *s, char eos)
+static char *
+ds_fgetstr_common (FILE *f, char *input_string, dynamic_string *s, char eos)
 {
   int insize;			/* Amount needed for line.  */
   int strsize;			/* Amount allocated for S.  */
@@ -72,7 +72,10 @@ ds_fgetstr (FILE *f, dynamic_string *s, char eos)
   strsize = s->ds_length;
 
   /* Read the input string.  */
-  next_ch = getc (f);
+  if (input_string)
+    next_ch = *input_string++;
+  else
+    next_ch = getc (f);
   while (next_ch != eos && next_ch != EOF)
     {
       if (insize >= strsize - 1)
@@ -81,7 +84,10 @@ ds_fgetstr (FILE *f, dynamic_string *s, char eos)
 	  strsize = s->ds_length;
 	}
       s->ds_string[insize++] = next_ch;
-      next_ch = getc (f);
+      if (input_string)
+	next_ch = *input_string++;
+      else
+	next_ch = getc (f);
     }
   s->ds_string[insize++] = '\0';
 
@@ -91,6 +97,12 @@ ds_fgetstr (FILE *f, dynamic_string *s, char eos)
     return s->ds_string;
 }
 
+char *
+ds_fgetstr (FILE *f, dynamic_string *s, char eos)
+{
+  return ds_fgetstr_common (f, NULL, s, eos);
+}
+
 char *
 ds_fgets (FILE *f, dynamic_string *s)
 {
@@ -102,3 +114,9 @@ ds_fgetname (FILE *f, dynamic_string *s)
 {
   return ds_fgetstr (f, s, '\0');
 }
+
+char *
+ds_sgetstr (char *input_string, dynamic_string *s, char eos)
+{
+  return ds_fgetstr_common (NULL, input_string, s, eos);
+}
diff --git a/src/dstring.h b/src/dstring.h
index b5135fe..f5f95ec 100644
--- a/src/dstring.h
+++ b/src/dstring.h
@@ -49,3 +49,4 @@ void ds_resize (dynamic_string *string, int size);
 char *ds_fgetname (FILE *f, dynamic_string *s);
 char *ds_fgets (FILE *f, dynamic_string *s);
 char *ds_fgetstr (FILE *f, dynamic_string *s, char eos);
+char *ds_sgetstr (char *input_string, dynamic_string *s, char eos);
diff --git a/src/extern.h b/src/extern.h
index 6fa2089..4c34404 100644
--- a/src/extern.h
+++ b/src/extern.h
@@ -19,6 +19,7 @@
 
 #include "paxlib.h"
 #include "quotearg.h"
+#include "initramfs.h"
 #include "quote.h"
 
 enum archive_format
@@ -99,6 +100,7 @@ extern char output_is_seekable;
 extern int (*xstat) ();
 extern void (*copy_function) ();
 extern char *change_directory_option;
+extern enum metadata_types metadata_type;
 \f
 
 /* copyin.c */
diff --git a/src/global.c b/src/global.c
index fb3abe9..0c40be0 100644
--- a/src/global.c
+++ b/src/global.c
@@ -199,3 +199,5 @@ char *change_directory_option;
 int renumber_inodes_option;
 int ignore_devno_option;
 
+/* include file metadata into the image */
+enum metadata_types metadata_type = TYPE_NONE;
diff --git a/src/initramfs.h b/src/initramfs.h
new file mode 100644
index 0000000..d13fc39
--- /dev/null
+++ b/src/initramfs.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0+ */
+/*
+ * include/linux/initramfs.h
+ *
+ * Include file for file metadata in the initial ram disk.
+ */
+#ifndef _LINUX_INITRAMFS_H
+#define _LINUX_INITRAMFS_H
+
+#define METADATA_FILENAME "METADATA!!!"
+
+enum metadata_types { TYPE_NONE, TYPE_XATTR, TYPE__LAST };
+
+struct metadata_hdr {
+	char c_size[8];     /* total size including c_size field */
+	char c_version;     /* header version */
+	char c_type;        /* metadata type */
+	char c_metadata[];  /* metadata */
+} __attribute__((packed));
+
+#endif /*_LINUX_INITRAMFS_H*/
diff --git a/src/main.c b/src/main.c
index c68aba9..af1fa52 100644
--- a/src/main.c
+++ b/src/main.c
@@ -200,6 +200,8 @@ static struct argp_option options[] = {
   {"device-independent", DEVICE_INDEPENDENT_OPTION, NULL, 0,
    N_("Create device-independent (reproducible) archives") },
   {"reproducible", 0, NULL, OPTION_ALIAS },
+  {"file-metadata", 'e', N_("TYPE"), 0,
+   N_("Include file metadata"), GRID+1 },
 #undef GRID
   
   /* ********** */
@@ -293,6 +295,22 @@ warn_control (char *arg)
   return 1;
 }
 
+static enum metadata_types
+parse_metadata_type(char *arg)
+{
+  static char *metadata_type_str[TYPE__LAST] = {
+    [TYPE_NONE] = "none",
+    [TYPE_XATTR] = "xattr",
+  };
+  int i;
+
+  for (i = 0; i < TYPE__LAST; i++)
+    if (!strcmp (metadata_type_str[i], arg))
+      return i;
+
+  return TYPE_NONE;
+}
+
 static error_t
 parse_opt (int key, char *arg, struct argp_state *state)
 {
@@ -354,6 +372,10 @@ parse_opt (int key, char *arg, struct argp_state *state)
       copy_matching_files = false;
       break;
 
+    case 'e':		/* Metadata type.  */
+      metadata_type = parse_metadata_type(arg);
+      break;
+
     case 'E':		/* Pattern file name.  */
       pattern_file_name = arg;
       break;
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 3/3] gen_init_cpio: add support for file metadata
From: Roberto Sassu @ 2019-05-23 12:18 UTC (permalink / raw)
  To: viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98, Roberto Sassu
In-Reply-To: <20190523121803.21638-1-roberto.sassu@huawei.com>

This patch adds support for file metadata (only TYPE_XATTR metadata type).
gen_init_cpio has been modified to read xattrs from files that will be
added to the image and to include file metadata as separate files with the
special name 'METADATA!!!'.

This behavior can be selected by setting the desired file metadata type as
value for CONFIG_INITRAMFS_FILE_METADATA.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
 usr/Kconfig               |   8 +++
 usr/Makefile              |   4 +-
 usr/gen_init_cpio.c       | 137 ++++++++++++++++++++++++++++++++++++--
 usr/gen_initramfs_list.sh |  10 ++-
 4 files changed, 150 insertions(+), 9 deletions(-)

diff --git a/usr/Kconfig b/usr/Kconfig
index 43658b8a975e..8d9f54a16440 100644
--- a/usr/Kconfig
+++ b/usr/Kconfig
@@ -233,3 +233,11 @@ config INITRAMFS_COMPRESSION
 	default ".lzma" if RD_LZMA
 	default ".bz2"  if RD_BZIP2
 	default ""
+
+config INITRAMFS_FILE_METADATA
+	string "File metadata type"
+	default ""
+	help
+	  Specify xattr to include xattrs in the image.
+
+	  If you are not sure, leave it blank.
diff --git a/usr/Makefile b/usr/Makefile
index 4a70ae43c9cb..7d5eb3c7b713 100644
--- a/usr/Makefile
+++ b/usr/Makefile
@@ -29,7 +29,9 @@ ramfs-input := $(if $(filter-out "",$(CONFIG_INITRAMFS_SOURCE)), \
 			$(shell echo $(CONFIG_INITRAMFS_SOURCE)),-d)
 ramfs-args  := \
         $(if $(CONFIG_INITRAMFS_ROOT_UID), -u $(CONFIG_INITRAMFS_ROOT_UID)) \
-        $(if $(CONFIG_INITRAMFS_ROOT_GID), -g $(CONFIG_INITRAMFS_ROOT_GID))
+        $(if $(CONFIG_INITRAMFS_ROOT_GID), -g $(CONFIG_INITRAMFS_ROOT_GID)) \
+        $(if $(filter-out "",$(CONFIG_INITRAMFS_FILE_METADATA)), \
+         -e $(CONFIG_INITRAMFS_FILE_METADATA))
 
 # $(datafile_d_y) is used to identify all files included
 # in initramfs and to detect if any files are added/removed.
diff --git a/usr/gen_init_cpio.c b/usr/gen_init_cpio.c
index 03b21189d58b..e93cb1093e77 100644
--- a/usr/gen_init_cpio.c
+++ b/usr/gen_init_cpio.c
@@ -3,6 +3,7 @@
 #include <stdlib.h>
 #include <sys/types.h>
 #include <sys/stat.h>
+#include <sys/xattr.h>
 #include <string.h>
 #include <unistd.h>
 #include <time.h>
@@ -10,6 +11,7 @@
 #include <errno.h>
 #include <ctype.h>
 #include <limits.h>
+#include "../include/linux/initramfs.h"
 
 /*
  * Original work by Jeff Garzik
@@ -24,6 +26,115 @@
 static unsigned int offset;
 static unsigned int ino = 721;
 static time_t default_mtime;
+static char metadata_path[] = "/tmp/cpio-metadata-XXXXXX";
+static int metadata_fd = -1;
+
+static enum metadata_types parse_metadata_type(char *arg)
+{
+	static char *metadata_type_str[TYPE__LAST] = {
+		[TYPE_NONE] = "none",
+		[TYPE_XATTR] = "xattr",
+	};
+	int i;
+
+	for (i = 0; i < TYPE__LAST; i++)
+		if (!strcmp(metadata_type_str[i], arg))
+			return i;
+
+	return TYPE_NONE;
+}
+
+static int cpio_mkfile(const char *name, const char *location,
+		       unsigned int mode, uid_t uid, gid_t gid,
+		       unsigned int nlinks);
+
+static int write_xattrs(const char *path)
+{
+	struct metadata_hdr hdr = { .c_version = 1, .c_type = TYPE_XATTR };
+	char str[sizeof(hdr.c_size) + 1];
+	char *xattr_list, *list_ptr, *xattr_value;
+	ssize_t list_len, name_len, value_len, len;
+	int ret = -EINVAL;
+
+	if (metadata_fd < 0)
+		return 0;
+
+	if (path == metadata_path)
+		return 0;
+
+	list_len = listxattr(path, NULL, 0);
+	if (list_len <= 0)
+		return 0;
+
+	list_ptr = xattr_list = malloc(list_len);
+	if (!list_ptr) {
+		fprintf(stderr, "out of memory\n");
+		return ret;
+	}
+
+	len = listxattr(path, xattr_list, list_len);
+	if (len != list_len)
+		goto out;
+
+	if (ftruncate(metadata_fd, 0))
+		goto out;
+
+	lseek(metadata_fd, 0, SEEK_SET);
+
+	while (list_ptr < xattr_list + list_len) {
+		name_len = strlen(list_ptr);
+
+		value_len = getxattr(path, list_ptr, NULL, 0);
+		if (value_len < 0) {
+			fprintf(stderr, "cannot get xattrs\n");
+			break;
+		}
+
+		if (value_len) {
+			xattr_value = malloc(value_len);
+			if (!xattr_value) {
+				fprintf(stderr, "out of memory\n");
+				break;
+			}
+		} else {
+			xattr_value = NULL;
+		}
+
+		len = getxattr(path, list_ptr, xattr_value, value_len);
+		if (len != value_len)
+			break;
+
+		snprintf(str, sizeof(str), "%.8lx",
+			 sizeof(hdr) + name_len + 1 + value_len);
+
+		memcpy(hdr.c_size, str, sizeof(hdr.c_size));
+
+		if (write(metadata_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+			break;
+
+		if (write(metadata_fd, list_ptr, name_len + 1) != name_len + 1)
+			break;
+
+		if (write(metadata_fd, xattr_value, value_len) != value_len)
+			break;
+
+		if (fsync(metadata_fd))
+			break;
+
+		list_ptr += name_len + 1;
+		free(xattr_value);
+		xattr_value = NULL;
+	}
+
+	free(xattr_value);
+out:
+	free(xattr_list);
+
+	if (list_ptr != xattr_list + list_len)
+		return ret;
+
+	return cpio_mkfile(METADATA_FILENAME, metadata_path, S_IFREG, 0, 0, 1);
+}
 
 struct file_handler {
 	const char *type;
@@ -128,7 +239,7 @@ static int cpio_mkslink(const char *name, const char *target,
 	push_pad();
 	push_string(target);
 	push_pad();
-	return 0;
+	return write_xattrs(name);
 }
 
 static int cpio_mkslink_line(const char *line)
@@ -174,7 +285,7 @@ static int cpio_mkgeneric(const char *name, unsigned int mode,
 		0);			/* chksum */
 	push_hdr(s);
 	push_rest(name);
-	return 0;
+	return write_xattrs(name);
 }
 
 enum generic_types {
@@ -268,7 +379,7 @@ static int cpio_mknod(const char *name, unsigned int mode,
 		0);			/* chksum */
 	push_hdr(s);
 	push_rest(name);
-	return 0;
+	return write_xattrs(name);
 }
 
 static int cpio_mknod_line(const char *line)
@@ -372,8 +483,7 @@ static int cpio_mkfile(const char *name, const char *location,
 		name += namesize;
 	}
 	ino++;
-	rc = 0;
-	
+	rc = write_xattrs(location);
 error:
 	if (filebuf) free(filebuf);
 	if (file >= 0) close(file);
@@ -526,10 +636,11 @@ int main (int argc, char *argv[])
 	int ec = 0;
 	int line_nr = 0;
 	const char *filename;
+	enum metadata_types metadata_type = TYPE_NONE;
 
 	default_mtime = time(NULL);
 	while (1) {
-		int opt = getopt(argc, argv, "t:h");
+		int opt = getopt(argc, argv, "t:e:h");
 		char *invalid;
 
 		if (opt == -1)
@@ -544,6 +655,9 @@ int main (int argc, char *argv[])
 				exit(1);
 			}
 			break;
+		case 'e':
+			metadata_type = parse_metadata_type(optarg);
+			break;
 		case 'h':
 		case '?':
 			usage(argv[0]);
@@ -565,6 +679,14 @@ int main (int argc, char *argv[])
 		exit(1);
 	}
 
+	if (metadata_type != TYPE_NONE) {
+		metadata_fd = mkstemp(metadata_path);
+		if (metadata_fd < 0) {
+			fprintf(stderr, "cannot create temporary file\n");
+			exit(1);
+		}
+	}
+
 	while (fgets(line, LINE_SIZE, cpio_list)) {
 		int type_idx;
 		size_t slen = strlen(line);
@@ -620,5 +742,8 @@ int main (int argc, char *argv[])
 	if (ec == 0)
 		cpio_trailer();
 
+	if (metadata_type != TYPE_NONE)
+		close(metadata_fd);
+
 	exit(ec);
 }
diff --git a/usr/gen_initramfs_list.sh b/usr/gen_initramfs_list.sh
index 0aad760fcd8c..0907a4043da9 100755
--- a/usr/gen_initramfs_list.sh
+++ b/usr/gen_initramfs_list.sh
@@ -15,7 +15,7 @@ set -e
 usage() {
 cat << EOF
 Usage:
-$0 [-o <file>] [-u <uid>] [-g <gid>] {-d | <cpio_source>} ...
+$0 [-o <file>] [-u <uid>] [-g <gid>] {-d | <cpio_source>} [-e <type>] ...
 	-o <file>      Create compressed initramfs file named <file> using
 		       gen_init_cpio and compressor depending on the extension
 	-u <uid>       User ID to map to user ID 0 (root).
@@ -28,6 +28,7 @@ $0 [-o <file>] [-u <uid>] [-g <gid>] {-d | <cpio_source>} ...
 		       If <cpio_source> is a .cpio file it will be used
 		       as direct input to initramfs.
 	-d             Output the default cpio list.
+	-e <type>      File metadata type to include in the cpio archive.
 
 All options except -o and -l may be repeated and are interpreted
 sequentially and immediately.  -u and -g states are preserved across
@@ -283,6 +284,10 @@ while [ $# -gt 0 ]; do
 			default_list="$arg"
 			${dep_list}default_initramfs
 			;;
+		"-e")   # file metadata type
+			metadata_arg="-e $1"
+			shift
+			;;
 		"-h")
 			usage
 			exit 0
@@ -312,7 +317,8 @@ if [ ! -z ${output_file} ]; then
 			fi
 		fi
 		cpio_tfile="$(mktemp ${TMPDIR:-/tmp}/cpiofile.XXXXXX)"
-		usr/gen_init_cpio $timestamp ${cpio_list} > ${cpio_tfile}
+		usr/gen_init_cpio $metadata_arg $timestamp \
+			${cpio_list} > ${cpio_tfile}
 	else
 		cpio_tfile=${cpio_file}
 	fi
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 2/3] initramfs: read metadata from special file METADATA!!!
From: Roberto Sassu @ 2019-05-23 12:18 UTC (permalink / raw)
  To: viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98, Roberto Sassu
In-Reply-To: <20190523121803.21638-1-roberto.sassu@huawei.com>

Instead of changing the CPIO format, metadata are parsed from regular files
with special name 'METADATA!!!'. This file immediately follows the file
metadata are added to.

This patch checks if the file being extracted has the special name and, if
yes, creates a buffer with the content of that file and calls
do_parse_metadata() to parse metadata from the buffer.

Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
Reported-by: kbuild test robot <lkp@intel.com>
---
 init/initramfs.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 5de396a6aac0..862c03123de8 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -222,6 +222,7 @@ static void __init read_into(char *buf, unsigned size, enum state next)
 }
 
 static __initdata char *header_buf, *symlink_buf, *name_buf, *metadata_buf;
+static __initdata char *metadata_buf_ptr, *previous_name_buf;
 
 static int __init do_start(void)
 {
@@ -400,6 +401,7 @@ static int __init __maybe_unused do_parse_metadata(char *pathname)
 }
 
 static __initdata int wfd;
+static __initdata int metadata;
 
 static int __init do_name(void)
 {
@@ -408,6 +410,10 @@ static int __init do_name(void)
 	if (strcmp(collected, "TRAILER!!!") == 0) {
 		free_hash();
 		return 0;
+	} else if (strcmp(collected, METADATA_FILENAME) == 0) {
+		metadata = 1;
+	} else {
+		memcpy(previous_name_buf, collected, strlen(collected) + 1);
 	}
 	clean_path(collected, mode);
 	if (S_ISREG(mode)) {
@@ -444,11 +450,47 @@ static int __init do_name(void)
 	return 0;
 }
 
+static int __init do_process_metadata(char *buf, int len, bool last)
+{
+	int ret = 0;
+
+	if (!metadata_buf) {
+		metadata_buf_ptr = metadata_buf = kmalloc(body_len, GFP_KERNEL);
+		if (!metadata_buf_ptr) {
+			ret = -ENOMEM;
+			goto out;
+		}
+
+		metadata_len = body_len;
+	}
+
+	if (metadata_buf_ptr + len > metadata_buf + metadata_len) {
+		ret = -EINVAL;
+		goto out;
+	}
+
+	memcpy(metadata_buf_ptr, buf, len);
+	metadata_buf_ptr += len;
+
+	if (last)
+		do_parse_metadata(previous_name_buf);
+out:
+	if (ret < 0 || last) {
+		kfree(metadata_buf);
+		metadata_buf = NULL;
+		metadata = 0;
+	}
+
+	return ret;
+}
+
 static int __init do_copy(void)
 {
 	if (byte_count >= body_len) {
 		if (xwrite(wfd, victim, body_len) != body_len)
 			error("write error");
+		if (metadata)
+			do_process_metadata(victim, body_len, true);
 		ksys_close(wfd);
 		do_utime(vcollected, mtime);
 		kfree(vcollected);
@@ -458,6 +500,8 @@ static int __init do_copy(void)
 	} else {
 		if (xwrite(wfd, victim, byte_count) != byte_count)
 			error("write error");
+		if (metadata)
+			do_process_metadata(victim, byte_count, false);
 		body_len -= byte_count;
 		eat(byte_count);
 		return 1;
@@ -467,6 +511,7 @@ static int __init do_copy(void)
 static int __init do_symlink(void)
 {
 	collected[N_ALIGN(name_len) + body_len] = '\0';
+	memcpy(previous_name_buf, collected, strlen(collected) + 1);
 	clean_path(collected, 0);
 	ksys_symlink(collected + N_ALIGN(name_len), collected);
 	ksys_lchown(collected, uid, gid);
@@ -534,8 +579,9 @@ static char * __init unpack_to_rootfs(char *buf, unsigned long len)
 	header_buf = kmalloc(110, GFP_KERNEL);
 	symlink_buf = kmalloc(PATH_MAX + N_ALIGN(PATH_MAX) + 1, GFP_KERNEL);
 	name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL);
+	previous_name_buf = kmalloc(N_ALIGN(PATH_MAX), GFP_KERNEL);
 
-	if (!header_buf || !symlink_buf || !name_buf)
+	if (!header_buf || !symlink_buf || !name_buf || !previous_name_buf)
 		panic("can't allocate buffers");
 
 	state = Start;
@@ -580,6 +626,7 @@ static char * __init unpack_to_rootfs(char *buf, unsigned long len)
 		len -= my_inptr;
 	}
 	dir_utime();
+	kfree(previous_name_buf);
 	kfree(name_buf);
 	kfree(symlink_buf);
 	kfree(header_buf);
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 1/3] initramfs: add file metadata
From: Roberto Sassu @ 2019-05-23 12:18 UTC (permalink / raw)
  To: viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98, Roberto Sassu
In-Reply-To: <20190523121803.21638-1-roberto.sassu@huawei.com>

From: Mimi Zohar <zohar@linux.vnet.ibm.com>

This patch adds metadata to a file from a supplied buffer. The buffer might
contains multiple metadata records. The format of each record is:

<metadata len (ASCII, 8 chars)><version><type><metadata>

For now, only the TYPE_XATTR metadata type is supported. The specific
format of this metadata type is:

<xattr #N name>\0<xattr #N value>

[kamensky: fixed restoring of xattrs for symbolic links by using
           sys_lsetxattr() instead of sys_setxattr()]

[sassu: removed state management, kept only do_setxattrs(), added support
        for generic file metadata, replaced sys_lsetxattr() with
        vfs_setxattr(), added check for entry_size, added check for
        hdr->c_size, replaced strlen() with strnlen(); moved do_setxattrs()
        before do_name()]

Signed-off-by: Mimi Zohar <zohar@linux.vnet.ibm.com>
Signed-off-by: Victor Kamensky <kamensky@cisco.com>
Signed-off-by: Taras Kondratiuk <takondra@cisco.com>
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
 include/linux/initramfs.h | 21 ++++++++++
 init/initramfs.c          | 88 ++++++++++++++++++++++++++++++++++++++-
 2 files changed, 107 insertions(+), 2 deletions(-)
 create mode 100644 include/linux/initramfs.h

diff --git a/include/linux/initramfs.h b/include/linux/initramfs.h
new file mode 100644
index 000000000000..2f8cee441236
--- /dev/null
+++ b/include/linux/initramfs.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * include/linux/initramfs.h
+ *
+ * Include file for file metadata in the initial ram disk.
+ */
+#ifndef _LINUX_INITRAMFS_H
+#define _LINUX_INITRAMFS_H
+
+#define METADATA_FILENAME "METADATA!!!"
+
+enum metadata_types { TYPE_NONE, TYPE_XATTR, TYPE__LAST };
+
+struct metadata_hdr {
+	char c_size[8];     /* total size including c_size field */
+	char c_version;     /* header version */
+	char c_type;        /* metadata type */
+	char c_metadata[];  /* metadata */
+} __packed;
+
+#endif /*LINUX_INITRAMFS_H*/
diff --git a/init/initramfs.c b/init/initramfs.c
index 178130fd61c2..5de396a6aac0 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -10,6 +10,9 @@
 #include <linux/syscalls.h>
 #include <linux/utime.h>
 #include <linux/file.h>
+#include <linux/namei.h>
+#include <linux/xattr.h>
+#include <linux/initramfs.h>
 
 static ssize_t __init xwrite(int fd, const char *p, size_t count)
 {
@@ -146,7 +149,7 @@ static __initdata time64_t mtime;
 
 static __initdata unsigned long ino, major, minor, nlink;
 static __initdata umode_t mode;
-static __initdata unsigned long body_len, name_len;
+static __initdata unsigned long body_len, name_len, metadata_len;
 static __initdata uid_t uid;
 static __initdata gid_t gid;
 static __initdata unsigned rdev;
@@ -218,7 +221,7 @@ static void __init read_into(char *buf, unsigned size, enum state next)
 	}
 }
 
-static __initdata char *header_buf, *symlink_buf, *name_buf;
+static __initdata char *header_buf, *symlink_buf, *name_buf, *metadata_buf;
 
 static int __init do_start(void)
 {
@@ -315,6 +318,87 @@ static int __init maybe_link(void)
 	return 0;
 }
 
+static int __init do_setxattrs(char *pathname, char *buf, size_t size)
+{
+	struct path path;
+	char *xattr_name, *xattr_value;
+	size_t xattr_name_size, xattr_value_size;
+	int ret;
+
+	xattr_name = buf;
+	xattr_name_size = strnlen(xattr_name, size);
+	if (xattr_name_size == size) {
+		error("malformed xattrs");
+		return -EINVAL;
+	}
+
+	xattr_value = xattr_name + xattr_name_size + 1;
+	xattr_value_size = buf + size - xattr_value;
+
+	ret = kern_path(pathname, 0, &path);
+	if (!ret) {
+		ret = vfs_setxattr(path.dentry, xattr_name, xattr_value,
+				   xattr_value_size, 0);
+
+		path_put(&path);
+	}
+
+	pr_debug("%s: %s size: %lu val: %s (ret: %d)\n", pathname,
+		 xattr_name, xattr_value_size, xattr_value, ret);
+
+	return ret;
+}
+
+static int __init __maybe_unused do_parse_metadata(char *pathname)
+{
+	char *buf = metadata_buf;
+	char *bufend = metadata_buf + metadata_len;
+	struct metadata_hdr *hdr;
+	char str[sizeof(hdr->c_size) + 1];
+	size_t entry_size;
+
+	if (!metadata_len)
+		return 0;
+
+	str[sizeof(hdr->c_size)] = 0;
+
+	while (buf < bufend) {
+		int ret;
+
+		if (buf + sizeof(*hdr) > bufend) {
+			error("malformed metadata");
+			break;
+		}
+
+		hdr = (struct metadata_hdr *)buf;
+		if (hdr->c_version != 1) {
+			pr_debug("Unsupported header version\n");
+			break;
+		}
+
+		memcpy(str, hdr->c_size, sizeof(hdr->c_size));
+		ret = kstrtoul(str, 16, &entry_size);
+		if (ret || buf + entry_size > bufend || !entry_size) {
+			error("malformed xattrs");
+			break;
+		}
+
+		switch (hdr->c_type) {
+		case TYPE_XATTR:
+			do_setxattrs(pathname, buf + sizeof(*hdr),
+				     entry_size - sizeof(*hdr));
+			break;
+		default:
+			pr_debug("Unsupported metadata type\n");
+			break;
+		}
+
+		buf += entry_size;
+	}
+
+	return 0;
+}
+
 static __initdata int wfd;
 
 static int __init do_name(void)
-- 
2.17.1

^ permalink raw reply related

* [PATCH v4 0/3] initramfs: add support for xattrs in the initial ram disk
From: Roberto Sassu @ 2019-05-23 12:18 UTC (permalink / raw)
  To: viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, bug-cpio, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, hpa, arnd, rob,
	james.w.mcmechan, niveditas98, Roberto Sassu

This patch set aims at solving the following use case: appraise files from
the initial ram disk. To do that, IMA checks the signature/hash from the
security.ima xattr. Unfortunately, this use case cannot be implemented
currently, as the CPIO format does not support xattrs.

This proposal consists in including file metadata as additional files named
METADATA!!!, for each file added to the ram disk. The CPIO parser in the
kernel recognizes these special files from the file name, and calls the
appropriate parser to add metadata to the previously extracted file. It has
been proposed to use bit 17:16 of the file mode as a way to recognize files
with metadata, but both the kernel and the cpio tool declare the file mode
as unsigned short.

The difference from v2, v3 (https://lkml.org/lkml/2019/5/9/230,
https://lkml.org/lkml/2019/5/17/466) is that file metadata are stored in
separate files instead of a single file. Given that files with metadata
must immediately follow the files metadata will be added to, image
generators have to be modified in this version.

The difference from v1 (https://lkml.org/lkml/2018/11/22/1182) is that
all files have the same name. The file metadata are added to is always the
previous one, and the image generator in user space will make sure that
files are in the correct sequence.

The difference with another proposal
(https://lore.kernel.org/patchwork/cover/888071/) is that xattrs can be
included in an image without changing the image format. Files with metadata
will appear as regular files. It will be task of the parser in the kernel
to process them.

This patch set extends the format of data defined in patch 9/15 of the last
proposal. It adds header version and type, so that new formats can be
defined and arbitrary metadata types can be processed.

The changes introduced by this patch set don't cause any compatibility
issue: kernels without the metadata parser simply extract the special files
and don't process metadata; kernels with the metadata parser don't process
metadata if the special files are not included in the image.

>From the kernel space perspective, backporting this functionality to older
kernels should be very easy. It is sufficient to add two calls to the new
function do_process_metadata() in do_copy(), and to check the file name in
do_name(). From the user space perspective, unlike the previous version of
the patch set, it is required to modify the image generators in order to
include metadata as separate files.

Changelog

v3:
- include file metadata as separate files named METADATA!!!
- add the possibility to include in the ram disk arbitrary metadata types

v2:
- replace ksys_lsetxattr() with kern_path() and vfs_setxattr()
  (suggested by Jann Horn)
- replace ksys_open()/ksys_read()/ksys_close() with
  filp_open()/kernel_read()/fput()
  (suggested by Jann Horn)
- use path variable instead of name_buf in do_readxattrs()
- set last byte of str to 0 in do_readxattrs()
- call do_readxattrs() in do_name() before replacing an existing
  .xattr-list
- pass pathname to do_setxattrs()

v1:
- move xattr unmarshaling to CPIO parser


Mimi Zohar (1):
  initramfs: add file metadata

Roberto Sassu (2):
  initramfs: read metadata from special file METADATA!!!
  gen_init_cpio: add support for file metadata

 include/linux/initramfs.h |  21 ++++++
 init/initramfs.c          | 137 +++++++++++++++++++++++++++++++++++++-
 usr/Kconfig               |   8 +++
 usr/Makefile              |   4 +-
 usr/gen_init_cpio.c       | 137 ++++++++++++++++++++++++++++++++++++--
 usr/gen_initramfs_list.sh |  10 ++-
 6 files changed, 305 insertions(+), 12 deletions(-)
 create mode 100644 include/linux/initramfs.h

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-23 11:51 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190522165737.GC4915@redhat.com>

On Wed, May 22, 2019 at 06:57:37PM +0200, Oleg Nesterov wrote:
> On 05/22, Christian Brauner wrote:
> >
> > +static struct file *pick_file(struct files_struct *files, unsigned fd)
> >  {
> > -	struct file *file;
> > +	struct file *file = NULL;
> >  	struct fdtable *fdt;
> >  
> >  	spin_lock(&files->file_lock);
> > @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
> >  		goto out_unlock;
> >  	rcu_assign_pointer(fdt->fd[fd], NULL);
> >  	__put_unused_fd(files, fd);
> > -	spin_unlock(&files->file_lock);
> > -	return filp_close(file, files);
> >  
> >  out_unlock:
> >  	spin_unlock(&files->file_lock);
> > -	return -EBADF;
> > +	return file;
> 
> ...
> 
> > +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> > +{
> > +	unsigned int cur_max;
> > +
> > +	if (fd > max_fd)
> > +		return -EINVAL;
> > +
> > +	rcu_read_lock();
> > +	cur_max = files_fdtable(files)->max_fds;
> > +	rcu_read_unlock();
> > +
> > +	/* cap to last valid index into fdtable */
> > +	if (max_fd >= cur_max)
> > +		max_fd = cur_max - 1;
> > +
> > +	while (fd <= max_fd) {
> > +		struct file *file;
> > +
> > +		file = pick_file(files, fd++);
> 
> Well, how about something like
> 
> 	static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned start)
> 	{
> 		unsigned int maxfd = fdt->max_fds;
> 		unsigned int maxbit = maxfd / BITS_PER_LONG;
> 		unsigned int bitbit = start / BITS_PER_LONG;
> 
> 		bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
> 		if (bitbit > maxfd)
> 			return maxfd;
> 		if (bitbit > start)
> 			start = bitbit;
> 		return find_next_bit(fdt->open_fds, maxfd, start);
> 	}

> 
> 	unsigned close_next_fd(struct files_struct *files, unsigned start, unsigned maxfd)
> 	{
> 		unsigned fd;
> 		struct file *file;
> 		struct fdtable *fdt;
> 	
> 		spin_lock(&files->file_lock);
> 		fdt = files_fdtable(files);
> 		fd = find_next_opened_fd(fdt, start);
> 		if (fd >= fdt->max_fds || fd > maxfd) {
> 			fd = -1;
> 			goto out;
> 		}
> 
> 		file = fdt->fd[fd];
> 		rcu_assign_pointer(fdt->fd[fd], NULL);
> 		__put_unused_fd(files, fd);
> 	out:
> 		spin_unlock(&files->file_lock);
> 
> 		if (fd == -1u)
> 			return fd;
> 
> 		filp_close(file, files);
> 		return fd + 1;
> 	}

Thanks, Oleg!

I kept it dumb and was about to reply that your solution introduces more
code when it seemed we wanted to keep this very simple for now.
But then I saw that find_next_opened_fd() already exists as
find_next_fd(). So it's actually not bad compared to what I sent in v1.
So - with some small tweaks (need to test it and all now) - how do we
feel about?:

/**
 * __close_next_open_fd() - Close the nearest open fd.
 *
 * @curfd: lowest file descriptor to consider
 * @maxfd: highest file descriptor to consider
 *
 * This function will close the nearest open fd, i.e. it will either
 * close @curfd if it is open or the closest open file descriptor
 * c greater than @curfd that
 * is smaller or equal to maxfd.
 * If the function found a file descriptor to close it will return 0 and
 * place the file descriptor it closed in @curfd. If it did not find a
 * file descriptor to close it will return -EBADF.
 */
static int __close_next_open_fd(struct files_struct *files, unsigned *curfd, unsigned maxfd)
{
        struct file *file = NULL;
	unsigned fd;
	struct fdtable *fdt;

	spin_lock(&files->file_lock);
	fdt = files_fdtable(files);
	fd = find_next_fd(fdt, *curfd);
	if (fd >= fdt->max_fds || fd > maxfd)
		goto out_unlock;

	file = fdt->fd[fd];
	rcu_assign_pointer(fdt->fd[fd], NULL);
	__put_unused_fd(files, fd);

out_unlock:
	spin_unlock(&files->file_lock);

	if (!file)
		return -EBADF;

	*curfd = fd;
	filp_close(file, files);
	return 0;
}

int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
{
	if (fd > max_fd)
		return -EINVAL;

	while (fd <= max_fd) {
		if (__close_next_fd(files, &fd, maxfd))
			break;

		cond_resched();
		fd++;
	}

	return 0;
}

SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
		unsigned int, flags)
{
	if (flags)
		return -EINVAL;

	return __close_range(current->files, fd, max_fd);
}

^ permalink raw reply

* Re: [PATCH RFC v8 01/10] namei: obey trailing magic-link DAC permissions
From: Aleksa Sarai @ 2019-05-23  2:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Christian Brauner,
	Eric Biederman, Andrew Morton, Alexei Starovoitov, Kees Cook,
	Jann Horn, Tycho Andersen, David Drysdale, Chanho Min,
	Oleg Nesterov, Aleksa Sarai, Linus Torvalds, Linux Containers
In-Reply-To: <CALCETrVCwe49q5mu=f6jTYNSgosQSjjY5chukMPo6eZtQGqo5g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4955 bytes --]

On 2019-05-22, Andy Lutomirski <luto@kernel.org> wrote:
> On Mon, May 20, 2019 at 6:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > One final exception is given, which is that non-O_PATH file descriptors
> > are given re-open rights equivalent to the permissions available at
> > open-time. This allows for O_RDONLY file descriptors to be re-opened
> > O_RDWR as long as the user had MAY_WRITE access at the time of opening
> > the O_RDONLY descriptor. This is necessary to avoid breaking userspace
> > (some of the kernel's own selftests depended on this "feature").
> 
> Can you clarify this exception a bit?  I'd like to make sure it's not
> such a huge exception that it invalidates the whole point of the
> patch.

Sure. This exception applies to regular file opens, and the idea is that
the user had permissions to open the file O_RDWR originally (even if
they opened it O_RDONLY) so re-opening it O_RDWR later should not be an
issue (they could've just opened it O_RDWR in the first place). These
permissions still get masked by nd->opath_mask, so opening a magic-link
or including an O_PATH doesn't increase the permission set.

This does mean that an O_RDONLY open (if the user could've done an
O_RDWR and still done the open) results in an FMODE_PATH_WRITE. To be
honest, I'm still on the fence whether this is a great idea or not (and
I'd prefer to not include it). Though, I don't think it invalidates the
patch though, since the attack scenario of a read-only file being
re-opened later as read-write is still blocked.

The main reason for including it is the concern that there is some
program from 1993 running in a basement somewhere that depends on this
that we don't know about. Though, as a counter-example, I have run this
patchset (without this exception) on my laptop for a few days without
any visible issues.

> If you open a file for execute, by actually exec()ing it or by using
> something like the proposed O_MAYEXEC, and you have inode_permission
> to write, do you still end up with FMODE_PATH_WRITE? The code looks
> like it does, and this seems like it might be a mistake.

I'm not sure about the execve(2) example -- after all, you don't get an
fd from execve(2) and /proc/self/exe still has a mode a+rx.

I'm also not sure what the semantics of a hypothetical O_MAYEXEC would
be -- but we'd probably want to discuss re-opening semantics when it
gets included. I would argue that since O_MAYEXEC would likely be merged
after this, no userspace code would depend on this mis-feature and we
could decide to not include FMODE_EXECv2 in the handling of additional
permissions.

As an aside, I did originally implement RESOLVE_UPGRADE_NOEXEC (and the
corresponding FMODE_PATH_EXEC handling). It worked for the most part,
though execveat(AT_EMPTY_PATH) would need some additional changes to do
the may_open_magiclink() checks and I decided against including it here
until we had an O_MAYEXEC.

> Is there any way for user code to read out these new file mode bits?

There is, but it's not exactly trivial. You could check the mode of
/proc/self/fd and then compare it to the acc_mode of the "flags"
/proc/self/fdinfo. The bits present in /proc/self/fd but not in acc_mode
are the FMODE_PATH_* bits.

However, this is quite an ugly way of doing it. I see two options to
make it easier:

 1. We can add additional information to fdinfo so it includes that
    FMODE_PATH_* bits to indicate how the fd can be upgraded.

 2. Previously, only the u bits of the fd mode were used to represent the
    open flags. We could add the FMODE_PATH_* permissions to the g bits
    and change how the permission check in trailing_symlink() operates.

    The really neat thing here is that we could then know for sure which
    fmode bits are set during name lookup of a magic-link rather than
    assuming they're all FMODE_PATH_* bits.

    In addition, userspace that depends on checking the u bits of an fd
    mode would continue to work (though I'm not aware of any userspace
    code that does depend on this).

Option 2 seems nicer to me in some respects, but it has the additional
cost of making the permission check less obvious (it's no longer an
"inode_permission" and is instead something different with a weird new
set of semantics). Then again, the modes of magic-links weren't obeyed
in the first place so I'd argue these semantics are entirely up for us
to decide.

> What are actual examples of uses for this exception?  Breaking
> selftests is not, in and of itself, a huge problem.

Not as far as I know. All of the re-opening users I know of do re-opens
of O_PATH or are re-opening with the same (or fewer) privileges. I also
ran this for a few days on my laptop without this exception, and didn't
have any visible issues.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()
From: Taras Kondratiuk @ 2019-05-22 20:21 UTC (permalink / raw)
  To: Arvind Sankar, Rob Landley, Roberto Sassu, hpa
  Cc: viro, linux-security-module, linux-integrity, initramfs,
	linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
	dmitry.kasatkin, kamensky, arnd, james.w.mcmechan, niveditas98
In-Reply-To: <3839583c-5466-6573-3048-0da7e6778c88@landley.net>

Quoting Rob Landley (2019-05-22 12:26:43)
> 
> 
> On 5/22/19 11:17 AM, hpa@zytor.com wrote:
> > On May 20, 2019 2:39:46 AM PDT, Roberto Sassu <roberto.sassu@huawei.com> wrote:
> >> On 5/18/2019 12:17 AM, Arvind Sankar wrote:
> >>> On Fri, May 17, 2019 at 02:47:31PM -0700, H. Peter Anvin wrote:
> >>>> On 5/17/19 2:02 PM, Arvind Sankar wrote:
> >>>>> On Fri, May 17, 2019 at 01:18:11PM -0700, hpa@zytor.com wrote:
> >>>>>>
> >>>>>> Ok... I just realized this does not work for a modular initramfs,
> >> composed at load time from multiple files, which is a very real
> >> problem. Should be easy enough to deal with: instead of one large file,
> >> use one companion file per source file, perhaps something like
> >> filename..xattrs (suggesting double dots to make it less likely to
> >> conflict with a "real" file.) No leading dot, as it makes it more
> >> likely that archivers will sort them before the file proper.
> >>>>> This version of the patch was changed from the previous one exactly
> >> to deal with this case --
> >>>>> it allows for the bootloader to load multiple initramfs archives,
> >> each
> >>>>> with its own .xattr-list file, and to have that work properly.
> >>>>> Could you elaborate on the issue that you see?
> >>>>>
> >>>>
> >>>> Well, for one thing, how do you define "cpio archive", each with its
> >> own
> >>>> .xattr-list file? Second, that would seem to depend on the ordering,
> >> no,
> >>>> in which case you depend critically on .xattr-list file following
> >> the
> >>>> files, which most archivers won't do.
> >>>>
> >>>> Either way it seems cleaner to have this per file; especially if/as
> >> it
> >>>> can be done without actually mucking up the format.
> >>>>
> >>>> I need to run, but I'll post a more detailed explanation of what I
> >> did
> >>>> in a little bit.
> >>>>
> >>>>    -hpa
> >>>>
> >>> Not sure what you mean by how do I define it? Each cpio archive will
> >>> contain its own .xattr-list file with signatures for the files within
> >>> it, that was the idea.
> >>>
> >>> You need to review the code more closely I think -- it does not
> >> depend
> >>> on the .xattr-list file following the files to which it applies.
> >>>
> >>> The code first extracts .xattr-list as though it was a regular file.
> >> If
> >>> a later dupe shows up (presumably from a second archive, although the
> >>> patch will actually allow a second one in the same archive), it will
> >>> then process the existing .xattr-list file and apply the attributes
> >>> listed within it. It then will proceed to read the second one and
> >>> overwrite the first one with it (this is the normal behaviour in the
> >>> kernel cpio parser). At the end once all the archives have been
> >>> extracted, if there is an .xattr-list file in the rootfs it will be
> >>> parsed (it would've been the last one encountered, which hasn't been
> >>> parsed yet, just extracted).
> >>>
> >>> Regarding the idea to use the high 16 bits of the mode field in
> >>> the header that's another possibility. It would just require
> >> additional
> >>> support in the program that actually creates the archive though,
> >> which
> >>> the current patch doesn't.
> >>
> >> Yes, for adding signatures for a subset of files, no changes to the ram
> >> disk generator are necessary. Everything is done by a custom module. To
> >> support a generic use case, it would be necessary to modify the
> >> generator to execute getfattr and the awk script after files have been
> >> placed in the temporary directory.
> >>
> >> If I understood the new proposal correctly, it would be task for cpio
> >> to
> >> read file metadata after the content and create a new record for each
> >> file with mode 0x18000, type of metadata encoded in the file name and
> >> metadata as file content. I don't know how easy it would be to modify
> >> cpio. Probably the amount of changes would be reasonable.
> 
> I could make toybox cpio do it in a weekend, and could probably throw a patch at
> usr/gen_init_cpio.c while I'm at it. I prototyped something like that a couple
> years ago, it's not hard.
> 
> The real question is scripts/gen_initramfs_list.sh and the text format it
> produces. We can currently generate cpio files with different ownership and
> permissions than the host system can represent (when not building as root, on a
> filesystem that may not support xattrs or would get unhappy about conflicting
> selinux annotations). We work around it by having the metadata represented
> textually in the initramfs_list file gen_initramfs_list.sh produces and
> gen_init_cpio.c consumes.
> 
> xattrs are a terrible idea the Macintosh invented so Finder could remember where
> you moved a file's icon in its folder without having to modify the file, and
> then things like OS/2 copied it and Windows picked it up from there and went "Of
> course, this is a security mechanism!" and... sigh.
> 
> This is "data that is not data", it's metadata of unbounded size. It seems like
> it should go in gen_initramfs_list.sh but as what, keyword=value pairs that
> might have embedded newlines in them? A base64 encoding? Something else?

I the previous try to add xattrs to cpio I've used hex encoding in
gen_initramfs_list.sh:
https://lkml.org/lkml/2018/1/24/851 - gen_init_cpio: set extended attributes for newcx format
https://lkml.org/lkml/2018/1/24/852 - gen_initramfs_list.sh: add -x option to enable newcx format

^ permalink raw reply

* Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()
From: Rob Landley @ 2019-05-22 19:26 UTC (permalink / raw)
  To: hpa, Roberto Sassu, Arvind Sankar
  Cc: viro, linux-security-module, linux-integrity, initramfs,
	linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <9C5B9F98-2067-43D3-B149-57613F38DCD4@zytor.com>



On 5/22/19 11:17 AM, hpa@zytor.com wrote:
> On May 20, 2019 2:39:46 AM PDT, Roberto Sassu <roberto.sassu@huawei.com> wrote:
>> On 5/18/2019 12:17 AM, Arvind Sankar wrote:
>>> On Fri, May 17, 2019 at 02:47:31PM -0700, H. Peter Anvin wrote:
>>>> On 5/17/19 2:02 PM, Arvind Sankar wrote:
>>>>> On Fri, May 17, 2019 at 01:18:11PM -0700, hpa@zytor.com wrote:
>>>>>>
>>>>>> Ok... I just realized this does not work for a modular initramfs,
>> composed at load time from multiple files, which is a very real
>> problem. Should be easy enough to deal with: instead of one large file,
>> use one companion file per source file, perhaps something like
>> filename..xattrs (suggesting double dots to make it less likely to
>> conflict with a "real" file.) No leading dot, as it makes it more
>> likely that archivers will sort them before the file proper.
>>>>> This version of the patch was changed from the previous one exactly
>> to deal with this case --
>>>>> it allows for the bootloader to load multiple initramfs archives,
>> each
>>>>> with its own .xattr-list file, and to have that work properly.
>>>>> Could you elaborate on the issue that you see?
>>>>>
>>>>
>>>> Well, for one thing, how do you define "cpio archive", each with its
>> own
>>>> .xattr-list file? Second, that would seem to depend on the ordering,
>> no,
>>>> in which case you depend critically on .xattr-list file following
>> the
>>>> files, which most archivers won't do.
>>>>
>>>> Either way it seems cleaner to have this per file; especially if/as
>> it
>>>> can be done without actually mucking up the format.
>>>>
>>>> I need to run, but I'll post a more detailed explanation of what I
>> did
>>>> in a little bit.
>>>>
>>>> 	-hpa
>>>>
>>> Not sure what you mean by how do I define it? Each cpio archive will
>>> contain its own .xattr-list file with signatures for the files within
>>> it, that was the idea.
>>>
>>> You need to review the code more closely I think -- it does not
>> depend
>>> on the .xattr-list file following the files to which it applies.
>>>
>>> The code first extracts .xattr-list as though it was a regular file.
>> If
>>> a later dupe shows up (presumably from a second archive, although the
>>> patch will actually allow a second one in the same archive), it will
>>> then process the existing .xattr-list file and apply the attributes
>>> listed within it. It then will proceed to read the second one and
>>> overwrite the first one with it (this is the normal behaviour in the
>>> kernel cpio parser). At the end once all the archives have been
>>> extracted, if there is an .xattr-list file in the rootfs it will be
>>> parsed (it would've been the last one encountered, which hasn't been
>>> parsed yet, just extracted).
>>>
>>> Regarding the idea to use the high 16 bits of the mode field in
>>> the header that's another possibility. It would just require
>> additional
>>> support in the program that actually creates the archive though,
>> which
>>> the current patch doesn't.
>>
>> Yes, for adding signatures for a subset of files, no changes to the ram
>> disk generator are necessary. Everything is done by a custom module. To
>> support a generic use case, it would be necessary to modify the
>> generator to execute getfattr and the awk script after files have been
>> placed in the temporary directory.
>>
>> If I understood the new proposal correctly, it would be task for cpio
>> to
>> read file metadata after the content and create a new record for each
>> file with mode 0x18000, type of metadata encoded in the file name and
>> metadata as file content. I don't know how easy it would be to modify
>> cpio. Probably the amount of changes would be reasonable.

I could make toybox cpio do it in a weekend, and could probably throw a patch at
usr/gen_init_cpio.c while I'm at it. I prototyped something like that a couple
years ago, it's not hard.

The real question is scripts/gen_initramfs_list.sh and the text format it
produces. We can currently generate cpio files with different ownership and
permissions than the host system can represent (when not building as root, on a
filesystem that may not support xattrs or would get unhappy about conflicting
selinux annotations). We work around it by having the metadata represented
textually in the initramfs_list file gen_initramfs_list.sh produces and
gen_init_cpio.c consumes.

xattrs are a terrible idea the Macintosh invented so Finder could remember where
you moved a file's icon in its folder without having to modify the file, and
then things like OS/2 copied it and Windows picked it up from there and went "Of
course, this is a security mechanism!" and... sigh.

This is "data that is not data", it's metadata of unbounded size. It seems like
it should go in gen_initramfs_list.sh but as what, keyword=value pairs that
might have embedded newlines in them? A base64 encoding? Something else?

>> The kernel will behave in a similar way. It will call do_readxattrs()
>> in
>> do_copy() for each file. Since the only difference between the current
>> and the new proposal would be two additional calls to do_readxattrs()
>> in
>> do_name() and unpack_to_rootfs(), maybe we could support both.
>>
>> Roberto
> 
> The nice thing with explicit metadata is that it doesn't have to contain the filename per se, and each file is self-contained. There is a reason why each cpio header starts with the magic number: each cpio record is formally independent and can be processed in isolation.  The TRAILER!!! thing is a huge wart in the format, although in practice TRAILER!!! always has a mode of 0 and so can be distinguished from an actual file.

Not adding the requirement that the cpio.gz must be generated as root from a
filesystem with the same users and selinux rules as the target system would be nice.

> The use of mode 0x18000 for metadata allows for optional backwards compatibility for extraction; for encoding this can be handled with very simple postprocessing.

The representation within the cpio file was never a huge deal to me. 0x18000
sounds fine for that.

> So my suggestion would be to have mode 0x18000 indicate extended file metadata, with the filename of the form:
> 
> optional_filename!XXXXX!
> 
> ... where XXXXX indicates the type of metadata (e.g. !XATTR!). The optional_filename prefix allows an unaware decoder to extract to a well-defined name; simple postprocessing would be able to either remove (for size) or add (for compatibility) this prefix. It would be an error for this prefix, if present, to not match the name of the previous file.

I'd suggest METADATA!!! to look like TRAILER!!!. (METADATA!!!XXXXX! if you
really think a keyword=value pair store is _not_ universal and we're going to
invent entire new _categories_ of this side channel nonsense.)

And extracting conflicting filenames is presumably already covered, it either
replaces or the new one fails to create the file and the extractor moves on.
(You need a working error recovery path that skips the right amount of data so
you can handle the next file properly, but you should have that anyway.)

Rob

^ permalink raw reply

* Re: [PATCH v4] net: Add UNIX_DIAG_UID to Netlink UNIX socket diagnostics.
From: David Miller @ 2019-05-22 17:36 UTC (permalink / raw)
  To: felipe; +Cc: viro, linux-kernel, netdev, linux-api
In-Reply-To: <20190521004351.23706-1-felipe@felipegasper.com>

From: Felipe Gasper <felipe@felipegasper.com>
Date: Mon, 20 May 2019 19:43:51 -0500

> This adds the ability for Netlink to report a socket's UID along with the
> other UNIX diagnostic information that is already available. This will
> allow diagnostic tools greater insight into which users control which
> socket.
> 
> To test this, do the following as a non-root user:
> 
>     unshare -U -r bash
>     nc -l -U user.socket.$$ &
> 
> .. and verify from within that same session that Netlink UNIX socket
> diagnostics report the socket's UID as 0. Also verify that Netlink UNIX
> socket diagnostics report the socket's UID as the user's UID from an
> unprivileged process in a different session. Verify the same from
> a root process.
> 
> Signed-off-by: Felipe Gasper <felipe@felipegasper.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()
From: Roberto Sassu @ 2019-05-22 17:22 UTC (permalink / raw)
  To: hpa, Arvind Sankar
  Cc: viro, linux-security-module, linux-integrity, initramfs,
	linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, arnd, rob, james.w.mcmechan,
	niveditas98
In-Reply-To: <9C5B9F98-2067-43D3-B149-57613F38DCD4@zytor.com>

On 5/22/2019 6:17 PM, hpa@zytor.com wrote:
> On May 20, 2019 2:39:46 AM PDT, Roberto Sassu <roberto.sassu@huawei.com> wrote:
>> On 5/18/2019 12:17 AM, Arvind Sankar wrote:
>>> On Fri, May 17, 2019 at 02:47:31PM -0700, H. Peter Anvin wrote:
>>>> On 5/17/19 2:02 PM, Arvind Sankar wrote:
>>>>> On Fri, May 17, 2019 at 01:18:11PM -0700, hpa@zytor.com wrote:
>>>>>>
>>>>>> Ok... I just realized this does not work for a modular initramfs,
>> composed at load time from multiple files, which is a very real
>> problem. Should be easy enough to deal with: instead of one large file,
>> use one companion file per source file, perhaps something like
>> filename..xattrs (suggesting double dots to make it less likely to
>> conflict with a "real" file.) No leading dot, as it makes it more
>> likely that archivers will sort them before the file proper.
>>>>> This version of the patch was changed from the previous one exactly
>> to deal with this case --
>>>>> it allows for the bootloader to load multiple initramfs archives,
>> each
>>>>> with its own .xattr-list file, and to have that work properly.
>>>>> Could you elaborate on the issue that you see?
>>>>>
>>>>
>>>> Well, for one thing, how do you define "cpio archive", each with its
>> own
>>>> .xattr-list file? Second, that would seem to depend on the ordering,
>> no,
>>>> in which case you depend critically on .xattr-list file following
>> the
>>>> files, which most archivers won't do.
>>>>
>>>> Either way it seems cleaner to have this per file; especially if/as
>> it
>>>> can be done without actually mucking up the format.
>>>>
>>>> I need to run, but I'll post a more detailed explanation of what I
>> did
>>>> in a little bit.
>>>>
>>>> 	-hpa
>>>>
>>> Not sure what you mean by how do I define it? Each cpio archive will
>>> contain its own .xattr-list file with signatures for the files within
>>> it, that was the idea.
>>>
>>> You need to review the code more closely I think -- it does not
>> depend
>>> on the .xattr-list file following the files to which it applies.
>>>
>>> The code first extracts .xattr-list as though it was a regular file.
>> If
>>> a later dupe shows up (presumably from a second archive, although the
>>> patch will actually allow a second one in the same archive), it will
>>> then process the existing .xattr-list file and apply the attributes
>>> listed within it. It then will proceed to read the second one and
>>> overwrite the first one with it (this is the normal behaviour in the
>>> kernel cpio parser). At the end once all the archives have been
>>> extracted, if there is an .xattr-list file in the rootfs it will be
>>> parsed (it would've been the last one encountered, which hasn't been
>>> parsed yet, just extracted).
>>>
>>> Regarding the idea to use the high 16 bits of the mode field in
>>> the header that's another possibility. It would just require
>> additional
>>> support in the program that actually creates the archive though,
>> which
>>> the current patch doesn't.
>>
>> Yes, for adding signatures for a subset of files, no changes to the ram
>> disk generator are necessary. Everything is done by a custom module. To
>> support a generic use case, it would be necessary to modify the
>> generator to execute getfattr and the awk script after files have been
>> placed in the temporary directory.
>>
>> If I understood the new proposal correctly, it would be task for cpio
>> to
>> read file metadata after the content and create a new record for each
>> file with mode 0x18000, type of metadata encoded in the file name and
>> metadata as file content. I don't know how easy it would be to modify
>> cpio. Probably the amount of changes would be reasonable.
>>
>> The kernel will behave in a similar way. It will call do_readxattrs()
>> in
>> do_copy() for each file. Since the only difference between the current
>> and the new proposal would be two additional calls to do_readxattrs()
>> in
>> do_name() and unpack_to_rootfs(), maybe we could support both.
>>
>> Roberto
> 
> The nice thing with explicit metadata is that it doesn't have to contain the filename per se, and each file is self-contained. There is a reason why each cpio header starts with the magic number: each cpio record is formally independent and can be processed in isolation.  The TRAILER!!! thing is a huge wart in the format, although in practice TRAILER!!! always has a mode of 0 and so can be distinguished from an actual file.
> 
> The use of mode 0x18000 for metadata allows for optional backwards compatibility for extraction; for encoding this can be handled with very simple postprocessing.
> 
> So my suggestion would be to have mode 0x18000 indicate extended file metadata, with the filename of the form:
> 
> optional_filename!XXXXX!
> 
> ... where XXXXX indicates the type of metadata (e.g. !XATTR!). The optional_filename prefix allows an unaware decoder to extract to a well-defined name; simple postprocessing would be able to either remove (for size) or add (for compatibility) this prefix. It would be an error for this prefix, if present, to not match the name of the previous file.

Actually, I defined '..metadata..' as special name to indicate that the
file contains metadata. Then, the content of the file is a set of:

struct metadata_hdr {
         char c_size[8];     /* total size including c_size field */
         char c_version;     /* header version */
         char c_type;        /* metadata type */
         char c_metadata[];  /* metadata */
} __packed;

init/initramfs.c now has a specific parser for c_type. Currently, I
implemented a parser for xattrs, which expects data in the format:

<xattr #N name>\0<xattr #N value>

I checked if it is possible to use bit 17:16 to identify files with
metadata, but both the cpio and the kernel use unsigned short.

I already modified gen_init_cpio and cpio. I modify at run-time the list
of files to be included in the image by adding a temporary file, that
each time is set with the xattrs of the previously processed file.

The output of cpio -t looks like:

--
.
..metadata..
bin
..metadata..
dev
..metadata..
dev/console
..metadata..
--

Would it be ok? If you prefer that I add the format to the file name or
you/anyone has a comment about this proposal, please let me know so that
I make the changes before sending a new version of the patch set.

Thanks

Roberto

-- 
HUAWEI TECHNOLOGIES Duesseldorf GmbH, HRB 56063
Managing Director: Bo PENG, Jian LI, Yanli SHI

^ permalink raw reply

* Re: [PATCH RFC v8 01/10] namei: obey trailing magic-link DAC permissions
From: Andy Lutomirski @ 2019-05-22 17:01 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Al Viro, Jeff Layton, J. Bruce Fields, Arnd Bergmann,
	David Howells, Shuah Khan, Shuah Khan, Andy Lutomirski,
	Christian Brauner, Eric Biederman, Andrew Morton,
	Alexei Starovoitov, Kees Cook, Jann Horn, Tycho Andersen,
	David Drysdale, Chanho Min, Oleg Nesterov, Aleksa Sarai,
	Linus Torvalds, Linux Containers <contai>
In-Reply-To: <20190520133305.11925-2-cyphar@cyphar.com>

On Mon, May 20, 2019 at 6:34 AM Aleksa Sarai <cyphar@cyphar.com> wrote:
> One final exception is given, which is that non-O_PATH file descriptors
> are given re-open rights equivalent to the permissions available at
> open-time. This allows for O_RDONLY file descriptors to be re-opened
> O_RDWR as long as the user had MAY_WRITE access at the time of opening
> the O_RDONLY descriptor. This is necessary to avoid breaking userspace
> (some of the kernel's own selftests depended on this "feature").

Can you clarify this exception a bit?  I'd like to make sure it's not
such a huge exception that it invalidates the whole point of the
patch.  If you open a file for execute, by actually exec()ing it or by
using something like the proposed O_MAYEXEC, and you have
inode_permission to write, do you still end up with FMODE_PATH_WRITE?
The code looks like it does, and this seems like it might be a
mistake.  Is there any way for user code to read out these new file
mode bits?

What are actual examples of uses for this exception?  Breaking
selftests is not, in and of itself, a huge problem.

^ permalink raw reply

* Re: [PATCH v1 1/2] open: add close_range()
From: Oleg Nesterov @ 2019-05-22 16:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer,
	jannh, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86
In-Reply-To: <20190522155259.11174-1-christian@brauner.io>

On 05/22, Christian Brauner wrote:
>
> +static struct file *pick_file(struct files_struct *files, unsigned fd)
>  {
> -	struct file *file;
> +	struct file *file = NULL;
>  	struct fdtable *fdt;
>  
>  	spin_lock(&files->file_lock);
> @@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
>  		goto out_unlock;
>  	rcu_assign_pointer(fdt->fd[fd], NULL);
>  	__put_unused_fd(files, fd);
> -	spin_unlock(&files->file_lock);
> -	return filp_close(file, files);
>  
>  out_unlock:
>  	spin_unlock(&files->file_lock);
> -	return -EBADF;
> +	return file;

...

> +int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
> +{
> +	unsigned int cur_max;
> +
> +	if (fd > max_fd)
> +		return -EINVAL;
> +
> +	rcu_read_lock();
> +	cur_max = files_fdtable(files)->max_fds;
> +	rcu_read_unlock();
> +
> +	/* cap to last valid index into fdtable */
> +	if (max_fd >= cur_max)
> +		max_fd = cur_max - 1;
> +
> +	while (fd <= max_fd) {
> +		struct file *file;
> +
> +		file = pick_file(files, fd++);

Well, how about something like

	static unsigned int find_next_opened_fd(struct fdtable *fdt, unsigned start)
	{
		unsigned int maxfd = fdt->max_fds;
		unsigned int maxbit = maxfd / BITS_PER_LONG;
		unsigned int bitbit = start / BITS_PER_LONG;

		bitbit = find_next_bit(fdt->full_fds_bits, maxbit, bitbit) * BITS_PER_LONG;
		if (bitbit > maxfd)
			return maxfd;
		if (bitbit > start)
			start = bitbit;
		return find_next_bit(fdt->open_fds, maxfd, start);
	}

	unsigned close_next_fd(struct files_struct *files, unsigned start, unsigned maxfd)
	{
		unsigned fd;
		struct file *file;
		struct fdtable *fdt;
	
		spin_lock(&files->file_lock);
		fdt = files_fdtable(files);
		fd = find_next_opened_fd(fdt, start);
		if (fd >= fdt->max_fds || fd > maxfd) {
			fd = -1;
			goto out;
		}

		file = fdt->fd[fd];
		rcu_assign_pointer(fdt->fd[fd], NULL);
		__put_unused_fd(files, fd);
	out:
		spin_unlock(&files->file_lock);

		if (fd == -1u)
			return fd;

		filp_close(file, files);
		return fd + 1;
	}

?

Then close_range() can do

	while (fd < max_fd)
		fd = close_next_fd(fd, maxfd);

Oleg.

^ permalink raw reply

* Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()
From: hpa @ 2019-05-22 16:18 UTC (permalink / raw)
  To: Rob Landley, Roberto Sassu, viro
  Cc: linux-security-module, linux-integrity, initramfs, linux-api,
	linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, arnd, james.w.mcmechan,
	niveditas98
In-Reply-To: <a0afd58f-c682-66b5-7478-c405a179d72a@landley.net>

On May 17, 2019 7:16:04 PM PDT, Rob Landley <rob@landley.net> wrote:
>On 5/17/19 4:41 PM, H. Peter Anvin wrote:
>> On 5/17/19 1:18 PM, hpa@zytor.com wrote:
>>>
>>> Ok... I just realized this does not work for a modular initramfs,
>composed at load time from multiple files, which is a very real
>problem. Should be easy enough to deal with: instead of one large file,
>use one companion file per source file, perhaps something like
>filename..xattrs (suggesting double dots to make it less likely to
>conflict with a "real" file.) No leading dot, as it makes it more
>likely that archivers will sort them before the file proper.
>>>
>>> A side benefit is that the format can be simpler as there is no need
>to encode the filename.
>>>
>>> A technically cleaner solution still, but which would need archiver
>modifications, would be to encode the xattrs as an optionally nameless
>file (just an empty string) with a new file mode value, immediately
>following the original file. The advantage there is that the archiver
>itself could support xattrs and other extended metadata (which has been
>requested elsewhere); the disadvantage obviously is that that it
>requires new support in the archiver. However, at least it ought to be
>simpler since it is still a higher protocol level than the cpio archive
>itself.
>>>
>>> There's already one special case in cpio, which is the
>"!!!TRAILER!!!" filename; although I don't think it is part of the
>formal spec, to the extent there is one, I would expect that in
>practice it is always encoded with a mode of 0, which incidentally
>could be used to unbreak the case where such a filename actually
>exists. So one way to support such extended metadata would be to set
>mode to 0 and use the filename to encode the type of metadata. I wonder
>how existing GNU or BSD cpio (the BSD one is better maintained these
>days) would deal with reading such a file; it would at least not be a
>regression if it just read it still, possibly with warnings. It could
>also be possible to use bits 17:16 in the mode, which are traditionally
>always zero (mode_t being 16 bits), but I believe are present in most
>or all of the cpio formats for historical reasons. It might be accepted
>better by existing implementations to use one of these high bits
>combined with S_IFREG, I dont know.
>>
>> 
>> Correction: it's just !!!TRAILER!!!.
>
>We documented it as "TRAILER!!!" without leading !!!, and that its
>purpose is to
>flush hardlinks:
>
>https://www.kernel.org/doc/Documentation/early-userspace/buffer-format.txt
>
>That's what toybox cpio has been producing. Kernel consumes it just
>fine. Just
>checked busybox cpio and that's what they're producing as well...
>
>Rob

Yes, TRAILER!!! is correct. Somehow I managed to get it wrong twice.
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* Re: [PATCH v3 2/2] initramfs: introduce do_readxattrs()
From: hpa @ 2019-05-22 16:17 UTC (permalink / raw)
  To: Roberto Sassu, Arvind Sankar
  Cc: viro, linux-security-module, linux-integrity, initramfs,
	linux-api, linux-fsdevel, linux-kernel, zohar, silviu.vlasceanu,
	dmitry.kasatkin, takondra, kamensky, arnd, rob, james.w.mcmechan,
	niveditas98
In-Reply-To: <7bdca169-7a01-8c55-40e4-a832e876a0e5@huawei.com>

On May 20, 2019 2:39:46 AM PDT, Roberto Sassu <roberto.sassu@huawei.com> wrote:
>On 5/18/2019 12:17 AM, Arvind Sankar wrote:
>> On Fri, May 17, 2019 at 02:47:31PM -0700, H. Peter Anvin wrote:
>>> On 5/17/19 2:02 PM, Arvind Sankar wrote:
>>>> On Fri, May 17, 2019 at 01:18:11PM -0700, hpa@zytor.com wrote:
>>>>>
>>>>> Ok... I just realized this does not work for a modular initramfs,
>composed at load time from multiple files, which is a very real
>problem. Should be easy enough to deal with: instead of one large file,
>use one companion file per source file, perhaps something like
>filename..xattrs (suggesting double dots to make it less likely to
>conflict with a "real" file.) No leading dot, as it makes it more
>likely that archivers will sort them before the file proper.
>>>> This version of the patch was changed from the previous one exactly
>to deal with this case --
>>>> it allows for the bootloader to load multiple initramfs archives,
>each
>>>> with its own .xattr-list file, and to have that work properly.
>>>> Could you elaborate on the issue that you see?
>>>>
>>>
>>> Well, for one thing, how do you define "cpio archive", each with its
>own
>>> .xattr-list file? Second, that would seem to depend on the ordering,
>no,
>>> in which case you depend critically on .xattr-list file following
>the
>>> files, which most archivers won't do.
>>>
>>> Either way it seems cleaner to have this per file; especially if/as
>it
>>> can be done without actually mucking up the format.
>>>
>>> I need to run, but I'll post a more detailed explanation of what I
>did
>>> in a little bit.
>>>
>>> 	-hpa
>>>
>> Not sure what you mean by how do I define it? Each cpio archive will
>> contain its own .xattr-list file with signatures for the files within
>> it, that was the idea.
>> 
>> You need to review the code more closely I think -- it does not
>depend
>> on the .xattr-list file following the files to which it applies.
>> 
>> The code first extracts .xattr-list as though it was a regular file.
>If
>> a later dupe shows up (presumably from a second archive, although the
>> patch will actually allow a second one in the same archive), it will
>> then process the existing .xattr-list file and apply the attributes
>> listed within it. It then will proceed to read the second one and
>> overwrite the first one with it (this is the normal behaviour in the
>> kernel cpio parser). At the end once all the archives have been
>> extracted, if there is an .xattr-list file in the rootfs it will be
>> parsed (it would've been the last one encountered, which hasn't been
>> parsed yet, just extracted).
>> 
>> Regarding the idea to use the high 16 bits of the mode field in
>> the header that's another possibility. It would just require
>additional
>> support in the program that actually creates the archive though,
>which
>> the current patch doesn't.
>
>Yes, for adding signatures for a subset of files, no changes to the ram
>disk generator are necessary. Everything is done by a custom module. To
>support a generic use case, it would be necessary to modify the
>generator to execute getfattr and the awk script after files have been
>placed in the temporary directory.
>
>If I understood the new proposal correctly, it would be task for cpio
>to
>read file metadata after the content and create a new record for each
>file with mode 0x18000, type of metadata encoded in the file name and
>metadata as file content. I don't know how easy it would be to modify
>cpio. Probably the amount of changes would be reasonable.
>
>The kernel will behave in a similar way. It will call do_readxattrs()
>in
>do_copy() for each file. Since the only difference between the current
>and the new proposal would be two additional calls to do_readxattrs()
>in
>do_name() and unpack_to_rootfs(), maybe we could support both.
>
>Roberto

The nice thing with explicit metadata is that it doesn't have to contain the filename per se, and each file is self-contained. There is a reason why each cpio header starts with the magic number: each cpio record is formally independent and can be processed in isolation.  The TRAILER!!! thing is a huge wart in the format, although in practice TRAILER!!! always has a mode of 0 and so can be distinguished from an actual file.

The use of mode 0x18000 for metadata allows for optional backwards compatibility for extraction; for encoding this can be handled with very simple postprocessing.

So my suggestion would be to have mode 0x18000 indicate extended file metadata, with the filename of the form:

optional_filename!XXXXX!

... where XXXXX indicates the type of metadata (e.g. !XATTR!). The optional_filename prefix allows an unaware decoder to extract to a well-defined name; simple postprocessing would be able to either remove (for size) or add (for compatibility) this prefix. It would be an error for this prefix, if present, to not match the name of the previous file.

I do agree that the delayed processing of an .xattr-list as you describe ought to work even with a modular initramfs.


-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

^ permalink raw reply

* [PATCH v1 2/2] tests: add close_range() tests
From: Christian Brauner @ 2019-05-22 15:52 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner
In-Reply-To: <20190522155259.11174-1-christian@brauner.io>

This adds basic tests for the new close_range() syscall.
- test that no invalid flags can be passed
- test that a range of file descriptors is correctly closed
- test that a range of file descriptors is correctly closed if there there
  are already closed file descriptors in the range
- test that max_fd is correctly capped to the current fdtable maximum

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1: unchanged
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/core/.gitignore       |   1 +
 tools/testing/selftests/core/Makefile         |   6 +
 .../testing/selftests/core/close_range_test.c | 128 ++++++++++++++++++
 4 files changed, 136 insertions(+)
 create mode 100644 tools/testing/selftests/core/.gitignore
 create mode 100644 tools/testing/selftests/core/Makefile
 create mode 100644 tools/testing/selftests/core/close_range_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 9781ca79794a..06e57fabbff9 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -4,6 +4,7 @@ TARGETS += bpf
 TARGETS += breakpoints
 TARGETS += capabilities
 TARGETS += cgroup
+TARGETS += core
 TARGETS += cpufreq
 TARGETS += cpu-hotplug
 TARGETS += drivers/dma-buf
diff --git a/tools/testing/selftests/core/.gitignore b/tools/testing/selftests/core/.gitignore
new file mode 100644
index 000000000000..6e6712ce5817
--- /dev/null
+++ b/tools/testing/selftests/core/.gitignore
@@ -0,0 +1 @@
+close_range_test
diff --git a/tools/testing/selftests/core/Makefile b/tools/testing/selftests/core/Makefile
new file mode 100644
index 000000000000..de3ae68aa345
--- /dev/null
+++ b/tools/testing/selftests/core/Makefile
@@ -0,0 +1,6 @@
+CFLAGS += -g -I../../../../usr/include/ -I../../../../include
+
+TEST_GEN_PROGS := close_range_test
+
+include ../lib.mk
+
diff --git a/tools/testing/selftests/core/close_range_test.c b/tools/testing/selftests/core/close_range_test.c
new file mode 100644
index 000000000000..ab10cd205ab9
--- /dev/null
+++ b/tools/testing/selftests/core/close_range_test.c
@@ -0,0 +1,128 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/kernel.h>
+#include <limits.h>
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "../kselftest.h"
+
+static inline int sys_close_range(unsigned int fd, unsigned int max_fd,
+				  unsigned int flags)
+{
+	return syscall(__NR_close_range, fd, max_fd, flags);
+}
+
+#ifndef ARRAY_SIZE
+#define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0]))
+#endif
+
+int main(int argc, char **argv)
+{
+	const char *test_name = "close_range";
+	int i, ret;
+	int open_fds[100];
+	int fd_max, fd_mid, fd_min;
+
+	ksft_set_plan(7);
+
+	for (i = 0; i < ARRAY_SIZE(open_fds); i++) {
+		int fd;
+
+		fd = open("/dev/null", O_RDONLY | O_CLOEXEC);
+		if (fd < 0) {
+			if (errno == ENOENT)
+				ksft_exit_skip(
+					"%s test: skipping test since /dev/null does not exist\n",
+					test_name);
+
+			ksft_exit_fail_msg(
+				"%s test: %s - failed to open /dev/null\n",
+				strerror(errno), test_name);
+		}
+
+		open_fds[i] = fd;
+	}
+
+	fd_min = open_fds[0];
+	fd_max = open_fds[99];
+
+	ret = sys_close_range(fd_min, fd_max, 1);
+	if (!ret)
+		ksft_exit_fail_msg(
+			"%s test: managed to pass invalid flag value\n",
+			test_name);
+	ksft_test_result_pass("do not allow invalid flag values for close_range()\n");
+
+	fd_mid = open_fds[50];
+	ret = sys_close_range(fd_min, fd_mid, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 4 to 50\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_min, fd_mid);
+
+	for (i = 0; i <= 50; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 4 to 50\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_min, fd_mid);
+
+	/* create a couple of gaps */
+	close(57);
+	close(78);
+	close(81);
+	close(82);
+	close(84);
+	close(90);
+
+	fd_mid = open_fds[51];
+	/* Choose slightly lower limit and leave some fds for a later test */
+	fd_max = open_fds[92];
+	ret = sys_close_range(fd_mid, fd_max, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 51; i <= 92; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	fd_mid = open_fds[93];
+	fd_max = open_fds[99];
+	/* test that the kernel caps and still closes all fds */
+	ret = sys_close_range(fd_mid, UINT_MAX, 0);
+	if (ret < 0)
+		ksft_exit_fail_msg(
+			"%s test: Failed to close range of file descriptors from 51 to 100\n",
+			test_name);
+	ksft_test_result_pass("close_range() from %d to %d\n", fd_mid, fd_max);
+
+	for (i = 93; i < 100; i++) {
+		ret = fcntl(open_fds[i], F_GETFL);
+		if (ret >= 0)
+			ksft_exit_fail_msg(
+				"%s test: Failed to close range of file descriptors from 51 to 100\n",
+				test_name);
+	}
+	ksft_test_result_pass("fcntl() verify closed range from %d to %d\n", fd_mid, fd_max);
+
+	return ksft_exit_pass();
+}
-- 
2.21.0

^ permalink raw reply related

* [PATCH v1 1/2] open: add close_range()
From: Christian Brauner @ 2019-05-22 15:52 UTC (permalink / raw)
  To: viro, linux-kernel, linux-fsdevel, linux-api, torvalds, fweimer
  Cc: jannh, oleg, tglx, arnd, shuah, dhowells, tkjos, ldv, miklos,
	linux-alpha, linux-arm-kernel, linux-ia64, linux-m68k, linux-mips,
	linux-parisc, linuxppc-dev, linux-s390, linux-sh, sparclinux,
	linux-xtensa, linux-arch, linux-kselftest, x86, Christian Brauner

This adds the close_range() syscall. It allows to efficiently close a range
of file descriptors up to all file descriptors of a calling task.

The syscall came up in a recent discussion around the new mount API and
making new file descriptor types cloexec by default. During this
discussion, Al suggested the close_range() syscall (cf. [1]). Note, a
syscall in this manner has been requested by various people over time.

First, it helps to close all file descriptors of an exec()ing task. This
can be done safely via (quoting Al's example from [1] verbatim):

        /* that exec is sensitive */
        unshare(CLONE_FILES);
        /* we don't want anything past stderr here */
        close_range(3, ~0U);
        execve(....);

The code snippet above is one way of working around the problem that file
descriptors are not cloexec by default. This is aggravated by the fact that
we can't just switch them over without massively regressing userspace. For
a whole class of programs having an in-kernel method of closing all file
descriptors is very helpful (e.g. demons, service managers, programming
language standard libraries, container managers etc.).
(Please note, unshare(CLONE_FILES) should only be needed if the calling
 task is multi-threaded and shares the file descriptor table with another
 thread in which case two threads could race with one thread allocating
 file descriptors and the other one closing them via close_range(). For the
 general case close_range() before the execve() is sufficient.)

Second, it allows userspace to avoid implementing closing all file
descriptors by parsing through /proc/<pid>/fd/* and calling close() on each
file descriptor. From looking at various large(ish) userspace code bases
this or similar patterns are very common in:
- service managers (cf. [4])
- libcs (cf. [6])
- container runtimes (cf. [5])
- programming language runtimes/standard libraries
  - Python (cf. [2])
  - Rust (cf. [7], [8])
As Dmitry pointed out there's even a long-standing glibc bug about missing
kernel support for this task (cf. [3]).
In addition, the syscall will also work for tasks that do not have procfs
mounted and on kernels that do not have procfs support compiled in. In such
situations the only way to make sure that all file descriptors are closed
is to call close() on each file descriptor up to UINT_MAX or RLIMIT_NOFILE,
OPEN_MAX trickery (cf. comment [8] on Rust).

The performance is striking. For good measure, comparing the following
simple close_all_fds() userspace implementation that is essentially just
glibc's version in [6]:

static int close_all_fds(void)
{
        int dir_fd;
        DIR *dir;
        struct dirent *direntp;

        dir = opendir("/proc/self/fd");
        if (!dir)
                return -1;
        dir_fd = dirfd(dir);
        while ((direntp = readdir(dir))) {
                int fd;
                if (strcmp(direntp->d_name, ".") == 0)
                        continue;
                if (strcmp(direntp->d_name, "..") == 0)
                        continue;
                fd = atoi(direntp->d_name);
                if (fd == dir_fd || fd == 0 || fd == 1 || fd == 2)
                        continue;
                close(fd);
        }
        closedir(dir);
        return 0;
}

to close_range() yields:
1. closing 4 open files:
   - close_all_fds(): ~280 us
   - close_range():    ~24 us

2. closing 1000 open files:
   - close_all_fds(): ~5000 us
   - close_range():   ~800 us

close_range() is designed to allow for some flexibility. Specifically, it
does not simply always close all open file descriptors of a task. Instead,
callers can specify an upper bound.
This is e.g. useful for scenarios where specific file descriptors are
created with well-known numbers that are supposed to be excluded from
getting closed.
For extra paranoia close_range() comes with a flags argument. This can e.g.
be used to implement extension. Once can imagine userspace wanting to stop
at the first error instead of ignoring errors under certain circumstances.
There might be other valid ideas in the future. In any case, a flag
argument doesn't hurt and keeps us on the safe side.

>From an implementation side this is kept rather dumb. It saw some input
from David and Jann but all nonsense is obviously my own!
- Errors to close file descriptors are currently ignored. (Could be changed
  by setting a flag in the future if needed.)
- __close_range() is a rather simplistic wrapper around __close_fd().
  My reasoning behind this is based on the nature of how __close_fd() needs
  to release an fd. But maybe I misunderstood specifics:
  We take the files_lock and rcu-dereference the fdtable of the calling
  task, we find the entry in the fdtable, get the file and need to release
  files_lock before calling filp_close().
  In the meantime the fdtable might have been altered so we can't just
  retake the spinlock and keep the old rcu-reference of the fdtable
  around. Instead we need to grab a fresh reference to the fdtable.
  If my reasoning is correct then there's really no point in fancyfying
  __close_range(): We just need to rcu-dereference the fdtable of the
  calling task once to cap the max_fd value correctly and then go on
  calling __close_fd() in a loop.

/* References */
[1]: https://lore.kernel.org/lkml/20190516165021.GD17978@ZenIV.linux.org.uk/
[2]: https://github.com/python/cpython/blob/9e4f2f3a6b8ee995c365e86d976937c141d867f8/Modules/_posixsubprocess.c#L220
[3]: https://sourceware.org/bugzilla/show_bug.cgi?id=10353#c7
[4]: https://github.com/systemd/systemd/blob/5238e9575906297608ff802a27e2ff9effa3b338/src/basic/fd-util.c#L217
[5]: https://github.com/lxc/lxc/blob/ddf4b77e11a4d08f09b7b9cd13e593f8c047edc5/src/lxc/start.c#L236
[6]: https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/unix/sysv/linux/grantpt.c;h=2030e07fa6e652aac32c775b8c6e005844c3c4eb;hb=HEAD#l17
     Note that this is an internal implementation that is not exported.
     Currently, libc seems to not provide an exported version of this
     because of missing kernel support to do this.
[7]: https://github.com/rust-lang/rust/issues/12148
[8]: https://github.com/rust-lang/rust/blob/5f47c0613ed4eb46fca3633c1297364c09e5e451/src/libstd/sys/unix/process2.rs#L303-L308
     Rust's solution is slightly different but is equally unperformant.
     Rust calls getdtablesize() which is a glibc library function that
     simply returns the current RLIMIT_NOFILE or OPEN_MAX values. Rust then
     goes on to call close() on each fd. That's obviously overkill for most
     tasks. Rarely, tasks - especially non-demons - hit RLIMIT_NOFILE or
     OPEN_MAX.
     Let's be nice and assume an unprivileged user with RLIMIT_NOFILE set
     to 1024. Even in this case, there's a very high chance that in the
     common case Rust is calling the close() syscall 1021 times pointlessly
     if the task just has 0, 1, and 2 open.

Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jann Horn <jannh@google.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Dmitry V. Levin <ldv@altlinux.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Florian Weimer <fweimer@redhat.com>
Cc: linux-api@vger.kernel.org
---
v1:
- Linus Torvalds <torvalds@linux-foundation.org>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
- Al Viro <viro@zeniv.linux.org.uk>:
  - add cond_resched() to yield cpu when closing a lot of file descriptors
---
 arch/alpha/kernel/syscalls/syscall.tbl      |  1 +
 arch/arm/tools/syscall.tbl                  |  1 +
 arch/arm64/include/asm/unistd32.h           |  2 +
 arch/ia64/kernel/syscalls/syscall.tbl       |  1 +
 arch/m68k/kernel/syscalls/syscall.tbl       |  1 +
 arch/microblaze/kernel/syscalls/syscall.tbl |  1 +
 arch/mips/kernel/syscalls/syscall_n32.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_n64.tbl   |  1 +
 arch/mips/kernel/syscalls/syscall_o32.tbl   |  1 +
 arch/parisc/kernel/syscalls/syscall.tbl     |  1 +
 arch/powerpc/kernel/syscalls/syscall.tbl    |  1 +
 arch/s390/kernel/syscalls/syscall.tbl       |  1 +
 arch/sh/kernel/syscalls/syscall.tbl         |  1 +
 arch/sparc/kernel/syscalls/syscall.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_32.tbl      |  1 +
 arch/x86/entry/syscalls/syscall_64.tbl      |  1 +
 arch/xtensa/kernel/syscalls/syscall.tbl     |  1 +
 fs/file.c                                   | 63 ++++++++++++++++++---
 fs/open.c                                   | 20 +++++++
 include/linux/fdtable.h                     |  2 +
 include/linux/syscalls.h                    |  2 +
 include/uapi/asm-generic/unistd.h           |  4 +-
 22 files changed, 100 insertions(+), 9 deletions(-)

diff --git a/arch/alpha/kernel/syscalls/syscall.tbl b/arch/alpha/kernel/syscalls/syscall.tbl
index 9e7704e44f6d..b55d93af8096 100644
--- a/arch/alpha/kernel/syscalls/syscall.tbl
+++ b/arch/alpha/kernel/syscalls/syscall.tbl
@@ -473,3 +473,4 @@
 541	common	fsconfig			sys_fsconfig
 542	common	fsmount				sys_fsmount
 543	common	fspick				sys_fspick
+545	common	close_range			sys_close_range
diff --git a/arch/arm/tools/syscall.tbl b/arch/arm/tools/syscall.tbl
index aaf479a9e92d..0125c97c75dd 100644
--- a/arch/arm/tools/syscall.tbl
+++ b/arch/arm/tools/syscall.tbl
@@ -447,3 +447,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/arm64/include/asm/unistd32.h b/arch/arm64/include/asm/unistd32.h
index c39e90600bb3..9a3270d29b42 100644
--- a/arch/arm64/include/asm/unistd32.h
+++ b/arch/arm64/include/asm/unistd32.h
@@ -886,6 +886,8 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 /*
  * Please add new compat syscalls above this comment and update
diff --git a/arch/ia64/kernel/syscalls/syscall.tbl b/arch/ia64/kernel/syscalls/syscall.tbl
index e01df3f2f80d..1a90b464e96f 100644
--- a/arch/ia64/kernel/syscalls/syscall.tbl
+++ b/arch/ia64/kernel/syscalls/syscall.tbl
@@ -354,3 +354,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/m68k/kernel/syscalls/syscall.tbl b/arch/m68k/kernel/syscalls/syscall.tbl
index 7e3d0734b2f3..2dee2050f9ef 100644
--- a/arch/m68k/kernel/syscalls/syscall.tbl
+++ b/arch/m68k/kernel/syscalls/syscall.tbl
@@ -433,3 +433,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/microblaze/kernel/syscalls/syscall.tbl b/arch/microblaze/kernel/syscalls/syscall.tbl
index 26339e417695..923ef69e5a76 100644
--- a/arch/microblaze/kernel/syscalls/syscall.tbl
+++ b/arch/microblaze/kernel/syscalls/syscall.tbl
@@ -439,3 +439,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n32.tbl b/arch/mips/kernel/syscalls/syscall_n32.tbl
index 0e2dd68ade57..967ed9de51cd 100644
--- a/arch/mips/kernel/syscalls/syscall_n32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n32.tbl
@@ -372,3 +372,4 @@
 431	n32	fsconfig			sys_fsconfig
 432	n32	fsmount				sys_fsmount
 433	n32	fspick				sys_fspick
+435	n32	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_n64.tbl b/arch/mips/kernel/syscalls/syscall_n64.tbl
index 5eebfa0d155c..71de731102b1 100644
--- a/arch/mips/kernel/syscalls/syscall_n64.tbl
+++ b/arch/mips/kernel/syscalls/syscall_n64.tbl
@@ -348,3 +348,4 @@
 431	n64	fsconfig			sys_fsconfig
 432	n64	fsmount				sys_fsmount
 433	n64	fspick				sys_fspick
+435	n64	close_range			sys_close_range
diff --git a/arch/mips/kernel/syscalls/syscall_o32.tbl b/arch/mips/kernel/syscalls/syscall_o32.tbl
index 3cc1374e02d0..5a325ab29f88 100644
--- a/arch/mips/kernel/syscalls/syscall_o32.tbl
+++ b/arch/mips/kernel/syscalls/syscall_o32.tbl
@@ -421,3 +421,4 @@
 431	o32	fsconfig			sys_fsconfig
 432	o32	fsmount				sys_fsmount
 433	o32	fspick				sys_fspick
+435	o32	close_range			sys_close_range
diff --git a/arch/parisc/kernel/syscalls/syscall.tbl b/arch/parisc/kernel/syscalls/syscall.tbl
index c9e377d59232..dcc0a0879139 100644
--- a/arch/parisc/kernel/syscalls/syscall.tbl
+++ b/arch/parisc/kernel/syscalls/syscall.tbl
@@ -430,3 +430,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/powerpc/kernel/syscalls/syscall.tbl b/arch/powerpc/kernel/syscalls/syscall.tbl
index 103655d84b4b..ba2c1f078cbd 100644
--- a/arch/powerpc/kernel/syscalls/syscall.tbl
+++ b/arch/powerpc/kernel/syscalls/syscall.tbl
@@ -515,3 +515,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/s390/kernel/syscalls/syscall.tbl b/arch/s390/kernel/syscalls/syscall.tbl
index e822b2964a83..d7c9043d2902 100644
--- a/arch/s390/kernel/syscalls/syscall.tbl
+++ b/arch/s390/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431  common	fsconfig		sys_fsconfig			sys_fsconfig
 432  common	fsmount			sys_fsmount			sys_fsmount
 433  common	fspick			sys_fspick			sys_fspick
+435  common	close_range		sys_close_range			sys_close_range
diff --git a/arch/sh/kernel/syscalls/syscall.tbl b/arch/sh/kernel/syscalls/syscall.tbl
index 016a727d4357..9b5e6bf0ce32 100644
--- a/arch/sh/kernel/syscalls/syscall.tbl
+++ b/arch/sh/kernel/syscalls/syscall.tbl
@@ -436,3 +436,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/sparc/kernel/syscalls/syscall.tbl b/arch/sparc/kernel/syscalls/syscall.tbl
index e047480b1605..8c674a1e0072 100644
--- a/arch/sparc/kernel/syscalls/syscall.tbl
+++ b/arch/sparc/kernel/syscalls/syscall.tbl
@@ -479,3 +479,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ad968b7bac72..7f7a89a96707 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -438,3 +438,4 @@
 431	i386	fsconfig		sys_fsconfig			__ia32_sys_fsconfig
 432	i386	fsmount			sys_fsmount			__ia32_sys_fsmount
 433	i386	fspick			sys_fspick			__ia32_sys_fspick
+435	i386	close_range		sys_close_range			__ia32_sys_close_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index b4e6f9e6204a..0f7d47ae921c 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -355,6 +355,7 @@
 431	common	fsconfig		__x64_sys_fsconfig
 432	common	fsmount			__x64_sys_fsmount
 433	common	fspick			__x64_sys_fspick
+435	common	close_range		__x64_sys_close_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/arch/xtensa/kernel/syscalls/syscall.tbl b/arch/xtensa/kernel/syscalls/syscall.tbl
index 5fa0ee1c8e00..b489532265d0 100644
--- a/arch/xtensa/kernel/syscalls/syscall.tbl
+++ b/arch/xtensa/kernel/syscalls/syscall.tbl
@@ -404,3 +404,4 @@
 431	common	fsconfig			sys_fsconfig
 432	common	fsmount				sys_fsmount
 433	common	fspick				sys_fspick
+435	common	close_range			sys_close_range
diff --git a/fs/file.c b/fs/file.c
index 3da91a112bab..54945efa046e 100644
--- a/fs/file.c
+++ b/fs/file.c
@@ -615,12 +615,9 @@ void fd_install(unsigned int fd, struct file *file)
 
 EXPORT_SYMBOL(fd_install);
 
-/*
- * The same warnings as for __alloc_fd()/__fd_install() apply here...
- */
-int __close_fd(struct files_struct *files, unsigned fd)
+static struct file *pick_file(struct files_struct *files, unsigned fd)
 {
-	struct file *file;
+	struct file *file = NULL;
 	struct fdtable *fdt;
 
 	spin_lock(&files->file_lock);
@@ -632,15 +629,65 @@ int __close_fd(struct files_struct *files, unsigned fd)
 		goto out_unlock;
 	rcu_assign_pointer(fdt->fd[fd], NULL);
 	__put_unused_fd(files, fd);
-	spin_unlock(&files->file_lock);
-	return filp_close(file, files);
 
 out_unlock:
 	spin_unlock(&files->file_lock);
-	return -EBADF;
+	return file;
+}
+
+/*
+ * The same warnings as for __alloc_fd()/__fd_install() apply here...
+ */
+int __close_fd(struct files_struct *files, unsigned fd)
+{
+	struct file *file;
+
+	file = pick_file(files, fd);
+	if (!file)
+		return -EBADF;
+
+	return filp_close(file, files);
 }
 EXPORT_SYMBOL(__close_fd); /* for ksys_close() */
 
+/**
+ * __close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ */
+int __close_range(struct files_struct *files, unsigned fd, unsigned max_fd)
+{
+	unsigned int cur_max;
+
+	if (fd > max_fd)
+		return -EINVAL;
+
+	rcu_read_lock();
+	cur_max = files_fdtable(files)->max_fds;
+	rcu_read_unlock();
+
+	/* cap to last valid index into fdtable */
+	if (max_fd >= cur_max)
+		max_fd = cur_max - 1;
+
+	while (fd <= max_fd) {
+		struct file *file;
+
+		file = pick_file(files, fd++);
+		if (!file)
+			continue;
+
+		filp_close(file, files);
+		cond_resched();
+	}
+
+	return 0;
+}
+
 /*
  * variant of __close_fd that gets a ref on the file for later fput
  */
diff --git a/fs/open.c b/fs/open.c
index 9c7d724a6f67..c7baaee7aa47 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1174,6 +1174,26 @@ SYSCALL_DEFINE1(close, unsigned int, fd)
 	return retval;
 }
 
+/**
+ * close_range() - Close all file descriptors in a given range.
+ *
+ * @fd:     starting file descriptor to close
+ * @max_fd: last file descriptor to close
+ * @flags:  reserved for future extensions
+ *
+ * This closes a range of file descriptors. All file descriptors
+ * from @fd up to and including @max_fd are closed.
+ * Currently, errors to close a given file descriptor are ignored.
+ */
+SYSCALL_DEFINE3(close_range, unsigned int, fd, unsigned int, max_fd,
+		unsigned int, flags)
+{
+	if (flags)
+		return -EINVAL;
+
+	return __close_range(current->files, fd, max_fd);
+}
+
 /*
  * This routine simulates a hangup on the tty, to arrange that users
  * are given clean terminals at login time.
diff --git a/include/linux/fdtable.h b/include/linux/fdtable.h
index f07c55ea0c22..fcd07181a365 100644
--- a/include/linux/fdtable.h
+++ b/include/linux/fdtable.h
@@ -121,6 +121,8 @@ extern void __fd_install(struct files_struct *files,
 		      unsigned int fd, struct file *file);
 extern int __close_fd(struct files_struct *files,
 		      unsigned int fd);
+extern int __close_range(struct files_struct *files, unsigned int fd,
+			 unsigned int max_fd);
 extern int __close_fd_get_file(unsigned int fd, struct file **res);
 
 extern struct kmem_cache *files_cachep;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..c0189e223255 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -441,6 +441,8 @@ asmlinkage long sys_fchown(unsigned int fd, uid_t user, gid_t group);
 asmlinkage long sys_openat(int dfd, const char __user *filename, int flags,
 			   umode_t mode);
 asmlinkage long sys_close(unsigned int fd);
+asmlinkage long sys_close_range(unsigned int fd, unsigned int max_fd,
+				unsigned int flags);
 asmlinkage long sys_vhangup(void);
 
 /* fs/pipe.c */
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a87904daf103..3f36c8745d24 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -844,9 +844,11 @@ __SYSCALL(__NR_fsconfig, sys_fsconfig)
 __SYSCALL(__NR_fsmount, sys_fsmount)
 #define __NR_fspick 433
 __SYSCALL(__NR_fspick, sys_fspick)
+#define __NR_close_range 435
+__SYSCALL(__NR_close_range, sys_close_range)
 
 #undef __NR_syscalls
-#define __NR_syscalls 434
+#define __NR_syscalls 436
 
 /*
  * 32 bit systems traditionally used different
-- 
2.21.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox