Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v4 09/11] kernel/api: add runtime verification selftest
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add a selftest for CONFIG_KAPI_RUNTIME_CHECKS that exercises
sys_open/sys_read/sys_write/sys_close through raw syscall() and
verifies KAPI pre-validation catches invalid parameters while
allowing valid operations through.

Test cases (TAP output):
  1-4: Valid open/read/write/close succeed
  5-7: Invalid flags, mode bits, NULL path rejected with EINVAL
  8:   dmesg contains expected KAPI warning strings

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 MAINTAINERS                                   |    1 +
 tools/testing/selftests/Makefile              |    1 +
 tools/testing/selftests/kapi/Makefile         |    7 +
 tools/testing/selftests/kapi/kapi_test_util.h |   33 +
 tools/testing/selftests/kapi/test_kapi.c      | 1096 +++++++++++++++++
 5 files changed, 1138 insertions(+)
 create mode 100644 tools/testing/selftests/kapi/Makefile
 create mode 100644 tools/testing/selftests/kapi/kapi_test_util.h
 create mode 100644 tools/testing/selftests/kapi/test_kapi.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 0d14205077908..ddfd9cad98916 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13826,6 +13826,7 @@ F:	include/linux/kernel_api_spec.h
 F:	kernel/api/
 F:	tools/kapi/
 F:	tools/lib/python/kdoc/kdoc_apispec.py
+F:	tools/testing/selftests/kapi/
 
 KERNEL AUTOMOUNTER
 M:	Ian Kent <raven@themaw.net>
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 450f13ba4cca9..7881bec5aafe1 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -48,6 +48,7 @@ TARGETS += intel_pstate
 TARGETS += iommu
 TARGETS += ipc
 TARGETS += ir
+TARGETS += kapi
 TARGETS += kcmp
 TARGETS += kexec
 TARGETS += kselftest_harness
diff --git a/tools/testing/selftests/kapi/Makefile b/tools/testing/selftests/kapi/Makefile
new file mode 100644
index 0000000000000..32a750901b111
--- /dev/null
+++ b/tools/testing/selftests/kapi/Makefile
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: GPL-2.0
+
+TEST_GEN_PROGS := test_kapi
+
+CFLAGS += -static -Wall -Wextra -Werror -O2 $(KHDR_INCLUDES)
+
+include ../lib.mk
diff --git a/tools/testing/selftests/kapi/kapi_test_util.h b/tools/testing/selftests/kapi/kapi_test_util.h
new file mode 100644
index 0000000000000..e097c370542ad
--- /dev/null
+++ b/tools/testing/selftests/kapi/kapi_test_util.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2026 Sasha Levin <sashal@kernel.org>
+ *
+ * Compatibility helpers for KAPI selftests.
+ *
+ * __NR_open is not defined on aarch64 and riscv64 (only __NR_openat exists).
+ * Provide a wrapper that uses __NR_openat with AT_FDCWD to achieve the same
+ * behavior as __NR_open on architectures that lack it.
+ */
+#ifndef KAPI_TEST_UTIL_H
+#define KAPI_TEST_UTIL_H
+
+#include <fcntl.h>
+#include <sys/syscall.h>
+
+#ifndef __NR_open
+/*
+ * On architectures without __NR_open (e.g., aarch64, riscv64),
+ * use openat(AT_FDCWD, ...) which is equivalent.
+ */
+static inline long kapi_sys_open(const char *pathname, int flags, int mode)
+{
+	return syscall(__NR_openat, AT_FDCWD, pathname, flags, mode);
+}
+#else
+static inline long kapi_sys_open(const char *pathname, int flags, int mode)
+{
+	return syscall(__NR_open, pathname, flags, mode);
+}
+#endif
+
+#endif /* KAPI_TEST_UTIL_H */
diff --git a/tools/testing/selftests/kapi/test_kapi.c b/tools/testing/selftests/kapi/test_kapi.c
new file mode 100644
index 0000000000000..a6b7576f95c3e
--- /dev/null
+++ b/tools/testing/selftests/kapi/test_kapi.c
@@ -0,0 +1,1096 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2026 Sasha Levin <sashal@kernel.org>
+ *
+ * Userspace selftest for KAPI runtime verification of syscall parameters.
+ *
+ * Exercises sys_open, sys_read, sys_write, and sys_close through raw
+ * syscall() to ensure KAPI pre-validation wrappers interact correctly
+ * with normal kernel error handling.
+ *
+ * Requires CONFIG_KAPI_RUNTIME_CHECKS=y for full coverage; many tests
+ * also pass without it.
+ *
+ * TAP output format.
+ */
+
+#define _GNU_SOURCE
+#include <stdbool.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <errno.h>
+#include <signal.h>
+#include <sys/syscall.h>
+#include <sys/stat.h>
+#include <linux/limits.h>
+#include "../kselftest.h"
+#include "kapi_test_util.h"
+
+#define NUM_TESTS 29
+
+/*
+ * Set from the SIGPIPE handler. `volatile sig_atomic_t` is the POSIX-
+ * mandated type for flags touched by async-signal-safe handlers;
+ * checkpatch's generic "volatile considered harmful" warning targets
+ * kernel code and does not apply here.
+ */
+static volatile sig_atomic_t got_sigpipe;
+
+/*
+ * The tap_* helpers are thin wrappers around ksft_test_result_* so the
+ * rest of this file reads like the original author wrote it, while the
+ * output goes through the shared kselftest harness.
+ */
+static void tap_ok(const char *desc)
+{
+	ksft_test_result_pass("%s\n", desc);
+}
+
+static void tap_fail(const char *desc, const char *reason)
+{
+	ksft_test_result_fail("%s: %s\n", desc, reason);
+}
+
+static void tap_skip(const char *desc, const char *reason)
+{
+	ksft_test_result_skip("%s: %s\n", desc, reason);
+}
+
+/*
+ * Return true when the kernel provides the kapi runtime-check surface.
+ * Tests that rely on KAPI rejecting bad parameters pre-call should be
+ * skipped on kernels without it, not reported as failures.
+ */
+static bool kapi_runtime_checks_active(void)
+{
+	struct stat st;
+
+	return stat("/sys/kernel/debug/kapi", &st) == 0 && S_ISDIR(st.st_mode);
+}
+
+static void sigpipe_handler(int sig)
+{
+	(void)sig;
+	got_sigpipe = 1;
+}
+
+/* ---- Valid operation tests ---- */
+
+/*
+ * Test 1: open a readable file
+ * Returns fd on success.
+ */
+static int test_open_valid(void)
+{
+	errno = 0;
+	long fd = kapi_sys_open("/etc/hostname", O_RDONLY, 0);
+
+	if (fd >= 0) {
+		tap_ok("open valid file");
+	} else {
+		/* /etc/hostname might not exist; try /etc/passwd */
+		errno = 0;
+		fd = kapi_sys_open("/etc/passwd", O_RDONLY, 0);
+		if (fd >= 0)
+			tap_ok("open valid file (fallback /etc/passwd)");
+		else
+			tap_fail("open valid file", strerror(errno));
+	}
+	return (int)fd;
+}
+
+/*
+ * Test 2: read from fd
+ */
+static void test_read_valid(int fd)
+{
+	char buf[256];
+
+	errno = 0;
+	long ret = syscall(__NR_read, fd, buf, sizeof(buf));
+
+	if (ret > 0)
+		tap_ok("read from valid fd");
+	else if (ret == 0)
+		tap_ok("read from valid fd (EOF)");
+	else
+		tap_fail("read from valid fd", strerror(errno));
+}
+
+/*
+ * Test 3: write to /dev/null
+ */
+static void test_write_valid(void)
+{
+	errno = 0;
+	long devnull = kapi_sys_open("/dev/null", O_WRONLY, 0);
+
+	if (devnull < 0) {
+		tap_fail("write to /dev/null (open failed)", strerror(errno));
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)devnull, "hello", 5);
+
+	if (ret == 5)
+		tap_ok("write to /dev/null");
+	else
+		tap_fail("write to /dev/null",
+			 ret < 0 ? strerror(errno) : "short write");
+
+	syscall(__NR_close, (int)devnull);
+}
+
+/*
+ * Test 4: close fd
+ */
+static void test_close_valid(int fd)
+{
+	errno = 0;
+	long ret = syscall(__NR_close, fd);
+
+	if (ret == 0)
+		tap_ok("close valid fd");
+	else
+		tap_fail("close valid fd", strerror(errno));
+}
+
+/* ---- KAPI parameter rejection tests ---- */
+
+/*
+ * Test 5: open with invalid flag bits
+ * 0x10000000 is outside the valid O_* mask, KAPI should reject.
+ */
+static void test_open_invalid_flags(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with invalid flags",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	/*
+	 * Use /dev/null (always present on any sane rootfs) so KAPI's flag
+	 * validation is reached before a path-lookup ENOENT can mask it.
+	 * 0x10000000 is outside the valid O_* mask.
+	 */
+	ret = kapi_sys_open("/dev/null", 0x10000000, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with invalid flags returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with invalid flags", "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with invalid flags", msg);
+	}
+}
+
+/*
+ * Test 6: open with invalid mode bits
+ * 0xFFFF has bits outside S_IALLUGO (07777), KAPI should reject.
+ */
+static void test_open_invalid_mode(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with invalid mode",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	ret = kapi_sys_open("/tmp/kapi_test_mode",
+			    O_CREAT | O_WRONLY | O_EXCL, 0xFFFF);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with invalid mode returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with invalid mode", "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+		unlink("/tmp/kapi_test_mode");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with invalid mode", msg);
+	}
+}
+
+/*
+ * Test 7: open with NULL path
+ * KAPI USER_PATH constraint should reject NULL.
+ */
+static void test_open_null_path(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open(NULL, O_RDONLY, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with NULL path returns EINVAL");
+	} else if (ret == -1 && errno == EFAULT) {
+		/* Kernel may catch this as EFAULT before KAPI */
+		tap_ok("open with NULL path returns EFAULT (acceptable)");
+	} else if (ret >= 0) {
+		tap_fail("open with NULL path", "expected error, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "got %s", strerror(errno));
+		tap_fail("open with NULL path", msg);
+	}
+}
+
+/*
+ * Test 8: open with flag bit 30 set (0x40000000)
+ * This bit is outside the valid O_* mask, KAPI should reject with EINVAL.
+ */
+static void test_open_flag_bit30(void)
+{
+	long ret;
+
+	if (!kapi_runtime_checks_active()) {
+		tap_skip("open with flag bit 30 (0x40000000) returns EINVAL",
+			 "CONFIG_KAPI_RUNTIME_CHECKS not enabled");
+		return;
+	}
+
+	errno = 0;
+	ret = kapi_sys_open("/dev/null", 0x40000000, 0);
+
+	if (ret == -1 && errno == EINVAL) {
+		tap_ok("open with flag bit 30 (0x40000000) returns EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with flag bit 30 (0x40000000) returns EINVAL",
+			 "expected EINVAL, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with flag bit 30 (0x40000000) returns EINVAL",
+			 msg);
+	}
+}
+
+/* ---- Boundary condition and error path tests ---- */
+
+/*
+ * Test 9: read with fd=-1 should return an error.
+ * With CONFIG_KAPI_RUNTIME_CHECKS=y, KAPI validates the fd first and
+ * rejects negative fds (other than AT_FDCWD) with EINVAL.  Without
+ * KAPI, the kernel returns EBADF.  Accept either.
+ */
+static void test_read_bad_fd(void)
+{
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, -1, buf, sizeof(buf));
+
+	if (ret == -1 && (errno == EBADF || errno == EINVAL)) {
+		tap_ok("read with fd=-1 returns error");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read with fd=-1 returns error", msg);
+	}
+}
+
+/*
+ * Test 10: read with count=0 should return 0
+ */
+static void test_read_zero_count(void)
+{
+	char buf[1];
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("read with count=0 returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, 0);
+
+	if (ret == 0) {
+		tap_ok("read with count=0 returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("read with count=0 returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 11: write with count=0 should return 0
+ */
+static void test_write_zero_count(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write with count=0 returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "x", 0);
+
+	if (ret == 0) {
+		tap_ok("write with count=0 returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("write with count=0 returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 12: open with a path longer than PATH_MAX should fail
+ * Expect ENAMETOOLONG or EINVAL.
+ */
+static void test_open_long_path(void)
+{
+	char *longpath;
+	size_t len = PATH_MAX + 256;
+
+	longpath = malloc(len);
+	if (!longpath) {
+		tap_fail("open with path > PATH_MAX", "malloc failed");
+		return;
+	}
+
+	memset(longpath, 'A', len - 1);
+	longpath[0] = '/';
+	longpath[len - 1] = '\0';
+
+	errno = 0;
+	long ret = kapi_sys_open(longpath, O_RDONLY, 0);
+
+	if (ret == -1 && (errno == ENAMETOOLONG || errno == EINVAL)) {
+		tap_ok("open with path > PATH_MAX returns ENAMETOOLONG/EINVAL");
+	} else if (ret >= 0) {
+		tap_fail("open with path > PATH_MAX",
+			 "expected error, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected ENAMETOOLONG/EINVAL, got %s",
+			 strerror(errno));
+		tap_fail("open with path > PATH_MAX", msg);
+	}
+
+	free(longpath);
+}
+
+/*
+ * Test 13: read with unmapped user pointer should return EFAULT or EINVAL.
+ * Use a pipe with data so the kernel actually tries to copy to the buffer.
+ */
+static void test_read_unmapped_buf(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("read with unmapped buffer returns EFAULT/EINVAL",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Write some data so read has something to copy */
+	(void)write(pipefd[1], "hello", 5);
+
+	errno = 0;
+	long ret = syscall(__NR_read, pipefd[0], (void *)0xDEAD0000, 16);
+
+	if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("read with unmapped buffer returns EFAULT/EINVAL");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EFAULT/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read with unmapped buffer returns EFAULT/EINVAL",
+			 msg);
+	}
+
+	close(pipefd[0]);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 14: write with unmapped user pointer should return EFAULT or EINVAL.
+ * Use a pipe so the kernel actually tries to copy from the buffer.
+ */
+static void test_write_unmapped_buf(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("write with unmapped buffer returns EFAULT/EINVAL",
+			 "pipe() failed");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, pipefd[1], (void *)0xDEAD0000, 16);
+
+	if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("write with unmapped buffer returns EFAULT/EINVAL");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EFAULT/EINVAL, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("write with unmapped buffer returns EFAULT/EINVAL",
+			 msg);
+	}
+
+	close(pipefd[0]);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 15: close an already-closed fd should return EBADF
+ */
+static void test_close_already_closed(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("close already-closed fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	/* Close it once - should succeed */
+	syscall(__NR_close, (int)fd);
+
+	/* Close it again - should fail with EBADF */
+	errno = 0;
+	long ret = syscall(__NR_close, (int)fd);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("close already-closed fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret == 0 ? "success" : strerror(errno));
+		tap_fail("close already-closed fd returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 16: open /dev/null with O_RDONLY|O_CLOEXEC should succeed
+ */
+static void test_open_valid_cloexec(void)
+{
+	errno = 0;
+	long fd = kapi_sys_open("/dev/null", O_RDONLY | O_CLOEXEC, 0);
+
+	if (fd >= 0) {
+		tap_ok("open /dev/null with O_RDONLY|O_CLOEXEC succeeds");
+		syscall(__NR_close, (int)fd);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success, got %s",
+			 strerror(errno));
+		tap_fail("open /dev/null with O_RDONLY|O_CLOEXEC succeeds",
+			 msg);
+	}
+}
+
+/*
+ * Test 17: write 0 bytes to /dev/null should return 0
+ */
+static void test_write_zero_devnull(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write 0 bytes to /dev/null returns 0",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "", 0);
+
+	if (ret == 0) {
+		tap_ok("write 0 bytes to /dev/null returns 0");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, strerror(errno));
+		tap_fail("write 0 bytes to /dev/null returns 0", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 18: read from a write-only fd should return EBADF
+ */
+static void test_read_writeonly_fd(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("read from write-only fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, sizeof(buf));
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("read from write-only fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read from write-only fd returns EBADF", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 19: write to a read-only fd should return EBADF
+ */
+static void test_write_readonly_fd(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("write to read-only fd returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, "hello", 5);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("write to read-only fd returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("write to read-only fd returns EBADF", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/*
+ * Test 20: close fd 9999 (likely invalid) should return EBADF
+ */
+static void test_close_fd_9999(void)
+{
+	errno = 0;
+	long ret = syscall(__NR_close, 9999);
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("close fd 9999 returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret == 0 ? "success" : strerror(errno));
+		tap_fail("close fd 9999 returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 21: read from pipe after write end is closed returns 0 (EOF)
+ */
+static void test_read_closed_pipe(void)
+{
+	int pipefd[2];
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("read from closed pipe returns 0 (EOF)",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Close write end */
+	close(pipefd[1]);
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, pipefd[0], buf, sizeof(buf));
+
+	if (ret == 0) {
+		tap_ok("read from closed pipe returns 0 (EOF)");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected 0, got %ld (errno=%s)",
+			 ret, ret < 0 ? strerror(errno) : "n/a");
+		tap_fail("read from closed pipe returns 0 (EOF)", msg);
+	}
+
+	close(pipefd[0]);
+}
+
+/*
+ * Test 22: write to pipe after read end is closed returns EPIPE + SIGPIPE
+ */
+static void test_write_closed_pipe(void)
+{
+	int pipefd[2];
+	struct sigaction sa, old_sa;
+
+	if (pipe(pipefd) < 0) {
+		tap_fail("write to closed pipe returns EPIPE + SIGPIPE",
+			 "pipe() failed");
+		return;
+	}
+
+	/* Install SIGPIPE handler */
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = sigpipe_handler;
+	sigemptyset(&sa.sa_mask);
+	sigaction(SIGPIPE, &sa, &old_sa);
+
+	got_sigpipe = 0;
+
+	/* Close read end */
+	close(pipefd[0]);
+
+	errno = 0;
+	long ret = syscall(__NR_write, pipefd[1], "hello", 5);
+
+	if (ret == -1 && errno == EPIPE && got_sigpipe) {
+		tap_ok("write to closed pipe returns EPIPE + SIGPIPE");
+	} else if (ret == -1 && errno == EPIPE) {
+		tap_ok("write to closed pipe returns EPIPE (SIGPIPE not caught)");
+	} else {
+		char msg[128];
+
+		snprintf(msg, sizeof(msg),
+			 "expected EPIPE, got %s (sigpipe=%d)",
+			 ret >= 0 ? "success" : strerror(errno),
+			 (int)got_sigpipe);
+		tap_fail("write to closed pipe returns EPIPE + SIGPIPE", msg);
+	}
+
+	/* Restore SIGPIPE handler */
+	sigaction(SIGPIPE, &old_sa, NULL);
+	close(pipefd[1]);
+}
+
+/*
+ * Test 23: open with O_DIRECTORY on a regular file returns ENOTDIR
+ */
+static void test_open_directory_on_file(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open("/dev/null", O_RDONLY | O_DIRECTORY, 0);
+
+	if (ret == -1 && errno == ENOTDIR) {
+		tap_ok("open O_DIRECTORY on regular file returns ENOTDIR");
+	} else if (ret >= 0) {
+		tap_fail("open O_DIRECTORY on regular file",
+			 "expected ENOTDIR, got success");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected ENOTDIR, got %s",
+			 strerror(errno));
+		tap_fail("open O_DIRECTORY on regular file", msg);
+	}
+}
+
+/*
+ * Test 24: open nonexistent file without O_CREAT returns ENOENT
+ */
+static void test_open_nonexistent(void)
+{
+	errno = 0;
+	long ret = kapi_sys_open("/tmp/kapi_nonexistent_file_12345",
+				 O_RDONLY, 0);
+
+	if (ret == -1 && errno == ENOENT) {
+		tap_ok("open nonexistent file without O_CREAT returns ENOENT");
+	} else if (ret >= 0) {
+		tap_fail("open nonexistent file",
+			 "expected ENOENT, got success (file exists?)");
+		syscall(__NR_close, (int)ret);
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected ENOENT, got %s",
+			 strerror(errno));
+		tap_fail("open nonexistent file", msg);
+	}
+}
+
+/*
+ * Test 25: close stdin (fd 0) should succeed
+ * We dup it first so we can restore it.
+ */
+static void test_close_stdin(void)
+{
+	int saved_stdin = dup(0);
+
+	if (saved_stdin < 0) {
+		tap_fail("close stdin succeeds", "cannot dup stdin");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_close, 0);
+
+	if (ret == 0) {
+		tap_ok("close stdin (fd 0) succeeds");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success, got %s",
+			 strerror(errno));
+		tap_fail("close stdin (fd 0) succeeds", msg);
+	}
+
+	/* Restore stdin */
+	dup2(saved_stdin, 0);
+	close(saved_stdin);
+}
+
+/*
+ * Test 26: read after close returns EBADF
+ */
+static void test_read_after_close(void)
+{
+	long fd;
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_RDONLY, 0);
+	if (fd < 0) {
+		tap_fail("read after close returns EBADF",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	syscall(__NR_close, (int)fd);
+
+	char buf[16];
+
+	errno = 0;
+	long ret = syscall(__NR_read, (int)fd, buf, sizeof(buf));
+
+	if (ret == -1 && errno == EBADF) {
+		tap_ok("read after close returns EBADF");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected EBADF, got %s",
+			 ret >= 0 ? "success" : strerror(errno));
+		tap_fail("read after close returns EBADF", msg);
+	}
+}
+
+/*
+ * Test 27: write with large count
+ * Without KAPI: the kernel clamps count to MAX_RW_COUNT and succeeds.
+ * With KAPI: KAPI validates the buffer against the count and may
+ * return EFAULT/EINVAL since the buffer is smaller than count.
+ * Accept either success or EFAULT/EINVAL.
+ */
+static void test_write_large_count(void)
+{
+	long fd;
+	char buf[64] = "test data";
+
+	errno = 0;
+	fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+	if (fd < 0) {
+		tap_fail("write with large count handled correctly",
+			 "cannot open /dev/null");
+		return;
+	}
+
+	errno = 0;
+	long ret = syscall(__NR_write, (int)fd, buf, (size_t)0x7ffff000UL);
+
+	if (ret > 0) {
+		tap_ok("write with large count succeeds (clamped, no KAPI)");
+	} else if (ret == -1 && (errno == EFAULT || errno == EINVAL)) {
+		tap_ok("write with large count returns EFAULT/EINVAL (KAPI validates buffer)");
+	} else {
+		char msg[64];
+
+		snprintf(msg, sizeof(msg), "expected success or EFAULT, got %s",
+			 ret == 0 ? "zero" : strerror(errno));
+		tap_fail("write with large count handled correctly", msg);
+	}
+
+	syscall(__NR_close, (int)fd);
+}
+
+/* ---- Integration tests ---- */
+
+/*
+ * Test 28: full normal syscall path - open, read, write, close
+ * Verify KAPI does not interfere with normal operations.
+ */
+static void test_normal_path(void)
+{
+	long rd_fd, wr_fd;
+	char buf[128];
+	int ok = 1;
+	char reason[128] = "";
+
+	/* Open a readable file */
+	errno = 0;
+	rd_fd = kapi_sys_open("/etc/hostname", O_RDONLY, 0);
+	if (rd_fd < 0) {
+		errno = 0;
+		rd_fd = kapi_sys_open("/etc/passwd", O_RDONLY, 0);
+	}
+	if (rd_fd < 0) {
+		snprintf(reason, sizeof(reason), "open readable file: %s",
+			 strerror(errno));
+		ok = 0;
+	}
+
+	/* Read from it */
+	if (ok) {
+		errno = 0;
+		long n = syscall(__NR_read, (int)rd_fd, buf, sizeof(buf));
+
+		if (n < 0) {
+			snprintf(reason, sizeof(reason), "read: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	/* Open /dev/null for writing */
+	wr_fd = -1;
+	if (ok) {
+		errno = 0;
+		wr_fd = kapi_sys_open("/dev/null", O_WRONLY, 0);
+		if (wr_fd < 0) {
+			snprintf(reason, sizeof(reason),
+				 "open /dev/null: %s", strerror(errno));
+			ok = 0;
+		}
+	}
+
+	/* Write to /dev/null */
+	if (ok) {
+		errno = 0;
+		long n = syscall(__NR_write, (int)wr_fd, "test", 4);
+
+		if (n != 4) {
+			snprintf(reason, sizeof(reason), "write: %s",
+				 n < 0 ? strerror(errno) : "short write");
+			ok = 0;
+		}
+	}
+
+	/* Close both fds */
+	if (rd_fd >= 0) {
+		errno = 0;
+		if (syscall(__NR_close, (int)rd_fd) != 0 && ok) {
+			snprintf(reason, sizeof(reason), "close read fd: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	if (wr_fd >= 0) {
+		errno = 0;
+		if (syscall(__NR_close, (int)wr_fd) != 0 && ok) {
+			snprintf(reason, sizeof(reason), "close write fd: %s",
+				 strerror(errno));
+			ok = 0;
+		}
+	}
+
+	if (ok)
+		tap_ok("normal syscall path (open/read/write/close) works");
+	else
+		tap_fail("normal syscall path (open/read/write/close) works",
+			 reason);
+}
+
+/*
+ * Test 29: verify dmesg contains KAPI warnings for the invalid tests
+ */
+static void test_dmesg_warnings(void)
+{
+	int kmsg_fd = open("/dev/kmsg", O_RDONLY | O_NONBLOCK);
+
+	if (kmsg_fd < 0) {
+		tap_skip("dmesg contains expected KAPI warnings",
+			 "cannot open /dev/kmsg");
+		return;
+	}
+
+	/*
+	 * Rewind to the start of kmsg. SEEK_DATA on /dev/kmsg is the
+	 * documented way to skip to the first entry still in the ring
+	 * buffer. Older kernels (or CONFIG_PRINTK=n builds) may reject
+	 * the seek with -EINVAL; in that case we can't reliably audit
+	 * past warnings, so skip the test rather than fail it.
+	 */
+	if (lseek(kmsg_fd, 0, SEEK_DATA) == (off_t)-1) {
+		tap_skip("dmesg contains expected KAPI warnings",
+			 "lseek(SEEK_DATA) not supported on /dev/kmsg");
+		close(kmsg_fd);
+		return;
+	}
+
+	char line[4096];
+	int found_invalid_bits = 0;
+	int found_null = 0;
+	ssize_t n;
+
+	for (;;) {
+		n = read(kmsg_fd, line, sizeof(line) - 1);
+		if (n > 0) {
+			line[n] = '\0';
+			if (strstr(line, "contains invalid bits"))
+				found_invalid_bits++;
+			if (strstr(line, "NULL") && strstr(line, "not allowed"))
+				found_null++;
+		} else if (n == -1 && errno == EPIPE) {
+			/* Ring buffer wrapped, continue reading */
+			continue;
+		} else {
+			/* EAGAIN (no more messages) or other error */
+			break;
+		}
+	}
+
+	close(kmsg_fd);
+
+	if (found_invalid_bits >= 2 && found_null >= 1) {
+		tap_ok("dmesg contains expected KAPI warnings");
+	} else if (found_invalid_bits >= 1 || found_null >= 1) {
+		char msg[128];
+
+		snprintf(msg, sizeof(msg),
+			 "partial: invalid_bits=%d null=%d",
+			 found_invalid_bits, found_null);
+		tap_ok(msg);
+	} else {
+		tap_fail("dmesg KAPI warnings",
+			 "no KAPI warnings found in dmesg");
+	}
+}
+
+int main(void)
+{
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	/* Valid operations (1-4) */
+	int fd = test_open_valid();
+
+	if (fd >= 0)
+		test_read_valid(fd);
+	else
+		tap_fail("read from valid fd", "no fd from open");
+
+	test_write_valid();
+
+	if (fd >= 0)
+		test_close_valid(fd);
+	else
+		tap_fail("close valid fd", "no fd from open");
+
+	/* KAPI parameter rejection (5-8) */
+	test_open_invalid_flags();
+	test_open_invalid_mode();
+	test_open_null_path();
+	test_open_flag_bit30();
+
+	/* Boundary conditions and error paths (9-20) */
+	test_read_bad_fd();
+	test_read_zero_count();
+	test_write_zero_count();
+	test_open_long_path();
+	test_read_unmapped_buf();
+	test_write_unmapped_buf();
+	test_close_already_closed();
+	test_open_valid_cloexec();
+	test_write_zero_devnull();
+	test_read_writeonly_fd();
+	test_write_readonly_fd();
+	test_close_fd_9999();
+
+	/* Pipe and lifecycle tests (21-27) */
+	test_read_closed_pipe();
+	test_write_closed_pipe();
+	test_open_directory_on_file();
+	test_open_nonexistent();
+	test_close_stdin();
+	test_read_after_close();
+	test_write_large_count();
+
+	/* Integration (28-29) */
+	test_normal_path();
+	test_dmesg_warnings();
+
+	ksft_finished();
+	return 0;
+}
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 10/11] kernel/api: add API specification for sys_madvise
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add KAPI-annotated kerneldoc for the sys_madvise system call in
mm/madvise.c.

The specification documents parameter constraints (start, len_in,
behavior), per-behavior error conditions, lock acquisition (mmap_lock
read and write modes plus the per-VMA fast path, mmu_gather and
mmu_notifier brackets), signal handling, side effects, capability
requirements (CAP_SYS_ADMIN for MADV_HWPOISON and MADV_SOFT_OFFLINE),
mseal interaction, and the heterogeneous skip semantics across the
hint, immediate-action and destructive groups.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 mm/madvise.c | 575 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 575 insertions(+)

diff --git a/mm/madvise.c b/mm/madvise.c
index dbb69400786d1..ed0a046e9e25b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -2032,6 +2032,581 @@ int do_madvise(struct mm_struct *mm, unsigned long start, size_t len_in, int beh
 	return error;
 }
 
+/**
+ * sys_madvise - Give advice about use of memory
+ * @start: Starting virtual address of the range to advise on
+ * @len_in: Length of the range in bytes
+ * @behavior: Advice (a MADV_* constant) the kernel should apply to the range
+ *
+ * long-desc: Provides the kernel with advice or directions about the address
+ *   range starting at start and extending for len_in bytes. The advice is
+ *   selected by behavior, which is one of the MADV_* constants defined in
+ *   <sys/mman.h>. The semantics fall into three groups. The hint group
+ *   primarily updates VMA flags (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ *   MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ *   MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ *   MADV_NOHUGEPAGE), and is itself heterogeneous: the fork-copy gates
+ *   (MADV_DONTFORK / MADV_DOFORK) and the KSM scan gates
+ *   (MADV_MERGEABLE / MADV_UNMERGEABLE) are strictly honored by their
+ *   consumers; MADV_HUGEPAGE / MADV_NOHUGEPAGE express THP eligibility
+ *   advice rather than allocation guarantees (MADV_NOHUGEPAGE blocks the
+ *   normal fault-time, MADV_COLLAPSE and khugepaged paths; MADV_HUGEPAGE
+ *   widens eligibility and increases defrag aggressiveness but does not
+ *   force allocation, which still depends on the global
+ *   transparent_hugepage= mode, VMA suitability, and allocation success);
+ *   MADV_WIPEONFORK / MADV_KEEPONFORK do not wipe at fork time but cause
+ *   the child's first access to fault in zero-filled pages;
+ *   MADV_DONTDUMP / MADV_DODUMP normally control coredump inclusion but
+ *   can be overridden by always_dump_vma() for gate, vm_ops-named or
+ *   arch-named VMAs; and MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are
+ *   genuinely heuristic read-ahead hints. The non-destructive
+ *   immediate-action group performs work
+ *   synchronously while preserving page contents (MADV_WILLNEED, MADV_COLD,
+ *   MADV_PAGEOUT, MADV_POPULATE_READ, MADV_POPULATE_WRITE, MADV_COLLAPSE,
+ *   MADV_GUARD_REMOVE). The destructive group discards, replaces or
+ *   invalidates page contents (MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ *   MADV_FREE, MADV_REMOVE, MADV_GUARD_INSTALL, MADV_HWPOISON,
+ *   MADV_SOFT_OFFLINE). MADV_GUARD_INSTALL belongs to the destructive group
+ *   because it zaps any existing pages in the range before installing PTE
+ *   guard markers.
+ *
+ *   start must be page-aligned; len_in is rounded up to the next page
+ *   boundary internally. Once those validation checks pass, a zero-length
+ *   range succeeds without performing work. The kernel rejects ranges that
+ *   wrap (start + PAGE_ALIGN(len_in) < start) and ranges where len_in is
+ *   non-zero but rounds down to zero. Address tagging bits are stripped
+ *   from start before VMA lookup for every behavior except MADV_HWPOISON
+ *   and MADV_SOFT_OFFLINE, which receive the raw start value because they
+ *   bypass the VMA walk entirely.
+ *
+ *   The kernel return value reports whether any error condition was
+ *   encountered, not whether the requested work was performed. The
+ *   relationship between the return code and the work done varies by
+ *   handler:
+ *
+ *     - Hint behaviors update VMA flags. The flags fall into five
+ *       sub-classes by how their consumers honor them:
+ *       (a) Hard gates -- MADV_DONTFORK / MADV_DOFORK strictly gate VMA
+ *       copy in dup_mmap(); MADV_MERGEABLE / MADV_UNMERGEABLE strictly
+ *       gate whether KSM will scan the VMA at all. These take effect
+ *       immediately and cannot be overridden by other policy.
+ *       (b) THP eligibility advice -- MADV_NOHUGEPAGE blocks the normal
+ *       fault-time, MADV_COLLAPSE and khugepaged THP paths for the VMA
+ *       (driver-internal PMD insertion via insert_pmd() is the only
+ *       documented bypass). MADV_HUGEPAGE only widens THP eligibility
+ *       under the kernel's "madvise" / "except-advised" policy and
+ *       increases defrag aggressiveness; it does not force allocation,
+ *       which still depends on the global transparent_hugepage= mode
+ *       (always / madvise / never), VMA suitability, defrag GFP policy,
+ *       and allocation or memcg-charge success.
+ *       (c) Fault-on-access -- MADV_WIPEONFORK / MADV_KEEPONFORK do not
+ *       wipe pages at fork time; instead the child VMA's pages are not
+ *       copied and the child sees zero-filled pages only when it first
+ *       reads or writes them.
+ *       (d) Mostly-strict with override -- MADV_DONTDUMP / MADV_DODUMP
+ *       control coredump inclusion via VM_DONTDUMP, but always_dump_vma()
+ *       can still include gate, vm_ops-named or arch-named VMAs in the
+ *       core regardless.
+ *       (e) Heuristic -- MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL set
+ *       VM_RAND_READ / VM_SEQ_READ as read-ahead hints that the read-ahead
+ *       code weighs against other policy and may diverge from at runtime.
+ *       In all five sub-classes the requested flag bits on the VMA are set;
+ *       what differs is the strength of the resulting downstream effect.
+ *
+ *     - Walk-and-skip handlers (MADV_COLD, MADV_PAGEOUT, MADV_FREE,
+ *       MADV_GUARD_REMOVE) traverse the range and silently skip pages or
+ *       PMDs that fail per-page preconditions (absent, special, device,
+ *       shared, non-LRU, unsplittable, locked, etc.), returning 0 even
+ *       when most or all pages were skipped.
+ *
+ *     - Bulk-backend handlers delegate the requested range to a single
+ *       backend call: MADV_DONTNEED and MADV_DONTNEED_LOCKED to
+ *       zap_page_range_single_batched(), MADV_REMOVE to vfs_fallocate(),
+ *       MADV_WILLNEED on regular files to vfs_fadvise(). The backend's
+ *       return is propagated for MADV_REMOVE and discarded for
+ *       MADV_WILLNEED; DAX files short-circuit MADV_WILLNEED entirely.
+ *
+ *     - Stop-on-error handlers (MADV_POPULATE_READ, MADV_POPULATE_WRITE,
+ *       MADV_SOFT_OFFLINE) walk the range but surface the first per-page
+ *       failure as an errno (-EHWPOISON, -EFAULT, -ENOMEM, ...) rather
+ *       than skipping silently.
+ *
+ *     - Hybrid handlers combine modes: MADV_WILLNEED walks for anonymous
+ *       and shmem ranges but bulk-calls vfs_fadvise() for regular files;
+ *       MADV_COLLAPSE walks PMD-by-PMD and tracks the last scan failure
+ *       so transient skips coexist with terminal errors;
+ *       MADV_GUARD_INSTALL walks to install markers and re-walks after
+ *       zap_page_range_single() to clear pre-existing pages, retrying up
+ *       to MAX_MADVISE_GUARD_RETRIES; MADV_HWPOISON walks pages but folds
+ *       memory_failure()'s -EOPNOTSUPP back to 0.
+ *
+ *   Applications that need to know whether a specific page was acted on
+ *   must verify the result through other means (e.g. /proc/[pid]/smaps,
+ *   page faults, read-after-write).
+ *
+ *   On success, madvise() returns 0; unlike read(2) and write(2) it has no
+ *   notion of partial completion at the syscall boundary. When the range
+ *   spans multiple VMAs, the kernel applies the advice to each in turn; an
+ *   unmapped gap inside the range causes the call to return -ENOMEM after
+ *   processing the mapped portions, rather than aborting at the gap.
+ *
+ *   POSIX defines posix_madvise(3) for a portable subset (POSIX_MADV_NORMAL,
+ *   _RANDOM, _SEQUENTIAL, _WILLNEED, _DONTNEED). Linux MADV_DONTNEED is
+ *   destructive: it discards the contents of the affected anonymous pages and
+ *   subsequent reads return zero. POSIX permits but does not require
+ *   destruction, so portable code that needs the POSIX semantics should use
+ *   posix_madvise(3) instead.
+ *
+ * contexts: process, sleepable
+ *
+ * param: start
+ *   type: uint, input
+ *   constraint-type: page_aligned
+ *   cdesc: Starting virtual address of the range. Must be aligned to
+ *     PAGE_SIZE. An unaligned start always returns -EINVAL, even when
+ *     len_in is zero. Address tag bits, where supported by the architecture,
+ *     are cleared via untagged_addr() before the range is interpreted, with
+ *     the exception of MADV_HWPOISON and MADV_SOFT_OFFLINE, which receive
+ *     the raw start value because they bypass the VMA walk.
+ *
+ * param: len_in
+ *   type: uint, input
+ *   constraint-type: range(0, SIZE_MAX)
+ *   cdesc: Length of the range in bytes. Internally rounded up to a multiple
+ *     of PAGE_SIZE. A len_in of 0 is accepted and the call is a no-op that
+ *     returns 0. A non-zero len_in that rounds up to 0 (i.e. wraps around)
+ *     returns -EINVAL, as does a range whose end (start + PAGE_ALIGN(len_in))
+ *     would wrap below start.
+ *
+ * param: behavior
+ *   type: int, input
+ *   cdesc: One of the MADV_* constants from <sys/mman.h>. See the long
+ *     description above for the full list and the three semantic groups
+ *     (hint, immediate-action, destructive). Behaviors gated by Kconfig
+ *     (KSM, transparent hugepage, memory failure) return -EINVAL when the
+ *     underlying support is disabled. A few architectures (notably alpha)
+ *     renumber values; portable code should always use the symbolic names.
+ *
+ * return:
+ *   type: int
+ *   check-type: exact
+ *   success: 0
+ *   desc: On success, returns 0. On error, returns a negative error code.
+ *     There is no partial-success indication; either the entire processed
+ *     range succeeded, or an error is returned and an unspecified prefix of
+ *     the range may have been advised.
+ *
+ * error: EINVAL, Invalid argument
+ *   desc: Returned for invalid input (unrecognised MADV_*, Kconfig-gated
+ *     behavior, unaligned start, range wrap, non-zero len_in rounding to
+ *     zero) and for per-behavior VMA-filter violations. The constraint:
+ *     blocks cover FREE, WIPEONFORK, REMOVE, COLD and PAGEOUT; inline
+ *     filters also reject DOFORK on VM_SPECIAL, KEEPONFORK on VM_DROPPABLE,
+ *     DODUMP on non-hugetlb VM_SPECIAL/VM_DROPPABLE, GUARD_* on VM_SPECIAL
+ *     or VM_HUGETLB, GUARD_INSTALL on VM_LOCKED. Also
+ *     returned by faultin_page_range() and madvise_collapse_errno().
+ *
+ * error: ENOMEM, Cannot allocate memory
+ *   desc: Some part of the requested range falls in a gap between mapped
+ *     VMAs; the kernel still applies the behavior to the mapped subranges
+ *     and only returns -ENOMEM after the walk completes. MADV_POPULATE_*
+ *     also returns -ENOMEM when the region has no VMA or when
+ *     faultin_page_range() exhausts memory. MADV_COLLAPSE returns -ENOMEM
+ *     when its struct collapse_control cannot be allocated up front, and
+ *     when madvise_collapse_errno() maps SCAN_ALLOC_HUGE_PAGE_FAIL (no
+ *     hugepage available) to -ENOMEM.
+ *
+ * error: EAGAIN, Resource temporarily unavailable
+ *   desc: For the VMA-flag-mutating behaviors, an internal -ENOMEM from VMA
+ *     splitting is translated to -EAGAIN before being returned to userspace,
+ *     advising the caller that a transient kernel resource shortage
+ *     prevented the update. Also returned by MADV_COLLAPSE via
+ *     madvise_collapse_errno() for transient scan failures (folio lock
+ *     contention, LRU isolation failure, dirty/writeback) where retrying
+ *     the call may succeed.
+ *
+ * error: EIO, Input/output error
+ *   desc: For MADV_REMOVE, an I/O error from the underlying filesystem's
+ *     FALLOC_FL_PUNCH_HOLE handler is propagated back as -EIO. MADV_WILLNEED
+ *     and MADV_PAGEOUT do not surface filesystem or device I/O errors:
+ *     vfs_fadvise() returns are discarded by madvise_willneed() and the
+ *     pageout walk is invoked through a void helper, so transient I/O
+ *     failures during read-ahead or page-out are silently dropped.
+ *
+ * error: EBADF, Bad file descriptor
+ *   desc: Returned by MADV_WILLNEED when applied to a non-file-backed VMA
+ *     and the kernel was built without CONFIG_SWAP, so there is neither a
+ *     file to read-ahead from nor a swap device to fault from.
+ *
+ * error: EACCES, Permission denied
+ *   desc: Returned by MADV_REMOVE when the target VMA is not a writable
+ *     shared mapping (vma_is_shared_maywrite() is false). Punching a hole in
+ *     a private or read-only shared mapping is not permitted; the operation
+ *     would either be invisible to other mappers or violate file permissions.
+ *
+ * error: EPERM, Operation not permitted
+ *   desc: Returned in two situations. First, MADV_HWPOISON and
+ *     MADV_SOFT_OFFLINE require CAP_SYS_ADMIN; the inject-error handler
+ *     refuses non-privileged callers. Second, on 64-bit kernels, a discard
+ *     operation (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_REMOVE,
+ *     MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is refused on a
+ *     read-only anonymous VMA that has been sealed with mseal(2), to prevent
+ *     bypassing the seal by discarding mapped data.
+ *
+ * error: EINTR, Interrupted system call
+ *   desc: Returned when a fatal signal is delivered while the call is
+ *     waiting to acquire the mmap write lock for a VMA-flag-mutating
+ *     behavior (mmap_write_lock_killable() returns -EINTR), or when
+ *     MADV_POPULATE_READ/MADV_POPULATE_WRITE is interrupted while faulting
+ *     in pages (faultin_page_range() returns -EINTR). The single-shot
+ *     madvise() syscall is not automatically restarted by the signal
+ *     framework on this path; the caller must reissue the request if
+ *     desired.
+ *
+ * error: EHWPOISON, Memory page has hardware error
+ *   desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE encountered a page that
+ *     has been marked as containing a hardware-detected memory error and
+ *     could not be faulted in.
+ *
+ * error: EFAULT, Bad address
+ *   desc: MADV_POPULATE_READ or MADV_POPULATE_WRITE attempted to fault in a
+ *     page whose mapping raised VM_FAULT_SIGBUS or VM_FAULT_SIGSEGV (for
+ *     example, a file-backed page beyond the end of the file).
+ *
+ * error: EBUSY, Device or resource busy
+ *   desc: Returned by MADV_COLLAPSE via madvise_collapse_errno() in two
+ *     specific scan-failure modes: SCAN_CGROUP_CHARGE_FAIL (the new
+ *     hugepage cannot be charged to the memory cgroup) and
+ *     SCAN_EXCEED_NONE_PTE (too many absent PTEs in the candidate range
+ *     for a synchronous collapse). Other transient collapse failures are
+ *     reported as -EAGAIN; non-transient ones as -EINVAL.
+ *
+ * lock: mm->mmap_lock (read mode)
+ *   type: rwlock
+ *   acquired: yes
+ *   released: yes
+ *   desc: Held on entry to the VMA walk for MADV_REMOVE, MADV_WILLNEED,
+ *     MADV_COLD, MADV_PAGEOUT and MADV_COLLAPSE, and as the fallback when
+ *     the per-VMA fast path declines. Several handlers drop and reacquire
+ *     this lock mid-operation: MADV_WILLNEED on a file-backed VMA and
+ *     MADV_REMOVE around their vfs_fadvise() / vfs_fallocate() callouts;
+ *     MADV_COLLAPSE on file-backed ranges around the page migration
+ *     pipeline; MADV_POPULATE_* (dispatched directly to madvise_populate())
+ *     around each faultin_page_range() call, which may itself drop the
+ *     lock internally before returning.
+ *
+ * lock: mm->mmap_lock (write mode; killable)
+ *   type: rwlock
+ *   acquired: yes
+ *   released: yes
+ *   desc: Acquired in killable write mode for behaviors that modify
+ *     vma->vm_flags or split/merge VMAs (MADV_NORMAL, MADV_RANDOM,
+ *     MADV_SEQUENTIAL, MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP,
+ *     MADV_WIPEONFORK, MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE,
+ *     MADV_HUGEPAGE, MADV_NOHUGEPAGE). If the acquisition is killed by a
+ *     fatal signal, the syscall returns -EINTR before any VMA is touched.
+ *
+ * lock: per-VMA read lock (vma->vm_lock)
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: Tried first for MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE,
+ *     MADV_GUARD_INSTALL and MADV_GUARD_REMOVE via lock_vma_under_rcu(). The
+ *     per-VMA path is taken only when the requested range fits within a
+ *     single VMA, the target mm is the caller's mm, the VMA is not armed
+ *     with userfaultfd, and (for behaviors that establish page tables) an
+ *     anon_vma is already attached. Otherwise the code falls back to the
+ *     mmap read lock above.
+ *
+ * lock: mmu_gather TLB batch
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: For MADV_DONTNEED, MADV_DONTNEED_LOCKED and MADV_FREE the syscall
+ *     wraps the per-VMA work in tlb_gather_mmu() / tlb_finish_mmu() so PTE
+ *     clearing and TLB invalidation are batched. MADV_COLD and MADV_PAGEOUT
+ *     build a short-lived gather inside the handler. MADV_GUARD_INSTALL
+ *     builds a transient gather via zap_page_range_single() each time the
+ *     retry loop has to clear pre-existing pages; if the range is already
+ *     empty no gather is built. MADV_GUARD_REMOVE never zaps and never
+ *     gathers.
+ *
+ * lock: mmu_notifier invalidate range
+ *   type: custom
+ *   acquired: yes
+ *   released: yes
+ *   desc: All zap-based paths -- MADV_DONTNEED, MADV_DONTNEED_LOCKED, the
+ *     zap branch of MADV_GUARD_INSTALL via zap_page_range_single(), and
+ *     MADV_FREE's own walk -- bracket their work with
+ *     mmu_notifier_invalidate_range_start()/_end() so secondary MMUs (KVM,
+ *     IOMMUv2, etc.) observe the page clearing.
+ *
+ * signal: Any fatal signal
+ *   direction: receive
+ *   action: return
+ *   condition: Acquiring the mmap write lock or faulting in pages for
+ *     MADV_POPULATE_*
+ *   desc: A pending fatal signal aborts mmap_write_lock_killable() (used by
+ *     the VMA-flag-mutating behaviors) and faultin_page_range() (used by
+ *     MADV_POPULATE_READ and MADV_POPULATE_WRITE), in both cases surfacing as
+ *     -EINTR to userspace. The single-shot madvise() syscall does not request
+ *     transparent restart on these paths; the caller is expected to reissue
+ *     the call if appropriate.
+ *   errno: -EINTR
+ *   timing: during
+ *   restartable: no
+ *
+ * side-effect: modify_state
+ *   target: vma->vm_flags
+ *   condition: Hint-group behaviors (MADV_NORMAL, MADV_RANDOM, MADV_SEQUENTIAL,
+ *     MADV_DONTFORK, MADV_DOFORK, MADV_DONTDUMP, MADV_DODUMP, MADV_WIPEONFORK,
+ *     MADV_KEEPONFORK, MADV_MERGEABLE, MADV_UNMERGEABLE, MADV_HUGEPAGE,
+ *     MADV_NOHUGEPAGE)
+ *   desc: Sets or clears VM_RAND_READ, VM_SEQ_READ, VM_DONTCOPY, VM_DONTDUMP,
+ *     VM_WIPEONFORK, VM_MERGEABLE or VM_HUGEPAGE on the affected VMAs and may
+ *     split or merge VMAs to apply the change to a sub-range. The change is
+ *     reversible by issuing madvise() with the inverse advice (e.g.
+ *     MADV_DOFORK undoes MADV_DONTFORK), with the caveat that the inverse
+ *     call's per-VMA filter still applies: MADV_DOFORK rejects VM_SPECIAL,
+ *     MADV_DODUMP rejects non-hugetlb VM_SPECIAL or VM_DROPPABLE, and
+ *     MADV_KEEPONFORK rejects VM_DROPPABLE; for those classes of VMA the
+ *     inverse cannot complete.
+ *   reversible: yes
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ *   target: page tables and resident pages within the range
+ *   condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE
+ *   desc: MADV_DONTNEED zaps PTEs, releasing the underlying pages or swap
+ *     slots so the next access faults in zero-filled anonymous pages or
+ *     re-reads the file. MADV_DONTNEED_LOCKED is identical but tolerates
+ *     VM_LOCKED. MADV_FREE marks anonymous pages lazy-freeable: clean pages
+ *     may be reclaimed under memory pressure, while writes before
+ *     reclamation cancel the lazy-free. Discarded data cannot be recovered.
+ *   reversible: no
+ *
+ * side-effect: filesystem | irreversible
+ *   target: backing file (FALLOC_FL_PUNCH_HOLE)
+ *   condition: MADV_REMOVE
+ *   desc: Calls vfs_fallocate(FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE) on
+ *     the backing file, deallocating the corresponding file blocks. The hole
+ *     is visible to all mappers of the file and to read(2)/write(2)
+ *     callers; subsequent reads return zero. Filesystem freeze protection,
+ *     i_rwsem and any quota/space accounting are taken by the underlying
+ *     fallocate path.
+ *   reversible: no
+ *
+ * side-effect: modify_state | schedule
+ *   target: LRU lists and page reclaim
+ *   condition: MADV_COLD, MADV_PAGEOUT
+ *   desc: MADV_COLD deactivates the affected pages, moving them to the
+ *     inactive LRU and clearing PG_referenced/PG_young so they are reclaimed
+ *     sooner under pressure. MADV_PAGEOUT additionally calls reclaim_pages()
+ *     to write dirty pages out and drop clean ones synchronously. Page data
+ *     is preserved (rereads will fault in the same content), but the I/O and
+ *     LRU bookkeeping cannot be undone.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: page tables (faultin)
+ *   condition: MADV_POPULATE_READ, MADV_POPULATE_WRITE
+ *   desc: Walks the requested range with faultin_page_range(), populating
+ *     PTEs by triggering read or write faults so subsequent accesses do not
+ *     fault. Equivalent to touching every page in the range while suppressing
+ *     SIGBUS/SIGSEGV through the syscall return value. Allocations made by
+ *     faultin are not undone on partial failure.
+ *   reversible: no
+ *
+ * side-effect: modify_state | schedule
+ *   target: transparent hugepage layout
+ *   condition: MADV_COLLAPSE
+ *   desc: Synchronously coalesces base pages in the range into a PMD-sized
+ *     transparent hugepage when the mapping permits. Performs the same page
+ *     migration and zeroing that khugepaged would do asynchronously; the
+ *     range's data is preserved across the collapse.
+ *   reversible: no
+ *
+ * side-effect: free_memory | modify_state | irreversible
+ *   target: PTE marker (PTE_MARKER_GUARD)
+ *   condition: MADV_GUARD_INSTALL, MADV_GUARD_REMOVE
+ *   desc: MADV_GUARD_INSTALL installs PTE_MARKER_GUARD entries that cause
+ *     subsequent accesses to deliver SIGSEGV without consuming physical
+ *     memory; existing pages already mapped in the range are zapped via
+ *     zap_page_range_single() before the markers are installed, so any
+ *     prior contents are lost. MADV_GUARD_REMOVE clears the markers but
+ *     does not (and cannot) restore zapped data.
+ *   reversible: no
+ *
+ * side-effect: hardware | irreversible
+ *   target: physical page (memory_failure / soft_offline_page)
+ *   condition: MADV_HWPOISON, MADV_SOFT_OFFLINE
+ *   desc: MADV_HWPOISON marks the affected pages as containing an
+ *     unrecoverable hardware error using the same machine-check path that
+ *     real ECC failures take; MADV_SOFT_OFFLINE migrates the contents off
+ *     the affected pages and removes them from the buddy allocator. Both
+ *     paths affect physical memory bookkeeping kernel-wide and cannot be
+ *     undone without a reboot. Intended for testing the memory-failure
+ *     pipeline; restricted to CAP_SYS_ADMIN.
+ *   reversible: no
+ *
+ * side-effect: modify_state
+ *   target: KSM merge state (vm_flags & VM_MERGEABLE)
+ *   condition: MADV_MERGEABLE, MADV_UNMERGEABLE
+ *   desc: Toggles the VMA's eligibility for the kernel same-page merger.
+ *     Enabling merging may later cause identical anonymous pages to be
+ *     replaced by shared, write-protected copies; disabling merging tears
+ *     any existing merges down lazily. The flag toggle itself is reversible
+ *     by issuing the inverse advice.
+ *   reversible: yes
+ *
+ * side-effect: modify_state
+ *   target: userfaultfd event queue
+ *   condition: MADV_DONTNEED, MADV_DONTNEED_LOCKED, MADV_FREE, MADV_REMOVE
+ *     on a userfaultfd-armed VMA
+ *   desc: Generates a UFFD_EVENT_REMOVE notification covering the discarded
+ *     range so userfaultfd monitors observing the mapping see the
+ *     invalidation. The event is queued before the discard takes effect; the
+ *     monitor cannot veto it.
+ *   reversible: no
+ *
+ * capability: CAP_SYS_ADMIN
+ *   type: perform_operation
+ *   allows: Inject memory errors via MADV_HWPOISON or MADV_SOFT_OFFLINE
+ *   without: Both behaviors return -EPERM
+ *   condition: Checked at entry to madvise_inject_error() before any pages
+ *     are looked up
+ *
+ * constraint: Page-aligned start
+ *   desc: start must lie on a page boundary; otherwise the call returns
+ *     -EINVAL before any VMA is consulted.
+ *   expr: (start & (PAGE_SIZE - 1)) == 0
+ *
+ * constraint: Length rounded up to PAGE_SIZE
+ *   desc: The effective range length is PAGE_ALIGN(len_in). A non-zero len_in
+ *     that overflows during rounding, or a (start, end) range that wraps,
+ *     is rejected with -EINVAL.
+ *   expr: end = start + PAGE_ALIGN(len_in); end >= start
+ *
+ * constraint: Behavior must be supported
+ *   desc: behavior must be one of the MADV_* values listed under the
+ *     behavior parameter. Behaviors gated by Kconfig (KSM, THP, memory
+ *     failure) are rejected with -EINVAL when the corresponding option is
+ *     disabled in the running kernel.
+ *
+ * constraint: mseal-protected discards
+ *   desc: On 64-bit kernels, a discard operation (FREE, DONTNEED,
+ *     DONTNEED_LOCKED, REMOVE, DONTFORK, WIPEONFORK, GUARD_INSTALL) against
+ *     a sealed anonymous VMA is rejected unless the mapping is currently
+ *     writable -- both VM_WRITE in vm_flags and arch_vma_access_permitted()
+ *     allowing write -- so that mseal(2) cannot be bypassed by instructing
+ *     the kernel to throw the data away. File-backed sealed VMAs and
+ *     writable sealed VMAs are not subject to this restriction.
+ *   expr: !is_discard(behavior) || !vma_is_sealed(vma) ||
+ *     !vma_is_anonymous(vma) || ((vma->vm_flags & VM_WRITE) &&
+ *     arch_vma_access_permitted(vma, true, false, false))
+ *
+ * constraint: MADV_FREE requires anonymous mappings
+ *   desc: MADV_FREE is defined only over anonymous mappings; the handler
+ *     rejects file-backed VMAs with -EINVAL.
+ *   expr: vma_is_anonymous(vma)
+ *
+ * constraint: MADV_WIPEONFORK requires private anonymous mappings
+ *   desc: MADV_WIPEONFORK rejects file-backed mappings and shared anonymous
+ *     mappings; only MAP_PRIVATE anonymous VMAs accept it. Both rejections
+ *     surface as -EINVAL.
+ *   expr: !vma->vm_file && !(vma->vm_flags & VM_SHARED)
+ *
+ * constraint: MADV_REMOVE requires a writable shared file mapping
+ *   desc: MADV_REMOVE rejects VM_LOCKED VMAs, VMAs without an associated
+ *     file/mapping/host inode, and non-shared-writable mappings. The first
+ *     two cases return -EINVAL; a private or read-only shared mapping
+ *     returns -EACCES.
+ *   expr: !(vma->vm_flags & VM_LOCKED) && vma->vm_file &&
+ *     vma->vm_file->f_mapping && vma->vm_file->f_mapping->host &&
+ *     vma_is_shared_maywrite(vma)
+ *
+ * constraint: MADV_COLD / MADV_PAGEOUT VMA filter
+ *   desc: Both behaviors require LRU-managed pages; they reject VMAs that
+ *     are mlocked, raw-PFN or hugetlb.
+ *   expr: !(vma->vm_flags & (VM_LOCKED | VM_PFNMAP | VM_HUGETLB))
+ *
+ * examples: madvise(p, len, MADV_SEQUENTIAL);  // set VM_SEQ_READ on the VMA
+ *   madvise(p, len, MADV_POPULATE_WRITE);  // prefault writable PTEs
+ *   madvise(p, len, MADV_DONTNEED);        // discard anonymous pages
+ *   madvise(p, len, MADV_GUARD_INSTALL);   // install SIGSEGV guard pages
+ *
+ * notes: madvise(2) reports only whether an error condition was
+ *   encountered, not whether the requested work was performed. The hint
+ *   group sets VMA flags whose downstream strictness varies:
+ *   MADV_DONTFORK / MADV_DOFORK and MADV_MERGEABLE / MADV_UNMERGEABLE are
+ *   hard gates honored by fork-copy and KSM scanning respectively;
+ *   MADV_NOHUGEPAGE is a hard gate against the normal user-visible THP
+ *   paths but MADV_HUGEPAGE is eligibility/advice that does not force
+ *   THP installation -- the global transparent_hugepage= mode, VMA
+ *   suitability and allocation success still apply; MADV_WIPEONFORK /
+ *   MADV_KEEPONFORK take effect at the child's first page access
+ *   (zero-on-fault), not at fork time; MADV_DONTDUMP / MADV_DODUMP gate
+ *   coredump inclusion but can be overridden by always_dump_vma();
+ *   MADV_NORMAL / MADV_RANDOM / MADV_SEQUENTIAL are heuristic read-ahead
+ *   hints that the read-ahead code may weigh against other policy. The
+ *   non-hint behaviors
+ *   are not uniform: walk-and-skip handlers (COLD, PAGEOUT, FREE,
+ *   GUARD_REMOVE) silently skip pages that fail per-page preconditions;
+ *   bulk-backend handlers (DONTNEED, DONTNEED_LOCKED, REMOVE, and
+ *   WILLNEED on regular files) delegate the range to a single backend
+ *   call whose return is propagated for REMOVE and discarded for
+ *   WILLNEED; stop-on-error handlers (POPULATE_READ, POPULATE_WRITE,
+ *   SOFT_OFFLINE) surface the first per-page failure rather than
+ *   skipping; and hybrid handlers (WILLNEED for anon/shmem, COLLAPSE,
+ *   GUARD_INSTALL, HWPOISON) mix walking with bulk backends, retry/zap
+ *   loops or selective error suppression. A successful return therefore
+ *   guarantees only that no error was raised in the handler that ran,
+ *   not that every page was processed.
+ *
+ *   Behavior introduction history (mainline): MADV_FREE in 4.5,
+ *   MADV_WIPEONFORK / MADV_KEEPONFORK in 4.14, MADV_COLD / MADV_PAGEOUT in
+ *   5.4, MADV_POPULATE_READ / MADV_POPULATE_WRITE in 5.14,
+ *   MADV_DONTNEED_LOCKED in 5.18, MADV_COLLAPSE in 6.1, MADV_GUARD_INSTALL /
+ *   MADV_GUARD_REMOVE in 6.13. Code that wants to remain portable to older
+ *   kernels must handle -EINVAL gracefully and fall back.
+ *
+ *   process_madvise(2) extends the same set of advices to another process
+ *   identified by a pidfd. When the target mm is the caller's own (the
+ *   pidfd refers to the caller), any locally-supported MADV_* value is
+ *   accepted. When the target is a different mm, the behavior must be in
+ *   the non-destructive remote subset (MADV_COLD, MADV_PAGEOUT,
+ *   MADV_WILLNEED, MADV_COLLAPSE) or the call returns -EINVAL, and the
+ *   caller must hold CAP_SYS_NICE.
+ *
+ *   The discard subset (MADV_FREE, MADV_DONTNEED, MADV_DONTNEED_LOCKED,
+ *   MADV_REMOVE, MADV_DONTFORK, MADV_WIPEONFORK, MADV_GUARD_INSTALL) is
+ *   refused on non-writable anonymous VMAs sealed with mseal(2) on 64-bit
+ *   kernels. mseal(2) does not provide an unseal operation, so applications
+ *   that need to retain the ability to discard such pages must keep the
+ *   mapping writable (and arch-accessible for write) or refrain from
+ *   sealing it.
+ *
+ *   On a userfaultfd-armed VMA, all four destructive discards (DONTNEED,
+ *   DONTNEED_LOCKED, FREE, REMOVE) emit a UFFD_EVENT_REMOVE event; the
+ *   per-VMA fast path is bypassed and the syscall falls back to the
+ *   heavier mmap_read_lock so the userfaultfd monitor is consulted before
+ *   the discard takes effect.
+ *
+ *   MADV_GUARD_INSTALL retries up to MAX_MADVISE_GUARD_RETRIES (3) times
+ *   when it loses races with concurrent faulting or khugepaged. If those
+ *   retries are exhausted the handler returns -ERESTARTNOINTR via
+ *   restart_syscall(); the kernel's syscall return path treats this as a
+ *   transparent restart of madvise() and re-enters the call with the same
+ *   arguments. The transparent restart is unconditional and is not driven
+ *   by signal delivery, so the caller never observes an errno from this
+ *   path and the call appears to make eventual forward progress.
+ *   anon_vma_prepare() failures inside MADV_GUARD_INSTALL bypass the
+ *   ENOMEM-to-EAGAIN translation that applies to the VMA-flag-mutating
+ *   behaviors and surface as -ENOMEM directly.
+ *
+ *   Architecture note: alpha defines MADV_DONTNEED as 6 (not 4) and reserves
+ *   MADV_SPACEAVAIL=5; portable code must use the symbolic names from
+ *   <sys/mman.h>.
+ */
 SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior)
 {
 	return do_madvise(current->mm, start, len_in, behavior);
-- 
2.53.0


^ permalink raw reply related

* [PATCH v4 11/11] kernel/api: add syscall enter/exit tracepoints
From: Sasha Levin @ 2026-05-29 23:33 UTC (permalink / raw)
  To: linux-api, linux-kernel
  Cc: linux-doc, linux-fsdevel, linux-kbuild, linux-kselftest,
	workflows, tools, x86, Thomas Gleixner, Paul E . McKenney,
	Greg Kroah-Hartman, Jonathan Corbet, Dmitry Vyukov, Randy Dunlap,
	Cyril Hrubis, Kees Cook, Jake Edge, David Laight,
	Gabriele Paoloni, Mauro Carvalho Chehab, Christian Brauner,
	Alexander Viro, Andrew Morton, Masahiro Yamada, Shuah Khan,
	Arnd Bergmann, Nathan Chancellor, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers
In-Reply-To: <20260529233311.1901670-1-sashal@kernel.org>

Add two tracepoints to the CONFIG_KAPI_RUNTIME_CHECKS syscall validation
path so the framework's behavior can be observed without the noise and
loss of pr_warn_ratelimited():

  kapi_syscall_enter - the spec name, the raw argument values, and a
                       rendered "name=value" list of the specified
                       parameters (pointer-like values in hex, integers
                       and file descriptors in decimal)
  kapi_syscall_exit  - the spec name, the return value, and whether it
                       matched the specification (spec_match)

Both fire only for syscalls that have a KAPI specification and live
inside the existing CONFIG_KAPI_RUNTIME_CHECKS region, so they exist
exactly when the runtime checks do; they compile to no-ops without
CONFIG_TRACEPOINTS and stay dormant until enabled. The parameter list
is rendered only when the enter tracepoint is enabled.

kapi_syscall_exit is also emitted on the parameter-validation rejection
path -- where the validator returns -EINVAL and the real handler is
skipped -- with spec_match=0, so every kapi_syscall_enter has a matching
exit.

Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 Documentation/dev-tools/kernel-api-spec.rst | 29 ++++++++
 MAINTAINERS                                 |  1 +
 include/trace/events/kapi.h                 | 74 ++++++++++++++++++++
 kernel/api/kernel_api_spec.c                | 77 ++++++++++++++++++---
 4 files changed, 173 insertions(+), 8 deletions(-)
 create mode 100644 include/trace/events/kapi.h

diff --git a/Documentation/dev-tools/kernel-api-spec.rst b/Documentation/dev-tools/kernel-api-spec.rst
index 26598a98c0f69..561e7bff58379 100644
--- a/Documentation/dev-tools/kernel-api-spec.rst
+++ b/Documentation/dev-tools/kernel-api-spec.rst
@@ -285,6 +285,35 @@ custom validation functions via the ``validate`` field in the constraint spec:
     .type = KAPI_CONSTRAINT_CUSTOM,
     .validate = validate_buffer_size,
 
+Tracepoints
+-----------
+
+When ``CONFIG_KAPI_RUNTIME_CHECKS`` is enabled, the syscall validation path emits
+two ftrace tracepoints (in the ``kapi`` trace system) for every syscall that has a
+specification:
+
+- ``kapi_syscall_enter`` -- fired before parameter validation, recording the spec
+  name, the raw syscall argument values, and -- when the spec provides parameter
+  metadata -- a rendered ``name=value`` list: pointer-like values are shown in hex,
+  integers and file descriptors in decimal, and an unnamed parameter as ``arg``.
+- ``kapi_syscall_exit`` -- fired after the handler returns, or in place of the
+  handler when parameter validation rejects the call (the handler is skipped and
+  ``-EINVAL`` is returned). Records the spec name, the return value, and
+  ``spec_match``: 0 when the call did not conform to the spec -- the parameters were
+  rejected, or the return value was not one the spec allows -- and 1 otherwise.
+
+Unlike the ``pr_warn_ratelimited`` violation reports, the tracepoints capture every
+spec'd call rather than only violations, are lossless under load, and can be filtered
+with the usual ftrace facilities. They require ``CONFIG_TRACEPOINTS`` and stay dormant
+until enabled::
+
+    # echo 1 > /sys/kernel/tracing/events/kapi/enable
+    # cat /sys/kernel/tracing/trace
+     ...  kapi_syscall_enter: sys_read(fd=3, buf=0x7ffd46780b58, count=0x340)
+     ...  kapi_syscall_exit: sys_read = 832 spec_match=1
+     ...  kapi_syscall_enter: sys_open(filename=0x480300, flags=268435456, mode=0x0)
+     ...  kapi_syscall_exit: sys_open = -22 spec_match=0
+
 DebugFS Interface
 =================
 
diff --git a/MAINTAINERS b/MAINTAINERS
index ddfd9cad98916..48def631ad823 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13823,6 +13823,7 @@ L:	linux-api@vger.kernel.org
 S:	Maintained
 F:	Documentation/dev-tools/kernel-api-spec.rst
 F:	include/linux/kernel_api_spec.h
+F:	include/trace/events/kapi.h
 F:	kernel/api/
 F:	tools/kapi/
 F:	tools/lib/python/kdoc/kdoc_apispec.py
diff --git a/include/trace/events/kapi.h b/include/trace/events/kapi.h
new file mode 100644
index 0000000000000..47828f3338828
--- /dev/null
+++ b/include/trace/events/kapi.h
@@ -0,0 +1,74 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM kapi
+
+#if !defined(_TRACE_KAPI_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_KAPI_H
+
+#include <linux/tracepoint.h>
+
+/* Max length of the rendered "name=value, ..." parameter list. */
+#define KAPI_TP_PARAMS_LEN 256
+
+/*
+ * Emitted from the CONFIG_KAPI_RUNTIME_CHECKS syscall validation path for
+ * syscalls that have a KAPI specification: kapi_syscall_enter fires before
+ * parameter validation, kapi_syscall_exit after the handler returns.
+ * @name is the spec name, e.g. "sys_open".
+ *
+ * kapi_syscall_enter carries both the raw argument values (args[]) and, when
+ * the spec provides parameter metadata, a rendered "name=value" list (params,
+ * built by the caller): pointer-like values in hex, integers and fds in decimal.
+ */
+TRACE_EVENT(kapi_syscall_enter,
+
+	TP_PROTO(const char *name, int nargs, const s64 *args, const char *params),
+
+	TP_ARGS(name, nargs, args, params),
+
+	TP_STRUCT__entry(
+		__string(	name,	name	)
+		__field(	int,	nargs	)
+		__array(	u64,	args,	6	)
+		__string(	params,	params	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->nargs = nargs;
+		memset(__entry->args, 0, sizeof(__entry->args));
+		if (args && nargs > 0)
+			memcpy(__entry->args, args,
+			       min_t(int, nargs, 6) * sizeof(__entry->args[0]));
+		__assign_str(params);
+	),
+
+	TP_printk("%s(%s)", __get_str(name), __get_str(params))
+);
+
+TRACE_EVENT(kapi_syscall_exit,
+
+	TP_PROTO(const char *name, long ret, bool spec_match),
+
+	TP_ARGS(name, ret, spec_match),
+
+	TP_STRUCT__entry(
+		__string(	name,		name		)
+		__field(	long,		ret		)
+		__field(	bool,		spec_match	)
+	),
+
+	TP_fast_assign(
+		__assign_str(name);
+		__entry->ret = ret;
+		__entry->spec_match = spec_match;
+	),
+
+	TP_printk("%s = %ld spec_match=%d",
+		  __get_str(name), __entry->ret, __entry->spec_match)
+);
+
+#endif /* _TRACE_KAPI_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/kernel/api/kernel_api_spec.c b/kernel/api/kernel_api_spec.c
index 1a9041a7f21a4..2aa8c04a5851e 100644
--- a/kernel/api/kernel_api_spec.c
+++ b/kernel/api/kernel_api_spec.c
@@ -659,6 +659,45 @@ EXPORT_SYMBOL_GPL(kapi_print_spec);
 
 #ifdef CONFIG_KAPI_RUNTIME_CHECKS
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/kapi.h>
+
+/*
+ * Render a syscall's parameters as a "name=value, ..." string for the
+ * kapi_syscall_enter tracepoint.  Names come from the spec; pointer-like
+ * values are shown in hex, integers and file descriptors in decimal.
+ */
+static void kapi_trace_format_params(const struct kernel_api_spec *spec,
+				     const s64 *args, int nargs,
+				     char *buf, size_t size)
+{
+	int i, used = 0;
+
+	buf[0] = '\0';
+	/* Bound by the caller-supplied arg count; the spec arity may differ. */
+	for (i = 0; args && i < nargs && i < 6; i++) {
+		const char *name = "arg";
+		bool dec = false;
+
+		if (i < spec->param_count) {
+			const struct kapi_param_spec *ps = &spec->params[i];
+
+			if (ps->name)
+				name = ps->name;
+			dec = ps->type == KAPI_TYPE_INT || ps->type == KAPI_TYPE_FD;
+		}
+
+		used += scnprintf(buf + used, size - used, "%s%s=",
+				  i ? ", " : "", name);
+		if (dec)
+			used += scnprintf(buf + used, size - used, "%lld",
+					  (long long)args[i]);
+		else
+			used += scnprintf(buf + used, size - used, "0x%llx",
+					  (unsigned long long)args[i]);
+	}
+}
+
 /**
  * kapi_validate_fd - Validate that a file descriptor value is in valid range
  * @fd: File descriptor to validate
@@ -1154,16 +1193,24 @@ EXPORT_SYMBOL_GPL(kapi_validate_syscall_param);
 int kapi_validate_syscall_params(const struct kernel_api_spec *spec,
 				 const s64 *params, int param_count)
 {
-	int i;
+	int i, ret = 0;
 
 	if (!spec || !params)
 		return 0;
 
+	if (trace_kapi_syscall_enter_enabled()) {
+		char pbuf[KAPI_TP_PARAMS_LEN];
+
+		kapi_trace_format_params(spec, params, param_count, pbuf, sizeof(pbuf));
+		trace_kapi_syscall_enter(spec->name, param_count, params, pbuf);
+	}
+
 	/* Validate that we have the expected number of parameters */
 	if (param_count != spec->param_count) {
 		pr_warn_ratelimited("API %s: parameter count mismatch (expected %u, got %d)\n",
 			spec->name, spec->param_count, param_count);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	/* Validate each parameter with context */
@@ -1173,12 +1220,22 @@ int kapi_validate_syscall_params(const struct kernel_api_spec *spec,
 		if (!kapi_validate_param_with_context(param_spec, params[i], params, param_count)) {
 			if (strncmp(spec->name, "sys_", 4) == 0) {
 				/* For syscalls, we can return EINVAL to userspace */
-				return -EINVAL;
+				ret = -EINVAL;
+				goto out;
 			}
 		}
 	}
 
-	return 0;
+out:
+	/*
+	 * Emit the exit event on the rejection path too (the wrapper
+	 * short-circuits the handler on a non-zero return), so every
+	 * kapi_syscall_enter has a matching kapi_syscall_exit.
+	 */
+	if (ret)
+		trace_kapi_syscall_exit(spec->name, ret, false);
+
+	return ret;
 }
 EXPORT_SYMBOL_GPL(kapi_validate_syscall_params);
 
@@ -1301,14 +1358,18 @@ EXPORT_SYMBOL_GPL(kapi_validate_return_value);
  */
 int kapi_validate_syscall_return(const struct kernel_api_spec *spec, s64 retval)
 {
+	bool valid = true;
+
 	if (!spec)
 		return 0;
 
-	/* Skip return validation if return spec was not defined */
-	if (spec->return_magic != KAPI_MAGIC_RETURN)
-		return 0;
+	/* Validate against the return spec when one was defined */
+	if (spec->return_magic == KAPI_MAGIC_RETURN)
+		valid = kapi_validate_return_value(spec, retval);
+
+	trace_kapi_syscall_exit(spec->name, retval, valid);
 
-	if (!kapi_validate_return_value(spec, retval)) {
+	if (!valid) {
 		/* Log the violation but don't change the return value */
 		pr_warn_ratelimited("KAPI: Syscall %s returned unspecified value %lld\n",
 				    spec->name, retval);
-- 
2.53.0


^ permalink raw reply related

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-31  0:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Bart Van Assche, Theodore Tso, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, linux-mm, Akilesh Kailash,
	linux-fsdevel, Christian Brauner
In-Reply-To: <ahkl52N3RDcusCNd@infradead.org>

On 05/28, Christoph Hellwig wrote:
> On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > F2FS merges bios before submit_bio, regardless of small or large folios,
> > since the block addresses are consecutive. So, I think IO subsystem was
> > working in full speed.
> 
> As does every other remotely modern file system.  But that merging is
> surprisingly expensive, which is why using folios gets really major
> performance improvements.
> 
> For one doing these checks to merge touch quite a few cache lines.
> Second, devices are often a lot more efficient if they see fewer SGL
> entries.  I.e. having a 1MB bio a single SGL tends to work better than
> having 256 of them.
> The same is true in the kernel code itself, both in the submission path
> (dma mapping and co), and even more so in the page cache handling
> both before submitting and in the completion path.
> 
> See Bart's patch about how long the walk of the bio_vecs in the f2fs
> completion path can take.  We had similar issues in XFS even in the
> workqueue completion path due to lack of rescheduling, and these simply
> go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> would avoid the need to explicit rescheduling these days, but that just
> papers over the symptoms in this case).
> 

I see. That's also super helpful. Let me kick off the large folio support asap.
Thanks.

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-05-31  0:35 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Theodore Tso, linux-api, linux-kernel, linux-f2fs-devel,
	Christoph Hellwig, linux-mm, linux-fsdevel, Akilesh Kailash,
	Christian Brauner
In-Reply-To: <ahiZRpE593n4blxn@casper.infradead.org>

On 05/28, Matthew Wilcox wrote:
> On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> > Background
> > ----------
> > The primary use case is accelerating AI model loading, which demands
> > exceptionally high sequential read speeds. In our benchmarks on embedded
> > systems:
> >  - Using high-order page allocations allows the system to saturate the
> >    Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> >    medium-to-low CPU frequencies.
> >  - In contrast, standard small folios cap performance at 2 GB/s.
> > 
> > The performance doubling stems directly from reducing CPU cycle overhead during
> > memory allocation.
> 
> When you say "AI model loading", are you mmap()ing the file of weights,
> or are you calling read() to load the file into anonymous memory?
> 
> This matters because for the first operation, you need to allocate folios
> of PMD size in order to make best use of TLB entries.  For the second
> operation, it's more important to iterate through the file quickly,
> freeing folios behind you after you access them so they're available
> for the next batch.

We deal with multiple options tho, what I'm looking at is mostly a preloading
models by mmap(MAP_POPULATE) which takes the readahead path bumping up the order
by 2. Previously I also looked at fadvise(WILLNEED), but gave up due to the
broken interface. OTOH, we use RWF_DONTCACHE for read() case, but I don't
think it's ideal for the best loading performance.

> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches

This patchset is for VFS.

Recently we got a lot of vulnerabilities in splice/vmsplice.

Also vmsplice already was source of vulnerabilities in the past:
CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).

Also vmsplice is problematic for other reasons. Here is what other
developers say:

Linus Torvalds in 2023:
> So I'd personally be perfectly ok with just making vmsplice() be
> exactly the same as write, and turn all of vmsplice() into just "it's
> a read() if the pipe is open for read, and a write if it's open for
> writing".
https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/

Christoph Hellwig in May 2026:
> vmsplice is the worst, as it is one of the few remaining places that
> can incorrectly dirty file backed pages without telling the file system
> and cause the other problems fixed by a FOLL_PIN conversion, but it is
> the only one where we do not have any idea yet how we could convert it
> to FOLL_PIN due to the unbounded pin time.
https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/

See recent discussion here:
https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

For all these reasons I propose to make vmsplice a simple wrapper for
preadv2/pwritev2.

vmsplice(fd, vec, vlen, vmsplice_flags) will
be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
writable pipe.

SPLICE_F_NONBLOCK is translated to RWF_NOWAIT, all other SPLICE_F_*
flags are ignored.

There is a small change to handling of NONBLOCK-related flags,
see commit messages for details.

I tested this patch in Qemu.

This patchset was written by me, not by LLMs.

Askar Safin (3):
  tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
  vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
  splice: remove PIPE_BUF_FLAG_GIFT

 fs/fuse/dev.c             |   1 -
 fs/read_write.c           |  23 +++++
 fs/splice.c               | 202 +-------------------------------------
 include/linux/pipe_fs_i.h |   1 -
 include/linux/skbuff.h    |   4 +-
 include/linux/splice.h    |   2 +-
 include/linux/syscalls.h  |   4 +-
 7 files changed, 33 insertions(+), 204 deletions(-)


base-commit: e7ae89a0c97ce2b68b0983cd01eda67cf373517d (7.1-rc5)
-- 
2.47.3


^ permalink raw reply

* [PATCH 1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

Remove unused parameter "flags" from "link_pipe".

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/splice.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/splice.c b/fs/splice.c
index 9d8f63e2fd1a..59adbc2fa4d6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1849,7 +1849,7 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
  */
 static ssize_t link_pipe(struct pipe_inode_info *ipipe,
 			 struct pipe_inode_info *opipe,
-			 size_t len, unsigned int flags)
+			 size_t len)
 {
 	struct pipe_buffer *ibuf, *obuf;
 	unsigned int i_head, o_head;
@@ -1962,7 +1962,7 @@ ssize_t do_tee(struct file *in, struct file *out, size_t len,
 		if (!ret) {
 			ret = opipe_prep(opipe, flags);
 			if (!ret)
-				ret = link_pipe(ipipe, opipe, len, flags);
+				ret = link_pipe(ipipe, opipe, len);
 		}
 	}
 
-- 
2.47.3


^ permalink raw reply related

* [PATCH 2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

vmsplice behavior on writable pipe became equivalent to pwritev2.
vmsplice behavior on readable pipe already was nearly
equivalent to preadv2, but I made this explicit. I. e. I made it
obvious from code that vmsplice now is equivalent to preadv2/pwritev2.

Also I moved vmsplice to fs/read_write.c, because now it arguably
belongs there.

Note that SPLICE_F_NONBLOCK behavior slightly changed: previously
vmsplice ignored whether the pipe was opened with O_NONBLOCK, and mode
of operation depended on whether SPLICE_F_NONBLOCK was passed only.
Now the operation will be non-blocking if O_NONBLOCK was passed when
opening *or* SPLICE_F_NONBLOCK was passed to vmsplice. Previous
behavior was arguably buggy, and new behavior is arguably better.

Now SPLICE_F_GIFT is always ignored by all 3 syscalls: splice, tee
and vmsplice.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/read_write.c          |  23 +++++
 fs/splice.c              | 192 +--------------------------------------
 include/linux/skbuff.h   |   4 +-
 include/linux/splice.h   |   2 +-
 include/linux/syscalls.h |   4 +-
 5 files changed, 29 insertions(+), 196 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 50bff7edc91f..1e5444f4dab3 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1213,6 +1213,29 @@ SYSCALL_DEFINE6(pwritev2, unsigned long, fd, const struct iovec __user *, vec,
 	return do_pwritev(fd, vec, vlen, pos, flags);
 }
 
+/*
+ * Legacy preadv2/pwritev2 wrapper.
+ */
+SYSCALL_DEFINE4(vmsplice, unsigned long, fd, const struct iovec __user *, vec,
+		unsigned long, vlen, unsigned int, flags)
+{
+	if (unlikely(flags & ~SPLICE_F_ALL))
+		return -EINVAL;
+
+	CLASS(fd, f)(fd);
+	if (fd_empty(f))
+		return -EBADF;
+
+	/* We do do_writev/do_readv, so it is okay to pass "false" here */
+	if (!get_pipe_info(fd_file(f), /* for_splice = */ false))
+		return -EBADF;
+
+	if (fd_file(f)->f_mode & FMODE_WRITE)
+		return do_writev(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+	else
+		return do_readv(fd, vec, vlen, (flags & SPLICE_F_NONBLOCK) ? RWF_NOWAIT : 0);
+}
+
 /*
  * Various compat syscalls.  Note that they all pretend to take a native
  * iovec - import_iovec will properly treat those as compat_iovecs based on
diff --git a/fs/splice.c b/fs/splice.c
index 59adbc2fa4d6..b1a4e3713bd6 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -159,22 +159,6 @@ const struct pipe_buf_operations page_cache_pipe_buf_ops = {
 	.get		= generic_pipe_buf_get,
 };
 
-static bool user_page_pipe_buf_try_steal(struct pipe_inode_info *pipe,
-		struct pipe_buffer *buf)
-{
-	if (!(buf->flags & PIPE_BUF_FLAG_GIFT))
-		return false;
-
-	buf->flags |= PIPE_BUF_FLAG_LRU;
-	return generic_pipe_buf_try_steal(pipe, buf);
-}
-
-static const struct pipe_buf_operations user_page_pipe_buf_ops = {
-	.release	= page_cache_pipe_buf_release,
-	.try_steal	= user_page_pipe_buf_try_steal,
-	.get		= generic_pipe_buf_get,
-};
-
 static void wakeup_pipe_readers(struct pipe_inode_info *pipe)
 {
 	smp_mb();
@@ -589,8 +573,7 @@ static void splice_from_pipe_end(struct pipe_inode_info *pipe, struct splice_des
  * Description:
  *    This function does little more than loop over the pipe and call
  *    @actor to do the actual moving of a single struct pipe_buffer to
- *    the desired destination. See pipe_to_file, pipe_to_sendmsg, or
- *    pipe_to_user.
+ *    the desired destination. See pipe_to_file or pipe_to_sendmsg.
  *
  */
 ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, struct splice_desc *sd,
@@ -1440,179 +1423,6 @@ static ssize_t __do_splice(struct file *in, loff_t __user *off_in,
 	return ret;
 }
 
-static ssize_t iter_to_pipe(struct iov_iter *from,
-			    struct pipe_inode_info *pipe,
-			    unsigned int flags)
-{
-	struct pipe_buffer buf = {
-		.ops = &user_page_pipe_buf_ops,
-		.flags = flags
-	};
-	size_t total = 0;
-	ssize_t ret = 0;
-
-	while (iov_iter_count(from)) {
-		struct page *pages[16];
-		ssize_t left;
-		size_t start;
-		int i, n;
-
-		left = iov_iter_get_pages2(from, pages, ~0UL, 16, &start);
-		if (left <= 0) {
-			ret = left;
-			break;
-		}
-
-		n = DIV_ROUND_UP(left + start, PAGE_SIZE);
-		for (i = 0; i < n; i++) {
-			int size = umin(left, PAGE_SIZE - start);
-
-			buf.page = pages[i];
-			buf.offset = start;
-			buf.len = size;
-			ret = add_to_pipe(pipe, &buf);
-			if (unlikely(ret < 0)) {
-				iov_iter_revert(from, left);
-				// this one got dropped by add_to_pipe()
-				while (++i < n)
-					put_page(pages[i]);
-				goto out;
-			}
-			total += ret;
-			left -= size;
-			start = 0;
-		}
-	}
-out:
-	return total ? total : ret;
-}
-
-static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
-			struct splice_desc *sd)
-{
-	int n = copy_page_to_iter(buf->page, buf->offset, sd->len, sd->u.data);
-	return n == sd->len ? n : -EFAULT;
-}
-
-/*
- * For lack of a better implementation, implement vmsplice() to userspace
- * as a simple copy of the pipe's pages to the user iov.
- */
-static ssize_t vmsplice_to_user(struct file *file, struct iov_iter *iter,
-				unsigned int flags)
-{
-	struct pipe_inode_info *pipe = get_pipe_info(file, true);
-	struct splice_desc sd = {
-		.total_len = iov_iter_count(iter),
-		.flags = flags,
-		.u.data = iter
-	};
-	ssize_t ret = 0;
-
-	if (!pipe)
-		return -EBADF;
-
-	pipe_clear_nowait(file);
-
-	if (sd.total_len) {
-		pipe_lock(pipe);
-		ret = __splice_from_pipe(pipe, &sd, pipe_to_user);
-		pipe_unlock(pipe);
-	}
-
-	if (ret > 0)
-		fsnotify_access(file);
-
-	return ret;
-}
-
-/*
- * vmsplice splices a user address range into a pipe. It can be thought of
- * as splice-from-memory, where the regular splice is splice-from-file (or
- * to file). In both cases the output is a pipe, naturally.
- */
-static ssize_t vmsplice_to_pipe(struct file *file, struct iov_iter *iter,
-				unsigned int flags)
-{
-	struct pipe_inode_info *pipe;
-	ssize_t ret = 0;
-	unsigned buf_flag = 0;
-
-	if (flags & SPLICE_F_GIFT)
-		buf_flag = PIPE_BUF_FLAG_GIFT;
-
-	pipe = get_pipe_info(file, true);
-	if (!pipe)
-		return -EBADF;
-
-	pipe_clear_nowait(file);
-
-	pipe_lock(pipe);
-	ret = wait_for_space(pipe, flags);
-	if (!ret)
-		ret = iter_to_pipe(iter, pipe, buf_flag);
-	pipe_unlock(pipe);
-	if (ret > 0) {
-		wakeup_pipe_readers(pipe);
-		fsnotify_modify(file);
-	}
-	return ret;
-}
-
-/*
- * Note that vmsplice only really supports true splicing _from_ user memory
- * to a pipe, not the other way around. Splicing from user memory is a simple
- * operation that can be supported without any funky alignment restrictions
- * or nasty vm tricks. We simply map in the user memory and fill them into
- * a pipe. The reverse isn't quite as easy, though. There are two possible
- * solutions for that:
- *
- *	- memcpy() the data internally, at which point we might as well just
- *	  do a regular read() on the buffer anyway.
- *	- Lots of nasty vm tricks, that are neither fast nor flexible (it
- *	  has restriction limitations on both ends of the pipe).
- *
- * Currently we punt and implement it as a normal copy, see pipe_to_user().
- *
- */
-SYSCALL_DEFINE4(vmsplice, int, fd, const struct iovec __user *, uiov,
-		unsigned long, nr_segs, unsigned int, flags)
-{
-	struct iovec iovstack[UIO_FASTIOV];
-	struct iovec *iov = iovstack;
-	struct iov_iter iter;
-	ssize_t error;
-	int type;
-
-	if (unlikely(flags & ~SPLICE_F_ALL))
-		return -EINVAL;
-
-	CLASS(fd, f)(fd);
-	if (fd_empty(f))
-		return -EBADF;
-	if (fd_file(f)->f_mode & FMODE_WRITE)
-		type = ITER_SOURCE;
-	else if (fd_file(f)->f_mode & FMODE_READ)
-		type = ITER_DEST;
-	else
-		return -EBADF;
-
-	error = import_iovec(type, uiov, nr_segs,
-			     ARRAY_SIZE(iovstack), &iov, &iter);
-	if (error < 0)
-		return error;
-
-	if (!iov_iter_count(&iter))
-		error = 0;
-	else if (type == ITER_SOURCE)
-		error = vmsplice_to_pipe(fd_file(f), &iter, flags);
-	else
-		error = vmsplice_to_user(fd_file(f), &iter, flags);
-
-	kfree(iov);
-	return error;
-}
-
 SYSCALL_DEFINE6(splice, int, fd_in, loff_t __user *, off_in,
 		int, fd_out, loff_t __user *, off_out,
 		size_t, len, unsigned int, flags)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 2bcf78a4de7b..2961fee3e5cc 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -505,7 +505,7 @@ enum {
 	SKBFL_ZEROCOPY_ENABLE = BIT(0),
 
 	/* This indicates at least one fragment might be overwritten
-	 * (as in vmsplice(), sendfile() ...)
+	 * (as in sendfile(), ...)
 	 * If we need to compute a TX checksum, we'll need to copy
 	 * all frags to avoid possible bad checksum
 	 */
@@ -4017,7 +4017,7 @@ static inline int skb_linearize(struct sk_buff *skb)
  * @skb: buffer to test
  *
  * Return: true if the skb has at least one frag that might be modified
- * by an external entity (as in vmsplice()/sendfile())
+ * by an external entity (as in sendfile())
  */
 static inline bool skb_has_shared_frag(const struct sk_buff *skb)
 {
diff --git a/include/linux/splice.h b/include/linux/splice.h
index 9dec4861d09f..fb4f035aae83 100644
--- a/include/linux/splice.h
+++ b/include/linux/splice.h
@@ -19,7 +19,7 @@
 				 /* we may still block on the fd we splice */
 				 /* from/to, of course */
 #define SPLICE_F_MORE	(0x04)	/* expect more data */
-#define SPLICE_F_GIFT	(0x08)	/* pages passed in are a gift */
+#define SPLICE_F_GIFT	(0x08)	/* ignored */
 
 #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT)
 
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f5639d5ac331..a86a88207956 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -514,8 +514,8 @@ asmlinkage long sys_ppoll_time32(struct pollfd __user *, unsigned int,
 			  struct old_timespec32 __user *, const sigset_t __user *,
 			  size_t);
 asmlinkage long sys_signalfd4(int ufd, sigset_t __user *user_mask, size_t sizemask, int flags);
-asmlinkage long sys_vmsplice(int fd, const struct iovec __user *iov,
-			     unsigned long nr_segs, unsigned int flags);
+asmlinkage long sys_vmsplice(unsigned long fd, const struct iovec __user *vec,
+			     unsigned long vlen, unsigned int flags);
 asmlinkage long sys_splice(int fd_in, loff_t __user *off_in,
 			   int fd_out, loff_t __user *off_out,
 			   size_t len, unsigned int flags);
-- 
2.47.3


^ permalink raw reply related

* [PATCH 3/3] splice: remove PIPE_BUF_FLAG_GIFT
From: Askar Safin @ 2026-05-31  1:01 UTC (permalink / raw)
  To: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara
  Cc: linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

It is unused now.

Signed-off-by: Askar Safin <safinaskar@gmail.com>
---
 fs/fuse/dev.c             | 1 -
 fs/splice.c               | 6 ++----
 include/linux/pipe_fs_i.h | 1 -
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 5dda7080f4a9..fb8fe0c96692 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -2352,7 +2352,6 @@ static ssize_t fuse_dev_splice_write(struct pipe_inode_info *pipe,
 				goto out_free;
 
 			*obuf = *ibuf;
-			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->len = rem;
 			ibuf->offset += obuf->len;
 			ibuf->len -= obuf->len;
diff --git a/fs/splice.c b/fs/splice.c
index b1a4e3713bd6..6ddf7dd72f7b 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -1622,10 +1622,9 @@ static int splice_pipe_to_pipe(struct pipe_inode_info *ipipe,
 			*obuf = *ibuf;
 
 			/*
-			 * Don't inherit the gift and merge flags, we need to
+			 * Don't inherit the merge flag, we need to
 			 * prevent multiple steals of this page.
 			 */
-			obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 			obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;
 
 			obuf->len = len;
@@ -1711,10 +1710,9 @@ static ssize_t link_pipe(struct pipe_inode_info *ipipe,
 		*obuf = *ibuf;
 
 		/*
-		 * Don't inherit the gift and merge flag, we need to prevent
+		 * Don't inherit the merge flag, we need to prevent
 		 * multiple steals of this page.
 		 */
-		obuf->flags &= ~PIPE_BUF_FLAG_GIFT;
 		obuf->flags &= ~PIPE_BUF_FLAG_CAN_MERGE;
 
 		if (obuf->len > len)
diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h
index 7f6a92ac9704..a1eeed800669 100644
--- a/include/linux/pipe_fs_i.h
+++ b/include/linux/pipe_fs_i.h
@@ -6,7 +6,6 @@
 
 #define PIPE_BUF_FLAG_LRU	0x01	/* page is on the LRU */
 #define PIPE_BUF_FLAG_ATOMIC	0x02	/* was atomically mapped */
-#define PIPE_BUF_FLAG_GIFT	0x04	/* page is a gift */
 #define PIPE_BUF_FLAG_PACKET	0x08	/* read() as a packet */
 #define PIPE_BUF_FLAG_CAN_MERGE	0x10	/* can merge buffers */
 #define PIPE_BUF_FLAG_WHOLE	0x20	/* read() must return entire buffer or error */
-- 
2.47.3


^ permalink raw reply related

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Barry Song @ 2026-05-31  5:28 UTC (permalink / raw)
  To: Jaegeuk Kim
  Cc: Christoph Hellwig, Bart Van Assche, Theodore Tso, linux-api,
	linux-kernel, Matthew Wilcox, linux-f2fs-devel, linux-mm,
	Akilesh Kailash, linux-fsdevel, Christian Brauner
In-Reply-To: <aht812OhSPFqIBPK@google.com>

On Sun, May 31, 2026 at 8:12 AM Jaegeuk Kim <jaegeuk@kernel.org> wrote:
>
> On 05/28, Christoph Hellwig wrote:
> > On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > > F2FS merges bios before submit_bio, regardless of small or large folios,
> > > since the block addresses are consecutive. So, I think IO subsystem was
> > > working in full speed.
> >
> > As does every other remotely modern file system.  But that merging is
> > surprisingly expensive, which is why using folios gets really major
> > performance improvements.
> >
> > For one doing these checks to merge touch quite a few cache lines.
> > Second, devices are often a lot more efficient if they see fewer SGL
> > entries.  I.e. having a 1MB bio a single SGL tends to work better than
> > having 256 of them.
> > The same is true in the kernel code itself, both in the submission path
> > (dma mapping and co), and even more so in the page cache handling
> > both before submitting and in the completion path.
> >
> > See Bart's patch about how long the walk of the bio_vecs in the f2fs
> > completion path can take.  We had similar issues in XFS even in the
> > workqueue completion path due to lack of rescheduling, and these simply
> > go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> > would avoid the need to explicit rescheduling these days, but that just
> > papers over the symptoms in this case).
> >
>
> I see. That's also super helpful. Let me kick off the large folio support asap.
> Thanks.

Hi Jaegeuk,

Nanzhe has put significant effort into this work at Xiaomi over
the past several months. Large folios can now be supported on
non-immutable files.

He has conducted extensive testing on the Pixel 6 and fixed a
number of hangs discovered during development. He is still
benchmarking performance, but the implementation appears to be
reasonably stable at this point. We can run Android Monkey for
many hours without observing any hangs.

If you would like to see an RFC, I can ask Nanzhe to send one
as soon as possible after some cleanup and polishing.

Best regards,
Barry

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Pedro Falcato @ 2026-05-31  8:54 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
> This patchset is for VFS.
> 
> Recently we got a lot of vulnerabilities in splice/vmsplice.
> 
> Also vmsplice already was source of vulnerabilities in the past:
> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> 
> Also vmsplice is problematic for other reasons. Here is what other
> developers say:
> 
> Linus Torvalds in 2023:
> > So I'd personally be perfectly ok with just making vmsplice() be
> > exactly the same as write, and turn all of vmsplice() into just "it's
> > a read() if the pipe is open for read, and a write if it's open for
> > writing".
> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
> 
> Christoph Hellwig in May 2026:
> > vmsplice is the worst, as it is one of the few remaining places that
> > can incorrectly dirty file backed pages without telling the file system
> > and cause the other problems fixed by a FOLL_PIN conversion, but it is
> > the only one where we do not have any idea yet how we could convert it
> > to FOLL_PIN due to the unbounded pin time.
> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
> 
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u

So, you took an ongoing discussion with an ongoing RFC patchset, and you
decided to reimplement part of the idea on your own, as a concurrent patchset.

Riiiiiight.... I don't think I have to NAK this, do I?

> 
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
> 
> vmsplice(fd, vec, vlen, vmsplice_flags) will
> be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> writable pipe.

This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
There are users.

-- 
Pedro

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: David Hildenbrand (Arm) @ 2026-05-31 19:01 UTC (permalink / raw)
  To: Pedro Falcato, Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On 5/31/26 10:54, Pedro Falcato wrote:
> On Sun, May 31, 2026 at 01:01:04AM +0000, Askar Safin wrote:
>> This patchset is for VFS.
>>
>> Recently we got a lot of vulnerabilities in splice/vmsplice.
>>
>> Also vmsplice already was source of vulnerabilities in the past:
>> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
>>
>> Also vmsplice is problematic for other reasons. Here is what other
>> developers say:
>>
>> Linus Torvalds in 2023:
>>> So I'd personally be perfectly ok with just making vmsplice() be
>>> exactly the same as write, and turn all of vmsplice() into just "it's
>>> a read() if the pipe is open for read, and a write if it's open for
>>> writing".
>> https://lore.kernel.org/all/CAHk-=wgG_2cmHgZwKjydi7=iimyHyN8aessnbM9XQ9ufbaUz9g@mail.gmail.com/
>>
>> Christoph Hellwig in May 2026:
>>> vmsplice is the worst, as it is one of the few remaining places that
>>> can incorrectly dirty file backed pages without telling the file system
>>> and cause the other problems fixed by a FOLL_PIN conversion, but it is
>>> the only one where we do not have any idea yet how we could convert it
>>> to FOLL_PIN due to the unbounded pin time.
>> https://lore.kernel.org/all/agwFlBKvKytjURDO@infradead.org/
>>
>> See recent discussion here:
>> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> 
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Riiiiiight.... I don't think I have to NAK this, do I?

Jup. I'll just ignore this patch set here.

-- 
Cheers,

David

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Askar Safin @ 2026-05-31 21:21 UTC (permalink / raw)
  To: Pedro Falcato
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <ahv16ogY8Zx3Rtox@pedro-suse.lan>

On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> So, you took an ongoing discussion with an ongoing RFC patchset, and you
> decided to reimplement part of the idea on your own, as a concurrent patchset.

Yes. But I propose an alternative solution to this problem.

Brauner said in discussion for your patchset:
"So I'm not very likely to pick this up as is".
So, I decided to submit another solution.

Pedro, I'm not trying to insult you.

Other kernel developers will decide which of these two solutions they like more.

Many people in discussion of your patchset said how they
dislike splice/vmsplice, and especially vmsplice.
Hellwig said "vmsplice is the worst".
Brauner, Hellwig, Horn said that they dislike vmsplice.
They said that vmsplice in its current form should not
be used, and that it is broken.

Despite all these problems nobody managed to fix
vmsplice in all these years.
So I propose just to effectively remove it.

You may think that I just saw a recent discussion and decided
to jump in. No. splice/vmsplice is my topic of interest for many
years. You can verify this by searching "f:Askar splice"
on lore.kernel.org . I simply decided that given
recent vulnerabilities now is the perfect time to solve
all these vmsplice problems once and for all.

I explained my position here:
https://lore.kernel.org/all/20260523204100.553125-1-safinaskar@gmail.com/ .
Nobody answered, so I just posted this patchset.

If my patchset is applied, then I will try to deal
with splice-pagecache-to-pipe somehow,
probably by removing it, too. :) I decided first
to deal with vmsplice, because it seems to be
easier problem.

> > vmsplice(fd, vec, vlen, vmsplice_flags) will
> > be equivalent to preadv2(fd, vec, vlen, -1, rw_flags) if you have
> > readable pipe and to pwritev2(fd, vec, vlen, -1, rw_flags) if you have
> > writable pipe.
>
> This does not work. https://codesearch.debian.net/search?q=vmsplice%28&literal=1
> There are users.

Yes, they are. But my solution is compatible. vmsplice is simply performance
optimization. vmsplice will work just as before, but slower.
And, most importantly, vmsplice design problems will be gone
(nobody managed to fix them anyway for all these years).

-- 
Askar Safin

^ permalink raw reply

* Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Jaegeuk Kim @ 2026-06-01  1:52 UTC (permalink / raw)
  To: Barry Song
  Cc: Theodore Tso, Bart Van Assche, linux-api, linux-kernel,
	Matthew Wilcox, linux-f2fs-devel, Christoph Hellwig, linux-mm,
	linux-fsdevel, Akilesh Kailash, Christian Brauner
In-Reply-To: <CAGsJ_4yJihngSY0GNcc+MwPHJjpF1qCnS8-UE1GwYoNDtEm9mQ@mail.gmail.com>

On 05/31, Barry Song via Linux-f2fs-devel wrote:
> On Sun, May 31, 2026 at 8:12 AM Jaegeuk Kim <jaegeuk@kernel.org> wrote:
> >
> > On 05/28, Christoph Hellwig wrote:
> > > On Wed, May 27, 2026 at 03:59:35PM +0000, Jaegeuk Kim wrote:
> > > > F2FS merges bios before submit_bio, regardless of small or large folios,
> > > > since the block addresses are consecutive. So, I think IO subsystem was
> > > > working in full speed.
> > >
> > > As does every other remotely modern file system.  But that merging is
> > > surprisingly expensive, which is why using folios gets really major
> > > performance improvements.
> > >
> > > For one doing these checks to merge touch quite a few cache lines.
> > > Second, devices are often a lot more efficient if they see fewer SGL
> > > entries.  I.e. having a 1MB bio a single SGL tends to work better than
> > > having 256 of them.
> > > The same is true in the kernel code itself, both in the submission path
> > > (dma mapping and co), and even more so in the page cache handling
> > > both before submitting and in the completion path.
> > >
> > > See Bart's patch about how long the walk of the bio_vecs in the f2fs
> > > completion path can take.  We had similar issues in XFS even in the
> > > workqueue completion path due to lack of rescheduling, and these simply
> > > go away when you do the folio manipulation in larger chunks (LAZY_PREEMPT
> > > would avoid the need to explicit rescheduling these days, but that just
> > > papers over the symptoms in this case).
> > >
> >
> > I see. That's also super helpful. Let me kick off the large folio support asap.
> > Thanks.
> 
> Hi Jaegeuk,

Hi Barry,

> 
> Nanzhe has put significant effort into this work at Xiaomi over
> the past several months. Large folios can now be supported on
> non-immutable files.
> 
> He has conducted extensive testing on the Pixel 6 and fixed a
> number of hangs discovered during development. He is still
> benchmarking performance, but the implementation appears to be
> reasonably stable at this point. We can run Android Monkey for
> many hours without observing any hangs.
> 
> If you would like to see an RFC, I can ask Nanzhe to send one
> as soon as possible after some cleanup and polishing.

Yeah, I was about to reach out to you. Let's do some discussion
offline.

Thanks,

> 
> Best regards,
> Barry
> 
> 
> _______________________________________________
> Linux-f2fs-devel mailing list
> Linux-f2fs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-01  2:47 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Kees Cook, Alexander Viro, linux-fsdevel, linux-api, linux-kernel,
	linux-mm, linux-arch, linux-doc, linux-kselftest, x86,
	Arnd Bergmann, Andy Lutomirski, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20260528-madig-fachrichtung-fehlinformation-61117ba640da@brauner>

Hi Christian,

Thanks a lot for your great review!

 ---- On Thu, 28 May 2026 19:02:53 +0800  Christian Brauner <brauner@kernel.org> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > Hi,
 > > 
 > > This is an early RFC for an idea that is probably still rough in both the
 > > UAPI and implementation details. Sorry for the rough edges; I am sending
 > > it now to check whether this direction is worth pursuing and to get
 > > feedback on the kernel/userspace boundary.
 > 
 > The idea of having a builder api for exec isn't all that crazy. But it
 > should simply be built on top of pidfds and thus pidfs itself instead.
 > It has all the basic infrastructure in place already.

Yes, that makes a lot more sense. I was staring too hard at the "hot
executable" part and made the cache/template the API, which was probably
the wrong thing to expose. Sorry about that.

 > Any implementation
 > should also allow userspace to implement posix_spawn() on top of it.

That's so cool, and this is a really useful point. I had not thought about this as
something that could sit under posix_spawn(), but that makes the target
much clearer. It should be a generic exec/spawn builder first, and the
agent use case should just be one user of it.

 > fd = pidfd_open(0, PIDFD_EMPTY /* or better name */)
 > 
 > pidfd_config(fd, ...) // modeled similar to fsconfig()

Reusing pidfd_open() with an empty target is nice because it keeps the API close
to pidfds, but I wonder if a separate entry point such as
pidfd_spawn_open() or pidfd_create() would make the "new process
builder" case a bit more explicit? Either way, the configuration side
being fsconfig-like makes sense to me.

Thanks again for pointing me in this direction. It helps a lot.

Regards,
Li​


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Andy Lutomirski @ 2026-06-01  3:11 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, Christian Brauner, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
>
> See recent discussion here:
> https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
>
> For all these reasons I propose to make vmsplice a simple wrapper for
> preadv2/pwritev2.
>

I have no comment on the code or the history.  But I'm 100% in favor
of the solution.  vmsplice is a crappy API, and would be incredibly
complex to get the implementation right,  and it should be removed.
But it has users, and the approach of just mapping them straight to
pread/pwrite makes perfect sense.

(If anyone wants to contemplate how bad the API is, contemplate gift
mode.  Or contemplate that, if you want correct results, you need to
avoid modifying the memory until the recipient is done reading or you
need to avoid reading the memory until the writer is done writing, and
vmsplice *does not tell you when it's done*.  And there isn't even a
caller specification of whether they want to read or write.  It's ...
crap.)

--Andy

^ permalink raw reply

* Re: [RFC PATCH v1 00/13] exec: add spawn templates for repeated executable startup
From: Li Chen @ 2026-06-01 15:11 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Christian Brauner, Kees Cook, Alexander Viro, linux-fsdevel,
	linux-api, linux-kernel, linux-mm, linux-arch, linux-doc,
	linux-kselftest, x86, Arnd Bergmann, Andy Lutomirski,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Jan Kara, Jonathan Corbet, Shuah Khan
In-Reply-To: <vealb52tv5suireenkke4lul2l3wbnaul2rp3ea545ly5wa5ty@yk3aksvp7skt>

Hi Mateusz,

 ---- On Thu, 28 May 2026 20:55:32 +0800  Mateusz Guzik <mjguzik@gmail.com> wrote --- 
 > On Thu, May 28, 2026 at 05:52:21PM +0800, Li Chen wrote:
 > > This RFC adds spawn_template, a userspace-controlled exec acceleration
 > > mechanism for runtimes that repeatedly start the same executable with
 > > different argv, envp, and per-spawn file descriptor setup.
 > > 
 > > The main target is agent runtimes. Modern coding agents repeatedly start
 > > short-lived helper tools such as rg, git, sed, awk, python, node, and
 > > shell wrappers while they inspect and edit a workspace. Those runtimes
 > > already know which tools are hot, and they are also the right place to
 > > decide policy. The kernel does not choose names such as rg, git, or sed.
 > > Userspace opts in by creating a template fd for one executable, then uses
 > > that fd for later spawns. Launchers, shells, and build systems have a
 > > similar repeated-startup shape and could use the same primitive, but the
 > > agent runtime case is the main motivation for this RFC.
 > > 
 > [..]
 > > A typical agent runtime would keep one template per hot executable and
 > > still build argv, envp, cwd, and pipe wiring for each tool call:
 > > 
 > >     rg_tmpl = spawn_template_create("/usr/bin/rg");
 > > 
 > >     for each search request:
 > >         out_r, out_w = pipe_cloexec();
 > >         err_r, err_w = pipe_cloexec();
 > >         actions = [
 > >             FCHDIR(worktree_fd),
 > >             DUP2(out_w, STDOUT_FILENO),
 > >             DUP2(err_w, STDERR_FILENO),
 > >         ];
 > >         child = spawn_template_spawn(rg_tmpl, rg_argv, envp, actions);
 > >         close(out_w);
 > >         close(err_w);
 > >         read out_r and err_r;
 > >         waitid(P_PIDFD, child.pidfd, ...);
 > > 
 > > 
 > [..]
 > > The cached state is intentionally small. The template fd keeps the opened
 > > main executable file, an optional absolute path string, the creator
 > > credential pointer, and the deny-write state. The executable identity key
 > > records device, inode, size, mode, owner, ctime, and mtime, and is
 > > rechecked before cached metadata is used. The ELF cache keeps only the
 > > main executable's ELF header, program header table, and program header
 > > count.
 > > 
 > >     cached in this RFC          not cached in this RFC
 > >     ------------------          ----------------------
 > >     opened main executable      PT_INTERP metadata
 > >     executable identity key     shared-library graph
 > >     main ELF header             VMA layout metadata
 > >     main ELF program headers    cross-process metadata sharing
 > >     creator cred pointer
 > >     deny-write state
 > > 
 > > This RFC does not cache ELF interpreter metadata, shared-library
 > > dependency state, or derived mapping-layout state. Shared-library
 > > resolution is dynamic linker policy and depends on LD_LIBRARY_PATH,
 > > RPATH, RUNPATH, /etc/ld.so.cache, mount namespaces, and secure-exec
 > > state. It also does not share cached executable metadata between template
 > > fds created by different processes. Each template owns its small cached
 > > metadata object in this RFC.
 > > 
 > > Performance
 > > ===========
 > > 
 > [..]
 > > Workload     Calls  subprocess  spawn_template  time_s       Delta
 > > (workers)    calls  calls/s     calls/s         seconds
 > > 1x16         6144      411.04          420.32   14.95/14.62  +2.26%
 > > 2x8          6144      666.78          690.08    9.21/8.90   +3.49%
 > > 4x4          6144      955.61         1003.25    6.43/6.12   +4.99%
 > > 8x2          6144     1048.25         1069.18    5.86/5.75   +2.00%
 > > 
 > 
 > This problem is dear to my heart and I have been pondering it on and off
 > for some time now. The entire fork + exec idiom is terrible and needs tox
 > be retired.
 > 
 > Is this vibe-coded? I asked claude for in-kernel posix_spawn for kicks
 > some time ago and it generated remarkably similar code. But that's a
 > tangent.

Partly, yes. The original idea came from using agents myself and noticing
that they spend a lot of time starting short-lived tools such as rg, sed,
git, bash, and python. I was wondering whether repeated tool calls could
be made cheaper.

After that I used an LLM to bounce around the smallest kernel prototype
for the idea. I did some review, patch split, test, benchmark, leak-check work,
and throw away some cache codes that not actually useful.

 > I'm rather confused by the angle in the patchset. Most of this shaves
 > off a tiny amount of work, while retaining the primary avoidable reason
 > for bad performance: the very fact that fork is part of the picture,
 > especially the part mucking with mm. Creating a pristine process is the
 > way to go.
 > 
 > Additionally there is a known problem where transiently copied file
 > descriptors on fork + exec cause a headache in multithreaded programs
 > doing something like this in parallel. I only did cursory reading, it
 > seems your patchset keeps the same problem in place.
 > 
 > There are numerous impactful ways to speed up execs both in terms of
 > single-threaded cost and their multicore scalability, most of which
 > would be immediately usable by all programs without an opt-in. imo these
 > needs to be exhausted before something like a "template" can be
 > considered.
 > 
 > Per the above, the primary win would stem from *NOT* messing with mm.
 > 
 > As in, whatever the interface, it needs to create an "empty" target
 > process (for lack of a better term).
 > 
 > In terms of userspace-visible APIs, a clean solution escapes me.
 > 
 > Some time ago I proposed returning a handle which is populated over time
 > by the parnet-to-be. One of the problems with it I failed to consider at
 > the time is NUMA locality -- what if the process to be created is going
 > to run on another domain? For example, opening and installing a file for
 > its later use will result in avoidable loss of locality for some of the
 > in-kernel data. That's on top of the fd vs fork problem.
 > 
 > From perf standpoint, the final goal of whatever mechanism should be a
 > state where the target process avoided copying any state it did not need
 > to and which allocated any memory it needed from local NUMA node
 > (whatever it may happen to be). Of course if no affinity is assigned it
 > may happen to move again and lose such locality, nothing can be done
 > about that. But pretend the process is to run in a specific node the
 > parent is NOT running in.
 > 
 > So I think the pragmatic way forward is to implement something close to
 > posix_spawn in the kernel. It may make sense for the thing to take the
 > PATH argument for repeated exec attempts. I understand this is of no use
 > in your particular case, but it very much IS of use for most of the
 > real-world. The initial implementation might even start with doing vfork
 > just to get it off the ground.
 > 
 > The next step would be to extend the interface with means to AVOID
 > copying any file descriptors. There could be a dedicated file action
 > which tells the kernel to avoid such copies or something like a
 > close_range file action (or close_from) -- with a range like <0, INT_MAX>
 > you know no fds are copied.
 > 
 > For the NUMA angle to be sorted out, any file action which opens a file
 > or dups from the parent needs to execute in the child. And frankly
 > something would be needed to ask the scheduler where does it think the
 > child is going to run, so that the task_struct itself can also be
 > allocated with the right backing.
 > 
 > I have not looked into what's needed to create a new process and NOT
 > mess with mm, but I don't think there are unsolvable problems there, at
 > worst some churn.
 > 
 > There are of course other parameters which need to be sorted out, that's
 > covered by the posix_spawn thing.
 > 
 > This e-mail is long enough, so I'm not going to go into issues
 > concerning exec itself right now.
 > 
 > tl;dr I would suggest redoing the patchset as posix_spawn and then doing
 > the actual optimization of not cloning mm itself.
 > 

Thanks a lot for writing this up. I clearly had too narrow a view of the
problem. I was mostly thinking about repeated executable startup, but your
reply and Christian's and Andy's made me see that the more useful target is probably
a pidfd/pidfs-backed process builder which can sit under posix_spawn, and
then grow into something that avoids the fork-shaped mm and fd costs. I
learned a lot from this thread.

At a high level, Windows CreateProcess/NtCreateUserProcess also looks
closer to this direction than fork+exec: create the target process
directly, pass explicit startup attributes and handle inheritance state,
and avoid starting from a copy of the parent address space. That seems
to be the same basic advantage here: build the child closer to its final
shape instead of copying parent state and then throwing much of it away.

I will study the process creation, exec, pidfd/pidfs, and posix_spawn
codes more carefully, then try the direction you suggested
and benchmark the mm/fd costs.

Regards,
Li​


^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Matthew Wilcox @ 2026-06-01 15:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Askar Safin, linux-fsdevel, Christian Brauner, Alexander Viro,
	Jan Kara, linux-kernel, linux-mm, linux-api, netdev,
	Linus Torvalds, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <CALCETrW__=8mSusayfXG7UFCfue5BGbx+vqESj1d9wqOfX4s8w@mail.gmail.com>

On Sun, May 31, 2026 at 08:11:34PM -0700, Andy Lutomirski wrote:
> On Sat, May 30, 2026 at 6:03 PM Askar Safin <safinaskar@gmail.com> wrote:
> >
> > See recent discussion here:
> > https://lore.kernel.org/all/20260516182126.530498-1-pfalcato@suse.de/T/#u
> >
> > For all these reasons I propose to make vmsplice a simple wrapper for
> > preadv2/pwritev2.
> >
> 
> I have no comment on the code or the history.  But I'm 100% in favor
> of the solution.  vmsplice is a crappy API, and would be incredibly
> complex to get the implementation right,  and it should be removed.
> But it has users, and the approach of just mapping them straight to
> pread/pwrite makes perfect sense.

I agree with Andy.  I think it was appropriate to send this series, since
(as far as I can tell) it's a completely different approach from the others
taken.  I'm not really qualified to judge whether the implementation is
good (it's a bit outside my competency as a reviewer), but the described
approach is more convincing to me than the other approaches.

Can we review this series properly?

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-01 15:50 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andy Lutomirski, Askar Safin, linux-fsdevel, Christian Brauner,
	Alexander Viro, Jan Kara, linux-kernel, linux-mm, linux-api,
	netdev, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <ah2nBAdsE5vVJ2PL@casper.infradead.org>

On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
>
> Can we review this series properly?

Well, since it pretty much is what I suggested a few years ago, I
certainly won't NAK it.

And the patches looked very straightforward to me. Just the final
diffstat is worth quoting again because that certainly doesn't look
problematic:

  7 files changed, 33 insertions(+), 204 deletions(-)

and it removes that GIFT flag that was truly disgusting.

So I'm certainly ok with it from a "looking at the patch" standpoint.
I didn't _test_ it. I don't have any workload that might remotely
care.

I did a quick scan on debian code search for vmsplice, and after ten
pages of entries that weren't actually *using* it but had lists of
system calls, I grew bored. So there are likely users, but I don't
know what they are and how much they care. It *might* be a big
performance issue somewhere. Unlikely, but...

         Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-01 16:16 UTC (permalink / raw)
  To: Askar Safin
  Cc: Pedro Falcato, linux-fsdevel, Alexander Viro, Jan Kara,
	linux-kernel, linux-mm, linux-api, netdev, Linus Torvalds,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Miklos Szeredi, patches
In-Reply-To: <CAPnZJGD5QR5jrNMAem7FjMRTtw+Ue+jm7Dc2Kp8Lcjj+9TDw_Q@mail.gmail.com>

On Mon, Jun 01, 2026 at 12:21:06AM +0300, Askar Safin wrote:
> On Sun, May 31, 2026 at 11:54 AM Pedro Falcato <pfalcato@suse.de> wrote:
> > So, you took an ongoing discussion with an ongoing RFC patchset, and you
> > decided to reimplement part of the idea on your own, as a concurrent patchset.
> 
> Yes. But I propose an alternative solution to this problem.

So I think this is a case where no explicit rules have been broken. But
if you know that someone has been posting patches and is working on a
problem just racing them to get your own stuff merged is very likely to
unnecessarily ruffle feathers. So sync with the person next time.

The discussion wasn't at an impasse and Pedro is expected to follow-up.
It's not very nice to just have someone else's work be for naught.

> Brauner said in discussion for your patchset:
> "So I'm not very likely to pick this up as is".
> So, I decided to submit another solution.

This lacks quite some context... I said "in its current form" and the a
long discussion ensued.

> If my patchset is applied, then I will try to deal
> with splice-pagecache-to-pipe somehow,
> probably by removing it, too. :) I decided first

So ok, but this is literally what Pedro is working on. This just wastes
people's time.

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-01 16:17 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Andy Lutomirski, Askar Safin, linux-fsdevel,
	Alexander Viro, Jan Kara, linux-kernel, linux-mm, linux-api,
	netdev, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <CAHk-=wiFuud0Nn3B9YpTWyQja08TeXVk2AB-aAkmVXyigOagbQ@mail.gmail.com>

On Mon, Jun 01, 2026 at 08:50:00AM -0700, Linus Torvalds wrote:
> On Mon, 1 Jun 2026 at 08:36, Matthew Wilcox <willy@infradead.org> wrote:
> >
> > Can we review this series properly?
> 
> Well, since it pretty much is what I suggested a few years ago, I
> certainly won't NAK it.
> 
> And the patches looked very straightforward to me. Just the final
> diffstat is worth quoting again because that certainly doesn't look
> problematic:
> 
>   7 files changed, 33 insertions(+), 204 deletions(-)
> 
> and it removes that GIFT flag that was truly disgusting.
> 
> So I'm certainly ok with it from a "looking at the patch" standpoint.
> I didn't _test_ it. I don't have any workload that might remotely
> care.
> 
> I did a quick scan on debian code search for vmsplice, and after ten
> pages of entries that weren't actually *using* it but had lists of
> system calls, I grew bored. So there are likely users, but I don't
> know what they are and how much they care. It *might* be a big
> performance issue somewhere. Unlikely, but...

As usual I would argue to accept it and revert in case we get actual
regression reports...

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-01 16:22 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Matthew Wilcox, Andy Lutomirski, Askar Safin, linux-fsdevel,
	Alexander Viro, Jan Kara, linux-kernel, linux-mm, linux-api,
	netdev, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches
In-Reply-To: <20260601-aufweichen-dissens-ausrechnen-0d9b84728113@brauner>

On Mon, 1 Jun 2026 at 09:17, Christian Brauner <brauner@kernel.org> wrote:
>
> As usual I would argue to accept it and revert in case we get actual
> regression reports...

Yes, likely the only way we'd ever find out ..

          Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Christian Brauner @ 2026-06-01 16:23 UTC (permalink / raw)
  To: Askar Safin
  Cc: Christian Brauner, linux-kernel, linux-mm, linux-api, netdev,
	Linus Torvalds, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, patches, linux-fsdevel, Alexander Viro, Jan Kara
In-Reply-To: <20260531010107.1953702-1-safinaskar@gmail.com>

On Sun, 31 May 2026 01:01:04 +0000, Askar Safin wrote:
> This patchset is for VFS.
> 
> Recently we got a lot of vulnerabilities in splice/vmsplice.
> 
> Also vmsplice already was source of vulnerabilities in the past:
> CVE-2020-29374 (see https://lwn.net/Articles/849638/ ).
> 
> [...]

Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.
Patches in the vfs-7.2.vmsplice branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-7.2.vmsplice

[1/3] tee: fs/splice.c: remove unused parameter "flags" from "link_pipe"
      https://git.kernel.org/vfs/vfs/c/a9f7db50ed2f
[2/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
      https://git.kernel.org/vfs/vfs/c/e2c0b2368081
[3/3] splice: remove PIPE_BUF_FLAG_GIFT
      https://git.kernel.org/vfs/vfs/c/7d75aa8edfce

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Linus Torvalds @ 2026-06-01 17:17 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Askar Safin, linux-kernel, linux-mm, linux-api, netdev,
	Matthew Wilcox, Jens Axboe, Christoph Hellwig, David Howells,
	Andrew Morton, David Hildenbrand, Pedro Falcato, Miklos Szeredi,
	patches, linux-fsdevel, Alexander Viro, Jan Kara
In-Reply-To: <20260601-enthusiasmus-canceln-anlehnen-0e62317a9784@brauner>

On Mon, 1 Jun 2026 at 09:42, Christian Brauner <brauner@kernel.org> wrote:
>
> Applied to the vfs-7.2.vmsplice branch of the vfs/vfs.git tree.

Btw, if people want to work further on this - assuming we don't get
any huge screams of pain from having effectively gotten rid of
vmsplice() - I don't think it would hurt to look at limiting the
"regular" splice() too.

We already have the code to just turn it into a pure copy on the
"splice to pipe" case: copy_splice_read(). In many ways it would be
*lovely* to just always force that path.

We already do that explicitly for DAX and O_DIRECT, but we made a lot
of special files do it implicitly too, so quite a lot of the splice
reading cases already use that "just read() into a kernel space
buffer" model for splicing.

It would be interesting to hear who would even notice if we just
always used that copy case, and made "f_op->splice_read" never trigger
at all.

And it turns out that the only thing that ever uses
"f_op->splice_write" is splice_to_socket. Which was actually the
problematic buggy case.

Everybody else pretty much seems to just use iter_file_splice_write(),
which does the "emulate it with just a write from kernel buffers".

So *if* we get rid of f_op->splice_read, we do leave the case that
really caused problems, but nobody will ever care. Because once splice
only deals with private buffers that can't be shared with anything
else, a f_op->splice_write() that gets things wrong is pretty much a
non-event.

(We'd have to look at 'tee()' too: I don't think anybody really uses
it, but it does do the "no copy linking" by just incrementing
refcounts on the pipe buffers. So to really protect against
splice_write users messing up, that should do copies too, but as long
as it's all "private ephemeral buffers" that get their refcounts
updated, I don't think anybody *really* cares)

TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
a big simplification.

                Linus

^ permalink raw reply

* Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2
From: Al Viro @ 2026-06-01 17:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christian Brauner, Askar Safin, linux-kernel, linux-mm, linux-api,
	netdev, Matthew Wilcox, Jens Axboe, Christoph Hellwig,
	David Howells, Andrew Morton, David Hildenbrand, Pedro Falcato,
	Miklos Szeredi, patches, linux-fsdevel, Jan Kara, Steven Rostedt
In-Reply-To: <CAHk-=wifX_rrDjRGnDnOqE-usptAukuXKrmuPuVDP5bOCBWzGQ@mail.gmail.com>

On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:

> TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> a big simplification.

FUSE might be interesting - fuse_dev_splice_read() and its ilk.
Communications between the kernel and fuse server at least used to
seriously want that, so that would be one place to look for unhappy
userland...

splice-related logics in fs/fuse/dev.c is interesting; another place
like this is kernel/trace/, but I'm less familiar with that one.

rostedt Cc'd (miklos already had been)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox