* [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept
2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
@ 2026-06-03 12:55 ` Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
2026-06-11 1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
2 siblings, 0 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler
`memfd_tripwire()` creates a file descriptor referring to a memory
region that generates poll notifications when it is written. This works
by installing read-only mappings. Write accesses then trigger a fault
and invoke the `page_mkwrite()` handler, which queues a POLLIN event and
allows the PTE to be made writable. After observing the POLLIN, the
consumer is expected to invoke the `MEMFD_TRIPWIRE_ACK` ioctl, which
write-protects the mappings again to re-arm the mechanism. It also
guarantees that previous writes are visible to the consumer.
Signed-off-by: Mattias Nissler <mnissler@meta.com>
Assisted-by: Claude:claude-opus-4-6
---
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/syscalls.h | 1 +
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/memfd_tripwire.h | 19 ++
kernel/sys_ni.c | 3 +
mm/Kconfig | 9 +
mm/Makefile | 1 +
mm/memfd_tripwire.c | 246 +++++++++++++++++++++++++
scripts/syscall.tbl | 1 +
9 files changed, 285 insertions(+), 1 deletion(-)
create mode 100644 include/uapi/linux/memfd_tripwire.h
create mode 100644 mm/memfd_tripwire.c
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da1..d93ace9385fd0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common memfd_tripwire sys_memfd_tripwire
#
# Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4fb7291f54b62..52ea6d808f96d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -940,6 +940,7 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
asmlinkage long sys_getrandom(char __user *buf, size_t count,
unsigned int flags);
asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_memfd_tripwire(unsigned int flags);
asmlinkage long sys_bpf(int cmd, union bpf_attr __user *attr, unsigned int size);
asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *argv,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5fe..38825b1c59cef 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
#define __NR_rseq_slice_yield 471
__SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
+#define __NR_memfd_tripwire 472
+__SYSCALL(__NR_memfd_tripwire, sys_memfd_tripwire)
+
#undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
/*
* 32 bit systems traditionally used different
diff --git a/include/uapi/linux/memfd_tripwire.h b/include/uapi/linux/memfd_tripwire.h
new file mode 100644
index 0000000000000..478599ba2f813
--- /dev/null
+++ b/include/uapi/linux/memfd_tripwire.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_MEMFD_TRIPWIRE_H
+#define _UAPI_LINUX_MEMFD_TRIPWIRE_H
+
+#include <linux/ioctl.h>
+
+#define MEMFD_TRIPWIRE_IOC 0xDA
+
+/*
+ * MEMFD_TRIPWIRE_ACK serves two purpose: First, it re-arms the mechanism to
+ * make sure future write activity triggers a POLLIN notification. Second, it
+ * makes sure that all writes up to the ACK are visible to the calling process.
+ * It is also guaranteed that no writes can sneak through unnoticed, i.e. after
+ * ACK concurrent writes have either taken place and are visible to the
+ * consumer, or will generate subsequent POLLIN events.
+ */
+#define MEMFD_TRIPWIRE_ACK _IO(MEMFD_TRIPWIRE_IOC, 0x00)
+
+#endif /* _UAPI_LINUX_MEMFD_TRIPWIRE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f5..6bcf8658b4ff5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -270,6 +270,9 @@ COND_SYSCALL(pkey_free);
/* memfd_secret */
COND_SYSCALL(memfd_secret);
+/* memfd_tripwire */
+COND_SYSCALL(memfd_tripwire);
+
/*
* Architecture specific weak syscall entries.
*/
diff --git a/mm/Kconfig b/mm/Kconfig
index e8bf1e9e6ad90..c7c1c2e69d37b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1323,6 +1323,15 @@ config SECRETMEM
memory areas visible only in the context of the owning process and
not mapped to other processes and other kernel page tables.
+config MEMFD_TRIPWIRE
+ bool "Enable memfd_tripwire() system call"
+ depends on MMU
+ help
+ Enable the memfd_tripwire() system call, which creates an anonymous
+ shared memory region with built-in write notification support. The
+ returned file descriptor supports poll() for detecting writes to the
+ mapped memory and an ioctl for re-arming notifications.
+
config ANON_VMA_NAME
bool "Anonymous VMA name support"
depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244eb..658443dbcb28d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -141,6 +141,7 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
obj-$(CONFIG_ZONE_DEVICE) += memremap.o
obj-$(CONFIG_HMM_MIRROR) += hmm.o
obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_TRIPWIRE) += memfd_tripwire.o
obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
obj-$(CONFIG_PTDUMP) += ptdump.o
obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd_tripwire.c b/mm/memfd_tripwire.c
new file mode 100644
index 0000000000000..52a27507b2d39
--- /dev/null
+++ b/mm/memfd_tripwire.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * memfd_tripwire - Memory file descriptor with write notification support.
+ *
+ * Creates an anonymous memory-backed file descriptor that supports polling for
+ * detecting writes to its mappings.
+ *
+ * Theory of operation:
+ * 1. memfd_tripwire() to create a new instance
+ * 2. Pass the fd / mapping to the (possibly out-of-process) producer
+ * 3a. The producer uses the memory region as normal memory. Writes stick, but
+ * generate a notification as a side effect.
+ * 3b. The consumer polls the file descriptor:
+ * a. POLLIN is observed when the memory region gets written
+ * b. Call ioctl(MEMFD_TRIPWIRE_ACK) to restore write protection
+ * c. Inspect memory contents and react as appropriate
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/folio_batch.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/pseudo_fs.h>
+#include <linux/rmap.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+
+#include <uapi/linux/magic.h>
+#include <uapi/linux/memfd_tripwire.h>
+
+struct tripwire_info {
+ atomic_t dirty;
+ wait_queue_head_t wqh;
+};
+
+static struct tripwire_info *to_tripwire_info(struct file *file)
+{
+ return file->private_data;
+}
+
+static vm_fault_t tripwire_fault(struct vm_fault *vmf)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ struct inode *inode = file_inode(vmf->vma->vm_file);
+ pgoff_t offset = vmf->pgoff;
+ gfp_t gfp = vmf->gfp_mask;
+ struct folio *folio;
+ int err;
+
+ if (((loff_t)offset << PAGE_SHIFT) >= i_size_read(inode))
+ return vmf_error(-EINVAL);
+
+retry:
+ folio = filemap_lock_folio(mapping, offset);
+ if (IS_ERR(folio)) {
+ folio = folio_alloc(gfp | __GFP_ZERO, 0);
+ if (!folio)
+ return VM_FAULT_OOM;
+
+ __folio_mark_uptodate(folio);
+ err = filemap_add_folio(mapping, folio, offset, gfp);
+ if (unlikely(err)) {
+ folio_put(folio);
+ if (err == -EEXIST)
+ goto retry;
+ return vmf_error(err);
+ }
+ }
+
+ vmf->page = folio_file_page(folio, offset);
+ return VM_FAULT_LOCKED;
+}
+
+static vm_fault_t tripwire_page_mkwrite(struct vm_fault *vmf)
+{
+ struct tripwire_info *info = to_tripwire_info(vmf->vma->vm_file);
+ struct folio *folio = page_folio(vmf->page);
+
+ /*
+ * Note that this might be racing with a concurrent ACK. We need to
+ * guarantee that the dirty flag and page protection state remains
+ * consistent and that no notifications are lost.
+ *
+ * The actual update to mark the PTE writable only happens after this
+ * function completes, so the window between here and PTE update must
+ * be protected against concurrent modifications by the ACK code path.
+ * Taking the folio lock before notifying consumers conveniently
+ * makes sure that ACKs can only complete after our PTE updates go
+ * through, preventing a situation where an interleaved ACK clears the
+ * dirty flag but we're still going ahead to mark the PTE writable.
+ */
+ folio_lock(folio);
+
+ if (atomic_cmpxchg(&info->dirty, 0, 1) == 0)
+ wake_up_poll(&info->wqh, EPOLLIN);
+
+ if (folio->mapping != vmf->vma->vm_file->f_mapping) {
+ folio_unlock(folio);
+ return VM_FAULT_NOPAGE;
+ }
+
+ return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct tripwire_vm_ops = {
+ .fault = tripwire_fault,
+ .page_mkwrite = tripwire_page_mkwrite,
+};
+
+static int tripwire_mmap_prepare(struct vm_area_desc *desc)
+{
+ file_accessed(desc->file);
+ desc->vm_ops = &tripwire_vm_ops;
+ return 0;
+}
+
+static __poll_t tripwire_poll(struct file *file, poll_table *wait)
+{
+ struct tripwire_info *info = to_tripwire_info(file);
+ __poll_t mask = 0;
+
+ poll_wait(file, &info->wqh, wait);
+
+ if (atomic_read(&info->dirty))
+ mask |= EPOLLIN | EPOLLRDNORM;
+
+ return mask;
+}
+
+static void tripwire_ack(struct file *file)
+{
+ struct tripwire_info *info = to_tripwire_info(file);
+ struct address_space *mapping = file->f_mapping;
+ struct folio_batch fbatch;
+ pgoff_t index = 0;
+ int i;
+
+ /*
+ * Note that this flag update is not protected by the folio locks taken
+ * below. Hence a concurrent writer might sneak in and switch back to
+ * dirty state. That's OK though, since it merely results in a spurious
+ * notification.
+ */
+ atomic_set(&info->dirty, 0);
+
+ folio_batch_init(&fbatch);
+ while (filemap_get_folios(mapping, &index, ~0UL, &fbatch)) {
+ for (i = 0; i < folio_batch_count(&fbatch); i++) {
+ folio_lock(fbatch.folios[i]);
+ folio_mkclean(fbatch.folios[i]);
+ folio_unlock(fbatch.folios[i]);
+ }
+ folio_batch_release(&fbatch);
+ cond_resched();
+ }
+}
+
+static long tripwire_ioctl(struct file *file, unsigned int cmd,
+ unsigned long arg)
+{
+ switch (cmd) {
+ case MEMFD_TRIPWIRE_ACK:
+ if (arg != 0)
+ return -EINVAL;
+ tripwire_ack(file);
+ return 0;
+ default:
+ return -ENOTTY;
+ }
+}
+
+static int tripwire_release(struct inode *inode, struct file *file)
+{
+ kfree(to_tripwire_info(file));
+ return 0;
+}
+
+static const struct file_operations tripwire_fops = {
+ .release = tripwire_release,
+ .mmap_prepare = tripwire_mmap_prepare,
+ .poll = tripwire_poll,
+ .unlocked_ioctl = tripwire_ioctl,
+};
+
+static const struct address_space_operations tripwire_aops = {
+ .dirty_folio = noop_dirty_folio,
+};
+
+static int tripwire_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+ struct iattr *iattr)
+{
+ struct inode *inode = d_inode(dentry);
+ unsigned int ia_valid = iattr->ia_valid;
+ int ret;
+
+ filemap_invalidate_lock(inode->i_mapping);
+
+ /* Allowing size to change only once avoids the need to unmap here. */
+ if ((ia_valid & ATTR_SIZE) && inode->i_size)
+ ret = -EINVAL;
+ else
+ ret = simple_setattr(idmap, dentry, iattr);
+
+ filemap_invalidate_unlock(inode->i_mapping);
+
+ return ret;
+}
+
+static const struct inode_operations tripwire_iops = {
+ .setattr = tripwire_setattr,
+};
+
+SYSCALL_DEFINE1(memfd_tripwire, unsigned int, flags)
+{
+ struct tripwire_info *info;
+ struct inode *inode;
+ struct file *file;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ info = kzalloc_obj(struct tripwire_info);
+ if (!info)
+ return -ENOMEM;
+
+ init_waitqueue_head(&info->wqh);
+
+ file = anon_inode_create_getfile("memfd_tripwire", &tripwire_fops, info,
+ O_RDWR, NULL);
+ if (IS_ERR(file)) {
+ kfree(info);
+ return PTR_ERR(file);
+ }
+
+ inode = file_inode(file);
+ inode->i_op = &tripwire_iops;
+ inode->i_mapping->a_ops = &tripwire_aops;
+ inode->i_mode |= S_IFREG;
+ inode->i_size = 0;
+
+ return FD_ADD(O_CLOEXEC, file);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b65776..a2bd8cd809ee8 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
469 common file_setattr sys_file_setattr
470 common listns sys_listns
471 common rseq_slice_yield sys_rseq_slice_yield
+472 common memfd_tripwire sys_memfd_tripwire
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest
2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
@ 2026-06-03 12:55 ` Mattias Nissler
2026-06-11 1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
2 siblings, 0 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler
Add test cases for `memfd_tripwire`, exercising basic semantics as well
as verifying behavior under stress.
Signed-off-by: Mattias Nissler <mnissler@meta.com>
Assisted-by: Claude:claude-opus-4-6
---
tools/testing/selftests/mm/Makefile | 1 +
tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
2 files changed, 696 insertions(+)
create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c
diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27e..f87839733df5a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
TEST_GEN_FILES += merge
TEST_GEN_FILES += rmap
TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += memfd_tripwire
ifneq ($(ARCH),arm64)
TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/memfd_tripwire.c b/tools/testing/selftests/mm/memfd_tripwire.c
new file mode 100644
index 0000000000000..7229b387ffcf4
--- /dev/null
+++ b/tools/testing/selftests/mm/memfd_tripwire.c
@@ -0,0 +1,695 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * memfd_tripwire selftest
+ *
+ * Tests for the memfd_tripwire() syscall which creates a shared memory region
+ * with write notification support via poll() and ioctl-based re-arming.
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <linux/memfd_tripwire.h>
+
+#include "kselftest.h"
+
+#ifdef __NR_memfd_tripwire
+
+static int sys_memfd_tripwire(unsigned int flags)
+{
+ return syscall(__NR_memfd_tripwire, flags);
+}
+
+/*
+ * Poll the fd for POLLIN with the given timeout in milliseconds.
+ * Returns 1 if POLLIN is set, 0 on timeout, -1 on error.
+ */
+static int poll_check(int fd, int timeout_ms)
+{
+ struct pollfd pfd = { .fd = fd, .events = POLLIN };
+ int ret;
+
+ ret = poll(&pfd, 1, timeout_ms);
+ if (ret < 0)
+ return -1;
+ if (ret == 0)
+ return 0;
+ return (pfd.revents & POLLIN) ? 1 : 0;
+}
+
+static int tripwire_ack(int fd)
+{
+ return ioctl(fd, MEMFD_TRIPWIRE_ACK, 0);
+}
+
+static long page_size;
+
+static int setup_tripwire(size_t size, int *fd, char **mem)
+{
+ *fd = sys_memfd_tripwire(0);
+ if (*fd < 0) {
+ ksft_test_result_fail("memfd_tripwire: %m\n");
+ return -1;
+ }
+
+ if (ftruncate(*fd, size) < 0) {
+ ksft_test_result_fail("ftruncate: %m\n");
+ goto out;
+ }
+
+ *mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, *fd, 0);
+ if (*mem == MAP_FAILED) {
+ *mem = NULL;
+ ksft_test_result_fail("mmap: %m\n");
+ goto out;
+ }
+
+ return 0;
+
+out:
+ close(*fd);
+ return -1;
+}
+
+static double time_elapsed(struct timespec *start)
+{
+ struct timespec now;
+
+ clock_gettime(CLOCK_MONOTONIC, &now);
+ return (now.tv_sec - start->tv_sec) +
+ (now.tv_nsec - start->tv_nsec) / 1e9;
+}
+
+/*
+ * Test 1: Sequential single-process semantics.
+ *
+ * Verifies the complete life cycle:
+ * create -> poll (clean) -> write -> poll (dirty) -> ACK -> poll (clean)
+ * -> write again -> poll (dirty)
+ */
+static void test_sequential(void)
+{
+ int fd;
+ char *mem;
+
+ if (setup_tripwire(page_size, &fd, &mem) != 0)
+ return;
+
+ /* Freshly created: should not have POLLIN */
+ if (poll_check(fd, 0) != 0) {
+ ksft_test_result_fail("POLLIN set on clean fd\n");
+ goto out;
+ }
+
+ /* Write to mapped memory */
+ mem[0] = 'A';
+
+ /* Should now report POLLIN */
+ if (poll_check(fd, 100) != 1) {
+ ksft_test_result_fail("POLLIN not set after write\n");
+ goto out;
+ }
+
+ /* Polling again without ACK should still show POLLIN */
+ if (poll_check(fd, 0) != 1) {
+ ksft_test_result_fail("POLLIN lost without ACK\n");
+ goto out;
+ }
+
+ /* ACK re-arms the tripwire */
+ if (tripwire_ack(fd) < 0) {
+ ksft_test_result_fail("ioctl ACK: %m\n");
+ goto out;
+ }
+
+ /* After ACK: should be clean again */
+ if (poll_check(fd, 0) != 0) {
+ ksft_test_result_fail("POLLIN set after ACK\n");
+ goto out;
+ }
+
+ /* Write again: must re-trigger */
+ mem[0] = 'B';
+
+ if (poll_check(fd, 100) != 1) {
+ ksft_test_result_fail("POLLIN not set after second write\n");
+ goto out;
+ }
+
+ ksft_test_result_pass("sequential\n");
+out:
+ munmap((void *)mem, page_size);
+ close(fd);
+}
+
+/*
+ * Test 2: Multi-page, writes to different pages trigger after re-arm.
+ */
+static void test_multi_page(void)
+{
+ int fd;
+ char *mem;
+ size_t size = page_size * 4;
+
+ if (setup_tripwire(size, &fd, &mem) != 0)
+ return;
+
+ /* Write page 0, observe, ACK */
+ mem[0] = 'A';
+ if (poll_check(fd, 100) != 1) {
+ ksft_test_result_fail("POLLIN not set after page 0 write\n");
+ goto out;
+ }
+ tripwire_ack(fd);
+
+ /* Write page 2: A different page must also trigger */
+ mem[page_size * 2] = 'B';
+ if (poll_check(fd, 100) != 1) {
+ ksft_test_result_fail("POLLIN not set after page 2 write\n");
+ goto out;
+ }
+ tripwire_ack(fd);
+
+ /* Write page 3 */
+ mem[page_size * 3] = 'C';
+ if (poll_check(fd, 100) != 1) {
+ ksft_test_result_fail("POLLIN not set after page 3 write\n");
+ goto out;
+ }
+
+ ksft_test_result_pass("multi_page\n");
+out:
+ munmap((void *)mem, size);
+ close(fd);
+}
+
+/*
+ * Test 3: Cross-process producer/consumer.
+ *
+ * Parent is the consumer (polls for events), child is the producer (writes
+ * to the shared mapping). Verifies that writes in the child process are
+ * visible to the parent via poll().
+ */
+static void test_cross_process(void)
+{
+ int fd, status;
+ char *mem;
+ pid_t pid;
+
+ if (setup_tripwire(page_size, &fd, &mem) != 0)
+ return;
+
+ pid = fork();
+ if (pid < 0) {
+ ksft_test_result_fail("fork: %m\n");
+ goto out;
+ }
+
+ if (pid == 0) {
+ /*
+ * Child: wait a bit then write. The parent should be blocked
+ * in poll() and wake up when we write.
+ */
+ usleep(50000);
+ mem[0] = 'P';
+ _exit(0);
+ }
+
+ /* Parent: wait for the child's write */
+ if (poll_check(fd, 2000) != 1) {
+ ksft_test_result_fail("POLLIN not set after child write\n");
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+ goto out;
+ }
+
+ waitpid(pid, &status, 0);
+
+ if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+ ksft_test_result_fail("child exited abnormally\n");
+ goto out;
+ }
+
+ /* Verify data written by child */
+ if (mem[0] != 'P') {
+ ksft_test_result_fail("data mismatch: got 0x%02x\n", mem[0]);
+ goto out;
+ }
+
+ ksft_test_result_pass("cross_process\n");
+out:
+ munmap((void *)mem, page_size);
+ close(fd);
+}
+
+/*
+ * Test 4: Multi-round cross-process producer/consumer.
+ *
+ * Uses a pipe to synchronize rounds between parent (consumer) and child
+ * (producer). Each round: child writes, parent detects via poll, parent
+ * ACKs, repeat. Verifies that the full cycle works reliably across
+ * process boundaries.
+ */
+static void test_cross_process_multi_round(void)
+{
+ int fd, status;
+ int pipe_to_child[2], pipe_to_parent[2];
+ char *mem;
+ pid_t pid;
+ int rounds = 1000;
+ char sync;
+
+ if (setup_tripwire(page_size, &fd, &mem) != 0)
+ return;
+
+ if (pipe(pipe_to_child) < 0 || pipe(pipe_to_parent) < 0) {
+ ksft_test_result_fail("pipe: %m\n");
+ goto out;
+ }
+
+ pid = fork();
+ if (pid < 0) {
+ ksft_test_result_fail("fork: %m\n");
+ goto out;
+ }
+
+ if (pid == 0) {
+ /* Child: producer */
+ close(pipe_to_child[1]);
+ close(pipe_to_parent[0]);
+
+ for (int i = 0; i < rounds; i++) {
+ /* Wait for parent to signal us to write */
+ if (read(pipe_to_child[0], &sync, 1) != 1)
+ _exit(1);
+ mem[0] = 'A' + i;
+ /* Tell parent we wrote */
+ if (write(pipe_to_parent[1], &sync, 1) != 1)
+ _exit(1);
+ }
+ _exit(0);
+ }
+
+ /* Parent: consumer */
+ close(pipe_to_child[0]);
+ close(pipe_to_parent[1]);
+
+ for (int i = 0; i < rounds; i++) {
+ /* Tell child to write */
+ sync = 'G';
+ if (write(pipe_to_child[1], &sync, 1) != 1) {
+ ksft_test_result_fail("write to pipe: %m\n");
+ goto reap;
+ }
+
+ /* Wait for child to confirm write */
+ if (read(pipe_to_parent[0], &sync, 1) != 1) {
+ ksft_test_result_fail("read from pipe: %m\n");
+ goto reap;
+ }
+
+ /* Detect the write */
+ if (poll_check(fd, 1000) != 1) {
+ ksft_test_result_fail("round %d: POLLIN not set\n", i);
+ goto reap;
+ }
+
+ /* Verify data */
+ if (mem[0] != (char)('A' + i)) {
+ ksft_test_result_fail("round %d: data 0x%02x\n", i,
+ mem[0]);
+ goto reap;
+ }
+
+ /* Re-arm for next round */
+ if (tripwire_ack(fd) < 0) {
+ ksft_test_result_fail("round %d: ACK: %m\n", i);
+ goto reap;
+ }
+
+ /* Confirm clean after ACK */
+ if (poll_check(fd, 0) != 0) {
+ ksft_test_result_fail("round %d: POLLIN after ACK\n",
+ i);
+ goto reap;
+ }
+ }
+
+ close(pipe_to_child[1]);
+ close(pipe_to_parent[0]);
+
+ waitpid(pid, &status, 0);
+ if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+ ksft_test_result_fail("child exited abnormally\n");
+ goto out;
+ }
+
+ ksft_test_result_pass("cross_process_multi_round\n");
+ munmap((void *)mem, page_size);
+ close(fd);
+ return;
+
+reap:
+ close(pipe_to_child[1]);
+ close(pipe_to_parent[0]);
+ kill(pid, SIGKILL);
+ waitpid(pid, &status, 0);
+out:
+ munmap((void *)mem, page_size);
+ close(fd);
+}
+
+/*
+ * Test 5: Producer/consumer race stress test.
+ *
+ * A producer thread writes continuously while the consumer thread polls and
+ * ACKs. The invariant is: we must never "miss" an event. Specifically, after
+ * the consumer ACKs and the producer writes at least once more, the consumer
+ * must eventually see POLLIN again.
+ *
+ * We use an atomic generation counter to track this. The producer increments
+ * the generation and writes to the mapping. The consumer records the
+ * generation at ACK time and verifies that when it next sees POLLIN, the
+ * generation has advanced.
+ */
+static atomic_int stress_gen;
+static atomic_bool stress_stop;
+
+static void *stress_producer(void *arg)
+{
+ char *mem = arg;
+
+ while (!atomic_load(&stress_stop)) {
+ atomic_fetch_add(&stress_gen, 1);
+ mem[0]++;
+ /* Intentionally no explicit delay to maximize contention */
+ }
+
+ return NULL;
+}
+
+static void test_stress_no_lost_events(void)
+{
+ int fd;
+ char *mem;
+ pthread_t producer;
+ int ack_count = 0;
+ int gen_at_ack, cur_gen;
+ struct timespec start;
+
+ if (setup_tripwire(page_size, &fd, &mem) != 0)
+ return;
+
+ atomic_store(&stress_gen, 0);
+ atomic_store(&stress_stop, false);
+
+ if (pthread_create(&producer, NULL, stress_producer, (void *)mem)) {
+ ksft_test_result_fail("pthread_create: %m\n");
+ goto out;
+ }
+
+ /*
+ * Consumer loop: run for ~2 seconds. In each iteration:
+ * 1. Wait for POLLIN (the producer is writing continuously)
+ * 2. Record the current generation
+ * 3. ACK (re-arm)
+ * 4. The producer is still writing, so POLLIN must come back
+ */
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ while (time_elapsed(&start) <= 2.0) {
+ /*
+ * Wait for a write notification. The producer is looping, so
+ * the timeout should be plenty.
+ */
+ if (poll_check(fd, 1000) != 1) {
+ ksft_test_result_fail(
+ "POLLIN not set (ack_count=%d, gen=%d)\n",
+ ack_count, atomic_load(&stress_gen));
+ atomic_store(&stress_stop, true);
+ pthread_join(producer, NULL);
+ goto out;
+ }
+
+ /* Record generation and ACK */
+ gen_at_ack = atomic_load(&stress_gen);
+ if (tripwire_ack(fd) < 0) {
+ ksft_test_result_fail("ACK failed: %m\n");
+ atomic_store(&stress_stop, true);
+ pthread_join(producer, NULL);
+ goto out;
+ }
+ ack_count++;
+
+ /*
+ * The producer is running concurrently. After ACK, it will
+ * write again and increment the generation. We don't need to
+ * check anything here because the next poll() iteration will
+ * verify that the write was detected.
+ *
+ * Spin briefly to let the producer get ahead, exercising the
+ * race between ACK (write-protect) and producer (write).
+ */
+ cur_gen = atomic_load(&stress_gen);
+ while (cur_gen == gen_at_ack) {
+ sched_yield();
+ cur_gen = atomic_load(&stress_gen);
+ }
+ }
+
+ atomic_store(&stress_stop, true);
+ pthread_join(producer, NULL);
+
+ if (ack_count < 10) {
+ ksft_test_result_fail("too few ACK cycles: %d\n", ack_count);
+ goto out;
+ }
+
+ ksft_print_msg("stress: %d ACK cycles, final gen=%d\n", ack_count,
+ atomic_load(&stress_gen));
+ ksft_test_result_pass("stress_no_lost_events\n");
+out:
+ munmap((void *)mem, page_size);
+ close(fd);
+}
+
+/*
+ * Test 6: Randomized multi-thread visibility stress test.
+ *
+ * N_VIS_WRITERS writer threads and N_VIS_READERS reader threads operate
+ * concurrently on a shared memfd_tripwire mapping for VIS_DURATION_S seconds.
+ * Each writer owns a uint32_t slot on its own page and writes monotonically
+ * increasing values, interspersed with random sleeps. Each reader randomly
+ * polls or sleeps; on POLLIN it snapshots committed values, ACKs, then reads
+ * the mapped memory and validates visibility.
+ *
+ * The invariant under test: after a reader calls ACK, all writes that were
+ * committed (writer_committed[i] updated with store-release) before the
+ * reader's snapshot (loaded with load-acquire) must be visible when reading
+ * the mapped memory. The ACK's TLB-flush IPI provides the cross-CPU memory
+ * barrier that makes this guarantee hold.
+ */
+
+#define N_VIS_WRITERS 4
+#define N_VIS_READERS 2
+#define VIS_DURATION_S 2
+#define VIS_WRITER_SLEEP_PCT 20
+#define VIS_READER_SLEEP_PCT 30
+#define VIS_MAX_WRITER_SLEEP_US 1000
+#define VIS_MAX_READER_SLEEP_US 3000
+
+struct vis_ctx {
+ int fd;
+ char *mem;
+ atomic_uint writer_committed[N_VIS_WRITERS];
+ atomic_bool stop;
+ atomic_bool failed;
+ char failure_msg[256];
+};
+
+struct vis_thread_arg {
+ struct vis_ctx *ctx;
+ int id;
+};
+
+static void *vis_writer_fn(void *arg)
+{
+ struct vis_thread_arg *ta = arg;
+ struct vis_ctx *ctx = ta->ctx;
+ int id = ta->id;
+ uint32_t *slot = (uint32_t *)(ctx->mem + id * page_size);
+ uint32_t value = 0;
+ unsigned int seed = (unsigned int)(id * 7919 + 1);
+
+ while (!atomic_load_explicit(&ctx->stop, memory_order_relaxed)) {
+ if (rand_r(&seed) % 100 < VIS_WRITER_SLEEP_PCT) {
+ usleep(rand_r(&seed) % VIS_MAX_WRITER_SLEEP_US);
+ continue;
+ }
+
+ value++;
+ *slot = value;
+ atomic_store_explicit(&ctx->writer_committed[id], value,
+ memory_order_release);
+ }
+
+ return NULL;
+}
+
+static void *vis_reader_fn(void *arg)
+{
+ struct vis_thread_arg *ta = arg;
+ struct vis_ctx *ctx = ta->ctx;
+ uint32_t snapshot[N_VIS_WRITERS];
+ unsigned int seed = (unsigned int)(ta->id * 6971 + 42);
+ int i;
+
+ while (!atomic_load_explicit(&ctx->stop, memory_order_relaxed)) {
+ if (atomic_load_explicit(&ctx->failed, memory_order_relaxed))
+ break;
+
+ if (rand_r(&seed) % 100 < VIS_READER_SLEEP_PCT) {
+ usleep(rand_r(&seed) % VIS_MAX_READER_SLEEP_US);
+ continue;
+ }
+
+ if (poll_check(ctx->fd, 100) != 1)
+ continue;
+
+ for (i = 0; i < N_VIS_WRITERS; i++)
+ snapshot[i] =
+ atomic_load_explicit(&ctx->writer_committed[i],
+ memory_order_acquire);
+
+ tripwire_ack(ctx->fd);
+
+ for (i = 0; i < N_VIS_WRITERS; i++) {
+ uint32_t *slot = (uint32_t *)(ctx->mem + i * page_size);
+ uint32_t observed = *slot;
+
+ if (observed < snapshot[i]) {
+ snprintf(
+ ctx->failure_msg,
+ sizeof(ctx->failure_msg),
+ "writer %d: observed %u < committed %u",
+ i, observed, snapshot[i]);
+ atomic_store(&ctx->failed, true);
+ atomic_store(&ctx->stop, true);
+ return NULL;
+ }
+ }
+ }
+
+ return NULL;
+}
+
+static void test_stress_visibility(void)
+{
+ struct vis_ctx ctx;
+ struct vis_thread_arg wargs[N_VIS_WRITERS];
+ struct vis_thread_arg rargs[N_VIS_READERS];
+ pthread_t writers[N_VIS_WRITERS];
+ pthread_t readers[N_VIS_READERS];
+ size_t map_size = N_VIS_WRITERS * page_size;
+ struct timespec start;
+ int i, nw = 0, nr = 0;
+
+ if (setup_tripwire(map_size, &ctx.fd, &ctx.mem) != 0)
+ return;
+
+ atomic_store(&ctx.stop, false);
+ atomic_store(&ctx.failed, false);
+ ctx.failure_msg[0] = '\0';
+ for (i = 0; i < N_VIS_WRITERS; i++)
+ atomic_store(&ctx.writer_committed[i], 0);
+
+ for (i = 0; i < N_VIS_WRITERS; i++) {
+ wargs[i].ctx = &ctx;
+ wargs[i].id = i;
+ if (pthread_create(&writers[i], NULL, vis_writer_fn, &wargs[i]))
+ goto stop;
+ nw++;
+ }
+
+ for (i = 0; i < N_VIS_READERS; i++) {
+ rargs[i].ctx = &ctx;
+ rargs[i].id = i;
+ if (pthread_create(&readers[i], NULL, vis_reader_fn, &rargs[i]))
+ goto stop;
+ nr++;
+ }
+
+ clock_gettime(CLOCK_MONOTONIC, &start);
+ do {
+ usleep(100000);
+ } while (time_elapsed(&start) <= VIS_DURATION_S &&
+ !atomic_load(&ctx.failed));
+
+stop:
+ atomic_store(&ctx.stop, true);
+ for (i = 0; i < nr; i++)
+ pthread_join(readers[i], NULL);
+ for (i = 0; i < nw; i++)
+ pthread_join(writers[i], NULL);
+
+ if (atomic_load(&ctx.failed)) {
+ ksft_test_result_fail("visibility: %s\n", ctx.failure_msg);
+ } else {
+ for (i = 0; i < N_VIS_WRITERS; i++) {
+ ksft_print_msg("visibility: writer %d committed %u\n",
+ i,
+ atomic_load(&ctx.writer_committed[i]));
+ }
+ ksft_test_result_pass("stress_visibility\n");
+ }
+
+ munmap((void *)ctx.mem, map_size);
+ close(ctx.fd);
+}
+
+#define NUM_TESTS 6
+
+int main(void)
+{
+ page_size = sysconf(_SC_PAGE_SIZE);
+ if (!page_size)
+ ksft_exit_fail_msg("Failed to get page size %m\n");
+
+ ksft_print_header();
+ ksft_set_plan(NUM_TESTS);
+
+ test_sequential();
+ test_multi_page();
+ test_cross_process();
+ test_cross_process_multi_round();
+ test_stress_no_lost_events();
+ test_stress_visibility();
+
+ ksft_finished();
+ return 0;
+}
+
+#else /* __NR_memfd_tripwire */
+
+int main(int argc, char *argv[])
+{
+ printf("skip: skipping memfd_tripwire test (missing __NR_memfd_tripwire)\n");
+ return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_tripwire */
--
2.52.0
^ permalink raw reply related [flat|nested] 7+ messages in thread* Re: [RFC PATCH 0/2] mm: memfd with write notifications
2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
@ 2026-06-11 1:36 ` Baolin Wang
2026-06-11 12:40 ` Mattias Nissler
[not found] ` <40381f8a-47e3-4f97-a9ad-f6f868fe0392@kernel.org>
2 siblings, 2 replies; 7+ messages in thread
From: Baolin Wang @ 2026-06-11 1:36 UTC (permalink / raw)
To: Mattias Nissler, linux-mm
Cc: Hugh Dickins, mattias.nissler,
david@kernel.org >> David Hildenbrand (Red Hat),
Lorenzo Stoakes (Oracle), Liam R. Howlett, Vlastimil Babka,
Mike Rapoport, Suren Baghdasaryan, Michal Hocko
Add more MM core maintainers.
On 6/3/26 8:55 PM, Mattias Nissler wrote:
> I want to propose a kernel facility to have user space create memory
> regions that can generate notifications on write access. This is useful
> as a cross-process communication mechanism, where a producer writes to
> the memory, and a consumer polls for new data to be available.
>
> I'm including a minimalistic proof-of-concept implementation meant as a
> vehicle to demonstrate the idea and clarify semantics. It works by
> mapping the memory region read-only, so write accesses will generate
> page faults. The `page_mkwrite` handler can thus trigger notifications.
> It also allows the mappings to become writable temporarily until the
> mechanism gets rearmed.
>
> Intended usage looks as follows (cf. selftest code included):
> 1. Call `memfd_tripwire` to create an instance, `ftruncate` to
> configure its size.
> 2. Pass the file descriptor for `mmap()`ing to the producer and
> consumer (potentially across process boundaries).
> 3a. The producer writes to the memory region whenever there is new
> data to publish to the consumer.
> 3b. The consumer runs a poll loop:
> * Wait for `POLLIN` event.
> * `ioctl(MEMFD_TRIPWIRE_ACK)` to (1) re-arm for subsequent
> notifications and (2) make sure prior writes are visible.
> * Examine the memory region to collect and process the data
> provided by the producer.
>
> The intention is to guarantee that no change to the memory region can
> slip through undetected by the consumer. The ACK semantics are
> instrumental in achieving that. When the ACK returns, we assure the
> consumer that (possibly concurrent) writes to the region are either
> visible, or if not they will trigger a subsequent `POLLIN` event. Note
> that there is no guarantee that each write will generate an individual
> event. Neither are any details on the triggering write provided
> (address, value). The consumer is expected to inspect the memory to find
> out what has changed. The details of this depend on the communication
> protocol producer and consumer are following.
>
> A direct consequence of the above semantics is that for a write to be
> detectable by the consumer, the write must change data in the memory
> region. Re-writing an already-present value would generally not be
> detectable. Sequences where a location is written briefly with a changed
> value and then restored to the previous value can't be reliably detected
> by the consumer either. I'm calling this out as a noteworthy restriction
> that will prevent certain usage patterns.
>
> I also want to call attention to an inherent race condition. Page faults
> signal that a write is being attempted, but obviously they fire before
> the actual write happens. Thus, write notifications are generated before
> the write, and race with the write actually taking place. If the
> consumer wins the race and gets their ACK through before the write
> manages to land, the consumer could see unchanged memory, and the
> producer would fault again. This is perfectly compatible with the
> desired semantics, but creates the risk of a lifelock situation. In
> practice, it is hopefully exceedingly unlikely for the consumer to win
> the race consistently so that this can be ignored.
>
> Switching to motivation: Producer / consumer communication can be
> implemented in many different ways. Pairing shared memory with a
> notification mechanism such as `eventfd` gets the job done nicely, but
> requires the producer to operate an `eventfd`. This can be undesirable
> or impractical for cases where the producer is designed to talk to a
> hardware interface in which a register write also conveys a "doorbell"
> to trigger hardware processing. The proposed mechanism allows software
> implementations of the consumer side that behave functionally equivalent
> to the popular ring buffer + doorbell consumer implementation in
> hardware. This is particularly useful in contexts where hardware is
> simulated, for example in VFIO-user server implementations. The tripwire
> mechanism isn't restricted to that use case though and is generic enough
> to be useful for other purposes as well.
From your earlier requirements description, my first thought was also
the shmem + eventfd approach. It seems straightforward and avoids
introducing a heavy new system call to reimplement functionality that
already exists.
Moreover, I haven’t fully understood what hardware constraints you’re
referring to here.
Anyway, let’s also see what others think.
> In terms of related technology, there is some overlap with both
> `userfaultfd` and KVM's `ioeventfd`. `userfaultfd` has a write protect
> mode that will generate notifications upon write access. This can be
> used to construct a similar notification mechanism to the one proposed.
> However, the producer will be blocked until the consumer resolves the
> fault and provides a writable page. The consumer will then have to give
> the producer time to carry out the write before re-arming write
> protection. Furthermore, `userfaultfd` is scoped by design to a process
> / `mm`, whereas the proposed tripwire mechanism makes the fault handling
> and notification mechanism a property of the memory region represented
> by the file descriptor. The latter is simpler to integrate with existing
> IPC protocols that already exchange file descriptors (such as
> VFIO-user), avoiding the need for additional setup code in the producer
> to instantiate a `userfaultfd` and pass it to the consumer. Also, the
> tripwire mechanism doesn't make an attempt at providing a generic fault
> handling framework, which sidesteps the access control complications of
> `userfaultfd` when it comes to handling faults generated in kernel
> context. In contrast to `ioeventfd`, `memfd_tripwire` does provide
> regular memory semantics, whereas `ioeventfd` discards written values.
>
> There are also a number of design choices / alternatives that warrant
> consideration. For simplicity, the proof-of-concept implementation
> generates `POLLIN` events on the file descriptor representing the memory
> region. That's somewhat unconventional given that `POLLIN` is originally
> meant to indicate a file descriptor's readiness to be read, which is
> always the case for memory-backed files. An alternative design might
> associate a separate `eventfd` at setup time to deliver events to. If we
> were to deliver events via an `eventfd`, we could possibly also fold the
> ACK operation into the `read()` that clears the `eventfd`, which might
> be considered a cleaner API.
>
> Finally, I acknowledge that the proof-of-concept implementation has a
> number of gaps that would need to be filled in for a production
> implementation. This includes read and write `file_operations`, support
> for sealing and THP, etc. These features are already implemented in
> `shmem`, so a production version would likely make most sense as a new
> `shmem` feature. I wanted to get high-level feedback on the concept
> before starting work on that though.
>
> Mattias Nissler (2):
> mm: `memfd_tripwire` proof-of-concept
> selftests: `memfd_tripwire` selftest
>
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> include/linux/syscalls.h | 1 +
> include/uapi/asm-generic/unistd.h | 5 +-
> include/uapi/linux/memfd_tripwire.h | 19 +
> kernel/sys_ni.c | 3 +
> mm/Kconfig | 9 +
> mm/Makefile | 1 +
> mm/memfd_tripwire.c | 246 +++++++
> scripts/syscall.tbl | 1 +
> tools/testing/selftests/mm/Makefile | 1 +
> tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
> 11 files changed, 981 insertions(+), 1 deletion(-)
> create mode 100644 include/uapi/linux/memfd_tripwire.h
> create mode 100644 mm/memfd_tripwire.c
> create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c
>
^ permalink raw reply [flat|nested] 7+ messages in thread