[RFC PATCH 0/2] mm: memfd with write notifications

Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] mm: memfd with write notifications
@ 2026-06-03 12:55 Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler

I want to propose a kernel facility to have user space create memory
regions that can generate notifications on write access. This is useful
as a cross-process communication mechanism, where a producer writes to
the memory, and a consumer polls for new data to be available.

I'm including a minimalistic proof-of-concept implementation meant as a
vehicle to demonstrate the idea and clarify semantics. It works by
mapping the memory region read-only, so write accesses will generate
page faults. The `page_mkwrite` handler can thus trigger notifications.
It also allows the mappings to become writable temporarily until the
mechanism gets rearmed.

Intended usage looks as follows (cf. selftest code included):
  1.  Call `memfd_tripwire` to create an instance, `ftruncate` to
      configure its size.
  2.  Pass the file descriptor for `mmap()`ing to the producer and
      consumer (potentially across process boundaries).
  3a. The producer writes to the memory region whenever there is new
      data to publish to the consumer.
  3b. The consumer runs a poll loop:
        * Wait for `POLLIN` event.
        * `ioctl(MEMFD_TRIPWIRE_ACK)` to (1) re-arm for subsequent
          notifications and (2) make sure prior writes are visible.
        * Examine the memory region to collect and process the data
          provided by the producer.

The intention is to guarantee that no change to the memory region can
slip through undetected by the consumer. The ACK semantics are
instrumental in achieving that. When the ACK returns, we assure the
consumer that (possibly concurrent) writes to the region are either
visible, or if not they will trigger a subsequent `POLLIN` event. Note
that there is no guarantee that each write will generate an individual
event. Neither are any details on the triggering write provided
(address, value). The consumer is expected to inspect the memory to find
out what has changed. The details of this depend on the communication
protocol producer and consumer are following.

A direct consequence of the above semantics is that for a write to be
detectable by the consumer, the write must change data in the memory
region. Re-writing an already-present value would generally not be
detectable. Sequences where a location is written briefly with a changed
value and then restored to the previous value can't be reliably detected
by the consumer either. I'm calling this out as a noteworthy restriction
that will prevent certain usage patterns.

I also want to call attention to an inherent race condition. Page faults
signal that a write is being attempted, but obviously they fire before
the actual write happens. Thus, write notifications are generated before
the write, and race with the write actually taking place. If the
consumer wins the race and gets their ACK through before the write
manages to land, the consumer could see unchanged memory, and the
producer would fault again. This is perfectly compatible with the
desired semantics, but creates the risk of a lifelock situation. In
practice, it is hopefully exceedingly unlikely for the consumer to win
the race consistently so that this can be ignored.

Switching to motivation: Producer / consumer communication can be
implemented in many different ways. Pairing shared memory with a
notification mechanism such as `eventfd` gets the job done nicely, but
requires the producer to operate an `eventfd`. This can be undesirable
or impractical for cases where the producer is designed to talk to a
hardware interface in which a register write also conveys a "doorbell"
to trigger hardware processing. The proposed mechanism allows software
implementations of the consumer side that behave functionally equivalent
to the popular ring buffer + doorbell consumer implementation in
hardware. This is particularly useful in contexts where hardware is
simulated, for example in VFIO-user server implementations. The tripwire
mechanism isn't restricted to that use case though and is generic enough
to be useful for other purposes as well.

In terms of related technology, there is some overlap with both
`userfaultfd` and KVM's `ioeventfd`. `userfaultfd` has a write protect
mode that will generate notifications upon write access. This can be
used to construct a similar notification mechanism to the one proposed.
However, the producer will be blocked until the consumer resolves the
fault and provides a writable page. The consumer will then have to give
the producer time to carry out the write before re-arming write
protection. Furthermore, `userfaultfd` is scoped by design to a process
/ `mm`, whereas the proposed tripwire mechanism makes the fault handling
and notification mechanism a property of the memory region represented
by the file descriptor. The latter is simpler to integrate with existing
IPC protocols that already exchange file descriptors (such as
VFIO-user), avoiding the need for additional setup code in the producer
to instantiate a `userfaultfd` and pass it to the consumer. Also, the
tripwire mechanism doesn't make an attempt at providing a generic fault
handling framework, which sidesteps the access control complications of
`userfaultfd` when it comes to handling faults generated in kernel
context. In contrast to `ioeventfd`, `memfd_tripwire` does provide
regular memory semantics, whereas `ioeventfd` discards written values.

There are also a number of design choices / alternatives that warrant
consideration. For simplicity, the proof-of-concept implementation
generates `POLLIN` events on the file descriptor representing the memory
region. That's somewhat unconventional given that `POLLIN` is originally
meant to indicate a file descriptor's readiness to be read, which is
always the case for memory-backed files. An alternative design might
associate a separate `eventfd` at setup time to deliver events to. If we
were to deliver events via an `eventfd`, we could possibly also fold the
ACK operation into the `read()` that clears the `eventfd`, which might
be considered a cleaner API.

Finally, I acknowledge that the proof-of-concept implementation has a
number of gaps that would need to be filled in for a production
implementation. This includes read and write `file_operations`, support
for sealing and THP, etc. These features are already implemented in
`shmem`, so a production version would likely make most sense as a new
`shmem` feature. I wanted to get high-level feedback on the concept
before starting work on that though.

Mattias Nissler (2):
  mm: `memfd_tripwire` proof-of-concept
  selftests: `memfd_tripwire` selftest

 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 include/linux/syscalls.h                    |   1 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/memfd_tripwire.h         |  19 +
 kernel/sys_ni.c                             |   3 +
 mm/Kconfig                                  |   9 +
 mm/Makefile                                 |   1 +
 mm/memfd_tripwire.c                         | 246 +++++++
 scripts/syscall.tbl                         |   1 +
 tools/testing/selftests/mm/Makefile         |   1 +
 tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
 11 files changed, 981 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/memfd_tripwire.h
 create mode 100644 mm/memfd_tripwire.c
 create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c

-- 
2.52.0

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept
  2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
@ 2026-06-03 12:55 ` Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
  2026-06-11  1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
  2 siblings, 0 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler

`memfd_tripwire()` creates a file descriptor referring to a memory
region that generates poll notifications when it is written. This works
by installing read-only mappings. Write accesses then trigger a fault
and invoke the `page_mkwrite()` handler, which queues a POLLIN event and
allows the PTE to be made writable. After observing the POLLIN, the
consumer is expected to invoke the `MEMFD_TRIPWIRE_ACK` ioctl, which
write-protects the mappings again to re-arm the mechanism. It also
guarantees that previous writes are visible to the consumer.

Signed-off-by: Mattias Nissler <mnissler@meta.com>
Assisted-by: Claude:claude-opus-4-6
---
 arch/x86/entry/syscalls/syscall_64.tbl |   1 +
 include/linux/syscalls.h               |   1 +
 include/uapi/asm-generic/unistd.h      |   5 +-
 include/uapi/linux/memfd_tripwire.h    |  19 ++
 kernel/sys_ni.c                        |   3 +
 mm/Kconfig                             |   9 +
 mm/Makefile                            |   1 +
 mm/memfd_tripwire.c                    | 246 +++++++++++++++++++++++++
 scripts/syscall.tbl                    |   1 +
 9 files changed, 285 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/memfd_tripwire.h
 create mode 100644 mm/memfd_tripwire.c

diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 524155d655da1..d93ace9385fd0 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -396,6 +396,7 @@
 469	common	file_setattr		sys_file_setattr
 470	common	listns			sys_listns
 471	common	rseq_slice_yield	sys_rseq_slice_yield
+472	common	memfd_tripwire		sys_memfd_tripwire
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4fb7291f54b62..52ea6d808f96d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -940,6 +940,7 @@ asmlinkage long sys_seccomp(unsigned int op, unsigned int flags,
 asmlinkage long sys_getrandom(char __user *buf, size_t count,
 			      unsigned int flags);
 asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
+asmlinkage long sys_memfd_tripwire(unsigned int flags);
 asmlinkage long sys_bpf(int cmd, union bpf_attr __user *attr, unsigned int size);
 asmlinkage long sys_execveat(int dfd, const char __user *filename,
 			const char __user *const __user *argv,
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index a627acc8fb5fe..38825b1c59cef 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,8 +863,11 @@ __SYSCALL(__NR_listns, sys_listns)
 #define __NR_rseq_slice_yield 471
 __SYSCALL(__NR_rseq_slice_yield, sys_rseq_slice_yield)
 
+#define __NR_memfd_tripwire 472
+__SYSCALL(__NR_memfd_tripwire, sys_memfd_tripwire)
+
 #undef __NR_syscalls
-#define __NR_syscalls 472
+#define __NR_syscalls 473
 
 /*
  * 32 bit systems traditionally used different
diff --git a/include/uapi/linux/memfd_tripwire.h b/include/uapi/linux/memfd_tripwire.h
new file mode 100644
index 0000000000000..478599ba2f813
--- /dev/null
+++ b/include/uapi/linux/memfd_tripwire.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _UAPI_LINUX_MEMFD_TRIPWIRE_H
+#define _UAPI_LINUX_MEMFD_TRIPWIRE_H
+
+#include <linux/ioctl.h>
+
+#define MEMFD_TRIPWIRE_IOC		0xDA
+
+/*
+ * MEMFD_TRIPWIRE_ACK serves two purpose: First, it re-arms the mechanism to
+ * make sure future write activity triggers a POLLIN notification. Second, it
+ * makes sure that all writes up to the ACK are visible to the calling process.
+ * It is also guaranteed that no writes can sneak through unnoticed, i.e. after
+ * ACK concurrent writes have either taken place and are visible to the
+ * consumer, or will generate subsequent POLLIN events.
+ */
+#define MEMFD_TRIPWIRE_ACK		_IO(MEMFD_TRIPWIRE_IOC, 0x00)
+
+#endif /* _UAPI_LINUX_MEMFD_TRIPWIRE_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index add3032da16f5..6bcf8658b4ff5 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -270,6 +270,9 @@ COND_SYSCALL(pkey_free);
 /* memfd_secret */
 COND_SYSCALL(memfd_secret);
 
+/* memfd_tripwire */
+COND_SYSCALL(memfd_tripwire);
+
 /*
  * Architecture specific weak syscall entries.
  */
diff --git a/mm/Kconfig b/mm/Kconfig
index e8bf1e9e6ad90..c7c1c2e69d37b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1323,6 +1323,15 @@ config SECRETMEM
 	  memory areas visible only in the context of the owning process and
 	  not mapped to other processes and other kernel page tables.
 
+config MEMFD_TRIPWIRE
+	bool "Enable memfd_tripwire() system call"
+	depends on MMU
+	help
+	  Enable the memfd_tripwire() system call, which creates an anonymous
+	  shared memory region with built-in write notification support. The
+	  returned file descriptor supports poll() for detecting writes to the
+	  mapped memory and an ioctl for re-arming notifications.
+
 config ANON_VMA_NAME
 	bool "Anonymous VMA name support"
 	depends on PROC_FS && ADVISE_SYSCALLS && MMU
diff --git a/mm/Makefile b/mm/Makefile
index 8ad2ab08244eb..658443dbcb28d 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -141,6 +141,7 @@ obj-$(CONFIG_PERCPU_STATS) += percpu-stats.o
 obj-$(CONFIG_ZONE_DEVICE) += memremap.o
 obj-$(CONFIG_HMM_MIRROR) += hmm.o
 obj-$(CONFIG_MEMFD_CREATE) += memfd.o
+obj-$(CONFIG_MEMFD_TRIPWIRE) += memfd_tripwire.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
diff --git a/mm/memfd_tripwire.c b/mm/memfd_tripwire.c
new file mode 100644
index 0000000000000..52a27507b2d39
--- /dev/null
+++ b/mm/memfd_tripwire.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * memfd_tripwire - Memory file descriptor with write notification support.
+ *
+ * Creates an anonymous memory-backed file descriptor that supports polling for
+ * detecting writes to its mappings.
+ *
+ * Theory of operation:
+ *  1.  memfd_tripwire() to create a new instance
+ *  2.  Pass the fd / mapping to the (possibly out-of-process) producer
+ *  3a. The producer uses the memory region as normal memory. Writes stick, but
+ *      generate a notification as a side effect.
+ *  3b. The consumer polls the file descriptor:
+ *       a. POLLIN is observed when the memory region gets written
+ *       b. Call ioctl(MEMFD_TRIPWIRE_ACK) to restore write protection
+ *       c. Inspect memory contents and react as appropriate
+ */
+
+#include <linux/anon_inodes.h>
+#include <linux/file.h>
+#include <linux/folio_batch.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/mount.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/pseudo_fs.h>
+#include <linux/rmap.h>
+#include <linux/slab.h>
+#include <linux/syscalls.h>
+
+#include <uapi/linux/magic.h>
+#include <uapi/linux/memfd_tripwire.h>
+
+struct tripwire_info {
+	atomic_t dirty;
+	wait_queue_head_t wqh;
+};
+
+static struct tripwire_info *to_tripwire_info(struct file *file)
+{
+	return file->private_data;
+}
+
+static vm_fault_t tripwire_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	struct folio *folio;
+	int err;
+
+	if (((loff_t)offset << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	folio = filemap_lock_folio(mapping, offset);
+	if (IS_ERR(folio)) {
+		folio = folio_alloc(gfp | __GFP_ZERO, 0);
+		if (!folio)
+			return VM_FAULT_OOM;
+
+		__folio_mark_uptodate(folio);
+		err = filemap_add_folio(mapping, folio, offset, gfp);
+		if (unlikely(err)) {
+			folio_put(folio);
+			if (err == -EEXIST)
+				goto retry;
+			return vmf_error(err);
+		}
+	}
+
+	vmf->page = folio_file_page(folio, offset);
+	return VM_FAULT_LOCKED;
+}
+
+static vm_fault_t tripwire_page_mkwrite(struct vm_fault *vmf)
+{
+	struct tripwire_info *info = to_tripwire_info(vmf->vma->vm_file);
+	struct folio *folio = page_folio(vmf->page);
+
+	/*
+	 * Note that this might be racing with a concurrent ACK. We need to
+	 * guarantee that the dirty flag and page protection state remains
+	 * consistent and that no notifications are lost.
+	 *
+	 * The actual update to mark the PTE writable only happens after this
+	 * function completes, so the window between here and PTE update must
+	 * be protected against concurrent modifications by the ACK code path.
+	 * Taking the folio lock before notifying consumers conveniently
+	 * makes sure that ACKs can only complete after our PTE updates go
+	 * through, preventing a situation where an interleaved ACK clears the
+	 * dirty flag but we're still going ahead to mark the PTE writable.
+	 */
+	folio_lock(folio);
+
+	if (atomic_cmpxchg(&info->dirty, 0, 1) == 0)
+		wake_up_poll(&info->wqh, EPOLLIN);
+
+	if (folio->mapping != vmf->vma->vm_file->f_mapping) {
+		folio_unlock(folio);
+		return VM_FAULT_NOPAGE;
+	}
+
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct tripwire_vm_ops = {
+	.fault = tripwire_fault,
+	.page_mkwrite = tripwire_page_mkwrite,
+};
+
+static int tripwire_mmap_prepare(struct vm_area_desc *desc)
+{
+	file_accessed(desc->file);
+	desc->vm_ops = &tripwire_vm_ops;
+	return 0;
+}
+
+static __poll_t tripwire_poll(struct file *file, poll_table *wait)
+{
+	struct tripwire_info *info = to_tripwire_info(file);
+	__poll_t mask = 0;
+
+	poll_wait(file, &info->wqh, wait);
+
+	if (atomic_read(&info->dirty))
+		mask |= EPOLLIN | EPOLLRDNORM;
+
+	return mask;
+}
+
+static void tripwire_ack(struct file *file)
+{
+	struct tripwire_info *info = to_tripwire_info(file);
+	struct address_space *mapping = file->f_mapping;
+	struct folio_batch fbatch;
+	pgoff_t index = 0;
+	int i;
+
+	/*
+	 * Note that this flag update is not protected by the folio locks taken
+	 * below. Hence a concurrent writer might sneak in and switch back to
+	 * dirty state. That's OK though, since it merely results in a spurious
+	 * notification.
+	 */
+	atomic_set(&info->dirty, 0);
+
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &index, ~0UL, &fbatch)) {
+		for (i = 0; i < folio_batch_count(&fbatch); i++) {
+			folio_lock(fbatch.folios[i]);
+			folio_mkclean(fbatch.folios[i]);
+			folio_unlock(fbatch.folios[i]);
+		}
+		folio_batch_release(&fbatch);
+		cond_resched();
+	}
+}
+
+static long tripwire_ioctl(struct file *file, unsigned int cmd,
+			   unsigned long arg)
+{
+	switch (cmd) {
+	case MEMFD_TRIPWIRE_ACK:
+		if (arg != 0)
+			return -EINVAL;
+		tripwire_ack(file);
+		return 0;
+	default:
+		return -ENOTTY;
+	}
+}
+
+static int tripwire_release(struct inode *inode, struct file *file)
+{
+	kfree(to_tripwire_info(file));
+	return 0;
+}
+
+static const struct file_operations tripwire_fops = {
+	.release = tripwire_release,
+	.mmap_prepare = tripwire_mmap_prepare,
+	.poll = tripwire_poll,
+	.unlocked_ioctl = tripwire_ioctl,
+};
+
+static const struct address_space_operations tripwire_aops = {
+	.dirty_folio = noop_dirty_folio,
+};
+
+static int tripwire_setattr(struct mnt_idmap *idmap, struct dentry *dentry,
+			    struct iattr *iattr)
+{
+	struct inode *inode = d_inode(dentry);
+	unsigned int ia_valid = iattr->ia_valid;
+	int ret;
+
+	filemap_invalidate_lock(inode->i_mapping);
+
+	/* Allowing size to change only once avoids the need to unmap here. */
+	if ((ia_valid & ATTR_SIZE) && inode->i_size)
+		ret = -EINVAL;
+	else
+		ret = simple_setattr(idmap, dentry, iattr);
+
+	filemap_invalidate_unlock(inode->i_mapping);
+
+	return ret;
+}
+
+static const struct inode_operations tripwire_iops = {
+	.setattr = tripwire_setattr,
+};
+
+SYSCALL_DEFINE1(memfd_tripwire, unsigned int, flags)
+{
+	struct tripwire_info *info;
+	struct inode *inode;
+	struct file *file;
+
+	if (flags != 0)
+		return -EINVAL;
+
+	info = kzalloc_obj(struct tripwire_info);
+	if (!info)
+		return -ENOMEM;
+
+	init_waitqueue_head(&info->wqh);
+
+	file = anon_inode_create_getfile("memfd_tripwire", &tripwire_fops, info,
+					 O_RDWR, NULL);
+	if (IS_ERR(file)) {
+		kfree(info);
+		return PTR_ERR(file);
+	}
+
+	inode = file_inode(file);
+	inode->i_op = &tripwire_iops;
+	inode->i_mapping->a_ops = &tripwire_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return FD_ADD(O_CLOEXEC, file);
+}
diff --git a/scripts/syscall.tbl b/scripts/syscall.tbl
index 7a42b32b65776..a2bd8cd809ee8 100644
--- a/scripts/syscall.tbl
+++ b/scripts/syscall.tbl
@@ -412,3 +412,4 @@
 469	common	file_setattr			sys_file_setattr
 470	common	listns				sys_listns
 471	common	rseq_slice_yield		sys_rseq_slice_yield
+472	common	memfd_tripwire			sys_memfd_tripwire
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest
  2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
@ 2026-06-03 12:55 ` Mattias Nissler
  2026-06-11  1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
  2 siblings, 0 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler

Add test cases for `memfd_tripwire`, exercising basic semantics as well
as verifying behavior under stress.

Signed-off-by: Mattias Nissler <mnissler@meta.com>
Assisted-by: Claude:claude-opus-4-6
---
 tools/testing/selftests/mm/Makefile         |   1 +
 tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
 2 files changed, 696 insertions(+)
 create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c

diff --git a/tools/testing/selftests/mm/Makefile b/tools/testing/selftests/mm/Makefile
index cd24596cdd27e..f87839733df5a 100644
--- a/tools/testing/selftests/mm/Makefile
+++ b/tools/testing/selftests/mm/Makefile
@@ -106,6 +106,7 @@ TEST_GEN_FILES += guard-regions
 TEST_GEN_FILES += merge
 TEST_GEN_FILES += rmap
 TEST_GEN_FILES += folio_split_race_test
+TEST_GEN_FILES += memfd_tripwire
 
 ifneq ($(ARCH),arm64)
 TEST_GEN_FILES += soft-dirty
diff --git a/tools/testing/selftests/mm/memfd_tripwire.c b/tools/testing/selftests/mm/memfd_tripwire.c
new file mode 100644
index 0000000000000..7229b387ffcf4
--- /dev/null
+++ b/tools/testing/selftests/mm/memfd_tripwire.c
@@ -0,0 +1,695 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * memfd_tripwire selftest
+ *
+ * Tests for the memfd_tripwire() syscall which creates a shared memory region
+ * with write notification support via poll() and ioctl-based re-arming.
+ */
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdatomic.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/syscall.h>
+#include <sys/wait.h>
+#include <time.h>
+#include <unistd.h>
+
+#include <linux/memfd_tripwire.h>
+
+#include "kselftest.h"
+
+#ifdef __NR_memfd_tripwire
+
+static int sys_memfd_tripwire(unsigned int flags)
+{
+	return syscall(__NR_memfd_tripwire, flags);
+}
+
+/*
+ * Poll the fd for POLLIN with the given timeout in milliseconds.
+ * Returns 1 if POLLIN is set, 0 on timeout, -1 on error.
+ */
+static int poll_check(int fd, int timeout_ms)
+{
+	struct pollfd pfd = { .fd = fd, .events = POLLIN };
+	int ret;
+
+	ret = poll(&pfd, 1, timeout_ms);
+	if (ret < 0)
+		return -1;
+	if (ret == 0)
+		return 0;
+	return (pfd.revents & POLLIN) ? 1 : 0;
+}
+
+static int tripwire_ack(int fd)
+{
+	return ioctl(fd, MEMFD_TRIPWIRE_ACK, 0);
+}
+
+static long page_size;
+
+static int setup_tripwire(size_t size, int *fd, char **mem)
+{
+	*fd = sys_memfd_tripwire(0);
+	if (*fd < 0) {
+		ksft_test_result_fail("memfd_tripwire: %m\n");
+		return -1;
+	}
+
+	if (ftruncate(*fd, size) < 0) {
+		ksft_test_result_fail("ftruncate: %m\n");
+		goto out;
+	}
+
+	*mem = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, *fd, 0);
+	if (*mem == MAP_FAILED) {
+		*mem = NULL;
+		ksft_test_result_fail("mmap: %m\n");
+		goto out;
+	}
+
+	return 0;
+
+out:
+	close(*fd);
+	return -1;
+}
+
+static double time_elapsed(struct timespec *start)
+{
+	struct timespec now;
+
+	clock_gettime(CLOCK_MONOTONIC, &now);
+	return (now.tv_sec - start->tv_sec) +
+	       (now.tv_nsec - start->tv_nsec) / 1e9;
+}
+
+/*
+ * Test 1: Sequential single-process semantics.
+ *
+ * Verifies the complete life cycle:
+ *   create -> poll (clean) -> write -> poll (dirty) -> ACK -> poll (clean)
+ *          -> write again -> poll (dirty)
+ */
+static void test_sequential(void)
+{
+	int fd;
+	char *mem;
+
+	if (setup_tripwire(page_size, &fd, &mem) != 0)
+		return;
+
+	/* Freshly created: should not have POLLIN */
+	if (poll_check(fd, 0) != 0) {
+		ksft_test_result_fail("POLLIN set on clean fd\n");
+		goto out;
+	}
+
+	/* Write to mapped memory */
+	mem[0] = 'A';
+
+	/* Should now report POLLIN */
+	if (poll_check(fd, 100) != 1) {
+		ksft_test_result_fail("POLLIN not set after write\n");
+		goto out;
+	}
+
+	/* Polling again without ACK should still show POLLIN */
+	if (poll_check(fd, 0) != 1) {
+		ksft_test_result_fail("POLLIN lost without ACK\n");
+		goto out;
+	}
+
+	/* ACK re-arms the tripwire */
+	if (tripwire_ack(fd) < 0) {
+		ksft_test_result_fail("ioctl ACK: %m\n");
+		goto out;
+	}
+
+	/* After ACK: should be clean again */
+	if (poll_check(fd, 0) != 0) {
+		ksft_test_result_fail("POLLIN set after ACK\n");
+		goto out;
+	}
+
+	/* Write again: must re-trigger */
+	mem[0] = 'B';
+
+	if (poll_check(fd, 100) != 1) {
+		ksft_test_result_fail("POLLIN not set after second write\n");
+		goto out;
+	}
+
+	ksft_test_result_pass("sequential\n");
+out:
+	munmap((void *)mem, page_size);
+	close(fd);
+}
+
+/*
+ * Test 2: Multi-page, writes to different pages trigger after re-arm.
+ */
+static void test_multi_page(void)
+{
+	int fd;
+	char *mem;
+	size_t size = page_size * 4;
+
+	if (setup_tripwire(size, &fd, &mem) != 0)
+		return;
+
+	/* Write page 0, observe, ACK */
+	mem[0] = 'A';
+	if (poll_check(fd, 100) != 1) {
+		ksft_test_result_fail("POLLIN not set after page 0 write\n");
+		goto out;
+	}
+	tripwire_ack(fd);
+
+	/* Write page 2: A different page must also trigger */
+	mem[page_size * 2] = 'B';
+	if (poll_check(fd, 100) != 1) {
+		ksft_test_result_fail("POLLIN not set after page 2 write\n");
+		goto out;
+	}
+	tripwire_ack(fd);
+
+	/* Write page 3 */
+	mem[page_size * 3] = 'C';
+	if (poll_check(fd, 100) != 1) {
+		ksft_test_result_fail("POLLIN not set after page 3 write\n");
+		goto out;
+	}
+
+	ksft_test_result_pass("multi_page\n");
+out:
+	munmap((void *)mem, size);
+	close(fd);
+}
+
+/*
+ * Test 3: Cross-process producer/consumer.
+ *
+ * Parent is the consumer (polls for events), child is the producer (writes
+ * to the shared mapping). Verifies that writes in the child process are
+ * visible to the parent via poll().
+ */
+static void test_cross_process(void)
+{
+	int fd, status;
+	char *mem;
+	pid_t pid;
+
+	if (setup_tripwire(page_size, &fd, &mem) != 0)
+		return;
+
+	pid = fork();
+	if (pid < 0) {
+		ksft_test_result_fail("fork: %m\n");
+		goto out;
+	}
+
+	if (pid == 0) {
+		/*
+		 * Child: wait a bit then write. The parent should be blocked
+		 * in poll() and wake up when we write.
+		 */
+		usleep(50000);
+		mem[0] = 'P';
+		_exit(0);
+	}
+
+	/* Parent: wait for the child's write */
+	if (poll_check(fd, 2000) != 1) {
+		ksft_test_result_fail("POLLIN not set after child write\n");
+		kill(pid, SIGKILL);
+		waitpid(pid, &status, 0);
+		goto out;
+	}
+
+	waitpid(pid, &status, 0);
+
+	if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+		ksft_test_result_fail("child exited abnormally\n");
+		goto out;
+	}
+
+	/* Verify data written by child */
+	if (mem[0] != 'P') {
+		ksft_test_result_fail("data mismatch: got 0x%02x\n", mem[0]);
+		goto out;
+	}
+
+	ksft_test_result_pass("cross_process\n");
+out:
+	munmap((void *)mem, page_size);
+	close(fd);
+}
+
+/*
+ * Test 4: Multi-round cross-process producer/consumer.
+ *
+ * Uses a pipe to synchronize rounds between parent (consumer) and child
+ * (producer). Each round: child writes, parent detects via poll, parent
+ * ACKs, repeat. Verifies that the full cycle works reliably across
+ * process boundaries.
+ */
+static void test_cross_process_multi_round(void)
+{
+	int fd, status;
+	int pipe_to_child[2], pipe_to_parent[2];
+	char *mem;
+	pid_t pid;
+	int rounds = 1000;
+	char sync;
+
+	if (setup_tripwire(page_size, &fd, &mem) != 0)
+		return;
+
+	if (pipe(pipe_to_child) < 0 || pipe(pipe_to_parent) < 0) {
+		ksft_test_result_fail("pipe: %m\n");
+		goto out;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		ksft_test_result_fail("fork: %m\n");
+		goto out;
+	}
+
+	if (pid == 0) {
+		/* Child: producer */
+		close(pipe_to_child[1]);
+		close(pipe_to_parent[0]);
+
+		for (int i = 0; i < rounds; i++) {
+			/* Wait for parent to signal us to write */
+			if (read(pipe_to_child[0], &sync, 1) != 1)
+				_exit(1);
+			mem[0] = 'A' + i;
+			/* Tell parent we wrote */
+			if (write(pipe_to_parent[1], &sync, 1) != 1)
+				_exit(1);
+		}
+		_exit(0);
+	}
+
+	/* Parent: consumer */
+	close(pipe_to_child[0]);
+	close(pipe_to_parent[1]);
+
+	for (int i = 0; i < rounds; i++) {
+		/* Tell child to write */
+		sync = 'G';
+		if (write(pipe_to_child[1], &sync, 1) != 1) {
+			ksft_test_result_fail("write to pipe: %m\n");
+			goto reap;
+		}
+
+		/* Wait for child to confirm write */
+		if (read(pipe_to_parent[0], &sync, 1) != 1) {
+			ksft_test_result_fail("read from pipe: %m\n");
+			goto reap;
+		}
+
+		/* Detect the write */
+		if (poll_check(fd, 1000) != 1) {
+			ksft_test_result_fail("round %d: POLLIN not set\n", i);
+			goto reap;
+		}
+
+		/* Verify data */
+		if (mem[0] != (char)('A' + i)) {
+			ksft_test_result_fail("round %d: data 0x%02x\n", i,
+					      mem[0]);
+			goto reap;
+		}
+
+		/* Re-arm for next round */
+		if (tripwire_ack(fd) < 0) {
+			ksft_test_result_fail("round %d: ACK: %m\n", i);
+			goto reap;
+		}
+
+		/* Confirm clean after ACK */
+		if (poll_check(fd, 0) != 0) {
+			ksft_test_result_fail("round %d: POLLIN after ACK\n",
+					      i);
+			goto reap;
+		}
+	}
+
+	close(pipe_to_child[1]);
+	close(pipe_to_parent[0]);
+
+	waitpid(pid, &status, 0);
+	if (!WIFEXITED(status) || WEXITSTATUS(status) != 0) {
+		ksft_test_result_fail("child exited abnormally\n");
+		goto out;
+	}
+
+	ksft_test_result_pass("cross_process_multi_round\n");
+	munmap((void *)mem, page_size);
+	close(fd);
+	return;
+
+reap:
+	close(pipe_to_child[1]);
+	close(pipe_to_parent[0]);
+	kill(pid, SIGKILL);
+	waitpid(pid, &status, 0);
+out:
+	munmap((void *)mem, page_size);
+	close(fd);
+}
+
+/*
+ * Test 5: Producer/consumer race stress test.
+ *
+ * A producer thread writes continuously while the consumer thread polls and
+ * ACKs. The invariant is: we must never "miss" an event. Specifically, after
+ * the consumer ACKs and the producer writes at least once more, the consumer
+ * must eventually see POLLIN again.
+ *
+ * We use an atomic generation counter to track this. The producer increments
+ * the generation and writes to the mapping. The consumer records the
+ * generation at ACK time and verifies that when it next sees POLLIN, the
+ * generation has advanced.
+ */
+static atomic_int stress_gen;
+static atomic_bool stress_stop;
+
+static void *stress_producer(void *arg)
+{
+	char *mem = arg;
+
+	while (!atomic_load(&stress_stop)) {
+		atomic_fetch_add(&stress_gen, 1);
+		mem[0]++;
+		/* Intentionally no explicit delay to maximize contention */
+	}
+
+	return NULL;
+}
+
+static void test_stress_no_lost_events(void)
+{
+	int fd;
+	char *mem;
+	pthread_t producer;
+	int ack_count = 0;
+	int gen_at_ack, cur_gen;
+	struct timespec start;
+
+	if (setup_tripwire(page_size, &fd, &mem) != 0)
+		return;
+
+	atomic_store(&stress_gen, 0);
+	atomic_store(&stress_stop, false);
+
+	if (pthread_create(&producer, NULL, stress_producer, (void *)mem)) {
+		ksft_test_result_fail("pthread_create: %m\n");
+		goto out;
+	}
+
+	/*
+	 * Consumer loop: run for ~2 seconds. In each iteration:
+	 *   1. Wait for POLLIN (the producer is writing continuously)
+	 *   2. Record the current generation
+	 *   3. ACK (re-arm)
+	 *   4. The producer is still writing, so POLLIN must come back
+	 */
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	while (time_elapsed(&start) <= 2.0) {
+		/*
+		 * Wait for a write notification. The producer is looping, so
+		 * the timeout should be plenty.
+		 */
+		if (poll_check(fd, 1000) != 1) {
+			ksft_test_result_fail(
+				"POLLIN not set (ack_count=%d, gen=%d)\n",
+				ack_count, atomic_load(&stress_gen));
+			atomic_store(&stress_stop, true);
+			pthread_join(producer, NULL);
+			goto out;
+		}
+
+		/* Record generation and ACK */
+		gen_at_ack = atomic_load(&stress_gen);
+		if (tripwire_ack(fd) < 0) {
+			ksft_test_result_fail("ACK failed: %m\n");
+			atomic_store(&stress_stop, true);
+			pthread_join(producer, NULL);
+			goto out;
+		}
+		ack_count++;
+
+		/*
+		 * The producer is running concurrently. After ACK, it will
+		 * write again and increment the generation. We don't need to
+		 * check anything here because the next poll() iteration will
+		 * verify that the write was detected.
+		 *
+		 * Spin briefly to let the producer get ahead, exercising the
+		 * race between ACK (write-protect) and producer (write).
+		 */
+		cur_gen = atomic_load(&stress_gen);
+		while (cur_gen == gen_at_ack) {
+			sched_yield();
+			cur_gen = atomic_load(&stress_gen);
+		}
+	}
+
+	atomic_store(&stress_stop, true);
+	pthread_join(producer, NULL);
+
+	if (ack_count < 10) {
+		ksft_test_result_fail("too few ACK cycles: %d\n", ack_count);
+		goto out;
+	}
+
+	ksft_print_msg("stress: %d ACK cycles, final gen=%d\n", ack_count,
+		       atomic_load(&stress_gen));
+	ksft_test_result_pass("stress_no_lost_events\n");
+out:
+	munmap((void *)mem, page_size);
+	close(fd);
+}
+
+/*
+ * Test 6: Randomized multi-thread visibility stress test.
+ *
+ * N_VIS_WRITERS writer threads and N_VIS_READERS reader threads operate
+ * concurrently on a shared memfd_tripwire mapping for VIS_DURATION_S seconds.
+ * Each writer owns a uint32_t slot on its own page and writes monotonically
+ * increasing values, interspersed with random sleeps. Each reader randomly
+ * polls or sleeps; on POLLIN it snapshots committed values, ACKs, then reads
+ * the mapped memory and validates visibility.
+ *
+ * The invariant under test: after a reader calls ACK, all writes that were
+ * committed (writer_committed[i] updated with store-release) before the
+ * reader's snapshot (loaded with load-acquire) must be visible when reading
+ * the mapped memory. The ACK's TLB-flush IPI provides the cross-CPU memory
+ * barrier that makes this guarantee hold.
+ */
+
+#define N_VIS_WRITERS 4
+#define N_VIS_READERS 2
+#define VIS_DURATION_S 2
+#define VIS_WRITER_SLEEP_PCT 20
+#define VIS_READER_SLEEP_PCT 30
+#define VIS_MAX_WRITER_SLEEP_US 1000
+#define VIS_MAX_READER_SLEEP_US 3000
+
+struct vis_ctx {
+	int fd;
+	char *mem;
+	atomic_uint writer_committed[N_VIS_WRITERS];
+	atomic_bool stop;
+	atomic_bool failed;
+	char failure_msg[256];
+};
+
+struct vis_thread_arg {
+	struct vis_ctx *ctx;
+	int id;
+};
+
+static void *vis_writer_fn(void *arg)
+{
+	struct vis_thread_arg *ta = arg;
+	struct vis_ctx *ctx = ta->ctx;
+	int id = ta->id;
+	uint32_t *slot = (uint32_t *)(ctx->mem + id * page_size);
+	uint32_t value = 0;
+	unsigned int seed = (unsigned int)(id * 7919 + 1);
+
+	while (!atomic_load_explicit(&ctx->stop, memory_order_relaxed)) {
+		if (rand_r(&seed) % 100 < VIS_WRITER_SLEEP_PCT) {
+			usleep(rand_r(&seed) % VIS_MAX_WRITER_SLEEP_US);
+			continue;
+		}
+
+		value++;
+		*slot = value;
+		atomic_store_explicit(&ctx->writer_committed[id], value,
+				      memory_order_release);
+	}
+
+	return NULL;
+}
+
+static void *vis_reader_fn(void *arg)
+{
+	struct vis_thread_arg *ta = arg;
+	struct vis_ctx *ctx = ta->ctx;
+	uint32_t snapshot[N_VIS_WRITERS];
+	unsigned int seed = (unsigned int)(ta->id * 6971 + 42);
+	int i;
+
+	while (!atomic_load_explicit(&ctx->stop, memory_order_relaxed)) {
+		if (atomic_load_explicit(&ctx->failed, memory_order_relaxed))
+			break;
+
+		if (rand_r(&seed) % 100 < VIS_READER_SLEEP_PCT) {
+			usleep(rand_r(&seed) % VIS_MAX_READER_SLEEP_US);
+			continue;
+		}
+
+		if (poll_check(ctx->fd, 100) != 1)
+			continue;
+
+		for (i = 0; i < N_VIS_WRITERS; i++)
+			snapshot[i] =
+				atomic_load_explicit(&ctx->writer_committed[i],
+						     memory_order_acquire);
+
+		tripwire_ack(ctx->fd);
+
+		for (i = 0; i < N_VIS_WRITERS; i++) {
+			uint32_t *slot = (uint32_t *)(ctx->mem + i * page_size);
+			uint32_t observed = *slot;
+
+			if (observed < snapshot[i]) {
+				snprintf(
+					ctx->failure_msg,
+					sizeof(ctx->failure_msg),
+					"writer %d: observed %u < committed %u",
+					i, observed, snapshot[i]);
+				atomic_store(&ctx->failed, true);
+				atomic_store(&ctx->stop, true);
+				return NULL;
+			}
+		}
+	}
+
+	return NULL;
+}
+
+static void test_stress_visibility(void)
+{
+	struct vis_ctx ctx;
+	struct vis_thread_arg wargs[N_VIS_WRITERS];
+	struct vis_thread_arg rargs[N_VIS_READERS];
+	pthread_t writers[N_VIS_WRITERS];
+	pthread_t readers[N_VIS_READERS];
+	size_t map_size = N_VIS_WRITERS * page_size;
+	struct timespec start;
+	int i, nw = 0, nr = 0;
+
+	if (setup_tripwire(map_size, &ctx.fd, &ctx.mem) != 0)
+		return;
+
+	atomic_store(&ctx.stop, false);
+	atomic_store(&ctx.failed, false);
+	ctx.failure_msg[0] = '\0';
+	for (i = 0; i < N_VIS_WRITERS; i++)
+		atomic_store(&ctx.writer_committed[i], 0);
+
+	for (i = 0; i < N_VIS_WRITERS; i++) {
+		wargs[i].ctx = &ctx;
+		wargs[i].id = i;
+		if (pthread_create(&writers[i], NULL, vis_writer_fn, &wargs[i]))
+			goto stop;
+		nw++;
+	}
+
+	for (i = 0; i < N_VIS_READERS; i++) {
+		rargs[i].ctx = &ctx;
+		rargs[i].id = i;
+		if (pthread_create(&readers[i], NULL, vis_reader_fn, &rargs[i]))
+			goto stop;
+		nr++;
+	}
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+	do {
+		usleep(100000);
+	} while (time_elapsed(&start) <= VIS_DURATION_S &&
+		 !atomic_load(&ctx.failed));
+
+stop:
+	atomic_store(&ctx.stop, true);
+	for (i = 0; i < nr; i++)
+		pthread_join(readers[i], NULL);
+	for (i = 0; i < nw; i++)
+		pthread_join(writers[i], NULL);
+
+	if (atomic_load(&ctx.failed)) {
+		ksft_test_result_fail("visibility: %s\n", ctx.failure_msg);
+	} else {
+		for (i = 0; i < N_VIS_WRITERS; i++) {
+			ksft_print_msg("visibility: writer %d committed %u\n",
+				       i,
+				       atomic_load(&ctx.writer_committed[i]));
+		}
+		ksft_test_result_pass("stress_visibility\n");
+	}
+
+	munmap((void *)ctx.mem, map_size);
+	close(ctx.fd);
+}
+
+#define NUM_TESTS 6
+
+int main(void)
+{
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %m\n");
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	test_sequential();
+	test_multi_page();
+	test_cross_process();
+	test_cross_process_multi_round();
+	test_stress_no_lost_events();
+	test_stress_visibility();
+
+	ksft_finished();
+	return 0;
+}
+
+#else /* __NR_memfd_tripwire */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_tripwire test (missing __NR_memfd_tripwire)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_tripwire */
-- 
2.52.0



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm: memfd with write notifications
  2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
@ 2026-06-11  1:36 ` Baolin Wang
  2026-06-11 12:40   ` Mattias Nissler
       [not found]   ` <40381f8a-47e3-4f97-a9ad-f6f868fe0392@kernel.org>
  2 siblings, 2 replies; 7+ messages in thread
From: Baolin Wang @ 2026-06-11  1:36 UTC (permalink / raw)
  To: Mattias Nissler, linux-mm
  Cc: Hugh Dickins, mattias.nissler,
	david@kernel.org >> David Hildenbrand (Red Hat),
	Lorenzo Stoakes (Oracle), Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko

Add more MM core maintainers.

On 6/3/26 8:55 PM, Mattias Nissler wrote:
> I want to propose a kernel facility to have user space create memory
> regions that can generate notifications on write access. This is useful
> as a cross-process communication mechanism, where a producer writes to
> the memory, and a consumer polls for new data to be available.
> 
> I'm including a minimalistic proof-of-concept implementation meant as a
> vehicle to demonstrate the idea and clarify semantics. It works by
> mapping the memory region read-only, so write accesses will generate
> page faults. The `page_mkwrite` handler can thus trigger notifications.
> It also allows the mappings to become writable temporarily until the
> mechanism gets rearmed.
> 
> Intended usage looks as follows (cf. selftest code included):
>    1.  Call `memfd_tripwire` to create an instance, `ftruncate` to
>        configure its size.
>    2.  Pass the file descriptor for `mmap()`ing to the producer and
>        consumer (potentially across process boundaries).
>    3a. The producer writes to the memory region whenever there is new
>        data to publish to the consumer.
>    3b. The consumer runs a poll loop:
>          * Wait for `POLLIN` event.
>          * `ioctl(MEMFD_TRIPWIRE_ACK)` to (1) re-arm for subsequent
>            notifications and (2) make sure prior writes are visible.
>          * Examine the memory region to collect and process the data
>            provided by the producer.
> 
> The intention is to guarantee that no change to the memory region can
> slip through undetected by the consumer. The ACK semantics are
> instrumental in achieving that. When the ACK returns, we assure the
> consumer that (possibly concurrent) writes to the region are either
> visible, or if not they will trigger a subsequent `POLLIN` event. Note
> that there is no guarantee that each write will generate an individual
> event. Neither are any details on the triggering write provided
> (address, value). The consumer is expected to inspect the memory to find
> out what has changed. The details of this depend on the communication
> protocol producer and consumer are following.
> 
> A direct consequence of the above semantics is that for a write to be
> detectable by the consumer, the write must change data in the memory
> region. Re-writing an already-present value would generally not be
> detectable. Sequences where a location is written briefly with a changed
> value and then restored to the previous value can't be reliably detected
> by the consumer either. I'm calling this out as a noteworthy restriction
> that will prevent certain usage patterns.
> 
> I also want to call attention to an inherent race condition. Page faults
> signal that a write is being attempted, but obviously they fire before
> the actual write happens. Thus, write notifications are generated before
> the write, and race with the write actually taking place. If the
> consumer wins the race and gets their ACK through before the write
> manages to land, the consumer could see unchanged memory, and the
> producer would fault again. This is perfectly compatible with the
> desired semantics, but creates the risk of a lifelock situation. In
> practice, it is hopefully exceedingly unlikely for the consumer to win
> the race consistently so that this can be ignored.
> 
> Switching to motivation: Producer / consumer communication can be
> implemented in many different ways. Pairing shared memory with a
> notification mechanism such as `eventfd` gets the job done nicely, but
> requires the producer to operate an `eventfd`. This can be undesirable
> or impractical for cases where the producer is designed to talk to a
> hardware interface in which a register write also conveys a "doorbell"
> to trigger hardware processing. The proposed mechanism allows software
> implementations of the consumer side that behave functionally equivalent
> to the popular ring buffer + doorbell consumer implementation in
> hardware. This is particularly useful in contexts where hardware is
> simulated, for example in VFIO-user server implementations. The tripwire
> mechanism isn't restricted to that use case though and is generic enough
> to be useful for other purposes as well.

 From your earlier requirements description, my first thought was also 
the shmem + eventfd approach. It seems straightforward and avoids 
introducing a heavy new system call to reimplement functionality that 
already exists.

Moreover, I haven’t fully understood what hardware constraints you’re 
referring to here.

Anyway, let’s also see what others think.

> In terms of related technology, there is some overlap with both
> `userfaultfd` and KVM's `ioeventfd`. `userfaultfd` has a write protect
> mode that will generate notifications upon write access. This can be
> used to construct a similar notification mechanism to the one proposed.
> However, the producer will be blocked until the consumer resolves the
> fault and provides a writable page. The consumer will then have to give
> the producer time to carry out the write before re-arming write
> protection. Furthermore, `userfaultfd` is scoped by design to a process
> / `mm`, whereas the proposed tripwire mechanism makes the fault handling
> and notification mechanism a property of the memory region represented
> by the file descriptor. The latter is simpler to integrate with existing
> IPC protocols that already exchange file descriptors (such as
> VFIO-user), avoiding the need for additional setup code in the producer
> to instantiate a `userfaultfd` and pass it to the consumer. Also, the
> tripwire mechanism doesn't make an attempt at providing a generic fault
> handling framework, which sidesteps the access control complications of
> `userfaultfd` when it comes to handling faults generated in kernel
> context. In contrast to `ioeventfd`, `memfd_tripwire` does provide
> regular memory semantics, whereas `ioeventfd` discards written values.
> 
> There are also a number of design choices / alternatives that warrant
> consideration. For simplicity, the proof-of-concept implementation
> generates `POLLIN` events on the file descriptor representing the memory
> region. That's somewhat unconventional given that `POLLIN` is originally
> meant to indicate a file descriptor's readiness to be read, which is
> always the case for memory-backed files. An alternative design might
> associate a separate `eventfd` at setup time to deliver events to. If we
> were to deliver events via an `eventfd`, we could possibly also fold the
> ACK operation into the `read()` that clears the `eventfd`, which might
> be considered a cleaner API.
> 
> Finally, I acknowledge that the proof-of-concept implementation has a
> number of gaps that would need to be filled in for a production
> implementation. This includes read and write `file_operations`, support
> for sealing and THP, etc. These features are already implemented in
> `shmem`, so a production version would likely make most sense as a new
> `shmem` feature. I wanted to get high-level feedback on the concept
> before starting work on that though.
> 
> Mattias Nissler (2):
>    mm: `memfd_tripwire` proof-of-concept
>    selftests: `memfd_tripwire` selftest
> 
>   arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
>   include/linux/syscalls.h                    |   1 +
>   include/uapi/asm-generic/unistd.h           |   5 +-
>   include/uapi/linux/memfd_tripwire.h         |  19 +
>   kernel/sys_ni.c                             |   3 +
>   mm/Kconfig                                  |   9 +
>   mm/Makefile                                 |   1 +
>   mm/memfd_tripwire.c                         | 246 +++++++
>   scripts/syscall.tbl                         |   1 +
>   tools/testing/selftests/mm/Makefile         |   1 +
>   tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
>   11 files changed, 981 insertions(+), 1 deletion(-)
>   create mode 100644 include/uapi/linux/memfd_tripwire.h
>   create mode 100644 mm/memfd_tripwire.c
>   create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm: memfd with write notifications
  2026-06-11  1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
@ 2026-06-11 12:40   ` Mattias Nissler
       [not found]   ` <40381f8a-47e3-4f97-a9ad-f6f868fe0392@kernel.org>
  1 sibling, 0 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-11 12:40 UTC (permalink / raw)
  To: Baolin Wang
  Cc: Mattias Nissler, linux-mm, Hugh Dickins,
	david@kernel.org >> David Hildenbrand (Red Hat),
	Lorenzo Stoakes (Oracle), Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On Thu, Jun 11, 2026 at 3:36 AM Baolin Wang
<baolin.wang@linux.alibaba.com> wrote:
>
> Add more MM core maintainers.
>
> On 6/3/26 8:55 PM, Mattias Nissler wrote:
> > I want to propose a kernel facility to have user space create memory
> > regions that can generate notifications on write access. This is useful
> > as a cross-process communication mechanism, where a producer writes to
> > the memory, and a consumer polls for new data to be available.
> >
> > I'm including a minimalistic proof-of-concept implementation meant as a
> > vehicle to demonstrate the idea and clarify semantics. It works by
> > mapping the memory region read-only, so write accesses will generate
> > page faults. The `page_mkwrite` handler can thus trigger notifications.
> > It also allows the mappings to become writable temporarily until the
> > mechanism gets rearmed.
> >
> > Intended usage looks as follows (cf. selftest code included):
> >    1.  Call `memfd_tripwire` to create an instance, `ftruncate` to
> >        configure its size.
> >    2.  Pass the file descriptor for `mmap()`ing to the producer and
> >        consumer (potentially across process boundaries).
> >    3a. The producer writes to the memory region whenever there is new
> >        data to publish to the consumer.
> >    3b. The consumer runs a poll loop:
> >          * Wait for `POLLIN` event.
> >          * `ioctl(MEMFD_TRIPWIRE_ACK)` to (1) re-arm for subsequent
> >            notifications and (2) make sure prior writes are visible.
> >          * Examine the memory region to collect and process the data
> >            provided by the producer.
> >
> > The intention is to guarantee that no change to the memory region can
> > slip through undetected by the consumer. The ACK semantics are
> > instrumental in achieving that. When the ACK returns, we assure the
> > consumer that (possibly concurrent) writes to the region are either
> > visible, or if not they will trigger a subsequent `POLLIN` event. Note
> > that there is no guarantee that each write will generate an individual
> > event. Neither are any details on the triggering write provided
> > (address, value). The consumer is expected to inspect the memory to find
> > out what has changed. The details of this depend on the communication
> > protocol producer and consumer are following.
> >
> > A direct consequence of the above semantics is that for a write to be
> > detectable by the consumer, the write must change data in the memory
> > region. Re-writing an already-present value would generally not be
> > detectable. Sequences where a location is written briefly with a changed
> > value and then restored to the previous value can't be reliably detected
> > by the consumer either. I'm calling this out as a noteworthy restriction
> > that will prevent certain usage patterns.
> >
> > I also want to call attention to an inherent race condition. Page faults
> > signal that a write is being attempted, but obviously they fire before
> > the actual write happens. Thus, write notifications are generated before
> > the write, and race with the write actually taking place. If the
> > consumer wins the race and gets their ACK through before the write
> > manages to land, the consumer could see unchanged memory, and the
> > producer would fault again. This is perfectly compatible with the
> > desired semantics, but creates the risk of a lifelock situation. In
> > practice, it is hopefully exceedingly unlikely for the consumer to win
> > the race consistently so that this can be ignored.
> >
> > Switching to motivation: Producer / consumer communication can be
> > implemented in many different ways. Pairing shared memory with a
> > notification mechanism such as `eventfd` gets the job done nicely, but
> > requires the producer to operate an `eventfd`. This can be undesirable
> > or impractical for cases where the producer is designed to talk to a
> > hardware interface in which a register write also conveys a "doorbell"
> > to trigger hardware processing. The proposed mechanism allows software
> > implementations of the consumer side that behave functionally equivalent
> > to the popular ring buffer + doorbell consumer implementation in
> > hardware. This is particularly useful in contexts where hardware is
> > simulated, for example in VFIO-user server implementations. The tripwire
> > mechanism isn't restricted to that use case though and is generic enough
> > to be useful for other purposes as well.
>
>  From your earlier requirements description, my first thought was also
> the shmem + eventfd approach. It seems straightforward and avoids
> introducing a heavy new system call to reimplement functionality that
> already exists.
>
> Moreover, I haven’t fully understood what hardware constraints you’re
> referring to here.

To elaborate on this: If you are free to define/adjust the protocol
between producer and consumer, and it's not too inconvenient for
producer/consumer to exchange the eventfd, then I don't see a good
reason why you wouldn't do exactly that. However, VFIO-user models
out-of-process PCI devices, and the common case is that the consumer
exposes a simulated BAR which contains doorbell registers. The
producer side is usually driver code written to work against that
hardware interface. The mechanism I'm proposing lets the consumer
simulate doorbell registers that behave equivalent to actual hardware.
I will note that if the producer runs in a virtual machine, there's
the alternative option to have the VMM trap the write and then notify
the consumer, but that doesn't work if the producer is a plain user
space process.

Also, to make you aware, I've left Meta in the meantime (voluntarily!
:-D), and it wasn't clear whether anyone would pick this up. Thus, I
suggest we determine whether the proposal adds enough value to stand
on its own. If there is interest, I'd be willing to spend some of my
own time iterating on this. If there isn't, then I'm happy to put this
on ice for now.

>
>
> Anyway, let’s also see what others think.
>
> > In terms of related technology, there is some overlap with both
> > `userfaultfd` and KVM's `ioeventfd`. `userfaultfd` has a write protect
> > mode that will generate notifications upon write access. This can be
> > used to construct a similar notification mechanism to the one proposed.
> > However, the producer will be blocked until the consumer resolves the
> > fault and provides a writable page. The consumer will then have to give
> > the producer time to carry out the write before re-arming write
> > protection. Furthermore, `userfaultfd` is scoped by design to a process
> > / `mm`, whereas the proposed tripwire mechanism makes the fault handling
> > and notification mechanism a property of the memory region represented
> > by the file descriptor. The latter is simpler to integrate with existing
> > IPC protocols that already exchange file descriptors (such as
> > VFIO-user), avoiding the need for additional setup code in the producer
> > to instantiate a `userfaultfd` and pass it to the consumer. Also, the
> > tripwire mechanism doesn't make an attempt at providing a generic fault
> > handling framework, which sidesteps the access control complications of
> > `userfaultfd` when it comes to handling faults generated in kernel
> > context. In contrast to `ioeventfd`, `memfd_tripwire` does provide
> > regular memory semantics, whereas `ioeventfd` discards written values.
> >
> > There are also a number of design choices / alternatives that warrant
> > consideration. For simplicity, the proof-of-concept implementation
> > generates `POLLIN` events on the file descriptor representing the memory
> > region. That's somewhat unconventional given that `POLLIN` is originally
> > meant to indicate a file descriptor's readiness to be read, which is
> > always the case for memory-backed files. An alternative design might
> > associate a separate `eventfd` at setup time to deliver events to. If we
> > were to deliver events via an `eventfd`, we could possibly also fold the
> > ACK operation into the `read()` that clears the `eventfd`, which might
> > be considered a cleaner API.
> >
> > Finally, I acknowledge that the proof-of-concept implementation has a
> > number of gaps that would need to be filled in for a production
> > implementation. This includes read and write `file_operations`, support
> > for sealing and THP, etc. These features are already implemented in
> > `shmem`, so a production version would likely make most sense as a new
> > `shmem` feature. I wanted to get high-level feedback on the concept
> > before starting work on that though.
> >
> > Mattias Nissler (2):
> >    mm: `memfd_tripwire` proof-of-concept
> >    selftests: `memfd_tripwire` selftest
> >
> >   arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
> >   include/linux/syscalls.h                    |   1 +
> >   include/uapi/asm-generic/unistd.h           |   5 +-
> >   include/uapi/linux/memfd_tripwire.h         |  19 +
> >   kernel/sys_ni.c                             |   3 +
> >   mm/Kconfig                                  |   9 +
> >   mm/Makefile                                 |   1 +
> >   mm/memfd_tripwire.c                         | 246 +++++++
> >   scripts/syscall.tbl                         |   1 +
> >   tools/testing/selftests/mm/Makefile         |   1 +
> >   tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
> >   11 files changed, 981 insertions(+), 1 deletion(-)
> >   create mode 100644 include/uapi/linux/memfd_tripwire.h
> >   create mode 100644 mm/memfd_tripwire.c
> >   create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c
> >
>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm: memfd with write notifications
       [not found]       ` <ee858321-7407-423a-adca-caab5ad9e2b8@kernel.org>
@ 2026-06-16 11:32         ` Mattias Nissler
  2026-06-17  9:14           ` David Hildenbrand (Arm)
  0 siblings, 1 reply; 7+ messages in thread
From: Mattias Nissler @ 2026-06-16 11:32 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Baolin Wang, Mattias Nissler, linux-mm, Hugh Dickins,
	Lorenzo Stoakes (Oracle), Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On Tue, Jun 16, 2026 at 10:20 AM David Hildenbrand (Arm)
<david@kernel.org> wrote:
>
> On 6/16/26 10:02, Mattias Nissler wrote:
> > On Mon, Jun 15, 2026 at 7:43 PM David Hildenbrand (Arm)
> > <david@kernel.org> wrote:
> >>
> >> On 6/11/26 03:36, Baolin Wang wrote:
> >>> Add more MM core maintainers.
> >>>
> >>
> >> FWIW, I was raising a couple of times that a userfaultfd-like mechanism on files
> >> would likely be helpful. I think we also discussed that in the context of
> >> guest_memfd.
> >>
> >> For example, for VM live-migration in QEMU with vhost-user processes, we have to
> >> play some weird games to protect the shared memory in each and every process,
> >> such that we get proper missing-faults.
> >>
> >> Background snapshots that uses uffd-wp in QEMU is not supported on shared
> >> memory, because the communication with the other processes would be a nightmare.
> >
> > Thanks for bringing up that context. So this is mostly dirty page
> > tracking? I reckon you'll need a way to learn which pages have been
> > written, rather than just a notification if any page in the region
> > gets hit?
>
> Yes, for QEMU to use it as replacement for uffd targeted at multi-process setups
> (i.e., vhost-user), I think you'd need something similar like uffd, but on the
> file level, and less uffd-like :)

This was probably a bit tongue-in-cheek, but if I may ask, what
aspects of uffd are problematic? Obviously it being process-scoped is
a major mismatch for your use case, but are there any other challenges
you have in mind?

>
> But in general, protecting/unprotecting file ranges (read-only, read-write),
> notifications on access, notifications when filling holes etc.

Do you need notifications to be synchronous (stopping the faulting
process) or is asynchronous (firing a notification and
write-unprotecting automatically) sufficient?

>
> uffd gives you that of course, but at the cost of the kernel
> > having to perform more bookkeeping. Trying to wrap my head around what
> > a good trade-off for complexity / user space API could be.
>
> Yes. For a simple doorbell mechanism (IIUC your proposal correctly), it feels
> rather odd to embed it like that in memfd.

So what you have in mind would operate at the file level, but work for
any kind of file?

Btw. thanks for bringing a different perspective to this conversation
to help explore the design space, this is exactly what I was hoping
for.

>
> I might be missing something important, though.
>
> --
> Cheers,
>
> David


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH 0/2] mm: memfd with write notifications
  2026-06-16 11:32         ` Mattias Nissler
@ 2026-06-17  9:14           ` David Hildenbrand (Arm)
  0 siblings, 0 replies; 7+ messages in thread
From: David Hildenbrand (Arm) @ 2026-06-17  9:14 UTC (permalink / raw)
  To: Mattias Nissler
  Cc: Baolin Wang, Mattias Nissler, linux-mm, Hugh Dickins,
	Lorenzo Stoakes (Oracle), Liam R. Howlett, Vlastimil Babka,
	Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On 6/16/26 13:32, Mattias Nissler wrote:
> On Tue, Jun 16, 2026 at 10:20 AM David Hildenbrand (Arm)
> <david@kernel.org> wrote:
>>
>> On 6/16/26 10:02, Mattias Nissler wrote:
>>> On Mon, Jun 15, 2026 at 7:43 PM David Hildenbrand (Arm)
>>> <david@kernel.org> wrote:
>>>
>>> Thanks for bringing up that context. So this is mostly dirty page
>>> tracking? I reckon you'll need a way to learn which pages have been
>>> written, rather than just a notification if any page in the region
>>> gets hit?
>>
>> Yes, for QEMU to use it as replacement for uffd targeted at multi-process setups
>> (i.e., vhost-user), I think you'd need something similar like uffd, but on the
>> file level, and less uffd-like :)
> 
> This was probably a bit tongue-in-cheek, but if I may ask, what
> aspects of uffd are problematic?

Heh, there are several.

Let's ignoring the implementation-wise issues that people keep complaining
about. Two (related) problems I am aware of:

1) User-space handling

Right now you always need a user-space handler that POLLs for events to handle
them. For each event, you have to context-switch to user space, sometimes a
couple of times. Scalability problems (many threads faulting at the same time,
in-kernel locking) was raised as a problem in the past.

2) Blocking nature

If you look into the details of the history of UFFD_USER_MODE_ONLY +
/proc/sys/vm/unprivileged_userfaultfd, the problem is that userfaultfd can block
at various places, some of them possibly being able to hurt the kernel.

We inherently rely on user space to make progress.

Using BPF[1] could avoid both problems in some scenarios, but there are
certainly use cases where you would still have to block.

[1] https://dl.acm.org/doi/10.1145/3672197.3673432

> Obviously it being process-scoped is
> a major mismatch for your use case, but are there any other challenges
> you have in mind?

Right, that's a conceptual thing: userfaultfd protects VMA ranges, not file
ranges. To emulate protecting file ranges, you have to protect all mmap's in all
involved processes.

Obviously, things like read() or write() instead of mmap() cannot be handled by
userfaultfd.

> 
>>
>> But in general, protecting/unprotecting file ranges (read-only, read-write),
>> notifications on access, notifications when filling holes etc.
> 
> Do you need notifications to be synchronous (stopping the faulting
> process) or is asynchronous (firing a notification and
> write-unprotecting automatically) sufficient?

Most use cases I am aware of need to be synchronous. Only some could be relaxed
to asynchronous handling.

With postcopy live-migration, you really have to place the page with the right
content before the faulting thread can continue running. You cannot just place
zero-filled pages. Similarly with CRIU.

With VM background snapshots, you really have to save away page content before
un-protecting and modifying the page.

For electric fences[2], it might be sufficient to just detect "wrong page
accessed" asynchronously. But it doesn't really work on files.

Protecting unplugged virtio-mem memory in VMs from re-access  (sparse memory
regions where some parts should no be accessed by the VM) could likely get away
with asynchronous handling, but similar to electric fences, synchronous events
might be better to debug the "what did actually do something wrong".

I assume garbage collection similarly requires synchronous notifications (but I
would suspect that this is usually anonymous memory).

There are some upcoming use cases around working-set tracking [3]. IIRC,
asynchronous handling is usually fine (or not requiring a notification at all
and instead inspecting the accessed-state).

[2] https://gitlab.com/efency/efency
[3] https://lore.kernel.org/r/20260529172716.357179-1-kas@kernel.org

> 
>>
>> uffd gives you that of course, but at the cost of the kernel
>>> having to perform more bookkeeping. Trying to wrap my head around what
>>> a good trade-off for complexity / user space API could be.
>>
>> Yes. For a simple doorbell mechanism (IIUC your proposal correctly), it feels
>> rather odd to embed it like that in memfd.
> 
> So what you have in mind would operate at the file level, but work for
> any kind of file?

VMs with file-backed memory usually rely on shmem/hugetlb/memfd (+guest_memfd in
the near future). Other file systems are uncommon.

For CRIU, I could imagine that other file systems could be reasonable, but I
don't know enough about how they handle files.

> 
> Btw. thanks for bringing a different perspective to this conversation
> to help explore the design space, this is exactly what I was hoping
> for.

Sure! I guess my main point is: most use cases I am aware of would need
synchronous handling (and have ways to fix it up, like userfaultfd). For a
simple doorbell, this might not really be what you want.

-- 
Cheers,

David

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-17  9:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
2026-06-11  1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
2026-06-11 12:40   ` Mattias Nissler
     [not found]   ` <40381f8a-47e3-4f97-a9ad-f6f868fe0392@kernel.org>
     [not found]     ` <CAERLvmQyOAvCN971uUx1PDqTXExOv-BHbNgo-oByaHavUmLgfw@mail.gmail.com>
     [not found]       ` <ee858321-7407-423a-adca-caab5ad9e2b8@kernel.org>
2026-06-16 11:32         ` Mattias Nissler
2026-06-17  9:14           ` David Hildenbrand (Arm)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox