All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/2] mm: memfd with write notifications
@ 2026-06-03 12:55 Mattias Nissler
  2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Mattias Nissler @ 2026-06-03 12:55 UTC (permalink / raw)
  To: linux-mm; +Cc: Hugh Dickins, Baolin Wang, mnissler, mattias.nissler

I want to propose a kernel facility to have user space create memory
regions that can generate notifications on write access. This is useful
as a cross-process communication mechanism, where a producer writes to
the memory, and a consumer polls for new data to be available.

I'm including a minimalistic proof-of-concept implementation meant as a
vehicle to demonstrate the idea and clarify semantics. It works by
mapping the memory region read-only, so write accesses will generate
page faults. The `page_mkwrite` handler can thus trigger notifications.
It also allows the mappings to become writable temporarily until the
mechanism gets rearmed.

Intended usage looks as follows (cf. selftest code included):
  1.  Call `memfd_tripwire` to create an instance, `ftruncate` to
      configure its size.
  2.  Pass the file descriptor for `mmap()`ing to the producer and
      consumer (potentially across process boundaries).
  3a. The producer writes to the memory region whenever there is new
      data to publish to the consumer.
  3b. The consumer runs a poll loop:
        * Wait for `POLLIN` event.
        * `ioctl(MEMFD_TRIPWIRE_ACK)` to (1) re-arm for subsequent
          notifications and (2) make sure prior writes are visible.
        * Examine the memory region to collect and process the data
          provided by the producer.

The intention is to guarantee that no change to the memory region can
slip through undetected by the consumer. The ACK semantics are
instrumental in achieving that. When the ACK returns, we assure the
consumer that (possibly concurrent) writes to the region are either
visible, or if not they will trigger a subsequent `POLLIN` event. Note
that there is no guarantee that each write will generate an individual
event. Neither are any details on the triggering write provided
(address, value). The consumer is expected to inspect the memory to find
out what has changed. The details of this depend on the communication
protocol producer and consumer are following.

A direct consequence of the above semantics is that for a write to be
detectable by the consumer, the write must change data in the memory
region. Re-writing an already-present value would generally not be
detectable. Sequences where a location is written briefly with a changed
value and then restored to the previous value can't be reliably detected
by the consumer either. I'm calling this out as a noteworthy restriction
that will prevent certain usage patterns.

I also want to call attention to an inherent race condition. Page faults
signal that a write is being attempted, but obviously they fire before
the actual write happens. Thus, write notifications are generated before
the write, and race with the write actually taking place. If the
consumer wins the race and gets their ACK through before the write
manages to land, the consumer could see unchanged memory, and the
producer would fault again. This is perfectly compatible with the
desired semantics, but creates the risk of a lifelock situation. In
practice, it is hopefully exceedingly unlikely for the consumer to win
the race consistently so that this can be ignored.

Switching to motivation: Producer / consumer communication can be
implemented in many different ways. Pairing shared memory with a
notification mechanism such as `eventfd` gets the job done nicely, but
requires the producer to operate an `eventfd`. This can be undesirable
or impractical for cases where the producer is designed to talk to a
hardware interface in which a register write also conveys a "doorbell"
to trigger hardware processing. The proposed mechanism allows software
implementations of the consumer side that behave functionally equivalent
to the popular ring buffer + doorbell consumer implementation in
hardware. This is particularly useful in contexts where hardware is
simulated, for example in VFIO-user server implementations. The tripwire
mechanism isn't restricted to that use case though and is generic enough
to be useful for other purposes as well.

In terms of related technology, there is some overlap with both
`userfaultfd` and KVM's `ioeventfd`. `userfaultfd` has a write protect
mode that will generate notifications upon write access. This can be
used to construct a similar notification mechanism to the one proposed.
However, the producer will be blocked until the consumer resolves the
fault and provides a writable page. The consumer will then have to give
the producer time to carry out the write before re-arming write
protection. Furthermore, `userfaultfd` is scoped by design to a process
/ `mm`, whereas the proposed tripwire mechanism makes the fault handling
and notification mechanism a property of the memory region represented
by the file descriptor. The latter is simpler to integrate with existing
IPC protocols that already exchange file descriptors (such as
VFIO-user), avoiding the need for additional setup code in the producer
to instantiate a `userfaultfd` and pass it to the consumer. Also, the
tripwire mechanism doesn't make an attempt at providing a generic fault
handling framework, which sidesteps the access control complications of
`userfaultfd` when it comes to handling faults generated in kernel
context. In contrast to `ioeventfd`, `memfd_tripwire` does provide
regular memory semantics, whereas `ioeventfd` discards written values.

There are also a number of design choices / alternatives that warrant
consideration. For simplicity, the proof-of-concept implementation
generates `POLLIN` events on the file descriptor representing the memory
region. That's somewhat unconventional given that `POLLIN` is originally
meant to indicate a file descriptor's readiness to be read, which is
always the case for memory-backed files. An alternative design might
associate a separate `eventfd` at setup time to deliver events to. If we
were to deliver events via an `eventfd`, we could possibly also fold the
ACK operation into the `read()` that clears the `eventfd`, which might
be considered a cleaner API.

Finally, I acknowledge that the proof-of-concept implementation has a
number of gaps that would need to be filled in for a production
implementation. This includes read and write `file_operations`, support
for sealing and THP, etc. These features are already implemented in
`shmem`, so a production version would likely make most sense as a new
`shmem` feature. I wanted to get high-level feedback on the concept
before starting work on that though.

Mattias Nissler (2):
  mm: `memfd_tripwire` proof-of-concept
  selftests: `memfd_tripwire` selftest

 arch/x86/entry/syscalls/syscall_64.tbl      |   1 +
 include/linux/syscalls.h                    |   1 +
 include/uapi/asm-generic/unistd.h           |   5 +-
 include/uapi/linux/memfd_tripwire.h         |  19 +
 kernel/sys_ni.c                             |   3 +
 mm/Kconfig                                  |   9 +
 mm/Makefile                                 |   1 +
 mm/memfd_tripwire.c                         | 246 +++++++
 scripts/syscall.tbl                         |   1 +
 tools/testing/selftests/mm/Makefile         |   1 +
 tools/testing/selftests/mm/memfd_tripwire.c | 695 ++++++++++++++++++++
 11 files changed, 981 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/memfd_tripwire.h
 create mode 100644 mm/memfd_tripwire.c
 create mode 100644 tools/testing/selftests/mm/memfd_tripwire.c

-- 
2.52.0



^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-06-17  9:14 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-03 12:55 [RFC PATCH 0/2] mm: memfd with write notifications Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 1/2] mm: `memfd_tripwire` proof-of-concept Mattias Nissler
2026-06-03 12:55 ` [RFC PATCH 2/2] selftests: `memfd_tripwire` selftest Mattias Nissler
2026-06-11  1:36 ` [RFC PATCH 0/2] mm: memfd with write notifications Baolin Wang
2026-06-11 12:40   ` Mattias Nissler
     [not found]   ` <40381f8a-47e3-4f97-a9ad-f6f868fe0392@kernel.org>
     [not found]     ` <CAERLvmQyOAvCN971uUx1PDqTXExOv-BHbNgo-oByaHavUmLgfw@mail.gmail.com>
     [not found]       ` <ee858321-7407-423a-adca-caab5ad9e2b8@kernel.org>
2026-06-16 11:32         ` Mattias Nissler
2026-06-17  9:14           ` David Hildenbrand (Arm)

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.