public inbox for rust-for-linux@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC 00/16] bus1: Capability-based IPC for Linux
@ 2026-03-31 19:02 David Rheinsberg
  2026-03-31 19:02 ` [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked() David Rheinsberg
                   ` (16 more replies)
  0 siblings, 17 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Hi

A decade is gone, the ideas stayed the same. After roughly 10 years
we finally got around to write a new version of bus1, a
capability-based IPC system for linux. The core idea is still the same
as 10 years ago and you can find a good summary by Neil Brown on
LWN [1]. In the meantime, we have successfully introduced many of the
concepts of bus1 to Linux distributions (mostly as part of
`dbus-broker` [2]) and are convinced more than ever that we should move
forward with bus1. But for this initial submission I want to put focus
on the Rust integration.

(Patch #7 contains a man-page with a full introduction of all concepts
 of bus1. I will refrain from copying it to this cover-letter.)

The biggest change is that we stripped everything down to the basics and
reimplemented the module in Rust. It is a delight not having to worry
about refcount ownership and object lifetimes, but at the cost of a
C<->Rust bridge that brings some challenges. For now, we settled on
boxing everything that is exposed to C, and calling bindgen manually,
but we would love to improve on several aspects of that in the future.

The bus1 UAPI has been stripped of any convenience helper, fast-path,
or more complex API extension (e.g., promise handles, oneshot handles,
unmanaged IDs, fd-passing, file-system handles,
streaming / flow-control, etc.). This does not mean that we do not want
them back, but they were not necessary and can be replaced with round
trips or other workarounds. Now we are left with the core of bus1, and
we believe it is much easier to understand and follow.

I am only sending this to Rust4Linux now, because I want to get feedback
on the Rust integration first. You are very much welcome to comment on
other parts, but those might be changing heavily depending on the
feedback on the Rust utilities. I will be sending the follow ups to
LKML proper and then shift focus to the UAPI and bus management.

The series can be split into the following parts:

#1 - 4: Small extension to `./rust/kernel/`, which we make use of later
        on. These are small in size and should be easy to discuss
        independently.
#5 - 7: Add basic module scaffolding, UAPI definitions, and a man-page
        explaining the user-space API and intended behavior.
#8 -13: Rust utilities that will be used by the bus1 module, but are not
        necessarily specific to bus1. A lot of this is about intrusive
        data structures.
#14-16: Implement resource accounting and the bus1 core in Rust,
        exposing a C API. Implement a character-device based UAPI in C.

I will gladly split the series in the future, if desired. But given that
r4l generally does not merge unused code, I can carry the utilities just
fine.

If you prefer browsing this online, you can find the series on
codeberg [3] and github [4].
You can also find me on the r4l zulip (@dvdhrm).

Let me know what you think!

Thanks
David

[1] https://lwn.net/Articles/697191/
[2] https://github.com/bus1/dbus-broker
[3] https://codeberg.org/bus1/linux/src/branch/pr/20260331
[4] https://github.com/bus1/linux/tree/pr/20260331

David Rheinsberg (16):
  rust/sync: add LockedBy::access_mut_unchecked()
  rust/sync: add Arc::drop_unless_unique()
  rust/alloc: add Vec::into_boxed_slice()
  rust/error: add EXFULL, EBADRQC, EDQUOT, ENOTRECOVERABLE
  bus1: add module scaffolding
  bus1: add the user-space API
  bus1: add man-page
  bus1/util: add basic utilities
  bus1/util: add field projections
  bus1/util: add IntoDeref/FromDeref
  bus1/util: add intrusive data-type helpers
  bus1/util: add intrusive single linked lists
  bus1/util: add intrusive rb-tree
  bus1/acct: add resouce accounting
  bus1: introduce peers, handles, and nodes
  bus1: implement the uapi

 Documentation/bus1/bus1.7.rst |  319 ++++++
 include/uapi/linux/bus1.h     |   82 ++
 init/Kconfig                  |   12 +
 ipc/Makefile                  |    2 +-
 ipc/bus1/Makefile             |   29 +
 ipc/bus1/acct.rs              | 1792 +++++++++++++++++++++++++++++++++
 ipc/bus1/bus.rs               | 1510 +++++++++++++++++++++++++++
 ipc/bus1/cdev.c               | 1326 ++++++++++++++++++++++++
 ipc/bus1/cdev.h               |   35 +
 ipc/bus1/lib.h                |  202 ++++
 ipc/bus1/lib.rs               |   22 +
 ipc/bus1/main.c               |   41 +
 ipc/bus1/util/convert.rs      |  259 +++++
 ipc/bus1/util/field.rs        |  359 +++++++
 ipc/bus1/util/intrusive.rs    |  397 ++++++++
 ipc/bus1/util/lll.rs          |  378 +++++++
 ipc/bus1/util/mod.rs          |   94 ++
 ipc/bus1/util/rb.rs           | 1324 ++++++++++++++++++++++++
 ipc/bus1/util/slist.rs        |  677 +++++++++++++
 rust/kernel/alloc/kvec.rs     |   67 ++
 rust/kernel/error.rs          |    4 +
 rust/kernel/sync/arc.rs       |   21 +
 rust/kernel/sync/locked_by.rs |   30 +
 rust/kernel/sync/refcount.rs  |   16 +
 24 files changed, 8997 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/bus1/bus1.7.rst
 create mode 100644 include/uapi/linux/bus1.h
 create mode 100644 ipc/bus1/Makefile
 create mode 100644 ipc/bus1/acct.rs
 create mode 100644 ipc/bus1/bus.rs
 create mode 100644 ipc/bus1/cdev.c
 create mode 100644 ipc/bus1/cdev.h
 create mode 100644 ipc/bus1/lib.h
 create mode 100644 ipc/bus1/lib.rs
 create mode 100644 ipc/bus1/main.c
 create mode 100644 ipc/bus1/util/convert.rs
 create mode 100644 ipc/bus1/util/field.rs
 create mode 100644 ipc/bus1/util/intrusive.rs
 create mode 100644 ipc/bus1/util/lll.rs
 create mode 100644 ipc/bus1/util/mod.rs
 create mode 100644 ipc/bus1/util/rb.rs
 create mode 100644 ipc/bus1/util/slist.rs

-- 
2.53.0


^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked()
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:29   ` Miguel Ojeda
  2026-03-31 19:02 ` [RFC 02/16] rust/sync: add Arc::drop_unless_unique() David Rheinsberg
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add a new accessor to `LockedBy`, which allows getting mutable access
without mutably borrowing the owning object.

This is particularly useful when having to lock multiple objects under
different instances of `LockedBy`, but protected by the same lock. In
those cases, the caller needs to retain access to the mutable reference,
so it can continue calling `access_mut()` on the other instances.

It is now up to the caller to ensure the provided reference is held long
enough, and a single `LockedBy` instance is not accessed multiple times.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 rust/kernel/sync/locked_by.rs | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/rust/kernel/sync/locked_by.rs b/rust/kernel/sync/locked_by.rs
index 61f100a45b35..8aca521279b6 100644
--- a/rust/kernel/sync/locked_by.rs
+++ b/rust/kernel/sync/locked_by.rs
@@ -166,4 +166,34 @@ pub fn access_mut<'a>(&'a self, owner: &'a mut U) -> &'a mut T {
         // SAFETY: `owner` is evidence that there is only one reference to the owner.
         unsafe { &mut *self.data.get() }
     }
+
+    /// Returns a mutable reference to the protected data when the caller
+    /// provides evidence (via a mutable reference) that the owner is locked.
+    ///
+    /// Unlike [`Self::access_mut()`] this does not require the mutable
+    /// reference to be borrowed for the requested lifetime, and thus multiple
+    /// different [`LockedBy`] objects can be acquired simultaneously.
+    ///
+    /// # Panics
+    ///
+    /// Panics if `owner` is different from the data protected by the lock used
+    /// in [`new`](LockedBy::new).
+    ///
+    /// # Safety
+    ///
+    /// The caller must hold `owner` for `'a` and must not call into this
+    /// function more than once under the same lifetime `'a`.
+    #[allow(clippy::mut_from_ref)]
+    pub unsafe fn access_mut_unchecked<'a>(&'a self, owner: &mut U) -> &'a mut T {
+        build_assert!(
+            size_of::<U>() > 0,
+            "`U` cannot be a ZST because `owner` wouldn't be unique"
+        );
+        if !ptr::eq(owner, self.owner) {
+            panic!("mismatched owners");
+        }
+
+        // SAFETY: `owner` is evidence that there is only one reference to the owner.
+        unsafe { &mut *self.data.get() }
+    }
 }
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 02/16] rust/sync: add Arc::drop_unless_unique()
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
  2026-03-31 19:02 ` [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked() David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Introduce `Arc::drop_unless_unique()`, a rust version of
`refcount_dec_not_one()`.

It is very common to store an object with `struct kref` (or
`refcount_t`) in a locked lookup-tree, but without the tree actually
owning a reference. Instead, whenever the object is destroyed, it
should be removed from the lookup tree.

The problem is, this requires taking the lock whenever releasing a
reference. Otherwise, a racing lookup might see a ref-count of 0. But
this is needlessly expensive.

One solution provided by `struct kref` is `kref_put_mutex()`. This drops
a reference, unless it is the last reference. In that case it takes a
mutex and tries again. If it no longer is the last reference, it simply
unlocks and returns. But if it dropped the last reference, it calls the
destructor with the lock held.

This entire technique is built around `refcount_dec_not_one()`. Expose
this exact feature in Rust, so it can be used to get the same effect as
`kref_put_mutex()`.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 rust/kernel/sync/arc.rs      | 21 +++++++++++++++++++++
 rust/kernel/sync/refcount.rs | 16 ++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/rust/kernel/sync/arc.rs b/rust/kernel/sync/arc.rs
index 921e19333b89..ead021a0dd5d 100644
--- a/rust/kernel/sync/arc.rs
+++ b/rust/kernel/sync/arc.rs
@@ -368,6 +368,27 @@ pub fn into_unique_or_drop(this: Self) -> Option<Pin<UniqueArc<T>>> {
             None
         }
     }
+
+    /// Drop this [`Arc`] unless it is the last reference.
+    ///
+    /// When this is the last reference to the `Arc`, it is returned as `Some`
+    /// unmodified. Otherwise, the `Arc` is dropped and [`None`] is returned.
+    ///
+    /// This function will never release the last reference to the object, and
+    /// as such never call its destructor.
+    pub fn drop_unless_unique(this: Self) -> Option<Self> {
+        let this = ManuallyDrop::new(this);
+
+        // SAFETY: We own a refcount, so the pointer is still valid.
+        if unsafe { this.ptr.as_ref() }.refcount.dec_not_one() {
+            // A single ref-count was dropped, but it was not the last. We
+            // manually drop `this` without invoking `Drop`.
+            None
+        } else {
+            // This was a no-op, return the reference to the caller.
+            Some(ManuallyDrop::into_inner(this))
+        }
+    }
 }
 
 // SAFETY: The pointer returned by `into_foreign` was originally allocated as an
diff --git a/rust/kernel/sync/refcount.rs b/rust/kernel/sync/refcount.rs
index 6c7ae8b05a0b..2a65b3e3f961 100644
--- a/rust/kernel/sync/refcount.rs
+++ b/rust/kernel/sync/refcount.rs
@@ -105,6 +105,22 @@ pub fn dec_and_test(&self) -> bool {
         // SAFETY: `self.as_ptr()` is valid.
         unsafe { bindings::refcount_dec_and_test(self.as_ptr()) }
     }
+
+    /// Decrement a refcount if it is not 1.
+    ///
+    /// It will `WARN` on underflow and fail to decrement when saturated.
+    ///
+    /// Provides release memory ordering when succeeding in decrementing the
+    /// refcount.
+    ///
+    /// Returns true if the decrement operation succeeded, false if the
+    /// decrement operation was skipped as the refcount is 1.
+    #[inline]
+    #[must_use = "refcount release is conditional and must be checked"]
+    pub fn dec_not_one(&self) -> bool {
+        // SAFETY: `self.as_ptr()` is valid.
+        unsafe { bindings::refcount_dec_not_one(self.as_ptr()) }
+    }
 }
 
 // SAFETY: `refcount_t` is thread-safe.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 03/16] rust/alloc: add Vec::into_boxed_slice()
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
  2026-03-31 19:02 ` [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked() David Rheinsberg
  2026-03-31 19:02 ` [RFC 02/16] rust/sync: add Arc::drop_unless_unique() David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:28   ` Miguel Ojeda
                     ` (2 more replies)
  2026-03-31 19:02 ` [RFC 04/16] rust/error: add EXFULL, EBADRQC, EDQUOT, ENOTRECOVERABLE David Rheinsberg
                   ` (13 subsequent siblings)
  16 siblings, 3 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add `Vec::into_boxed_slice()` similar to
`std::vec::Vec::into_boxed_slice()` [1].

There is currently no way to easily consume the allocation of a vector.
However, it is very convenient to use `Vec` to initialize a dynamically
sized array and then "seal" it, so it can be passed along as a Box:

    fn create_from(src: &[T]) -> Result<KBox<[U]>, AllocError> {
        let v = Vec::with_capacity(n, GFP_KERNEL)?;

        for i in src {
            v.push(foo(i)?, GFP_KERNEL)?;
        }

        Ok(v.into_boxed_slice())
    }

A valid alternative is to use `Box::new_uninit()` rather than
`Vec::with_capacity()`, and eventually convert the box via
`Box::assume_init()`. This works but needlessly requires unsafe code,
awkward drop handling, etc. Using `Vec` is the much simpler solution.

[1] https://doc.rust-lang.org/std/vec/struct.Vec.html#method.into_boxed_slice

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 rust/kernel/alloc/kvec.rs | 67 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 67 insertions(+)

diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
index ac8d6f763ae8..b8b0fa1a7505 100644
--- a/rust/kernel/alloc/kvec.rs
+++ b/rust/kernel/alloc/kvec.rs
@@ -733,6 +733,73 @@ pub fn retain(&mut self, mut f: impl FnMut(&mut T) -> bool) {
         }
         self.truncate(num_kept);
     }
+
+    fn shrink_to_fit(&mut self) -> Result<(), AllocError>  {
+        if Self::is_zst() {
+            // ZSTs always use maximum capacity.
+            return Ok(());
+        }
+
+        let layout = ArrayLayout::new(self.len()).map_err(|_| AllocError)?;
+
+        // SAFETY:
+        // - `ptr` is valid because it's either `None` or comes from a previous
+        //   call to `A::realloc`.
+        // - `self.layout` matches the `ArrayLayout` of the preceding
+        //   allocation.
+        let ptr = unsafe {
+            A::realloc(
+                Some(self.ptr.cast()),
+                layout.into(),
+                self.layout.into(),
+                crate::alloc::flags::GFP_NOWAIT,
+                NumaNode::NO_NODE,
+            )?
+        };
+
+        // INVARIANT:
+        // - `layout` is some `ArrayLayout::<T>`,
+        // - `ptr` has been created by `A::realloc` from `layout`.
+        self.ptr = ptr.cast();
+        self.layout = layout;
+        Ok(())
+    }
+
+    /// Converts the vector into [`Box<[T], A>`].
+    ///
+    /// Excess capacity is retained in the allocation, but lost until the box
+    /// is dropped.
+    ///
+    /// This function is fallible, because kernel allocators do not guarantee
+    /// that shrinking reallocations are infallible, yet the Rust abstractions
+    /// strictly require that layouts are correct. Hence, the caller must be
+    /// ready to deal with reallocation failures.
+    ///
+    /// # Examples
+    ///
+    /// ```
+    /// let mut v = KVec::<u16>::with_capacity(4, GFP_KERNEL)?;
+    /// for i in 0..4 {
+    ///     v.push(i, GFP_KERNEL);
+    /// }
+    /// let s: KBox<[u16]> = v.into_boxed_slice()?;
+    /// assert_eq!(s.len(), 4);
+    /// # Ok::<(), kernel::alloc::AllocError>(())
+    /// ```
+    pub fn into_boxed_slice(mut self) -> Result<Box<[T], A>, AllocError> {
+        self.shrink_to_fit()?;
+        let (buf, len, _cap) = self.into_raw_parts();
+        let slice = ptr::slice_from_raw_parts_mut(buf, len);
+
+        // SAFETY:
+        // - `slice` has been allocated with `A`
+        // - `slice` is suitably aligned
+        // - `slice` has an exact length of `len`
+        // - all elements within `slice` are initialized values of `T`
+        // - `len` does not exceed `isize::MAX`
+        // - `slice` was allocated for `Layout::for_value::<[T]>()`
+        Ok(unsafe { Box::from_raw(slice) })
+    }
 }
 
 impl<T: Clone, A: Allocator> Vec<T, A> {
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 04/16] rust/error: add EXFULL, EBADRQC, EDQUOT, ENOTRECOVERABLE
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (2 preceding siblings ...)
  2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:02 ` [RFC 05/16] bus1: add module scaffolding David Rheinsberg
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Import these definitions from their C-equivalents.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 rust/kernel/error.rs | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/rust/kernel/error.rs b/rust/kernel/error.rs
index 258b12afdcba..1cce6e9dbf22 100644
--- a/rust/kernel/error.rs
+++ b/rust/kernel/error.rs
@@ -66,8 +66,12 @@ macro_rules! declare_err {
     declare_err!(EPIPE, "Broken pipe.");
     declare_err!(EDOM, "Math argument out of domain of func.");
     declare_err!(ERANGE, "Math result not representable.");
+    declare_err!(EXFULL, "Exchange full.");
+    declare_err!(EBADRQC, "Invalid request code.");
     declare_err!(EOVERFLOW, "Value too large for defined data type.");
     declare_err!(ETIMEDOUT, "Connection timed out.");
+    declare_err!(EDQUOT, "Quota exceeded.");
+    declare_err!(ENOTRECOVERABLE, "State not recoverable.");
     declare_err!(ERESTARTSYS, "Restart the system call.");
     declare_err!(ERESTARTNOINTR, "System call was interrupted by a signal and will be restarted.");
     declare_err!(ERESTARTNOHAND, "Restart if no handler.");
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 05/16] bus1: add module scaffolding
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (3 preceding siblings ...)
  2026-03-31 19:02 ` [RFC 04/16] rust/error: add EXFULL, EBADRQC, EDQUOT, ENOTRECOVERABLE David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:02 ` [RFC 06/16] bus1: add the user-space API David Rheinsberg
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add ./ipc/bus1/ as the new home of the bus1 module. `uapi/linux/bus1.h`
will provide the user-space API, but is left blank for now.

The module setup and the user-space API are written in C with the intent
to expose a C API to the other parts of the kernel to use in the future,
if desired.

The internals of the bus1 module are written in Rust. The exposed C API
is defined in `lib.h` and only used inside of the bus1 module for now.
`bindgen` is used to generate Rust symbols for `lib.h` (and
`uapi/linux/bus1.h`) so we can avoid duplicating symbols of the C API in
the Rust code. This way, we can rely on the C header and the C API of
the Rust code to match.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 include/uapi/linux/bus1.h |  5 +++++
 init/Kconfig              | 12 ++++++++++++
 ipc/Makefile              |  2 +-
 ipc/bus1/Makefile         | 28 ++++++++++++++++++++++++++++
 ipc/bus1/lib.h            | 18 ++++++++++++++++++
 ipc/bus1/lib.rs           | 18 ++++++++++++++++++
 ipc/bus1/main.c           | 19 +++++++++++++++++++
 7 files changed, 101 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/bus1.h
 create mode 100644 ipc/bus1/Makefile
 create mode 100644 ipc/bus1/lib.h
 create mode 100644 ipc/bus1/lib.rs
 create mode 100644 ipc/bus1/main.c

diff --git a/include/uapi/linux/bus1.h b/include/uapi/linux/bus1.h
new file mode 100644
index 000000000000..4297e7a00ab9
--- /dev/null
+++ b/include/uapi/linux/bus1.h
@@ -0,0 +1,5 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _UAPI_LINUX_BUS1_H
+#define _UAPI_LINUX_BUS1_H
+
+#endif /* _UAPI_LINUX_BUS1_H */
diff --git a/init/Kconfig b/init/Kconfig
index 444ce811ea67..20e8a577d9a7 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -502,6 +502,18 @@ config POSIX_MQUEUE_SYSCTL
 	depends on SYSCTL
 	default y
 
+config BUS1
+	tristate "Bus1 Capability-based IPC"
+	depends on RUST
+	help
+	  The Bus1 linux subsystem provides capability-based inter-process
+	  communication. It allows services and operating system tasks to
+	  exchange messages, send notifications, and transfer data and
+	  resources, using an object-capability system for implicit access
+	  control. It is highly scalable, yet guarantees causal ordering across
+	  all its operations, including the builtin object lifetime
+	  notifications.
+
 config WATCH_QUEUE
 	bool "General notification queue"
 	default n
diff --git a/ipc/Makefile b/ipc/Makefile
index c2558c430f51..6fa8c3b1c47c 100644
--- a/ipc/Makefile
+++ b/ipc/Makefile
@@ -9,4 +9,4 @@ obj-$(CONFIG_SYSVIPC_SYSCTL) += ipc_sysctl.o
 obj-$(CONFIG_POSIX_MQUEUE) += mqueue.o msgutil.o
 obj-$(CONFIG_IPC_NS) += namespace.o
 obj-$(CONFIG_POSIX_MQUEUE_SYSCTL) += mq_sysctl.o
-
+obj-$(CONFIG_BUS1) += bus1/
diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
new file mode 100644
index 000000000000..1f2fbbe8603f
--- /dev/null
+++ b/ipc/bus1/Makefile
@@ -0,0 +1,28 @@
+# SPDX-License-Identifier: GPL-2.0
+
+quiet_cmd_bindgen = BINDGEN $@
+      cmd_bindgen = \
+	$(BINDGEN) $< -o $@ \
+		$(addprefix --allowlist-file=,$^) \
+		--ctypes-prefix kernel::ffi \
+		--no-debug '.*' \
+		--no-layout-tests \
+		--rust-target 1.68 \
+		--use-core \
+		-- \
+		$(c_flags) \
+		-D__BINDGEN__ \
+		-DMODULE \
+		-fno-builtin
+
+$(obj)/capi.rs: $(src)/lib.h $(srctree)/include/uapi/linux/bus1.h
+	$(call cmd,bindgen)
+
+$(obj)/lib.o: $(obj)/capi.rs
+$(obj)/lib.o: export BUS1_CAPI_PATH=$(abspath $(obj)/capi.rs)
+
+bus1-y :=		\
+	lib.o			\
+	main.o
+
+obj-$(CONFIG_BUS1) += bus1.o
diff --git a/ipc/bus1/lib.h b/ipc/bus1/lib.h
new file mode 100644
index 000000000000..e84c47f97031
--- /dev/null
+++ b/ipc/bus1/lib.h
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __B1_LIB_H
+#define __B1_LIB_H
+
+/**
+ * DOC: C API of the Bus1 Rust Module
+ *
+ * This header exposes the C API of the Bus1 rust module. It provides the
+ * necessary hooks for the C module code to call into the rust code.
+ */
+
+#include <linux/cleanup.h>
+#include <linux/err.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <uapi/linux/bus1.h>
+
+#endif /* __B1_LIB_H */
diff --git a/ipc/bus1/lib.rs b/ipc/bus1/lib.rs
new file mode 100644
index 000000000000..7c3364651638
--- /dev/null
+++ b/ipc/bus1/lib.rs
@@ -0,0 +1,18 @@
+// SPDX-License-Identifier: GPL-2.0
+//! # Kernel Bus1 Crate
+//!
+//! This is the in-kernel implementation of the Bus1 communication system in
+//! rust. Any user-space API is outside the scope of this module.
+
+#[allow(
+    dead_code,
+    missing_docs,
+    non_camel_case_types,
+    non_snake_case,
+    non_upper_case_globals,
+)]
+pub mod capi {
+    include!(env!("BUS1_CAPI_PATH"));
+}
+
+const __LOG_PREFIX: &[u8] = b"bus1\0";
diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
new file mode 100644
index 000000000000..bd6399b2ce3a
--- /dev/null
+++ b/ipc/bus1/main.c
@@ -0,0 +1,19 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/init.h>
+#include <linux/module.h>
+#include "lib.h"
+
+static int __init b1_main_init(void)
+{
+	return 0;
+}
+
+static void __exit b1_main_deinit(void)
+{
+}
+
+module_init(b1_main_init);
+module_exit(b1_main_deinit);
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("Capability-based IPC");
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 06/16] bus1: add the user-space API
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (4 preceding siblings ...)
  2026-03-31 19:02 ` [RFC 05/16] bus1: add module scaffolding David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-03-31 19:02 ` [RFC 07/16] bus1: add man-page David Rheinsberg
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add the user-space API of bus1. Everything is contained in linux/bus1.h
for now, and exposed via a character-device. This allows hotloading of
the module during development.

In the future, a syscall based API, as demanded previously, can be added
as well. The ioctls on the character device are designed with this in
mind, and can be easily translated into syscalls, as is already
documented in the man-page.

The API has no inline documentation. Instead, documentation is provided
in ./Documentation/bus1/bus1.7.rst.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 include/uapi/linux/bus1.h | 77 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/include/uapi/linux/bus1.h b/include/uapi/linux/bus1.h
index 4297e7a00ab9..b4317675475e 100644
--- a/include/uapi/linux/bus1.h
+++ b/include/uapi/linux/bus1.h
@@ -2,4 +2,81 @@
 #ifndef _UAPI_LINUX_BUS1_H
 #define _UAPI_LINUX_BUS1_H
 
+#include <linux/ioctl.h>
+#include <linux/types.h>
+
+#define BUS1_IOCTL_MAGIC			(0x97)
+#define BUS1_INVALID				((__u64)-1)
+#define BUS1_MANAGED				((__u64)0x1)
+
+#define BUS1_FROM_PTR(_v) ((__u64)(void *)(_v))
+#define BUS1_TO_PTR(_v) ((void *)(__u64)(_v))
+
+#define BUS1_TRANSFER_FLAG_CREATE		(((__u64)1) << 0)
+
+struct bus1_transfer {
+	__u64 flags;
+	__u64 id;
+} __attribute__((__aligned__(8)));
+
+struct bus1_metadata {
+	__u64 flags;
+	__u64 id;
+	__u64 account;
+} __attribute__((__aligned__(8)));
+
+enum bus1_message_type: __u64 {
+	BUS1_MESSAGE_TYPE_USER			= 0,
+	BUS1_MESSAGE_TYPE_NODE_RELEASE		= 1,
+	BUS1_MESSAGE_TYPE_HANDLE_RELEASE	= 2,
+	_BUS1_MESSAGE_TYPE_N,
+};
+
+struct bus1_message {
+	__u64 flags;
+	__u64 type;
+	__u64 n_transfers;
+	__u64 ptr_transfers;
+	__u64 n_data;
+	__u64 n_data_vecs;
+	__u64 ptr_data_vecs;
+} __attribute__((__aligned__(8)));
+
+struct bus1_cmd_transfer {
+	__u64 flags;
+	__u64 to;
+	__u64 n_transfers;
+	__u64 ptr_src;
+	__u64 ptr_dst;
+} __attribute__((__aligned__(8)));
+
+struct bus1_cmd_release {
+	__u64 flags;
+	__u64 n_ids;
+	__u64 ptr_ids;
+} __attribute__((__aligned__(8)));
+
+struct bus1_cmd_send {
+	__u64 flags;
+	__u64 n_destinations;
+	__u64 ptr_destinations;
+	__u64 ptr_errors;
+	__u64 ptr_message;
+} __attribute__((__aligned__(8)));
+
+struct bus1_cmd_recv {
+	__u64 flags;
+	__u64 ptr_metadata;
+	__u64 ptr_message;
+} __attribute__((__aligned__(8)));
+
+#define BUS1_CMD_TRANSFER \
+	(_IOWR(BUS1_IOCTL_MAGIC, 0x00, struct bus1_cmd_transfer))
+#define BUS1_CMD_RELEASE \
+	(_IOWR(BUS1_IOCTL_MAGIC, 0x01, struct bus1_cmd_release))
+#define BUS1_CMD_SEND \
+	(_IOWR(BUS1_IOCTL_MAGIC, 0x02, struct bus1_cmd_send))
+#define BUS1_CMD_RECV \
+	(_IOWR(BUS1_IOCTL_MAGIC, 0x03, struct bus1_cmd_recv))
+
 #endif /* _UAPI_LINUX_BUS1_H */
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 07/16] bus1: add man-page
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (5 preceding siblings ...)
  2026-03-31 19:02 ` [RFC 06/16] bus1: add the user-space API David Rheinsberg
@ 2026-03-31 19:02 ` David Rheinsberg
  2026-04-01 16:30   ` Jonathan Corbet
  2026-04-04 15:30   ` Thomas Meyer
  2026-03-31 19:03 ` [RFC 08/16] bus1/util: add basic utilities David Rheinsberg
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:02 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Create an overview man-page `bus1(7)` describing the overall design of
bus1 as well as its individual commands.

The man-page can be compiled and read via:

    rst2man Documentation/bus1/bus1.7.rst bus1.7
    man ./bus1.7

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 Documentation/bus1/bus1.7.rst | 319 ++++++++++++++++++++++++++++++++++
 1 file changed, 319 insertions(+)
 create mode 100644 Documentation/bus1/bus1.7.rst

diff --git a/Documentation/bus1/bus1.7.rst b/Documentation/bus1/bus1.7.rst
new file mode 100644
index 000000000000..0e2f26fee3e2
--- /dev/null
+++ b/Documentation/bus1/bus1.7.rst
@@ -0,0 +1,319 @@
+====
+bus1
+====
+
+----------------------------------------------
+Capability-based IPC for Linux
+----------------------------------------------
+
+:Manual section: 7
+:Manual group: Miscellaneous
+
+SYNOPSIS
+========
+
+| ``#include <linux/bus1.h>``
+
+DESCRIPTION
+-----------
+
+The bus1 API provides capability-based inter-process communication. Its core
+primitive is a multi-producer/single-consumer unidirectional channel that can
+transmit arbitrary user messages. The receiving end of the channel is called
+a **node**, while the sending end is called a **handle**.
+
+A handle always refers to exactly one node, but there can be many handles
+referring to the same node, and those handles can be held by independent
+owners. Messages are sent via a handle, meaning it is transmitted to the node
+the handle is linked to. A handle to a node is required to transmit a message
+to that node.
+
+A sender can attach copies of any handle they hold to a message, and thus
+transfer them alongside the message. The copied handles refer to the same node
+as their respective original handle.
+
+All nodes and handles have an owning **peer**. A peer is a purely local
+concept. The owning peer of a node or handle never affects the externally
+visible behavior of them. However, all nodes and handles of a single peer share
+a message queue.
+
+When the last handle to a node is released, the owning peer of the node
+receives a notification. Similarly, if a node is released, the owning peers of
+all handles referring to that node receive a notification. All notifications
+are ordered causally with any other ongoing communication.
+
+Communication on the bus happens via transactions. A transaction is an atomic
+transmission of messages, which can include release notifications. All message
+types can be part of a transaction, and thus can happen atomically with any
+other kind of message. A transaction with only a single message or notification
+is called a unicast. Any other transaction is called a multicast.
+
+Transactions are causally ordered. That is, if any transaction is a reaction to
+any previous transaction, all messages of the reaction transaction will be
+received by any peer after the messages that were part of the original
+transaction. This is even guaranteed if the causal relationship exists only via
+a side-channel outside the scope of bus1. However, messages without causal
+relationship have no stable order. This is especially noticeable with
+multicasts, where receivers might see independent multicasts in a different
+order.
+
+Operations
+----------
+
+The user-space API of bus1 is not decided on. This section describes the
+available operations as system calls, as they likely would be exposed by any
+user-space library. However, for development reasons the actual user-space API
+is currently performed via ioctls on a character device.
+
+Peer Creation
+^^^^^^^^^^^^^
+
+| ``int bus1_peer_new();``
+
+Peers are independent entities that can be created at will. They are accessed
+via file-descriptors, with each peer having its own file-description. Multiple
+file-descriptors can refer to the same peer, yet currently all operations will
+lock a peer and thus serialize all operations on that peer.
+
+Once the last file-descriptor referring to a peer is closed, the peer is
+released. Any resources of that peer are released, and any ongoing transactions
+targetting the peer will discard their messages.
+
+File descriptions pin the credentials of the calling process. A peer will use
+those pinned credentials for resource accounting. Otherwise, no ambient
+resources are used by bus1.
+
+Transfer Command
+^^^^^^^^^^^^^^^^
+
+| ``#define BUS1_TRANSFER_FLAG_CREATE 0x1``
+|
+| ``struct bus1_transfer {``
+|         ``uint64_t flags;``
+|         ``uint64_t id;``
+| ``};``
+|
+| ``int bus1_cmd_transfer(``
+|         ``uint64_t flags,``
+|         ``int from,``
+|         ``int to,``
+|         ``size_t n,``
+|         ``struct bus1_transfer *src,``
+|         ``struct bus1_transfer *dst``
+| ``);``
+
+A transfer command can be used for two different operations. First, it can be
+used to create nodes and handles on a peer. Second, it can be used to transfer
+a handle from one peer to another, while holding file-descriptors to both
+peers.
+
+The command takes ``flags``, which currently is unused and must be 0. ``from``
+and ``to`` are file-descriptors referring to the involved peers. ``from`` must
+be provided, while ``to`` can be ``-1``, in which case it will refer to the
+same peer as ``from``.
+
+``n`` defines the number of transfer operations that are performed atomically.
+``src`` and ``dst`` must refer to arrays with ``n`` elements. ``dst`` can be
+uninitialized, and will be filled in by the kernel. ``src`` must be initialized
+by the caller. ``src[i].flags`` must be 0 or ``BUS1_TRANSFER_FLAG_CREATE``.
+``src[i].id`` must refer to an ID of a handle in ``from``. If
+``BUS1_TRANSFER_FLAG_CREATE`` is set, ``src[i].id`` must be set to
+``BUS1_INVALID``. In this case a new node is create and the ID of the node
+is returned in ``src[i].id`` with ``src[i].flags`` cleared to 0.
+
+In any case, a new handle in ``to`` is created for every provided transfer. Its
+ID is returned in ``dst[i].id`` and ``dst[i].flags`` is set to 0.
+
+Note that both arrays ``src`` and ``dst`` can be partially modified by the
+kernel even if the operation fails (even if it fails with a different error
+than ``EFAULT``).
+
+Release Command
+^^^^^^^^^^^^^^^
+
+| ``int bus1_cmd_release(``
+|         ``int peerfd,``
+|         ``size_t n_ids,``
+|         ``uint64_t *ids``
+| ``);``
+
+A release command takes a peer file-descriptor as ``peerfd`` and an array of
+node and handle IDs as ``ids`` with ``n_ids`` number of elements. All these
+nodes and handles will be released in a single atomic transaction.
+
+The command does not fail, except if invalid arguments are provided.
+
+No subsequent operation on this peer will refer to the IDs once this call
+returns. Furthermore, those IDs will never be reused.
+
+Send Command
+^^^^^^^^^^^^
+
+| ``enum bus1_message_type: uint64_t {``
+|         ``BUS1_MESSAGE_TYPE_USER = 0,``
+|         ``BUS1_MESSAGE_TYPE_NODE_RELEASE = 1,``
+|         ``BUS1_MESSAGE_TYPE_HANDLE_RELEASE = 2,``
+|         ``_BUS1_MESSAGE_TYPE_N,``
+| ``}``
+|
+| ``struct bus1_message {``
+|         ``uint64_t flags;``
+|         ``uint64_t type;``
+|         ``uint64_t n_transfers;   // size_t n_transfers``
+|         ``uint64_t ptr_transfers; // struct bus1_transfer *transfers;``
+|         ``uint64_t n_data;        // size_t n_data;``
+|         ``uint64_t n_data_vecs;   // size_t n_data_vecs;``
+|         ``uint64_t ptr_data_vecs; // struct iovec *data_vecs;``
+| ``};``
+|
+| ``int bus1_cmd_send(``
+|         ``int peerfd,``
+|         ``size_t n_destinations,``
+|         ``uint64_t *destinations,``
+|         ``int32_t *errors,``
+|         ``struct bus1_message *message``
+| ``);``
+
+The send command takes a peer file-descriptor as ``peerfd``, the message to
+send as ``message``, and an array of destination handles as ``destinations``
+(with ``n_destinations`` number of elements).
+
+Additionally, ``errors`` is used to return the individual error code for each
+destination. This is only done if the send command returns success. Since
+currently partial failure is not exposed, ``errors[i]`` is currently always
+set to 0 on success.
+
+All destination IDs must refer to a valid handle of the calling peer.
+``EBADRQC`` is returned if an ID did not refer to an handle. Currently, only
+a single message can be provided with a single send command, and this message
+is transmitted to all destinations in a single atomic transaction.
+
+The message to be transmitted is provided as ``message``. This structure
+describes the payload of the message. ``message.flags`` must be 0.
+``message.type`` must be ``BUS1_MESSAGE_TYPE_USER``. ``message.n_transfers``
+and ``message.ptr_transfers`` refer to an array of ``struct bus1_transfer``
+and describe handles to be transferred with the message. The transfers are
+used the same as in ``bus1_cmd_transfer(2)``, but ``BUS1_TRANSFER_FLAG_CREATE``
+is currently not refused.
+
+``message.n_data_vecs`` and ``message.ptr_data_vecs`` provide the iovecs with
+the data to be transmitted with the message. Only the first ``message.n_data``
+bytes of the iovecs are considered part of the message. Any trailing bytes
+are ignored. The data is copied into kernel buffers and the iovecs are no
+longer accessed once the command returns.
+
+Recv Command
+^^^^^^^^^^^^
+
+| ``struct bus1_metadata {``
+|         ``uint64_t flags;``
+|         ``uint64_t id;``
+|         ``uint64_t account;``
+| ``};``
+|
+| ``int bus1_cmd_recv(``
+|         ``int peerfd,``
+|         ``struct bus1_metadata *metadata,``
+|         ``struct bus1_message *message``
+| ``);``
+
+The recv command takes a peer file-descriptor as ``peerfd`` and fetches the
+next message from its queue. If no message is queued ``EAGAIN`` is returned.
+
+The message is returned in ``message``. The caller must set ``message.flags``
+to 0 and ``message.type`` to ``BUS1_INVALID``. ``message.n_transfers`` and
+``message.ptr_transfers`` refer to an array of ``struct bus1_transfer``
+structures used to return the transferred handles of the next message. Upon
+return, ``message.n_transfers`` is updated to the actually transferred number
+of handles, while ``message.transfers[i]`` is updated as described in
+``bus1_cmd_transfer(2)``.
+
+``message.n_data``, ``message.n_data_vecs``, and ``message.ptr_data_vecs``
+must be initialized by the caller and provide the space to store the data of
+the next message. The iovecs are never modified by the operation.
+
+If the message would exceed ``message.n_transfers`` or ``message.n_data``,
+``EMSGSIZE`` is returned and the fields are updated accordingly.
+
+Upon success, ``message`` is updated with data of the received message, with
+transferred handles and data written to the transfer array and iovecs.
+
+``metadata`` is updated to contain more data about the message.
+``metadata.flags`` is unused and set to 0. ``metadata.id`` contains the ID
+of the node the message was received on (or the ID of the handle in case of
+``BUS1_MESSAGE_TYPE_NODE_RELEASE``). ``metadata.account`` contains the ID
+of the resource context of the sender.
+
+Errors
+------
+
+All operations follow a strict error reporting model. If an operation has a
+documented error case, then this will be indicated to user-space with a
+negative return value (or ``errno`` respectively). Whenever an error appears,
+the operation will have been cancelled entirely and have no observable affect
+on the bus. User space can safely assume the system to be in the same state as
+if the operation was not invoked, unless explicitly documented.
+
+One major exception is ``EFAULT``. The ``EFAULT`` error code is returned
+whenever user-space supplied malformed pointers to the kernel, and the kernel
+was unable to fetch information from, or return information to, user-space.
+This indicates a misbehaving client, and usually there is no way to recover
+from this, unless user-space intentionally triggered this behavior. User-space
+should treat ``EFAULT`` as an assertion failure and not try to recover. If the
+bus1 API is used in a correct manner, ``EFAULT`` will never be returned by any
+operation.
+
+Resource Accounting
+-------------------
+
+Every peer has an associated resource context used to account claimed
+resources. This resource context is determined at the time the peer is created
+and it will never change over its lifetime. The default, and at this time only,
+accounting model is based on UNIX ``UIDs``. That is, each peer gets assigned
+the resource-context of the ``Effective UID`` of the process that creates it.
+From then on any resource consumption of the peer is accounted on this
+resource-context, and thus shared with all other peers of the same ``UID``.
+
+All allocations have upper limits which cannot be exceeded. An operation will
+return ``EDQUOT`` if the quota limits prevent an operation from being
+performed. User-space is expected to treat this as an administration or
+configuration error, since there is generally no meaningful way to recover.
+Applications should expect to be spawned with suitable resource limits
+pre-configured. However, this is not enforced and user-space is free to react
+to ``EDQUOT`` as it wishes.
+
+Unlike all other bus properties, resource accounting is not part of the bus
+atomicity and ordering guarantees, nor does it implement strict rollback. This
+means, if an operation allocates multiple resources, the resource counters are
+updated before the operation will happen on the bus. Hence, the resource
+counter modifications are visible to the system before the operation itself is.
+Furthermore, while any failing operation will correctly revert any temporary
+resource allocations, the allocations will have been visible to the system
+for the time of this (failed) operation. Therefore, even a failed operation
+can have (temporary) visible side-effects. But similar to the atomicity
+guarantees, these do not affect any other bus properties, but only the resource
+accounting.
+
+However, note that monitoring of bus accounting is not considered a
+programmatic interface, nor are any explicit accounting APIs exposed. Thus, the
+only visible effect of resource accounting is getting ``EDQUOT`` if a counter
+is exceeded.
+
+Additionally to standard resource accounting, a peer can also allocate remote
+resources. This happens whenever a transaction transmits resources from
+a sender to a receiver. All such transactions are always accounted on the
+receiver at the time of *send*. To prevent senders from exhausting resources
+of a receiver, a peer only ever gets access to a subset of the resources of any
+other resource-context that does not match its own.
+
+The exact quotas are
+calculated at runtime and dynamically adapt to the number of different users
+that currently partake. The ideal is a fair linear distribution of the
+available resources, and the algorithm guarantees a quasi-linear distribution.
+Yet, the details are implementation specific and can change over time.
+
+Additionally, a second layer resource accounting separates peers of the same
+resource context. This is done to prevent malfunctioning peers from exceeding
+all resources of their resource context, and thus affecting other peers with
+the same resource context. This uses a much less strict quota system, since
+it does not span security domains.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 08/16] bus1/util: add basic utilities
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (6 preceding siblings ...)
  2026-03-31 19:02 ` [RFC 07/16] bus1: add man-page David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:35   ` Miguel Ojeda
  2026-03-31 19:03 ` [RFC 09/16] bus1/util: add field projections David Rheinsberg
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Import some basic utility helpers. They come with documentation and
should be self-explanatory.

Some helpers will become obsolete, once the MSRV is bumped. This is
noted in the documentation.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/lib.rs      |  2 +
 ipc/bus1/util/mod.rs | 87 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)
 create mode 100644 ipc/bus1/util/mod.rs

diff --git a/ipc/bus1/lib.rs b/ipc/bus1/lib.rs
index 7c3364651638..a7e7a99086c2 100644
--- a/ipc/bus1/lib.rs
+++ b/ipc/bus1/lib.rs
@@ -4,6 +4,8 @@
 //! This is the in-kernel implementation of the Bus1 communication system in
 //! rust. Any user-space API is outside the scope of this module.
 
+pub mod util;
+
 #[allow(
     dead_code,
     missing_docs,
diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
new file mode 100644
index 000000000000..4dd08c04eec6
--- /dev/null
+++ b/ipc/bus1/util/mod.rs
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+//! # Utility library
+//!
+//! This module provides utilities that can be used independently of the core
+//! module.
+
+use kernel::prelude::*;
+use kernel::sync::{Arc, ArcBorrow};
+
+/// Convert an Arc to its pinned version.
+///
+/// All [`Arc`] instances are unconditionally pinned. It is always safe to
+/// convert from their unpinned variant to their pinned variant.
+///
+/// Most kernel APIs just use a plain [`Arc`], even if they rely on pinning.
+/// If another API needs a `Pin<T>`, this converter can provide it for [`Arc`],
+/// even though most other kernel APIs do not use `Pin<Arc<T>>`.
+pub fn arc_pin<T>(v: Arc<T>) -> Pin<Arc<T>> {
+    // SAFETY: `Arc<T>` guarantees its target is pinned.
+    unsafe { Pin::new_unchecked(v) }
+}
+
+/// Convert an Arc to its unpinned version.
+///
+/// All [`Arc`] instances are unconditionally pinned. It is always safe to
+/// convert from their pinned variant to their unpinned variant.
+///
+/// Most kernel APIs just use a plain [`Arc`], even if they rely on pinning.
+/// This converter allows getting an [`Arc`] if some other API returned a
+/// generic `Pin<T>` with `T = Arc<U>`.
+pub fn arc_unpin<T>(v: Pin<Arc<T>>) -> Arc<T> {
+    // SAFETY: `Arc<T>` guarantees its target is pinned, even if not wrapped.
+    unsafe { Pin::into_inner_unchecked(v) }
+}
+
+/// Convert an ArcBorrow to its pinned version.
+///
+/// All [`Arc`] instances are unconditionally pinned. It is always safe to
+/// convert from their unpinned variant to their pinned variant.
+pub fn arc_borrow_pin<T>(v: ArcBorrow<'_, T>) -> Pin<ArcBorrow<'_, T>> {
+    // SAFETY: `Arc<T>` guarantees its target is pinned.
+    unsafe { Pin::new_unchecked(v) }
+}
+
+/// Create a [`NonNull`] from a reference.
+///
+/// This is a backport of [`core::ptr::NonNull::from_ref()`].
+pub fn nonnull_from_ref<T: ?Sized>(v: &T) -> core::ptr::NonNull<T> {
+    // SAFETY: A reference cannot be NULL.
+    unsafe { core::ptr::NonNull::new_unchecked(core::ptr::from_ref(v).cast_mut()) }
+}
+
+/// Create a [`NonNull`] from a reference.
+///
+/// This is a backport of [`core::ptr::NonNull::from_mut()`].
+pub fn nonnull_from_mut<T: ?Sized>(v: &mut T) -> core::ptr::NonNull<T> {
+    // SAFETY: A reference cannot be NULL.
+    unsafe { core::ptr::NonNull::new_unchecked(v) }
+}
+
+/// Return the memory address part of a pointer without exposing provenance.
+///
+/// This returns the same value as an `as usize` cast. However, this function
+/// is meant to not expose provenance, and rather behave like
+/// `<*mut T>::addr()`. Unfortunately, the latter requires an MSRV of 1.84,
+/// which is not yet available upstream. Until then, this serves as a
+/// replacement.
+pub fn ptr_addr<T: ?Sized>(v: *const T) -> usize {
+    // Simply expose the provenance until. A transmute would avoid the
+    // exposition, but is not a stable API.
+    v.cast::<()>() as usize
+}
+
+/// Compare two pointers.
+///
+/// This is equivalent to `<*const T as Ord>::cmp()`. Unlike the trait-based
+/// solution, this has fixed pointer types and thus can be called with
+/// references, which are then coerced to pointers.
+///
+/// This serves the same purpose as `core::ptr::eq()`, but for `Ord` rather
+/// than `Eq`.
+pub fn ptr_cmp<T: ?Sized>(a: *const T, b: *const T) -> core::cmp::Ordering {
+    // Even though `PartialOrd for *mut T` documents that it uses
+    // `<*mut T>::addr()` for comparisons, clippy still warns about it. Cast
+    // to `()` to ensure metadata is ignored.
+    a.cast::<()>().cmp(&b.cast::<()>())
+}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 09/16] bus1/util: add field projections
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (7 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 08/16] bus1/util: add basic utilities David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:38   ` Miguel Ojeda
  2026-03-31 19:03 ` [RFC 10/16] bus1/util: add IntoDeref/FromDeref David Rheinsberg
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Introduce a utility module that provides field-projections for stable
Rust. The module is designed very similar to the official Rust field
projections (which are still unstable), but without any requirement for
compiler support.

The module explicitly uses names similar to the ones from the official
field projections, and is certainly meant to be replaced once those
become stable or are otherwise introduced into the kernel.

However, until then, this module is small and simple enough to allow
very convenient intrusive collections, and thus is included here.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/util/field.rs | 359 +++++++++++++++++++++++++++++++++++++++++
 ipc/bus1/util/mod.rs   |   2 +
 2 files changed, 361 insertions(+)
 create mode 100644 ipc/bus1/util/field.rs

diff --git a/ipc/bus1/util/field.rs b/ipc/bus1/util/field.rs
new file mode 100644
index 000000000000..e0226726278c
--- /dev/null
+++ b/ipc/bus1/util/field.rs
@@ -0,0 +1,359 @@
+//! # Field Projections
+//!
+//! This module allows generalizing over the fields of a structure. At its
+//! core is the [`Field`] trait, allowing limited type reflection.
+//!
+//! This trait is just enough to get intrusive collections working. For
+//! generalized versions of this, see
+//! [field projections](https://github.com/rust-lang/rust/issues/145383).
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+
+/// Authoritative information about a field of another type.
+///
+/// This trait asserts that [`Self::Base`] has a member field of type
+/// [`Self::Type`] at byte offset [`Self::OFFSET`]. This information is
+/// authoritative. As such, implementing this trait on *any* type must be
+/// subject to this condition.
+///
+/// All subtypes of an implementation always carry the same trait
+/// implementation. That is, an implementing type cannot be coerced into
+/// another type with a deviating implementation.
+///
+/// Commonly, this trait is automatically implemented on Field Representing
+/// Types (FRTs) by the compiler, or manually via [`impl_field`].
+///
+/// # Unsized Types
+///
+/// All involved types currently must be `Sized`. In particular, the
+/// implementing type `Self`, `Field::Base`, and `Field::Type` must be `Sized`.
+///
+/// The trait could allow unsized types, but the helpers that convert from base
+/// to field pointer cannot calculate pointer metadata without external
+/// input. Unless there is a solid design to pass around metadata, this is left
+/// for a future extension.
+///
+/// ## Safety
+///
+/// Implementing this type is only safe, if for a given valid value of type
+/// [`Self::Base`] there exists a valid value of type [`Self::Type`] at byte
+/// offset `OFFSET`, and this value is represented by a direct member field
+/// on [`Self::Base`].
+///
+/// Any subtypes of the implementing type must have an equal trait
+/// implementation. Type coercion must never lead to a deviating trait
+/// implementation.
+pub unsafe trait Field: Send + Sync + Copy {
+    /// Base containing type this field exists in.
+    type Base;
+
+    /// Type of the field.
+    type Type;
+
+    /// Offset of this field in bytes relative to the start of the base type.
+    const OFFSET: usize;
+}
+
+/// Authoritative information about a structurally pinned field.
+///
+/// This trait is an extension of [`Field`] and guarantees that the field is
+/// [structurally pinned](https://rust.docs.kernel.org/core/pin/index.html#projections-and-structural-pinning).
+///
+/// ## Safety
+///
+/// The implementation must guarantee that the field is structurally pinned.
+pub unsafe trait PinField: Field {
+}
+
+/// Reflection metadata about a field of a base type.
+///
+/// This type is used as implementing type for generated [`Field`]
+/// implementations. It is a 1-ZST and used only to represent reflection
+/// metadata about a field of a type.
+///
+/// See [`impl_field`] for its main user.
+///
+/// This type is invariant over all its type parameters.
+///
+/// ## Limitations
+///
+/// If multiple zero-sized member fields share the same offset, only a single
+/// one can be represented with this type. The compiler generated alternative
+/// in the standard library can circumvent this limitation. Without compiler
+/// support, auto-generation of such types requires other external enumerations
+/// that make usage needlessly complex. Hence, this uses the field offset
+/// as distinguisher, and thus limits the implementation.
+#[repr(C, packed)]
+pub struct FieldRepr<Base: ?Sized, Type: ?Sized, const OFFSET: usize> {
+    _base: [*mut Base; 0],
+    _type: [*mut Type; 0],
+    _offset: [(); OFFSET],
+}
+
+// SAFETY: `FieldRepr` doesn't contain any values. No subtypes exist.
+unsafe impl<Base: ?Sized, Type: ?Sized, const OFFSET: usize> Send
+for FieldRepr<Base, Type, OFFSET> {
+}
+
+// SAFETY: `FieldRepr` doesn't contain any values. No subtypes exist.
+unsafe impl<Base: ?Sized, Type: ?Sized, const OFFSET: usize> Sync
+for FieldRepr<Base, Type, OFFSET> {
+}
+
+impl<Base: ?Sized, Type: ?Sized, const OFFSET: usize> Clone
+for FieldRepr<Base, Type, OFFSET> {
+    fn clone(&self) -> Self {
+        *self
+    }
+}
+
+impl<Base: ?Sized, Type: ?Sized, const OFFSET: usize> Copy
+for FieldRepr<Base, Type, OFFSET> {
+}
+
+/// Turn a base pointer into a member field pointer.
+///
+/// This is equivalent to taking a raw pointer to a member field
+/// `&raw mut (*v).field`. Note that the base is not dereferenced for this
+/// operation.
+///
+/// ## Safety
+///
+/// The pointer `v` must point to an allocation of `BaseTy`, but that value can
+/// be uninitialized.
+pub unsafe fn field_of_ptr<Frt: Field>(v: *mut Frt::Base) -> *mut Frt::Type {
+    // SAFETY: Validity of the allocation behind `v` is delegated to the
+    //     caller. The offset calculation is guaranteed by the `Field` trait.
+    unsafe { v.byte_offset(Frt::OFFSET as isize).cast() }
+}
+
+/// Turn a base pointer into a member field pointer.
+///
+/// Works like [`field_of_ptr()`] but on [`NonNull`].
+///
+/// ## Safety
+///
+/// The pointer `v` must point to an allocation of `BaseTy`, but that value can
+/// be uninitialized.
+pub unsafe fn field_of_nn<Frt: Field>(v: NonNull<Frt::Base>) -> NonNull<Frt::Type> {
+    // SAFETY: Validity of the allocation behind `v` is delegated to the
+    //     caller. The offset calculation is guaranteed by the `Field` trait.
+    unsafe { v.byte_offset(Frt::OFFSET as isize).cast() }
+}
+
+/// Turn a field pointer into a base pointer.
+///
+/// This is the inverse of [`field_of_ptr()`]. It recreates the base pointer
+/// from the member field pointer.
+///
+/// ## Miri Stacked & Tree Borrows
+///
+/// If you require compatibility with Stacked Borrows as used in Miri, you must
+/// ensure that the field pointer was created from a reference to the base,
+/// rather than from a reference to the field. In other words, make sure that
+/// you use [`field_of_ptr()`] and then retain that raw field pointer until you
+/// need it for [`base_of_ptr()`]. Otherwise, your code will likely not be
+/// compatible with Stacked Borrows.
+///
+/// If you only require compatibility with Tree Borrows, this is not an issue.
+///
+/// ## Safety
+///
+/// The pointer `v` must point into an allocation of `BaseTy` at the offset of
+/// the member field described by `Field`, but the value can be uninitialized.
+pub unsafe fn base_of_ptr<Frt: Field>(v: *mut Frt::Type) -> *mut Frt::Base {
+    // SAFETY: Validity of the allocation behind `v` is delegated to the
+    //     caller. The offset calculation is guaranteed by the `Field` trait.
+    unsafe { v.byte_offset(-(Frt::OFFSET as isize)).cast() }
+}
+
+/// Turn a field pointer into a base pointer.
+///
+/// Works like [`base_of_ptr()`] but on [`NonNull`].
+///
+/// ## Safety
+///
+/// The pointer `v` must point into an allocation of `BaseTy` at the offset of
+/// the member field described by `Field`, but the value can be uninitialized.
+pub unsafe fn base_of_nn<Frt: Field>(v: NonNull<Frt::Type>) -> NonNull<Frt::Base> {
+    // SAFETY: Validity of the allocation behind `v` is delegated to the
+    //     caller. The offset calculation is guaranteed by the `Field` trait.
+    unsafe { v.byte_offset(-(Frt::OFFSET as isize)).cast() }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_field_frt {
+    ($base:ty, $field:ident, $type:ty $(,)?) => {
+        $crate::util::field::FieldRepr<
+            $base,
+            $type,
+            { ::core::mem::offset_of!($base, $field) },
+        >
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_field_impl_field {
+    ($base:ty, $field:ident, $type:ty $(,)?) => {
+        // SAFETY: `FieldRepr` exposes no variance. `$field` is verified to be
+        //     a member of `$base` via `offset_of!()`, and correctness of its
+        //     type is verified apart from coercions (which we accept).
+        unsafe impl $crate::util::field::Field
+        for $crate::util::field::FieldRepr<
+            $base,
+            $type,
+            { ::core::mem::offset_of!($base, $field) },
+        > {
+            type Base = $base;
+            type Type = $type;
+            const OFFSET: usize = const {
+                // Verify the type of the member field.
+                let mut v = ::core::mem::MaybeUninit::<Self::Base>::uninit();
+                let v_ptr = core::ptr::from_mut(&mut v).cast::<Self::Base>();
+                // SAFETY: `v` is a valid allocation, a field access is safe.
+                let _: *mut Self::Type = unsafe {
+                    &raw mut ((*v_ptr).$field)
+                };
+                ::core::mem::offset_of!(Self::Base, $field)
+            };
+        }
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_field_impl_pin_field {
+    ($base:ty, $field:ident, $type:ty $(,)?) => {
+        $crate::util::field::impl_field!($base, $field, $type);
+        // SAFETY: Structural pinning of `$field` is guaranteed by the caller.
+        unsafe impl $crate::util::field::PinField
+        for $crate::util::field::FieldRepr<
+            $base,
+            $type,
+            { ::core::mem::offset_of!($base, $field) },
+        > {
+        }
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_field_field_of {
+    ($base:ty, $field:ident $(,)?) => {
+        $crate::util::field::frt!{$base, $field, _}
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_field_typed_field_of {
+    ($base:ty, $field:ident, $type:ty $(,)?) => {
+        $crate::util::field::frt!{$base, $field, $type}
+    }
+}
+
+/// Resolve to the field-representing-type (FRT).
+///
+/// This takes as arguments:
+/// - $base:ty
+/// - $field:ident
+/// - $type:ty
+///
+/// This resolves to `FieldRepr<$base, $type, ...>` with the last generic
+/// parameter set to the offset of `$field` in `$base`.
+#[doc(inline)]
+pub use util_field_frt as frt;
+
+/// Implement [`Field`] for a specific member field.
+///
+/// This takes as arguments:
+/// - $base:ty
+/// - $field:ident
+/// - $type:ty
+///
+/// This implements [`Field`] on [`FieldRepr`] with the given base type, member
+/// field name, and member field type.
+///
+/// ## Safety
+///
+/// The caller must guarantee that `$type` matches the type of the member 
+/// field `$field`. This is verified by this macro, except for possible
+/// coercions.
+#[doc(inline)]
+pub use util_field_impl_field as impl_field;
+
+/// Implement [`PinField`] for a structurally pinned member field.
+///
+/// This works like [`impl_field!`] but implements [`PinField`] on top.
+/// of [`Field`].
+///
+/// ## Safety
+///
+/// The safety requirements of [`impl_field!`] apply. On top, the caller
+/// must guarantee the field in question is structurally pinned.
+#[doc(inline)]
+pub use util_field_impl_pin_field as impl_pin_field;
+
+/// Resolve to the [`FieldRepr`] of a specific member field.
+///
+/// This takes as arguments:
+/// - $base:ty
+/// - $field:ident
+///
+/// This resolves to a specific type of [`FieldRepr`] for the specified member
+/// field. This lets the compiler auto-derive the type of the field. In
+/// situations where an auto-derive is not allowed (e.g., function signatures)
+/// use [`typed_field_of!`].
+#[doc(inline)]
+pub use util_field_field_of as field_of;
+
+/// Resolve to the typed [`FieldRepr`] of a specific member field.
+///
+/// This takes as arguments:
+/// - $base:ty
+/// - $field:ident
+/// - $type:ty
+///
+/// This resolves to a specific type of [`FieldRepr`] for the specified member
+/// field.
+#[doc(inline)]
+pub use util_field_typed_field_of as typed_field_of;
+
+#[allow(clippy::undocumented_unsafe_blocks)]
+#[kunit_tests(bus1_util_field)]
+mod test {
+    use super::*;
+
+    #[derive(Clone, Copy, Debug, PartialEq)]
+    #[repr(C, align(4))]
+    struct Test {
+        a: u16,
+        b: u8,
+        c: u32,
+    }
+
+    impl_field!(Test, a, u16);
+    impl_field!(Test, b, u8);
+    impl_pin_field!(Test, c, u32);
+
+    // Basic functionality tests for `Field` and its utilities.
+    #[test]
+    fn field_basics() {
+        assert_eq!(core::mem::size_of::<Test>(), 8);
+
+        let mut o = Test { a: 14, b: 11, c: 1444 };
+        let o_p = &raw mut o;
+
+        let f_p = unsafe { field_of_ptr::<field_of!(Test, b)>(o_p) };
+        let f_r = unsafe { &*f_p };
+        let b_p = unsafe { base_of_ptr::<field_of!(Test, b)>(f_p) };
+        let b_r = unsafe { &*b_p };
+
+        assert!(core::ptr::eq(o_p, b_p));
+        assert_eq!(*f_r, 11);
+        assert_eq!(b_r.b, 11);
+    }
+}
diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
index 4dd08c04eec6..ad1ceef35f3d 100644
--- a/ipc/bus1/util/mod.rs
+++ b/ipc/bus1/util/mod.rs
@@ -7,6 +7,8 @@
 use kernel::prelude::*;
 use kernel::sync::{Arc, ArcBorrow};
 
+pub mod field;
+
 /// Convert an Arc to its pinned version.
 ///
 /// All [`Arc`] instances are unconditionally pinned. It is always safe to
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 10/16] bus1/util: add IntoDeref/FromDeref
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (8 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 09/16] bus1/util: add field projections David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:44   ` Miguel Ojeda
  2026-03-31 19:03 ` [RFC 11/16] bus1/util: add intrusive data-type helpers David Rheinsberg
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Introduce two new utility traits: IntoDeref and FromDeref.

The traits are an abstraction for `Box::into_raw()` and
`Box::from_raw()`, as well as their equivalents in `Arc`. At the same
time, the traits can be implemented for plain references as no-ops.

The traits will be used by intrusive collections to generalize over the
data-type stored in a collection, without moving the actual data into
the collection.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/util/convert.rs | 259 +++++++++++++++++++++++++++++++++++++++
 ipc/bus1/util/mod.rs     |   1 +
 2 files changed, 260 insertions(+)
 create mode 100644 ipc/bus1/util/convert.rs

diff --git a/ipc/bus1/util/convert.rs b/ipc/bus1/util/convert.rs
new file mode 100644
index 000000000000..2b918f3b7db2
--- /dev/null
+++ b/ipc/bus1/util/convert.rs
@@ -0,0 +1,259 @@
+//! # Utilities for conversions between types
+//!
+//! This module contains utilities to help dealing with conversions between
+//! types.
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+use crate::util;
+
+/// Convert a value into a raw pointer to its dereferenced value.
+///
+/// This trait extends [`core::ops::Deref`] and allows converting a value
+/// into a raw pointer to its dereferenced value. While
+/// [`core::ops::Deref`] retains the original value and merely borrows to
+/// dereference it, [`IntoDeref`] actually converts the value into a raw
+/// pointer to the dereferenced value without retaining the original.
+///
+/// Note that in many cases this will leak the original value if no extra
+/// steps are taken. Usually, you want to restore the original value to ensure
+/// the correct drop handlers are run (see [`FromDeref`]).
+///
+/// This trait is a generic version of
+/// [`Box::into_raw()`](kernel::alloc::Box::into_raw),
+/// [`Arc::into_raw()`](kernel::sync::Arc::into_raw), and more.
+///
+/// ## Mutability
+///
+/// [`IntoDeref`] serves for immutable and mutable conversions. The returned
+/// pointer does not reflect whether mutable access is granted. That is, the
+/// trait can be used for owned values like [`Box`](kernel::alloc::Box)
+/// where mutable access is granted, and shared values like
+/// [`Arc`](kernel::sync::Arc) where no mutable access is granted.
+///
+/// If [`DerefMut`] is implemented as well, then the value is available
+/// mutably. If it is not implemented, then mutably is left undefined.
+///
+/// ## Safety
+///
+/// The implementations of [`Deref`](core::ops::Deref) and [`IntoDeref`] must
+/// be compatible. That is, `deref()` must return the same pointer as
+/// `into_deref()`. If [`DerefMut`] is implemented, `deref_mut()` must also
+/// return the same pointer.
+///
+/// Furthermore, for types that provide [pinned](core::pin) variants,
+/// [`IntoDeref`] is part of the safety requirements of
+/// [`core::pin::Pin::new_unchecked()`] just like
+/// [`Deref`](core::ops::Deref) is.
+///
+/// An implementation must uphold the documented guarantees of the individual
+/// methods.
+pub unsafe trait IntoDeref: Sized + core::ops::Deref {
+    /// Convert a value into a raw pointer to its dereferenced value.
+    ///
+    /// This consumes a dereferencable value and yields a raw pointer
+    /// to the dereferenced value.
+    ///
+    /// The returned pointer is guaranteed to be convertible to a shared
+    /// reference for any caller-chosen lifetime `'a` where `Self: 'a`.
+    ///
+    /// If [`DerefMut`](core::ops::DerefMut) is implemented for [`Self`],
+    /// then the pointer is guaranteed to be convertible to a mutable
+    /// reference for any caller-chosen lifetime `'a` where `Self: 'a`.
+    ///
+    /// Conversion to a reference is subject to exclusivity guarantees as
+    /// required by `&` and `&mut`.
+    fn into_deref(v: Self) -> NonNull<Self::Target>;
+
+    /// Convert a pinned value into a raw pointer to its dereferenced value.
+    ///
+    /// This is the pinned equivalent of [`Self::into_deref()`]. The pointer
+    /// is only convertible to a pinned reference.
+    fn pin_into_deref(v: Pin<Self>) -> NonNull<Self::Target> {
+        // SAFETY: Pinned types must ensure they uphold pinning guarantees
+        //     just like `Deref` does (see trait requirements).
+        Self::into_deref(unsafe { Pin::into_inner_unchecked(v) })
+    }
+}
+
+/// Convert a dereferenced value back to its original value.
+///
+/// This trait provides the inverse operation of [`IntoDeref`]. It takes a
+/// raw pointer to a dereferenced value and restores the original pointer.
+/// This operation is unsafe and requires the caller to guarantee that the
+/// pointer was acquired via [`IntoDeref`] or similar means.
+///
+/// This trait is a generic version of
+/// [`Box::from_raw()`](kernel::alloc::Box::from_raw),
+/// [`Arc::from_raw()`](kernel::sync::Arc::from_raw), and more.
+///
+/// ## Safety
+///
+/// An implementation must uphold the documented guarantees of the individual
+/// methods.
+pub unsafe trait FromDeref: IntoDeref {
+    /// Convert a dereferenced value back to its original value.
+    ///
+    /// This returns the value that was originally passed to
+    /// [`IntoDeref::into_deref()`].
+    ///
+    /// ## Safety
+    ///
+    /// The wrapped pointer must have been acquired via [`IntoDeref`] or a
+    /// matching equivalent.
+    ///
+    /// The caller must guarantee that they do not make use of any retained
+    /// copies of the wrapped pointer.
+    ///
+    /// It is always safe to call this on values obtained via [`IntoDeref`], as
+    /// long as the raw pointer is no longer used afterwards.
+    unsafe fn from_deref(v: NonNull<Self::Target>) -> Self;
+
+    /// Convert a dereferenced value back to its original pinned value.
+    ///
+    /// This is the pinned equivalent of [`Self::from_deref()`].
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee that the original value was a pinned pointer.
+    /// Furthermore, all requirements of [`Self::from_deref()`] apply.
+    unsafe fn pin_from_deref(v: NonNull<Self::Target>) -> Pin<Self> {
+        // SAFETY: Pinned types must ensure they uphold pinning guarantees
+        //     just like `Deref` does (see trait requirements of `IntoDeref`).
+        //     Also, the caller must ensure the original value was pinned.
+        unsafe { Pin::new_unchecked(Self::from_deref(v)) }
+    }
+}
+
+mod impls {
+    use super::*;
+    use kernel::alloc::{Allocator, Box};
+    use kernel::sync::Arc;
+
+    // SAFETY: Coherent with `Deref` and pinning. Upholds method guarantees.
+    unsafe impl<T: ?Sized> IntoDeref for &T {
+        fn into_deref(v: Self) -> NonNull<Self::Target> {
+            util::nonnull_from_ref(v)
+        }
+    }
+
+    // SAFETY: Upholds method guarantees.
+    unsafe impl<T: ?Sized> FromDeref for &T {
+        unsafe fn from_deref(v: NonNull<Self::Target>) -> Self {
+            // SAFETY: Caller guarantees `v` is a `&T`.
+            unsafe { v.as_ref() }
+        }
+    }
+
+    // SAFETY: Coherent with `Deref` and pinning. Upholds method guarantees.
+    unsafe impl<T: ?Sized> IntoDeref for &mut T {
+        fn into_deref(v: Self) -> NonNull<Self::Target> {
+            util::nonnull_from_mut(v)
+        }
+    }
+
+    // SAFETY: Upholds method guarantees.
+    unsafe impl<T: ?Sized> FromDeref for &mut T {
+        unsafe fn from_deref(mut v: NonNull<Self::Target>) -> Self {
+            // SAFETY: Caller guarantees `v` is a `&mut T`.
+            unsafe { v.as_mut() }
+        }
+    }
+
+    // SAFETY: Coherent with `Deref` and pinning. Upholds method guarantees.
+    unsafe impl<T: ?Sized, A: Allocator> IntoDeref for Box<T, A> {
+        fn into_deref(v: Self) -> NonNull<Self::Target> {
+            // SAFETY: `Box::into_raw()` never returns NULL.
+            unsafe { NonNull::new_unchecked(Box::into_raw(v)) }
+        }
+    }
+
+    // SAFETY: Upholds method guarantees.
+    unsafe impl<T: ?Sized, A: Allocator> FromDeref for Box<T, A> {
+        unsafe fn from_deref(v: NonNull<Self::Target>) -> Self {
+            // SAFETY: Caller guarantees `v` is from `IntoDeref`.
+            unsafe { Box::from_raw(v.as_ptr()) }
+        }
+    }
+
+    // SAFETY: Coherent with `Deref` and pinning. Upholds method guarantees.
+    unsafe impl<T: ?Sized> IntoDeref for Arc<T> {
+        fn into_deref(v: Self) -> NonNull<Self::Target> {
+            // SAFETY: `Arc::into_raw()` never returns NULL.
+            unsafe { NonNull::new_unchecked(Arc::into_raw(v).cast_mut()) }
+        }
+    }
+
+    // SAFETY: Upholds method guarantees.
+    unsafe impl<T: ?Sized> FromDeref for Arc<T> {
+        unsafe fn from_deref(v: NonNull<Self::Target>) -> Self {
+            // SAFETY: Caller guarantees `v` is from `IntoDeref`.
+            unsafe { Arc::from_raw(v.as_ptr()) }
+        }
+    }
+}
+
+#[allow(clippy::undocumented_unsafe_blocks)]
+#[kunit_tests(bus1_util_convert)]
+mod test {
+    use super::*;
+    use kernel::alloc::KBox;
+    use kernel::sync::Arc;
+
+    #[test]
+    fn into_from_deref() {
+        let mut v: u64 = 71;
+
+        {
+            let p: *const u64 = &raw const v;
+            let f: &u64 = &v;
+
+            let d: NonNull<u64> = IntoDeref::into_deref(f);
+            assert_eq!(71, unsafe { *d.as_ref() });
+            assert!(core::ptr::eq(p, d.as_ptr()));
+
+            let r: &u64 = unsafe { FromDeref::from_deref(d) };
+            assert_eq!(71, *r);
+            assert!(core::ptr::eq(p, r));
+        }
+
+        {
+            let p: *mut u64 = &raw mut v;
+            let f: &mut u64 = &mut v;
+
+            let d: NonNull<u64> = IntoDeref::into_deref(f);
+            assert_eq!(71, unsafe { *d.as_ref() });
+            assert!(core::ptr::eq(p, d.as_ptr()));
+
+            let r: &mut u64 = unsafe { FromDeref::from_deref(d) };
+            assert_eq!(71, *r);
+            assert!(core::ptr::eq(p, r));
+        }
+
+        {
+            let f: KBox<u64> = KBox::new(v, GFP_KERNEL).unwrap();
+            let p: *const u64 = &raw const *f;
+
+            let d: NonNull<u64> = IntoDeref::into_deref(f);
+            assert_eq!(71, unsafe { *d.as_ref() });
+            assert!(core::ptr::eq(p, d.as_ptr()));
+
+            let r: KBox<u64> = unsafe { FromDeref::from_deref(d) };
+            assert_eq!(71, *r);
+            assert!(core::ptr::eq(p, &raw const *r));
+        }
+
+        {
+            let f: Arc<u64> = Arc::new(v, GFP_KERNEL).unwrap();
+            let p: *const u64 = &raw const *f;
+
+            let d: NonNull<u64> = IntoDeref::into_deref(f);
+            assert_eq!(71, unsafe { *d.as_ref() });
+            assert!(core::ptr::eq(p, d.as_ptr()));
+
+            let r: Arc<u64> = unsafe { FromDeref::from_deref(d) };
+            assert_eq!(71, *r);
+            assert!(core::ptr::eq(p, &raw const *r));
+        }
+    }
+}
diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
index ad1ceef35f3d..b8922cfb74cc 100644
--- a/ipc/bus1/util/mod.rs
+++ b/ipc/bus1/util/mod.rs
@@ -7,6 +7,7 @@
 use kernel::prelude::*;
 use kernel::sync::{Arc, ArcBorrow};
 
+pub mod convert;
 pub mod field;
 
 /// Convert an Arc to its pinned version.
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 11/16] bus1/util: add intrusive data-type helpers
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (9 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 10/16] bus1/util: add IntoDeref/FromDeref David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:03 ` [RFC 12/16] bus1/util: add intrusive single linked lists David Rheinsberg
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add `util::intrusive` as a helper module for the different intrusive
data-types that will be added later on. At its core is the `Link`
trait. It describes the relationship between a value in a collection,
and the intrusive metadata on that type required by the collection.

`Link` is auto-derived for types that use `IntoDeref` and `FromDeref`,
as well as field projections. This means, for most usecases, the `Link`
trait is not necessary to be implemented at all.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/util/intrusive.rs | 397 +++++++++++++++++++++++++++++++++++++
 ipc/bus1/util/mod.rs       |   1 +
 2 files changed, 398 insertions(+)
 create mode 100644 ipc/bus1/util/intrusive.rs

diff --git a/ipc/bus1/util/intrusive.rs b/ipc/bus1/util/intrusive.rs
new file mode 100644
index 000000000000..d6d4656a26d7
--- /dev/null
+++ b/ipc/bus1/util/intrusive.rs
@@ -0,0 +1,397 @@
+//! # Utilities for Intrusive Data Structures
+//!
+//! Intrusive data structures store metadata of elements they manage in the
+//! element itself, rather than using support data structures. This requires
+//! users to embed the respective metadata types in their types, and annotate
+//! the data structures with sufficient information about this embedding.
+//!
+//! Intrusive data structures encode connections between elements and
+//! containing collections in the type system. Furthermore, they can often
+//! reduce or fully eliminate dynamic allocations, as well as reduce the
+//! number of pointer chases necessary to traverse a collection.
+//! On the flip side, intrusive data structures often reduce cache locality and
+//! can thus be more expensive to traverse.
+//!
+//! In general, performance and allocation pressure highly depend on the
+//! implemented algorithm, rather than on the fact whether a data structure is
+//! intrusive. But if dynamic collections free of allocations are needed,
+//! intrusive data structures are usually the only option.
+//!
+//! The biggest advantage of intrusive data structures, though, is their use
+//! of the type system to encode the possible relationships of different
+//! data types. An instance of a given data type can only be linked into a
+//! statically known number of intrusive collections at a time. The fact that
+//! metadata is directly embedded in a type can be used to deduce how a type
+//! can be hooked up into other collections. Furthermore, such relationships
+//! can be queried at runtime in constant time.
+//!
+//! ## Examples
+//!
+//! The following pseudo code shows how intrusive collections behave, based on
+//! a fictional intrusive collection called `col`. Most collections that use
+//! the utilities of this module behave in a very similar manner.
+//!
+//! ```rust,ignore
+//! // Import the fictional collection `col`.
+//! use some::intrusive::collection::col;
+//!
+//! // Data type that will be stored in the collection. The payload describes
+//! // the user controlled data that can be put into the entry. `md` is the
+//! // mandatory metadata used to store it as a node in the collection.
+//! struct Entry {
+//!     payload: u8,
+//!     md: col::Node,
+//! }
+//!
+//! // Encode that `Entry` has a node as member field called `md`. The unstable
+//! // field-projections feature of Rust would make this obsolete.
+//! util::field::impl_pin_field!(Entry, md, col::Node);
+//!
+//! // Create 3 entries, and then pin them on the stack. As an alternative, the
+//! // entries could also be dynamically allocated via `Box`, `Arc`, etc..
+//! let e0_o = Entry { payload: 0, ..Default::default() };
+//! let e1_o = Entry { payload: 1, ..Default::default() };
+//! let e2_o = Entry { payload: 2, ..Default::default() };
+//! let e0 = core::pin::pin!(e0_o);
+//! let e1 = core::pin::pin!(e1_o);
+//! let e2 = core::pin::pin!(e2_o);
+//!
+//! // Create a fictional collection called `Map`, which stores elements of
+//! // type `Entry` using the `md` member field.
+//! let map = col::Map::<col::node_of!(&Entry, md)>::new();
+//!
+//! // Nodes can be queried for their state at any time.
+//! assert!(!e0.md.is_linked());
+//! // Collections can often be queried for their relationship to a node. This
+//! // can usually be performed in O(1), but depends on the implementation.
+//! assert!(!map.contains(&e0));
+//!
+//! // Collections take ownership of a reference to an element, rather than
+//! // moving the element into the collection. In this case, a shared reference
+//! // is used. Dynamic allocations would move a `Box`, `Arc`, etc. into the
+//! // collection.
+//! map.push(&e0);
+//!
+//! // Since the used reference type implements `Clone`, the caller retains a
+//! // reference and can use it to query the collection for it.
+//! assert!(e0.md.is_linked());
+//! assert!(map.contains(&e0));
+//!
+//! // Many elements can be pushed into a collection, but a single element
+//! // cannot be in multiple collections that use the same member field node.
+//! // Assuming `push()` panics on failure, we use `try_push()` to verify that
+//! // `e0` was already pushed before.
+//! map.push(&e1);
+//! map.push(&e2);
+//! assert!(map.try_push(&e0).is_err());
+//!
+//! // Collections usually provide cursors that allow traversals that can
+//! // optionally modify the collection. In this case it is used to drop all
+//! // elements with an even payload number.
+//! let mut cursor = map.first_mut();
+//! while let Some(v) = cursor.get() {
+//!     if (v.payload % 2) == 0 {
+//!         cursor.move_next_unlink();
+//!     } else {
+//!         cursor.move_next();
+//!     }
+//! }
+//! assert!(!map.contains(&e0));
+//! assert!(map.contains(&e1));
+//! assert!(!map.contains(&e2));
+//!
+//! // Since collections own references to their elements, they can be dropped
+//! // and will automatically drop all references to contained elements.
+//! // Here, this means the elements are no longer linked anywhere and we can
+//! // get mutable access to them again (verified by using `Pin::set()`).
+//! drop(map);
+//! e0.set(Default::default());
+//! e1.set(Default::default());
+//! e2.set(Default::default());
+//! ```
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+use crate::util::{self, field};
+
+/// Link metadata for intrusive data structures.
+///
+/// This trait represents the link between nodes in an intrusive data
+/// structure. It defines how to operate with the data-type that is stored
+/// in an intrusive data structure, and how to acquire and release the
+/// metadata stored in it.
+///
+/// An intrusive data-structure is usually generic over [`Link<Node>`], where
+/// `Node` is the type of intrusive metadata used with that data structure. A
+/// user then needs to provide an implementation of `Link<Node>` for the data
+/// type they want to use. The trait defines how to get acquire and release
+/// values of this data type, and how to locate the metadata of type `Node`
+/// within that type.
+///
+/// The trait is usually not directly implemented on any of the involved types.
+/// Instead, a representing meta-type is used, similar to
+/// field-representing-types of [`field projections`](crate::util::field). This
+/// module provides such a link-representing-type as [`LinkRepr`].
+///
+/// # Default Implementation via [`Deref`](core::ops::Deref)
+///
+/// If a type implements [`IntoDeref`](field::IntoDeref), a
+/// default implementation of [`Link`] is provided via `Deref::Target` as
+/// `Link<Ref = T, Target = T::Target>`. The implementation is provided via
+/// a field-representing-type on [`LinkRepr`], and can be accessed via
+/// [`link_of!()`].
+///
+/// # Safety
+///
+/// An implementation must uphold the documented guarantees of the individual
+/// methods.
+pub unsafe trait Link<Node: ?Sized> {
+    /// Reference type that is used with an intrusive collection.
+    type Ref;
+    /// Target type that represents the borrowed version of the reference type.
+    type Target: ?Sized;
+
+    /// Acquire a reference target pointer from a reference.
+    ///
+    /// This will turn the reference into a reference target pointer and node
+    /// pointer. It is up to the caller to ensure calling [`Self::release()`]
+    /// when releasing the pointer. Otherwise, the original reference will be
+    /// leaked.
+    ///
+    /// The returned pointer is guaranteed to be convertible to a reference for
+    /// any caller-chosen lifetime `'a` where `Self::Ref: 'a`.
+    ///
+    /// Conversion to a reference is subject to exclusivity guarantees of `&`
+    /// and `&mut` (i.e., only shared refs, or no shared ref but exactly one
+    /// mutable ref).
+    ///
+    /// If, and only if, [`LinkMut`] is implemented on `Self`, mutable
+    /// references are allowed.
+    ///
+    /// Repeated calls to `acquire()` (with intermittent calls to `release()`)
+    /// produce the same pointer.
+    fn acquire(v: Pin<Self::Ref>) -> NonNull<Node>;
+
+    /// Release a reference target pointer to get back the original reference.
+    ///
+    /// # Safety
+    ///
+    /// The reference target pointer must have been acquired via
+    /// [`Self::acquire()`], and the caller must cease further use of the
+    /// pointer.
+    unsafe fn release(v: NonNull<Node>) -> Pin<Self::Ref>;
+
+    /// Project a reference target to its node.
+    ///
+    /// This allows access to a `Node` from its owning reference target,
+    /// without having access to any owning reference (`Self::Ref`).
+    ///
+    /// The returned pointer is guaranteed to be convertible to a shared
+    /// reference for any caller-chosen lifetime `'a` where `'_: 'a`.
+    ///
+    /// Repeated calls will return the same pointer, and are guaranteed to be
+    /// the inverse operation of [`Self::borrow()`].
+    // XXX: We need to investigate whether it is valid to store pointers from
+    // this in a collection itself. A collection must already own pointers to
+    // the same element, but likely not derived from a matching reference.
+    // Hence, under Stacked Borrows it will carry a different tag and thus can
+    // invalidate other tags if interior mutability is involved (needs to be
+    // verified).
+    fn project(v: &Self::Target) -> NonNull<Node>;
+
+    /// Borrow the reference target temporarily.
+    ///
+    /// # Safety
+    ///
+    /// `v` must have been from [`Self::acquire()`] and no mutable reference
+    /// can co-exist.
+    unsafe fn borrow<'a>(v: NonNull<Node>) -> Pin<&'a Self::Target>
+    where
+        Self::Ref: 'a;
+
+    /// Clone a borrow of the reference target.
+    ///
+    /// This creates a clone of the original reference type from a borrowed
+    /// node. This is only available, if the reference type implements
+    /// [`Clone`](core::clone::Clone).
+    ///
+    /// # Safety
+    ///
+    /// `v` must have been from [`Self::acquire()`] and no other reference
+    /// can co-exist.
+    unsafe fn borrow_clone(v: NonNull<Node>) -> Pin<Self::Ref>
+    where
+        Self::Ref: Clone,
+    {
+        // Create a clone by temporarily releasing and re-acquiring the
+        // original reference. Ensure a panic in the `clone()`-impl does not
+        // drop the original reference, just to be safe and avoid cascading
+        // failures.
+        let t = core::mem::ManuallyDrop::new(unsafe { Self::release(v) });
+        let r = (*t).clone();
+        let _ = Self::acquire(core::mem::ManuallyDrop::into_inner(t));
+        r
+    }
+}
+
+/// Mutability Extensions to [`Link`].
+///
+/// This trait extends [`Link`] by declaring the reference type to grant
+/// mutable access to the stored data.
+///
+/// # Safety
+///
+/// An implementation must uphold the documented guarantees of the individual
+/// methods.
+pub unsafe trait LinkMut<Node: ?Sized>: Link<Node> {
+    /// Mutably borrow the reference target temporarily.
+    ///
+    /// # Safety
+    ///
+    /// `v` must have been from [`Self::acquire()`] and no other reference
+    /// can co-exist.
+    unsafe fn borrow_mut<'a>(v: NonNull<Node>) -> Pin<&'a mut Self::Target>
+    where
+        Self::Ref: 'a;
+}
+
+/// Representation of link metadata for intrusive data structures.
+///
+/// This type is used as implementing type for [`Link`] if field projections
+/// are used to access metadata. It is a 1-ZST and only used to represent the
+/// link between nodes of an intrusive data structure.
+///
+/// This type takes the reference type of the data structure as first argument
+/// (called `Ref`), and the field-representing-type of the reference target as
+/// second argument (called `Frt`). The type is usually access via
+/// [`link_of!()`].
+///
+/// This type is invariant over all its type parameters.
+#[repr(C, packed)]
+pub struct LinkRepr<Ref: ?Sized, Frt> {
+    _ref: [*mut Ref; 0],
+    _frt: [*mut Frt; 0],
+}
+
+// SAFETY: Upholds method guarantees.
+unsafe impl<Ref, Frt> Link<Frt::Type> for LinkRepr<Ref, Frt>
+where
+    Ref: util::convert::FromDeref,
+    Frt: field::PinField<Base = Ref::Target>,
+{
+    type Ref = Ref;
+    type Target = Frt::Base;
+
+    fn acquire(v: Pin<Self::Ref>) -> NonNull<Frt::Type> {
+        let target = Ref::pin_into_deref(v);
+        // SAFETY: `target` was just acquired from `pin_into_deref()`, which
+        //     guarantees that the result is convertible to a reference. A
+        //     field projection thus cannot be `NULL`.
+        unsafe {
+            NonNull::new_unchecked(
+                field::field_of_ptr::<Frt>(target.as_ptr())
+            )
+        }
+    }
+
+    unsafe fn release(v: NonNull<Frt::Type>) -> Pin<Self::Ref> {
+        // SAFETY: Caller guarantees that `v` was from `acquire()`, and
+        //     thus points into an allocation of `Frt::Type` within a
+        //     `Self::Target`. Hence, `base_of_ptr()` is safe to call and
+        //     cannot return `NULL`.
+        let target = unsafe {
+            NonNull::new_unchecked(field::base_of_ptr::<Frt>(v.as_ptr()))
+        };
+        // SAFETY: Caller guarantees that `v` was from `acquire()`, and thus
+        //     from `pin_into_deref()`. They also guarantee to cease using `v`.
+        unsafe { util::convert::FromDeref::pin_from_deref(target) }
+    }
+
+    fn project(v: &Self::Target) -> NonNull<Frt::Type> {
+        // SAFETY: `v` is a valid reference, so it must point to a valid
+        //     allocation.
+        unsafe {
+            NonNull::new_unchecked(
+                field::field_of_ptr::<Frt>(core::ptr::from_ref(v).cast_mut())
+            )
+        }
+    }
+
+    unsafe fn borrow<'a>(v: NonNull<Frt::Type>) -> Pin<&'a Self::Target>
+    where
+        Self::Ref: 'a,
+    {
+        // SAFETY: Caller guarantees that `v` was from `acquire()`, and
+        //     thus points into an allocation of `Frt::Type` within a
+        //     `Self::Target`. Hence, `base_of_ptr()` is safe to call and
+        //     is pinned.
+        //     Caller also guarantees that no mutable reference exists.
+        unsafe { Pin::new_unchecked(&*field::base_of_ptr::<Frt>(v.as_ptr())) }
+    }
+}
+
+// SAFETY: Upholds method guarantees.
+unsafe impl<Ref, Frt> LinkMut<Frt::Type> for LinkRepr<Ref, Frt>
+where
+    Ref: util::convert::FromDeref,
+    Frt: field::PinField<Base = Ref::Target>,
+{
+    unsafe fn borrow_mut<'a>(v: NonNull<Frt::Type>) -> Pin<&'a mut Self::Target>
+    where
+        Self::Ref: 'a,
+    {
+        // SAFETY: Caller guarantees that `v` was from `acquire()`, and
+        //     thus points into an allocation of `Frt::Type` within a
+        //     `Self::Target`. Hence, `base_of_ptr()` is safe to call and
+        //     is pinned.
+        //     Caller also guarantees that no other reference exists.
+        unsafe { Pin::new_unchecked(&mut *field::base_of_ptr::<Frt>(v.as_ptr())) }
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_intrusive_lrt {
+    ($ref:ty, $deref:ty, $field:ident, $node:ty $(,)?) => {
+        $crate::util::intrusive::LinkRepr<$ref, $crate::util::field::typed_field_of!{$deref, $field, $node}>
+    }
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_intrusive_link_of {
+    ($ref:ty, $field:ident, $node:ty $(,)?) => {
+        $crate::util::intrusive::lrt!{
+            $ref,
+            <$ref as core::ops::Deref>::Target,
+            $field,
+            $node,
+        }
+    }
+}
+
+/// Resolve to the link-representing-type (LRT).
+///
+/// This takes as arguments:
+/// - $ref:ty
+/// - $deref:ty
+/// - $field:ident
+/// - $node:ty
+///
+/// And resolves to `LinkRepr<$ref, typed_field_of!($deref, $field, $node)>`
+/// using a field-representing-type.
+#[doc(inline)]
+pub use crate::util_intrusive_lrt as lrt;
+
+/// Resolve to [`LinkRepr`] of a reference type.
+///
+/// This takes as arguments:
+/// - $ref:ty
+/// - $field:ident
+/// - $node:ty
+///
+/// It resolves to a type implementing [`Link<$node>`], using
+/// [`Deref`](core::ops::Deref) on `$ref` as reference target and its
+/// field-representing-type with `$field` as member name.
+#[doc(inline)]
+pub use crate::util_intrusive_link_of as link_of;
diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
index b8922cfb74cc..bcd6eedff85a 100644
--- a/ipc/bus1/util/mod.rs
+++ b/ipc/bus1/util/mod.rs
@@ -9,6 +9,7 @@
 
 pub mod convert;
 pub mod field;
+pub mod intrusive;
 
 /// Convert an Arc to its pinned version.
 ///
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 12/16] bus1/util: add intrusive single linked lists
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (10 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 11/16] bus1/util: add intrusive data-type helpers David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:03 ` [RFC 13/16] bus1/util: add intrusive rb-tree David Rheinsberg
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Introduce two new utility modules: slist and lll

`lll` is a lockless-single-linked-list similar to `linux/llist.h`. It
will be used by the bus1 module as message queue. `lll` uses the
intrusive helpers from `util::intrusive` and exposes a generic
(non-lockless) single linked list as `slist`.

Since clearing a lockless single linked list returns ownership of all
entries in the list, `slist` is provided as a generic single linked list
to represent that list. However, `slist` is useful on its own, even
without `lll` in mind.

`lll` follows a standard lockless-linked-list design, but has one
non-standard feature: It can be sealed. Sealing a lockless list clears
it and prevents and further entry from being linked, thus disabling the
list entirely, until it is deallocated. This is required by `bus1` to
ensure circular references are broken once a peer is deallocated.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/util/lll.rs   | 378 +++++++++++++++++++++++
 ipc/bus1/util/mod.rs   |   2 +
 ipc/bus1/util/slist.rs | 677 +++++++++++++++++++++++++++++++++++++++++
 3 files changed, 1057 insertions(+)
 create mode 100644 ipc/bus1/util/lll.rs
 create mode 100644 ipc/bus1/util/slist.rs

diff --git a/ipc/bus1/util/lll.rs b/ipc/bus1/util/lll.rs
new file mode 100644
index 000000000000..cddbe6ead4a1
--- /dev/null
+++ b/ipc/bus1/util/lll.rs
@@ -0,0 +1,378 @@
+// SPDX-License-Identifier: GPL-2.0
+//! # Intrusive Lockless Linked Lists
+//!
+//! This module implements an intrusive lockless linked list. It is similar
+//! to `linux/llist.h`, but implemented in pure Rust.
+//!
+//! This follows the intrusive design described in
+//! [`intrusive`](crate::util::intrusive). However, it only offers a very
+//! limited API surface. For a general purpose single linked list list API,
+//! use [`util::slist`].
+//!
+//! The core entrypoint is [`List`], which maintains a single pointer to the
+//! last entry in a single linked list. Elements are stored by their [`Node`]
+//! metadata field, which again is just a single pointer to the respective
+//! previous element in a list.
+//!
+//! More generally, [`List`] can be seen as a multi-producer/multi-consumer
+//! channel, similar to (but very much reduced in scope)
+//! `std::sync::mpsc` in the Rust standard library.
+
+use kernel::prelude::*;
+use kernel::sync::atomic;
+
+use crate::util;
+
+/// Intrusive lockless single linked list to store elements.
+///
+/// A [`List`] effectively provides two operations:
+/// 1) Push a new element to the front of the list.
+/// 2) Remove all elements from the list and return them.
+///
+/// Both operations can be performed without any locks but only via hardware
+/// atomic operations.
+///
+/// This list is mainly used for multi-producer / single-or-multi-consumer
+/// (mpsc/mpmc) channels. That is, it serves as handover of items from
+/// producers to a consumer / consumers. The list does not provide any
+/// iterators, cursors, or other utilities to modify or inspect a list. If
+/// those are needed, proper locked lists are the better option.
+///
+/// Elements stored in a [`List`] are owned by that list, but are not moved,
+/// nor allocated by the list. Instead, the list takes ownership of a pointer
+/// to the element (either via smart pointers, or via references that have a
+/// lifetime that exceeds the lifetime of the list). Once an element is
+/// removed, ownership is transferred back to the caller.
+///
+/// [`List`] uses the same nodes as [`util::slist::Node`], and thus can move
+/// nodes from one to another.
+pub struct List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    // Pointer to the first node in the list. Set to `END` if the list is
+    // empty, `NULL` if it was sealed and can no longer be pused to. All other
+    // pointers always represent a pinned owned reference to an entry, gained
+    // via `Ref::pin_into_deref()`.
+    first: atomic::Atomic<usize>,
+    // Different lists can store entries of type `Ref` via different nodes. By
+    // pinning the field-representing-type, it is always clear which node a
+    // list is using.
+    _lrt: core::marker::PhantomData<Lrt>,
+}
+
+/// Metadata required for elements of a [`List`].
+///
+/// This is an alias for [`util::slist::Node`]. That is, this uses the same
+/// node type as the general purpose single-linked list provided by
+/// [`util::slist`].
+pub type Node = util::slist::Node;
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_lll_node_of {
+    ($ref:ty, $field:ident $(,)?) => {
+        $crate::util::intrusive::link_of!{$ref, $field, $crate::util::lll::Node}
+    }
+}
+
+/// Alias of [`link_of!()`](util::intrusive::link_of) for [`Node`] members.
+#[doc(inline)]
+pub use util_lll_node_of as node_of;
+
+// Marks a sealed list. This is different than `slist::END` in that no more
+// entries can be pushed to a sealed list. Otherwise, it is treated like an
+// empty list.
+pub(crate) const SEAL: usize = 0;
+
+impl<Lrt> List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Create a new empty list.
+    ///
+    /// The new list has no entries linked and is completely independent of
+    /// other lists.
+    pub const fn new() -> Self {
+        Self {
+            first: atomic::Atomic::new(util::slist::END),
+            _lrt: core::marker::PhantomData,
+        }
+    }
+
+    /// Check whether the list is empty.
+    ///
+    /// This returns `true` is no entries are linked, `false` if at least one
+    /// entry is linked.
+    ///
+    /// Note that the list does not maintain a counter of how many elements are
+    /// linked.
+    pub fn is_empty(&self) -> bool {
+        match self.first.load(atomic::Relaxed) {
+            SEAL | util::slist::END => true,
+            _ => false,
+        }
+    }
+
+    /// Check whether the list is sealed.
+    ///
+    /// This returns `true` if the list is sealed. A sealed list is always
+    /// empty and cannot be modified, anymore (nor can the seal be removed).
+    pub fn is_sealed(&self) -> bool {
+        self.first.load(atomic::Relaxed) == SEAL
+    }
+
+    /// Link a node at the front of a list.
+    ///
+    /// On success, `Ok` is returned and the node is linked at the front of the
+    /// list, with ownership transferred to the list.
+    ///
+    /// If the node is already on another list, this will return `Err` and
+    /// return ownership of the entry to the caller.
+    ///
+    /// On success, this ensures a release memory barrier before linking it
+    /// into the list, matching the acquire memory barrier in
+    /// [`List::clear()`].
+    pub fn try_link_front(
+        &self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<(), Pin<Lrt::Ref>> {
+        let mut first = self.first.load(atomic::Relaxed);
+        if first == SEAL {
+            // Sealed lists cannot be linked to.
+            return Err(ent);
+        }
+
+        let ent_node = Lrt::acquire(ent);
+        // SAFETY: `ent_node` is convertible to a shared reference as long as
+        //     we do not call `Lrt::release()`.
+        let ent_node_r = unsafe { ent_node.as_ref() };
+
+        let Ok(_) = ent_node_r.next.cmpxchg(0, first, atomic::Relaxed) else {
+            // `ent_node_r` becomes invalid once `end_deref` is released.
+            #[expect(dropping_references)]
+            drop(ent_node_r);
+            // `ent` is already linked into another list, return ownership to
+            // the caller wrapped in an `Err`.
+            //
+            // SAFETY: `ent_node` was just acquired from `pin_into_deref()`
+            //     and is no longer used afterwards.
+            return Err(unsafe { Lrt::release(ent_node) });
+        };
+
+        // Expose provenance, until `Atomic<*mut T>` is here.
+        let ent_node_addr = ent_node.as_ptr() as usize;
+
+        // Try updating the list-front until it succeeds.
+        loop {
+            // Use release barrier, since we want all operations on the node
+            // to be ordered before the node is pushed to the list. The
+            // matching acquire barrier is in `Self::clear()`.
+            match self.first.cmpxchg(
+                first,
+                ent_node_addr,
+                atomic::Release,
+            ) {
+                Ok(_) => break Ok(()),
+                Err(v) => {
+                    // If the list is sealed, no more entries can be linked.
+                    if v == SEAL {
+                        // SAFETY: `ent_node` was just acquired from
+                        // `pin_into_deref()` and is no longer used afterwards.
+                        break Err(unsafe { Lrt::release(ent_node) });
+                    }
+                    first = v;
+                    ent_node_r.next.store(first, atomic::Relaxed);
+                },
+            }
+        }
+    }
+
+    /// Clear the entire list and return the entries to the caller.
+    ///
+    /// This will atomically remove all entries from the list, and return those
+    /// entries as a general purpose single linked list to the caller.
+    ///
+    /// Note that [`List`] only supports adding entries at the front. Hence,
+    /// the returned list will be in LIFO (last-in-first-out) order.
+    ///
+    /// This ensures an acquire memory barrier matching the release memory
+    /// barrier in [`List::try_link_front()`].
+    pub fn clear(&self) -> util::slist::List<Lrt> {
+        let mut first = self.first.load(atomic::Relaxed);
+        loop {
+            if first == SEAL {
+                break util::slist::List::new();
+            }
+            // Use acquire barrier to ensure writes to the nodes are
+            // visible, if done before they were linked. The matching
+            // release barrier is in `Self::try_link_front()`.
+            match self.first.cmpxchg(
+                first,
+                util::slist::END,
+                atomic::Acquire,
+            ) {
+                Ok(v) => {
+                    // SAFETY: By clearing `self.first` we acquire the list.
+                    // Since it uses the same nodes as `slist`, we can create
+                    // one from it.
+                    break unsafe { util::slist::List::with(v) };
+                },
+                Err(v) => first = v,
+            }
+        }
+    }
+
+    /// Seal the entire list and return all entries to the caller.
+    ///
+    /// This will atomically remove all entries from the list and seal it, so
+    /// any new attempt to link more entries will fail.
+    ///
+    /// A sealed list will remain sealed and cannot be unsealed. This also
+    /// implies that the list will remain empty.
+    ///
+    /// If the list is already sealed, this is a no-op and will return an empty
+    /// list.
+    pub fn seal(&self) -> util::slist::List<Lrt> {
+        let v = self.first.xchg(SEAL, atomic::Acquire);
+        if v == SEAL {
+            util::slist::List::new()
+        } else {
+            // SAFETY: By clearing `self.first` we acquire the list. Since it
+            // uses the same nodes as `slist`, we can create one from it.
+            unsafe { util::slist::List::with(v) }
+        }
+    }
+}
+
+// Convenience helpers
+impl<Lrt> List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Link a node at the front of a list.
+    ///
+    /// Works like [`List::try_link_front()`] but warns on error and leaks the
+    /// entry.
+    pub fn link_front(&self, ent: Pin<Lrt::Ref>) {
+        self.try_link_front(ent).unwrap_or_else(|v| {
+            // Warn if the entry is already used elsewhere, and then leak the
+            // reference to avoid cascading failures.
+            kernel::warn_on!(true);
+            core::mem::forget(v);
+        })
+    }
+}
+
+// SAFETY: `List` can be sent along CPUs, as long as the data it contains can
+//     also be sent along. `List` never cares about the CPU it is called on.
+unsafe impl<Lrt> Send for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Send,
+{
+}
+
+// SAFETY: `List` is meant to be shared across CPUs and safely handles parallel
+//     accesses through atomics. It never hands out references to stored
+//     elements, so it is `Sync` as long as the data it sends along is `Send`.
+unsafe impl<Lrt> Sync for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Send,
+{
+}
+
+impl<Lrt> core::default::Default for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Return a new empty list.
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl<Lrt> core::ops::Drop for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Clear a list before dropping it.
+    ///
+    /// This drops all elements in the list via [`List::clear()`], before
+    /// dropping the list. This ensures that the elements of a list are
+    /// not leaked.
+    fn drop(&mut self) {
+        self.clear();
+    }
+}
+
+#[kunit_tests(bus1_util_lll)]
+mod test {
+    use super::*;
+
+    #[derive(Default)]
+    struct Entry {
+        key: u8,
+        node: Node,
+    }
+
+    util::field::impl_pin_field!(Entry, node, Node);
+
+    #[test]
+    fn test_basic() {
+        let e0 = core::pin::pin!(Entry { key: 0, ..Default::default() });
+        let e1 = core::pin::pin!(Entry { key: 1, ..Default::default() });
+
+        let list: List<node_of!(&Entry, node)> = List::new();
+
+        assert!(list.is_empty());
+        assert!(!list.is_sealed());
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+
+        list.link_front(e0.as_ref());
+        list.link_front(e1.as_ref());
+
+        assert!(!list.is_empty());
+        assert!(!list.is_sealed());
+        assert!(e0.node.is_linked());
+        assert!(e1.node.is_linked());
+
+        assert!(list.try_link_front(e0.as_ref()).is_err());
+        assert!(list.try_link_front(e1.as_ref()).is_err());
+
+        let mut r = list.clear();
+        assert_eq!(r.unlink_front().unwrap().key, 1);
+        assert_eq!(r.unlink_front().unwrap().key, 0);
+        assert!(r.unlink_front().is_none());
+
+        assert!(list.is_empty());
+        assert!(!list.is_sealed());
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+
+        list.link_front(e0.as_ref());
+        assert!(!list.is_empty());
+        assert!(!list.is_sealed());
+        assert!(e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+
+        assert!(!list.is_sealed());
+        let mut r = list.seal();
+        assert!(list.is_empty());
+        assert!(list.is_sealed());
+        assert!(e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+        assert!(list.try_link_front(e0.as_ref()).is_err());
+        assert!(list.try_link_front(e1.as_ref()).is_err());
+
+        assert_eq!(r.unlink_front().unwrap().key, 0);
+        assert!(r.unlink_front().is_none());
+
+        assert!(list.is_empty());
+        assert!(list.is_sealed());
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+    }
+}
diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
index bcd6eedff85a..4639e40382c4 100644
--- a/ipc/bus1/util/mod.rs
+++ b/ipc/bus1/util/mod.rs
@@ -10,6 +10,8 @@
 pub mod convert;
 pub mod field;
 pub mod intrusive;
+pub mod lll;
+pub mod slist;
 
 /// Convert an Arc to its pinned version.
 ///
diff --git a/ipc/bus1/util/slist.rs b/ipc/bus1/util/slist.rs
new file mode 100644
index 000000000000..e6ca6b078fb8
--- /dev/null
+++ b/ipc/bus1/util/slist.rs
@@ -0,0 +1,677 @@
+// SPDX-License-Identifier: GPL-2.0
+//! # Intrusive Single-Linked Lists
+//!
+//! This module implements an intrusive single linked list. It follows the
+//! intrusive design described in [`intrusive`](crate::util::intrusive).
+//!
+//! [`List`] represents a single linked list and maintains a pointer to the
+//! first element in a list. It is an owning list, which takes ownership of a
+//! reference to each element stored in the list. Elements must embed a
+//! [`Node`], which is used by the list to store metadata. Nodes effectively
+//! store just a pointer to the next node in the list.
+//!
+//! Since elements of a single linked list do not have a pointer to their
+//! previous element, they generally cannot be unlinked ad-hoc. Instead, they
+//! can only be unlinked during iteration or if they are the first element.
+//! Therefore, this implementation does not provide any way to test list
+//! association in O(1). It is possible to check whether an element is linked
+//! or not, but you cannot check whether it is linked into a specific list.
+
+// XXX: Since `kernel::atomic::Atomic<*mut T>` was not yet stabilized, this
+//     implementation uses `Atomic<usize>` instead, exposing provenance. This
+//     will change once atomic pointers are stabilized.
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+use kernel::sync::atomic;
+
+use crate::util;
+
+/// Intrusive single linked list to store elements.
+///
+/// A [`List`] is a single-linked list, where each element only knows its
+/// following element. The list maintains a pointer to the first element only.
+///
+/// Elements stored in a [`List`] are owned by that list, but are not moved,
+/// nor allocated by the list. Instead, the list takes ownership of a pointer
+/// to the element (either via smart pointers, or via references that have a
+/// lifetime that exceeds the lifetime of the list). Once an element is
+/// removed, ownership is transferred back to the caller.
+///
+/// [`List`] is intrusive in nature, meaning it relies on metadata on each
+/// element to manage list internals. This metadata must be a member field of
+/// type [`Node`]. The exact member field that is used for a list is provided
+/// via a generic field representing type (`FRT`). All elements stored in a
+/// single list will use the same member field. But a single element can have
+/// multiple different member fields of type [`Node`], and thus be linked into
+/// multiple different lists simultaneously.
+pub struct List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    // Pointer to the first node in a single-linked list. This is never `NULL`,
+    // but can be `END`, in which case it marks the end of the list. This
+    // pointer always represents a pinned owned reference to the entry, gained
+    // via `Ref::pin_into_deref()`.
+    // Only uses atomics to allow `CursorMut` to treat it as `Node.next`.
+    first: atomic::Atomic<usize>,
+    // Different lists can store entries of type `Link::Ref` via different
+    // nodes. By pinning the link-representing-type, it is always clear which
+    // node a list is using.
+    _lrt: core::marker::PhantomData<Lrt>,
+}
+
+/// Metadata required for elements of a [`List`].
+///
+/// Every element that is stored in a [`List`] must have a member field of
+/// type [`Node`]. [`List`] is generic over the member field used, so a
+/// single element can have multiple nodes to be stored in multiple different
+/// lists. All elements stored in the same list will use the same member field
+/// for that list, though.
+pub struct Node {
+    // A pointer to the next node. This is `NULL` if the node is unlinked,
+    // `END` if it is the last element in the list. This pointer always
+    // represents a pinned owned reference to the entry, gained via
+    // `Ref::pin_into_deref()`.
+    // This is an atomic to allow acquiring unused nodes. Once acquired, a list
+    // can use non-atomic reads. Writes must be atomic still, to prevent
+    // temporary releases.
+    pub(crate) next: atomic::Atomic<usize>,
+    // List nodes store pointers to other nodes, so nodes must always be pinned
+    // when linked.
+    _pin: core::marker::PhantomPinned,
+}
+
+/// Mutable cursor to move over the elements of a [`List`].
+///
+/// Mutable cursors mutably borrow a list and then allow moving over the list
+/// and accessing the elements. Unlike immutable cursors, mutable cursors allow
+/// linking new elements, and unlinking existing elements.
+///
+/// Single linked lists can only be iterated in one direction, so this cursor
+/// behaves very similar to standard iterators.
+pub struct CursorMut<'list, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    // Current position of the cursor. Always refers to the pointer to an
+    // element, rather than the element itself, to allow removal of the
+    // element.
+    pos: NonNull<atomic::Atomic<usize>>,
+    // Cursors borrow their list mutably, yet that borrow is never used,
+    // so provide as phantom data.
+    _list: core::marker::PhantomData<&'list mut List<Lrt>>,
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_slist_node_of {
+    ($ref:ty, $field:ident $(,)?) => {
+        $crate::util::intrusive::link_of!{$ref, $field, $crate::util::slist::Node}
+    }
+}
+
+/// Alias of [`link_of!()`](util::intrusive::link_of) for [`Node`] members.
+#[doc(inline)]
+pub use util_slist_node_of as node_of;
+
+// Marks the end of a list, to be able to distinguish unlinked nodes from tail
+// nodes. Since the initial page is reserved, this cannot match real nodes.
+pub(crate) const END: usize = core::mem::align_of::<Node>();
+
+impl<Lrt> List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Create a new list with the given value for `first`.
+    ///
+    /// # Safety
+    ///
+    /// `first` must either be [`END`] or point to the first node of a list,
+    /// with ownership transferred to the new list.
+    pub(crate) const unsafe fn with(first: usize) -> Self {
+        Self {
+            first: atomic::Atomic::new(first),
+            _lrt: core::marker::PhantomData,
+        }
+    }
+
+    /// Create a new empty list.
+    ///
+    /// The new list has no entries linked and is completely independent of
+    /// other lists.
+    pub const fn new() -> Self {
+        // SAFETY: `END` is trivially allowed as argument.
+        unsafe { Self::with(END) }
+    }
+
+    /// Check whether the list is empty.
+    ///
+    /// This returns `true` is no entries are linked, `false` if at least one
+    /// entry is linked.
+    ///
+    /// Note that the list does not maintain a counter of how many elements are
+    /// linked.
+    pub fn is_empty(&self) -> bool {
+        self.first.load(atomic::Relaxed) == END
+    }
+
+    /// Return a mutable cursor for this list, starting at the front.
+    ///
+    /// Create a new mutable cursor for the list, which initially points at the
+    /// first element. The cursor mutably borrows the list for its entire
+    /// lifetime.
+    pub fn cursor_mut(&mut self) -> CursorMut<'_, Lrt> {
+        CursorMut {
+            pos: util::nonnull_from_ref(&self.first),
+            _list: core::marker::PhantomData,
+        }
+    }
+
+    /// Link a node at the front of the list.
+    ///
+    /// On success, `Ok` is returned and the node is linked at the front of the
+    /// list, with ownership transferred to the list.
+    ///
+    /// If the node is already on another list, this will return `Err` and
+    /// return ownership of the entry to the caller.
+    pub fn try_link_front(
+        &mut self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<Pin<&Lrt::Target>, Pin<Lrt::Ref>> {
+        self.cursor_mut().try_link_consume(ent)
+    }
+
+    /// Unlink the first element of the list.
+    ///
+    /// If the list is empty, this will return `None`. Otherwise, the first
+    /// entry is removed from the list and ownership is transferred to the
+    /// caller.
+    pub fn unlink_front(&mut self) -> Option<Pin<Lrt::Ref>> {
+        self.cursor_mut().unlink()
+    }
+
+    /// Clear the list and move ownership of all entries into a closure.
+    ///
+    /// This will invoke `clear_fn` once for each entry in the list. The entry
+    /// is removed and ownership is transferred into the closure.
+    ///
+    /// Entries are removed sequentially starting from the front of the list.
+    pub fn clear_with<ClearFn>(
+        &mut self,
+        mut clear_fn: ClearFn,
+    )
+    where
+        ClearFn: FnMut(Pin<Lrt::Ref>),
+    {
+        while let Some(v) = self.unlink_front() {
+            clear_fn(v);
+        }
+    }
+}
+
+// Convenience helpers
+impl<Lrt> List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Link a node at the front of the list.
+    ///
+    /// Works like [`List::try_link_front()`] but panics on error.
+    pub fn link_front(&mut self, ent: Pin<Lrt::Ref>) -> Pin<&Lrt::Target> {
+        self.try_link_front(ent).unwrap_or_else(|_| {
+            panic!("attempting to link a foreign node");
+        })
+    }
+
+    /// Clear the list and drop all elements.
+    ///
+    /// Works like [`List::clear_with()`] but uses `core::mem::drop` as
+    /// closure.
+    pub fn clear(&mut self) {
+        self.clear_with(|_| {})
+    }
+}
+
+// SAFETY: Lists have no interior mutability, nor do they otherwise care for
+//     their calling CPU. They can be freely sent across CPUs, only limited by
+//     the stored type.
+unsafe impl<Lrt> Send for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Send,
+{
+}
+
+// SAFETY: Lists have no interior mutability, nor do they otherwise care for
+//     their calling CPU. They can be shared across CPUs, only limited by
+//     the stored type.
+unsafe impl<Lrt> Sync for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Sync,
+{
+}
+
+impl<Lrt> core::default::Default for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Return a new empty list.
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl<Lrt> core::ops::Drop for List<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Clear a list before dropping it.
+    fn drop(&mut self) {
+        self.clear();
+    }
+}
+
+impl Node {
+    /// Create a new unlinked node.
+    ///
+    /// The new node is marked as unlinked and not associated with any other
+    /// node or list.
+    ///
+    /// Note that nodes must be pinned to be linked into a list. Therefore,
+    /// the result must either be pinned in place or moved into a pinned
+    /// structure to make proper use of it.
+    pub const fn new() -> Self {
+        Self {
+            next: atomic::Atomic::new(0),
+            _pin: core::marker::PhantomPinned,
+        }
+    }
+
+    /// Check whether this node is linked into a list.
+    ///
+    /// This returns `true` if this node is linked into any list. It returns
+    /// `false` if the node is currently unlinked.
+    ///
+    /// Note that a node can be linked into a list at any time. That is,
+    /// validity of the returned boolean can change spuriously, unless the
+    /// caller otherwise ensures exclusive access to the node.
+    /// Furthermore, no memory barriers are guaranteed by this call, so data
+    /// dependence must be considered separately.
+    pub fn is_linked(&self) -> bool {
+        self.next.load(atomic::Relaxed) != 0
+    }
+}
+
+// SAFETY: Nodes are only ever modified through their owning list, or through
+//     atomics. Hence, they can be freely sent across CPUs.
+unsafe impl Send for Node {
+}
+
+// SAFETY: Shared references to a node always use atomics for any data access.
+//     They can be freely shared across CPUs.
+unsafe impl Sync for Node {
+}
+
+impl core::clone::Clone for Node {
+    /// Returns a clean and unlinked node.
+    ///
+    /// Cloning a node always yields a new node that is unlinked and in no way
+    /// tied to the original node.
+    fn clone(&self) -> Self {
+        Self::new()
+    }
+}
+
+impl core::default::Default for Node {
+    /// Create a new unlinked node.
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl core::ops::Drop for Node {
+    /// Drop a node and verify it is unlinked.
+    ///
+    /// No special cleanup is required when dropping nodes. However, linked
+    /// nodes are owned by their respective list. So if a linked node is
+    /// dropped, someone screwed up and this will warn loudly.
+    fn drop(&mut self) {
+        kernel::warn_on!(self.is_linked());
+    }
+}
+
+impl<'list, Lrt> CursorMut<'list, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    fn as_pos(&self) -> &atomic::Atomic<usize> {
+        // SAFETY: `Self.pos` is always convertible to a reference.
+        unsafe { self.pos.as_ref() }
+    }
+
+    fn get_ptr(&self) -> Option<NonNull<Node>> {
+        let pos = self.as_pos().load(atomic::Relaxed);
+        if pos == END {
+            None
+        } else {
+            // Recreate the pointer with the exposed provenance.
+            let ptr = pos as *mut Node;
+            // SAFETY: NULL is never stored in a list, but only used to denote
+            //     unlinked nodes. Hence, `node` cannot be NULL.
+            Some(unsafe { NonNull::new_unchecked(ptr) })
+        }
+    }
+
+    fn link(
+        &mut self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<NonNull<Node>, Pin<Lrt::Ref>> {
+        let pos = self.as_pos();
+        let ent_node = Lrt::acquire(ent);
+
+        // Nothing is dependent on the value of `Node.next`, except for the
+        // fact whether this operation succeeded, so perform a relaxed cmpxchg.
+        // Any data dependence behind `Ref` is ordered on `self` (and `ent`).
+        //
+        // SAFETY: We hold `ent_node` exclusively here, so it is convertible to
+        //     a shared reference for that long.
+        if let Ok(_) = unsafe { ent_node.as_ref() }.next.cmpxchg(
+            0,
+            pos.load(atomic::Relaxed),
+            atomic::Relaxed,
+        ) {
+            // Expose provenance, until `Atomic<*mut T>` is here.
+            let ent_node_addr = ent_node.as_ptr() as usize;
+            // All nodes are owned by the list, so no ordering needed. Atomics
+            // are only used to prevent temporary releases of the nodes.
+            pos.store(ent_node_addr, atomic::Relaxed);
+            Ok(ent_node)
+        } else {
+            // `ent` is already linked into another list, return ownership to
+            // the caller wrapped in an `Err`.
+            //
+            // SAFETY: `ent_node` was just acquired from `acquire()` and is
+            //     no longer used afterwards.
+            Err(unsafe { Lrt::release(ent_node) })
+        }
+    }
+
+    /// Get a reference to the element the cursor points to.
+    ///
+    /// If the cursor points past the last element, `None` is returned.
+    /// Otherwise, a reference to the element is returned.
+    pub fn get(&self) -> Option<Pin<&Lrt::Target>> {
+        let ent_node = self.get_ptr()?;
+        // SAFETY: `ent_node` was taken from the list, and a list ensures
+        //     those were acquired via `Lrt::acquire()` and thus valid
+        //     until released. Since the cursor is immutably borrowed, the
+        //     entry is valid for that lifetime.
+        Some(unsafe { Lrt::borrow(ent_node) })
+    }
+
+    /// Get a clone of the reference to the element the cursor points to.
+    ///
+    /// If the cursor points past the last element, `None` is returned.
+    /// Otherwise, a clone of the reference to the element is returned.
+    pub fn get_clone(&self) -> Option<Pin<Lrt::Ref>>
+    where
+        Lrt::Ref: Clone,
+    {
+        let ent_node = self.get_ptr()?;
+        // SAFETY: `ent_node` was taken from the list, and a list ensures
+        //     those were acquired via `Lrt::acquire()` and thus valid
+        //     until released. Since the cursor is immutably borrowed, the
+        //     entry is valid for that lifetime.
+        Some(unsafe { Lrt::borrow_clone(ent_node) })
+    }
+
+    /// Get a mutable reference to the element the cursor points to.
+    ///
+    /// If the cursor points past the last element, `None` is returned.
+    /// Otherwise, a mutable reference to the element is returned.
+    pub fn get_mut(
+        &mut self,
+    ) -> Option<Pin<&mut <Lrt as util::intrusive::Link<Node>>::Target>>
+    where
+        Lrt: util::intrusive::LinkMut<Node>,
+    {
+        let ent_node = self.get_ptr()?;
+        // SAFETY: `ent_node` was taken from the list, and a list ensures
+        //     those were acquired via `Lrt::acquire()` and thus valid
+        //     until released. Since the cursor is immutably borrowed, the
+        //     entry is valid for that lifetime.
+        //     Since we own an `Lrt::Ref`, we can mutably borrow the entire
+        //     list to get a mutable reference to the reference target.
+        Some(unsafe { Lrt::borrow_mut(ent_node) })
+    }
+
+    /// Move the cursor to the next element.
+    ///
+    /// If the cursor already points past the last element, this is a no-op.
+    /// Otherwise, the cursor is moved to the next element.
+    pub fn move_next(&mut self) {
+        if let Some(ent_node) = self.get_ptr() {
+            // SAFETY: `ent_node` was taken from the list, and a list ensures
+            //     those were acquired via `Lrt::acquire()` and thus valid
+            //     until released. Since cursors always move forward, the entry
+            //     is valid until destruction of the cursor.
+            self.pos = unsafe {
+                NonNull::new_unchecked(
+                    (&raw const (*ent_node.as_ptr()).next).cast_mut(),
+                )
+            };
+        }
+    }
+
+    /// Link a node at the cursor position.
+    ///
+    /// On success, `Ok` is returned and the node is linked at the cursor
+    /// position (i.e., the cursor points to the node), with ownership
+    /// transferred to the list.
+    ///
+    /// If the node is already on another list, this will return `Err` and
+    /// return ownership of the entry to the caller. The cursor and list remain
+    /// unmodified.
+    pub fn try_link(
+        &mut self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<Pin<&Lrt::Target>, Pin<Lrt::Ref>> {
+        self.link(ent).map(|v| {
+            // SAFETY: The entry is convertible to a pinned shared reference
+            //     for as long as we do not call `Ref::release()`. By holding
+            //     `self`, we prevent `Cursor` from doing so.
+            unsafe { Lrt::borrow(v) }
+        })
+    }
+
+    /// Unlink the current element without moving the cursor.
+    ///
+    /// If the cursor points past the last element, this is a no-op and returns
+    /// `None`. Otherwise, the element is unlinked and ownership
+    /// transferred to the caller. The cursor now points to the following
+    /// element.
+    pub fn unlink(&mut self) -> Option<Pin<Lrt::Ref>> {
+        let ent_node = self.get_ptr()?;
+
+        // Borrow `ent_node` as reference. Update the list position to skip
+        // the node and then update the node to be marked as unlinked.
+        {
+            // SAFETY: `ent_node` was taken from the list, and a list ensures
+            //     those were acquired via `Lrt::acquire()` and thus valid
+            //     until released. A temporary conversion to reference is thus
+            //     safe.
+            let ent_node_r = unsafe { ent_node.as_ref() };
+
+            // Unlink the node from the list. The load could be non-atomic,
+            // since the node is owned by the list. The store must be atomic,
+            // to ensure no temporary releases of the node.
+            // No data dependence, since everything is still list-owned and
+            // ordered through `self._list`.
+            self.as_pos().store(
+                ent_node_r.next.load(atomic::Relaxed),
+                atomic::Relaxed,
+            );
+
+            // Release the node. No ordering required, since any data
+            // dependence is either ordered on `self` or up to the caller.
+            ent_node_r.next.store(0, atomic::Relaxed);
+        }
+
+        // SAFETY: `ent_node` was taken from the list, and thus guaranteed
+        //     to be acquired via `Lrt::acquire()`. Since the previous entry
+        //     was updated to point to the next, the pointer is no longer
+        //     stored in the list.
+        Some(unsafe { Lrt::release(ent_node) })
+    }
+}
+
+// Convenience helpers
+impl<'list, Lrt> CursorMut<'list, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Consume the cursor and return the element it pointed to.
+    ///
+    /// Works like [`Self::get()`], but consumes the cursor and can thus
+    /// return a borrow for `'list`.
+    pub fn get_consume(self) -> Option<Pin<&'list Lrt::Target>> {
+        let ent_node = self.get_ptr()?;
+        drop(self);
+        // SAFETY: `ent_node` was taken from the list, and a list ensures
+        //     those were acquired via `Lrt::acquire()` and thus valid
+        //     until released. Since the cursor was consumed, the entry is
+        //     valid for `'list`.
+        Some(unsafe { Lrt::borrow(ent_node) })
+    }
+
+    /// Get a reference to the element the cursor points to and move forward.
+    ///
+    /// Works like [`Self::get()`] but calls [`Self::move_next()`] afterwards,
+    /// thus returning a longer borrow.
+    pub fn get_and_move_next(&mut self) -> Option<Pin<&'list Lrt::Target>> {
+        let ent_node = self.get_ptr()?;
+        self.move_next();
+        // SAFETY: `ent_node` was taken from the list, and a list ensures
+        //     those were acquired via `Lrt::acquire()` and thus valid
+        //     until released. Since the cursor was moved forward and can never
+        //     move backwards, this node is valid for `'list`.
+        Some(unsafe { Lrt::borrow(ent_node) })
+    }
+
+    /// Link a node at the cursor position, consuming the cursor.
+    ///
+    /// Works like [`Self::try_link()`] but consumes the cursor, thus returning
+    /// a longer borrow.
+    pub fn try_link_consume(
+        mut self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<Pin<&'list Lrt::Target>, Pin<Lrt::Ref>> {
+        match self.link(ent) {
+            Ok(v) => {
+                drop(self);
+                // SAFETY: The entry is convertible to a pinned shared
+                //     reference for as long as we do not call
+                //     `Ref::release()`. By dropping `self`, no-one can modify
+                //     the tree for as long as `'list`.
+                Ok(unsafe { Lrt::borrow(v) })
+            },
+            Err(v) => Err(v),
+        }
+    }
+}
+
+#[kunit_tests(bus1_util_slist)]
+mod test {
+    use super::*;
+
+    #[derive(Default)]
+    struct Entry {
+        key: u8,
+        node: Node,
+    }
+
+    util::field::impl_pin_field!(Entry, node, Node);
+
+    // Create a list that stores shared references. This allows access to
+    // the elements even if stored in the list. Once the list is dropped,
+    // mutable access to the elements is possible again.
+    #[test]
+    fn shared_refs() {
+        let mut e0 = core::pin::pin!(Entry { key: 0, ..Default::default() });
+        let mut e1 = core::pin::pin!(Entry { key: 1, ..Default::default() });
+
+        let mut list: List<node_of!(&Entry, node)> = List::new();
+
+        assert!(list.is_empty());
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+
+        list.link_front(e0.as_ref());
+        list.link_front(e1.as_ref());
+
+        assert!(!list.is_empty());
+        assert!(e0.node.is_linked());
+        assert!(e1.node.is_linked());
+
+        assert!(list.try_link_front(e0.as_ref()).is_err());
+        assert!(list.try_link_front(e1.as_ref()).is_err());
+
+        let mut c = list.cursor_mut();
+        assert_eq!(c.get().unwrap().key, 1);
+        assert_eq!(c.get_and_move_next().unwrap().key, 1);
+        assert_eq!(c.get().unwrap().key, 0);
+        assert_eq!(c.get_and_move_next().unwrap().key, 0);
+        assert!(c.get().is_none());
+
+        assert_eq!(list.unlink_front().unwrap().key, 1);
+        assert_eq!(list.unlink_front().unwrap().key, 0);
+
+        assert!(list.unlink_front().is_none());
+        assert!(list.is_empty());
+
+        drop(list);
+        assert!(!e0.as_mut().node.is_linked());
+        assert!(!e1.as_mut().node.is_linked());
+    }
+
+    // Create a `List` that stores mutable references. This prevents any use of
+    // the entries while linked. But once the list is dropped, they can be used
+    // again.
+    #[test]
+    fn mutable_refs() {
+        let mut e0 = core::pin::pin!(Entry { key: 0, ..Default::default() });
+        let mut e1 = core::pin::pin!(Entry { key: 1, ..Default::default() });
+
+        let mut list: List<node_of!(&mut Entry, node)> = List::new();
+
+        assert!(list.is_empty());
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+
+        list.link_front(e0.as_mut());
+        list.link_front(e1.as_mut());
+
+        assert!(!list.is_empty());
+
+        let mut c = list.cursor_mut();
+        assert_eq!(c.get_mut().unwrap().key, 1);
+        c.move_next();
+        assert_eq!(c.get_mut().unwrap().key, 0);
+        c.move_next();
+        assert!(c.get().is_none());
+
+        let v = list.unlink_front().unwrap();
+        assert_eq!(v.key, 1);
+        let v = list.unlink_front().unwrap();
+        assert_eq!(v.key, 0);
+
+        assert!(list.unlink_front().is_none());
+        assert!(list.is_empty());
+
+        drop(list);
+        assert!(!e0.node.is_linked());
+        assert!(!e1.node.is_linked());
+    }
+}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 13/16] bus1/util: add intrusive rb-tree
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (11 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 12/16] bus1/util: add intrusive single linked lists David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:43   ` Miguel Ojeda
  2026-03-31 19:03 ` [RFC 14/16] bus1/acct: add resouce accounting David Rheinsberg
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add `util::rb`, an intrusive RB-Tree using `util::intrusive` for the
API, and `linux/rbtree.h` for the implementation.

The API is designed for very easy use, without requiring any unsafe code
from a user. It tracks ownership via a simple atomic, and can thus
assert collection association in O(1) in a completely safe manner.

Unlike the owning version of RB-Trees in `kernel::rbtree`, the intrusive
version clearly documents the node<->collection relationship in
data-structures, avoids double pointer-chases in traversals, and can be
used by bus1 to queue release/destruction notifications without fallible
allocations.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/util/mod.rs |    1 +
 ipc/bus1/util/rb.rs  | 1324 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 1325 insertions(+)
 create mode 100644 ipc/bus1/util/rb.rs

diff --git a/ipc/bus1/util/mod.rs b/ipc/bus1/util/mod.rs
index 4639e40382c4..833f86d8ccbe 100644
--- a/ipc/bus1/util/mod.rs
+++ b/ipc/bus1/util/mod.rs
@@ -11,6 +11,7 @@
 pub mod field;
 pub mod intrusive;
 pub mod lll;
+pub mod rb;
 pub mod slist;
 
 /// Convert an Arc to its pinned version.
diff --git a/ipc/bus1/util/rb.rs b/ipc/bus1/util/rb.rs
new file mode 100644
index 000000000000..52cd7bf714bb
--- /dev/null
+++ b/ipc/bus1/util/rb.rs
@@ -0,0 +1,1324 @@
+// SPDX-License-Identifier: GPL-2.0
+//! # Intrusive Red-Black Trees
+//!
+//! This module implements an intrusive Red-Black Tree. Internally, it uses the
+//! common infrastructure provided by the C implementation `lib/rbtree.c` and
+//! is designed to work in a very similar manner.
+//!
+//! The entire API is meant to be completely safe to use. However, in the C API
+//! you cannot assert whether a node is attached to a specific instance of a
+//! tree. This makes ad-hoc operations unsafe or expensive, since object
+//! ownership has to be verified. Therefore, this implementation extents all
+//! nodes with a tag that asserts ownership and thus circumvents this
+//! restriction. The cost is an additional pointer-sized field in each node.
+//!
+//! The API is designed to be very similar to the API of common Rust
+//! collections that own their entries (e.g., `alloc::collections::BTreeMap`).
+//! That is, [`Tree`] is the entry-point of every rb-tree operation, and it
+//! owns all entries stored in this tree. However, unlike the standard Rust
+//! collections, [`Tree`] only stores smart-pointers, and never allocates or
+//! moves entries. Instead, it relies on
+//! [`IntoDeref`](crate::util::convert::IntoDeref) to convert smart pointers
+//! into raw pointers. Furthermore, it uses an intrusive design, so it relies
+//! on metadata on the nodes to link/unlink. It uses field projections to be
+//! generic over where this metadata is stored (see
+//! [`Field`](crate::util::field::Field) for details).
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+use kernel::sync::atomic;
+
+use crate::util;
+
+/// Red-Black Tree that stores and manages elements.
+///
+/// A [`Tree`] can be used to link and unlink elements, and thus transfer
+/// ownership of an element into a tree. Those elements can then be searched
+/// for, or can be iterated, similar to other standard Rust collections.
+///
+/// Elements stored in a [`Tree`] are owned by that tree, but are not moved,
+/// nor allocated by the tree. Instead, the tree takes ownership of a pointer
+/// to the element (either via smart pointers, or via references that have a
+/// lifetime that exceeds the lifetime of the tree). Once an element is
+/// removed, ownership is transferred back to the caller.
+///
+/// Trees are intrusive in nature, meaning they rely on metadata on each
+/// element to manage tree internals. This metadata must be a member field of
+/// type [`Node`]. The exact member field that is used for a tree is provided
+/// via a generic field representing type (`FRT`). All elements stored in a
+/// single tree will use the same member field. But a single element can have
+/// multiple different member fields of type [`Node`], and thus be linked into
+/// multiple different trees simultaneously.
+pub struct Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    // Rb-tree metadata for the entire tree. In their most basic form, this is
+    // just a pointer to the root node (but caching variants are an option to
+    // improve lookup of the first and last entry).
+    root: kernel::bindings::rb_root,
+    // We need a unique identifier to store as `owner` marker in nodes. The
+    // address of the owning tree is a reasonable default that can be acquired
+    // without external interference. Downside is that the tree needs to be
+    // pinned. Since pinning is ubiquitious in kernel APIs, this seems like an
+    // acceptable price to pay.
+    _pin: core::marker::PhantomPinned,
+    // Different trees can store entries of type `Ref` via different nodes. By
+    // pinning the field-representing-type, it is always clear which node a
+    // tree is using.
+    _lrt: core::marker::PhantomData<Lrt>,
+}
+
+/// Red-Black Tree metadata required on each element.
+///
+/// Every element that is stored in a [`Tree`] must have a member field of type
+/// [`Node`]. A [`Tree`] is generic over the member field used, so a single
+/// element can have multiple nodes to be stored in multiple different trees.
+/// All elements stored in the same tree will use the same member field for
+/// that tree, though.
+pub struct Node {
+    // Since this is an owning intrusive collection, we carefully ensure to
+    // never create multiple references to a single node. With a non-owning
+    // intrusive collection, we would need an `UnsafePinned` here, to ensure
+    // references retained by the caller do not alias with references created
+    // during tree introspection or manipulation. Yet, for owning collections
+    // this is unnecessary.
+    // We still use `Opaque` over `UnsafeCell` here, since interior mutability
+    // is required, and we really don't want to rely too much on the
+    // implementation details of `rbtree.c`. And we get `UnsafePinned` that way
+    // as well, so future non-owning extensions to this API would need no
+    // adjustments.
+    bindings: kernel::types::Opaque<kernel::bindings::rb_node>,
+    // The owner field is a tag that uniquely identifies the tree that owns the
+    // entry. Anything could be used as tag, but we decided on the tree address
+    // as it is trivially unique for each pinned tree.
+    // A value of 0 means the entry is unlinked. Any other value marks the
+    // entry as owned by the tree with the given tag. The value can only be set
+    // or cleared when holding a mutable reference to the owning tree. Acquire
+    // and Release semantics are guaranteed by any link/unlink functions, so
+    // entries can be moved from one tree to another, even across tasks.
+    owner: atomic::Atomic<usize>,
+    // RB-Tree node bindings store pointers to other nodes, so nodes must
+    // always be pinned when linked.
+    _pin: core::marker::PhantomPinned,
+}
+
+enum CursorPos {
+    Empty,
+    Front(NonNull<Node>),
+    At(NonNull<Node>),
+    Back(NonNull<Node>),
+}
+
+/// Immutable cursor over entries of an RB-Tree.
+///
+/// This cursor either points at an empty tree, directly at a node in a
+/// non-empty tree, before the first node, or after the last node. The cursor
+/// can be moved back and forth.
+pub struct Cursor<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    _tree: &'tree Tree<Lrt>,
+    pos: CursorPos,
+}
+
+/// Mutable cursor over entries of an RB-Tree.
+///
+/// This cursor either points at an empty tree, directly at a node in a
+/// non-empty tree, before the first node, or after the last node. The cursor
+/// can be moved back and forth, and elements can be inserted and removed at
+/// will.
+pub struct CursorMut<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    tree: Pin<&'tree mut Tree<Lrt>>,
+    pos: CursorPos,
+}
+
+/// Mutable slot in an RB-Tree.
+///
+/// This slot points at a location in an RB-Tree, which can either be an
+/// existing element, or an empty slot. It is usually obtained by searching
+/// a tree for a specific key. If an element matching the key is found, the
+/// slot of that element is returned. Otherwise, an empty slot suitable for
+/// insertion of an element with such a key is returned.
+///
+/// Slots mutably borrow the tree they reference, and as such allow insertion
+/// of new elements into the tree.
+pub struct Slot<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    tree: Pin<&'tree mut Tree<Lrt>>,
+    anchor: *mut kernel::bindings::rb_node,
+    slot: *mut *mut kernel::bindings::rb_node,
+}
+
+#[doc(hidden)]
+#[macro_export]
+macro_rules! util_rb_node_of {
+    ($ref:ty, $field:ident $(,)?) => {
+        $crate::util::intrusive::link_of!{$ref, $field, $crate::util::rb::Node}
+    }
+}
+
+/// Alias of [`link_of!()`](util::intrusive::link_of) for [`Node`] members.
+#[doc(inline)]
+pub use util_rb_node_of as node_of;
+
+impl<Lrt> Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Create a new empty tree.
+    ///
+    /// Note that empty trees do not have to be pinned, but to insert elements
+    /// into a tree, it must be pinned. Therefore, the returned tree must be
+    /// either pinned in place, or moved into a pinned structure to be used for
+    /// insertions.
+    pub fn new() -> Self {
+        Self {
+            root: kernel::bindings::rb_root {
+                rb_node: core::ptr::null_mut(),
+            },
+            _pin: core::marker::PhantomPinned,
+            _lrt: core::marker::PhantomData,
+        }
+    }
+
+    fn panic_acquire(_v: Pin<Lrt::Ref>) -> ! {
+        core::panic!("attempting to link a foreign node");
+    }
+
+    fn panic_claim(_v: Pin<&Lrt::Target>) -> ! {
+        core::panic!("attempting to claim a foreign node");
+    }
+
+    // Return the memory address of this tree as an integer. This is used to
+    // tag nodes belonging to this tree.
+    //
+    // Note that the address of a tree is only stable if the tree is pinned.
+    // Since all modifications to a tree take a `Pin<&mut Self>`, this is
+    // given. We avoid taking a pinned tree here, to allow read-only queries to
+    // work with `&self` for simpler APIs. It is up to the caller to ensure
+    // that the result is used coherently.
+    //
+    // Also note that the drop handler of a tree ensures all entries are
+    // unlinked before a tree is unpinned. Therefore, owner tags are usually
+    // never used longer than a tree is pinned.
+    fn as_owner(&self) -> usize {
+        util::ptr_addr(core::ptr::from_ref(self))
+    }
+
+    fn root_mut(self: Pin<&mut Self>) -> &mut kernel::bindings::rb_root {
+        // SAFETY: `Self.root` is not structurally pinned.
+        unsafe { &mut Pin::into_inner_unchecked(self).root }
+    }
+
+    /// Check whether the tree is empty.
+    ///
+    /// This returns `true` is no entries are linked, `false` if at least one
+    /// entry is linked.
+    ///
+    /// Note that the tree does not maintain a counter of how many elements are
+    /// linked.
+    pub fn is_empty(&self) -> bool {
+        self.root.rb_node.is_null()
+    }
+
+    /// Check whether this tree contains a given node.
+    ///
+    /// This returns `true` if the node is linked into this tree. It returns
+    /// `false` if the node is currently unlinked or linked into another tree.
+    ///
+    /// This operation is performed in O(1), regardless of the number of
+    /// elements in the tree.
+    ///
+    /// While holding a reference to [`Tree`], no node can be linked to, or
+    /// unlinked from, the tree. Hence, unlike [`Node::is_linked()`] the return
+    /// value of this method is stable for at least the lifetime of `self`.
+    pub fn contains(
+        &self,
+        ent_target: &Lrt::Target,
+    ) -> bool {
+        let ent_node = Lrt::project(ent_target);
+        // SAFETY: `ent_node` is convertible to a reference.
+        let v = unsafe {
+            (*Node::owner(ent_node)).load(atomic::Relaxed)
+        };
+        v == self.as_owner()
+    }
+
+    /// Create an immutable cursor at the first element.
+    ///
+    /// The new cursor will point to the first element. If the tree is empty,
+    /// the cursor points to no element at all.
+    pub fn cursor_first(&self) -> Cursor<'_, Lrt> {
+        // SAFETY: `rb_first()` requires a pointer to a valid root, pointing to
+        // valid nodes. This invariant is maintained by `Tree`.
+        let pos = if let Some(v) = NonNull::new(unsafe {
+            kernel::bindings::rb_first(&self.root)
+        }) {
+            // SAFETY: `v` points to a valid entry in the tree, thus it is
+            // embedded in a `Node`.
+            CursorPos::At(unsafe { Node::from_rb(v) })
+        } else {
+            CursorPos::Empty
+        };
+
+        Cursor {
+            _tree: self,
+            pos,
+        }
+    }
+
+    /// Create an immutable cursor at the last element.
+    ///
+    /// The new cursor will point to the last element. If the tree is empty,
+    /// the cursor points to no element at all.
+    pub fn cursor_last(&self) -> Cursor<'_, Lrt> {
+        // SAFETY: `rb_last()` requires a pointer to a valid root, pointing to
+        // valid nodes. This invariant is maintained by `Tree`.
+        let pos = if let Some(v) = NonNull::new(unsafe {
+            kernel::bindings::rb_last(&self.root)
+        }) {
+            // SAFETY: `v` points to a valid entry in the tree, thus it is
+            // embedded in a `Node`.
+            CursorPos::At(unsafe { Node::from_rb(v) })
+        } else {
+            CursorPos::Empty
+        };
+
+        Cursor {
+            _tree: self,
+            pos,
+        }
+    }
+
+    /// Create a mutable cursor at the first element.
+    ///
+    /// The new cursor will point to the first element. If the tree is empty,
+    /// the cursor points to no element at all.
+    pub fn cursor_mut_first(
+        mut self: Pin<&mut Self>,
+    ) -> CursorMut<'_, Lrt> {
+        // SAFETY: `rb_first()` requires a pointer to a valid root, pointing to
+        // valid nodes. This invariant is maintained by `Tree`.
+        let pos = if let Some(v) = NonNull::new(unsafe {
+            kernel::bindings::rb_first(
+                self.as_mut().root_mut(),
+            )
+        }) {
+            // SAFETY: `v` points to a valid entry in the tree, thus it is
+            // embedded in a `Node`.
+            CursorPos::At(unsafe { Node::from_rb(v) })
+        } else {
+            CursorPos::Empty
+        };
+
+        CursorMut {
+            tree: self,
+            pos,
+        }
+    }
+
+    /// Create a mutable cursor at the last element.
+    ///
+    /// The new cursor will point to the last element. If the tree is empty,
+    /// the cursor points to no element at all.
+    pub fn cursor_mut_last(
+        mut self: Pin<&mut Self>,
+    ) -> CursorMut<'_, Lrt> {
+        // SAFETY: `rb_last()` requires a pointer to a valid root, pointing to
+        // valid nodes. This invariant is maintained by `Tree`.
+        let pos = if let Some(v) = NonNull::new(unsafe {
+            kernel::bindings::rb_last(
+                self.as_mut().root_mut(),
+            )
+        }) {
+            // SAFETY: `v` points to a valid entry in the tree, thus it is
+            // embedded in a `Node`.
+            CursorPos::At(unsafe { Node::from_rb(v) })
+        } else {
+            CursorPos::Empty
+        };
+
+        CursorMut {
+            tree: self,
+            pos,
+        }
+    }
+
+    /// Try creating a mutable cursor at an explicit element.
+    ///
+    /// This tries to create a [`CursorMut`] at the given element. If the
+    /// element is not linked in this tree, this will return `None` instead.
+    pub fn try_cursor_mut_at(
+        mut self: Pin<&mut Self>,
+        ent_target: Pin<&Lrt::Target>,
+    ) -> Option<CursorMut<'_, Lrt>> {
+        let ent_node = Lrt::project(&ent_target);
+        // SAFETY: `end_node` points to a valid allocation.
+        let v = unsafe {
+            (*Node::owner(ent_node)).load(atomic::Relaxed)
+        };
+        if v == self.as_mut().as_owner() {
+            Some(CursorMut {
+                tree: self,
+                pos: CursorPos::At(ent_node),
+            })
+        } else {
+            None
+        }
+    }
+
+    /// Find a slot in the tree.
+    ///
+    /// Search through the tree with the given comparison function, looking for
+    /// a specific slot. Regardless whether the slot is occupied or not, this
+    /// will return a `Slot` object.
+    ///
+    /// This will perform a search through the binary tree from root to leaf,
+    /// using `cmp_fn` on each node. `cmp_fn` can be chosen freely, but should
+    /// preferably implement a partial order to ensure a coherent tree order.
+    pub fn find_slot_by<CmpFn>(
+        mut self: Pin<&mut Self>,
+        mut cmp_fn: CmpFn,
+    ) -> Slot<'_, Lrt>
+    where
+        CmpFn: FnMut(Pin<&Lrt::Target>) -> core::cmp::Ordering,
+    {
+        let mut anchor: *mut kernel::bindings::rb_node;
+        let mut slot: &mut *mut kernel::bindings::rb_node;
+
+        anchor = core::ptr::null_mut();
+        slot = &mut self.as_mut().root_mut().rb_node;
+
+        while let Some(mut ent_rb) = NonNull::new(*slot) {
+            // SAFETY: All rb-entriess in a tree always refer to a valid
+            //     rb-entry within a valid node.
+            let ent_node = unsafe { Node::from_rb(ent_rb) };
+            // SAFETY: All nodes in a tree always refer to a valid node
+            //     within a reference target.
+            let ent_target = unsafe { Lrt::borrow(ent_node) };
+
+            slot = match cmp_fn(ent_target) {
+                core::cmp::Ordering::Less => {
+                    // SAFETY: `ent_rb` points to a valid node and no other
+                    //     references to it exist.
+                    unsafe { &mut ent_rb.as_mut().rb_left }
+                },
+                core::cmp::Ordering::Greater => {
+                    // SAFETY: `ent_rb` points to a valid node and no other
+                    //     references to it exist.
+                    unsafe { &mut ent_rb.as_mut().rb_left }
+                },
+                core::cmp::Ordering::Equal => break,
+            };
+            anchor = ent_rb.as_ptr();
+        }
+
+        Slot {
+            anchor,
+            slot,
+            tree: self,
+        }
+    }
+
+    /// Remove all entries from a tree.
+    ///
+    /// Clear the entire tree and pass ownership of each entry by invoking
+    /// `clear_fn`.
+    ///
+    /// This will iterate the tree in postorder, without rebalancing. Hence,
+    /// this is significantly faster than clearing a tree via [`CursorMut`].
+    pub fn clear_with<ClearFn>(
+        mut self: Pin<&mut Self>,
+        mut clear_fn: ClearFn,
+    )
+    where
+        ClearFn: FnMut(Pin<Lrt::Ref>),
+    {
+        let mut anchor: *mut kernel::bindings::rb_node;
+
+        // SAFETY: `rb_first_postorder()` requires a pointer to a valid root,
+        //     pointing to valid nodes. This invariant is maintained by `Tree`.
+        anchor = unsafe {
+            kernel::bindings::rb_first_postorder(
+                self.as_mut().root_mut(),
+            )
+        };
+
+        // Clear the tree, so it is considered empty. Since nodes do not
+        // contain pointers to the root, this cannot affect the postorder
+        // traversal below.
+        self.as_mut().root_mut().rb_node = core::ptr::null_mut();
+
+        while let Some(ent_rb) = NonNull::new(anchor) {
+            // SAFETY: Same as for `rb_first_postorder()` above, but only cares
+            //     for elements following it, not any elements preceding it.
+            //     Since we call it before calling `clear_fn`, it is safe.
+            anchor = unsafe {
+                kernel::bindings::rb_next_postorder(ent_rb.as_ptr())
+            };
+
+            // SAFETY: `ent_rb` is a valid rb-entry in a valid node.
+            let ent_node = unsafe { Node::from_rb(ent_rb) };
+            // SAFETY: `end_node` is a valid node.
+            unsafe { (*Node::owner(ent_node)).store(0, atomic::Release) };
+            // SAFETY: `end_node` is a valid node in a reference target.
+            let ent = unsafe { Lrt::release(ent_node) };
+            clear_fn(ent);
+        }
+    }
+}
+
+// Convenience helpers
+impl<Lrt> Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Create a mutable cursor at an explicit element.
+    ///
+    /// Works like [`Tree::try_cursor_mut_at()`] but panics if the element is
+    /// not linked into this tree.
+    pub fn cursor_mut_at(
+        self: Pin<&mut Self>,
+        ent_target: Pin<&Lrt::Target>,
+    ) -> CursorMut<'_, Lrt> {
+        self.try_cursor_mut_at(ent_target).unwrap_or_else(
+            || Self::panic_claim(ent_target),
+        )
+    }
+
+    /// Try linking a new entry in this tree.
+    ///
+    /// This combines [`Tree::find_slot_by()`] with [`Slot::try_link()`].
+    ///
+    /// On success, the ownership of the new entry is transferred to the tree
+    /// and the dereferenced entry is returned as a borrow wrapped in `Ok`. If
+    /// there either already is a matching entry linked into the tree, or if
+    /// `ent` is already linked into any tree, this function will fail and
+    /// return ownership of the new entry to the caller wrapped in `Err`.
+    ///
+    /// If this function fails, the slot is immediately dropped. If a retry
+    /// attempt is desired, the slot should be retrieved via `find_slot_by()`
+    /// instead. This avoids repeated traversals.
+    ///
+    /// ## Comparator
+    ///
+    /// The comparator `cmp_fn` gets the dereferenced to-be-linked entry `ent`
+    /// as first argument and an existing entry in the tree as second argument.
+    /// It shall return an order describing the relationship of the first
+    /// argument compared to the second.
+    pub fn try_link_by<CmpFn>(
+        self: Pin<&mut Self>,
+        ent: Pin<Lrt::Ref>,
+        mut cmp_fn: CmpFn,
+    ) -> Result<Pin<&'_ Lrt::Target>, Pin<Lrt::Ref>>
+    where
+        CmpFn: FnMut(&Pin<Lrt::Ref>, Pin<&Lrt::Target>) -> core::cmp::Ordering,
+    {
+        self.find_slot_by(
+            |other| cmp_fn(&ent, other),
+        ).try_link(ent)
+    }
+
+    /// Try linking a new entry in this tree, after all duplicates.
+    ///
+    /// Works like [`Tree:try_link_by()`] but will link the entry even if there
+    /// are duplicates with the same key. The entry will be linked after all
+    /// duplicates.
+    pub fn try_link_last_by<CmpFn>(
+        self: Pin<&mut Self>,
+        ent: Pin<Lrt::Ref>,
+        mut cmp_fn: CmpFn,
+    ) -> Result<Pin<&'_ Lrt::Target>, Pin<Lrt::Ref>>
+    where
+        CmpFn: FnMut(&Pin<Lrt::Ref>, Pin<&Lrt::Target>) -> core::cmp::Ordering,
+    {
+        self.find_slot_by(|other| match cmp_fn(&ent, other) {
+            core::cmp::Ordering::Equal => core::cmp::Ordering::Greater,
+            v => v,
+        }).try_link(ent)
+    }
+
+    /// Try unlinking a specific element from the tree.
+    ///
+    /// This chains [`Tree::try_cursor_mut_at()`] and [`Slot::try_unlink()`].
+    ///
+    /// This returns the element if it was unlinked from the tree, or `None` if
+    /// the element was not linked into this tree.
+    pub fn try_unlink(
+        self: Pin<&mut Self>,
+        ent_target: Pin<&Lrt::Target>,
+    ) -> Option<Pin<Lrt::Ref>> {
+        self.try_cursor_mut_at(ent_target).and_then(|v| {
+            v.try_unlink()
+        })
+    }
+
+    /// Unlink a specific element from the tree.
+    ///
+    /// Works like [`Tree::try_unlink()`] but panics if the entry was not
+    /// linked into this tree.
+    pub fn unlink(
+        self: Pin<&mut Self>,
+        ent_target: Pin<&Lrt::Target>,
+    ) -> Pin<Lrt::Ref> {
+        self.try_unlink(ent_target).unwrap_or_else(
+            || core::panic!("attempting to unlink foreign entry"),
+        )
+    }
+
+    /// Remove all entries from a tree.
+    ///
+    /// Works like [`Tree::clear_with()`] but uses [`core::mem::drop()`] as
+    /// callback.
+    pub fn clear(self: Pin<&mut Self>) {
+        self.clear_with(|_| {});
+    }
+}
+
+// SAFETY: Trees have no interior mutability, nor do they otherwise care for
+//     their calling CPU. They can be freely sent across CPUs, only limited by
+//     the stored type.
+unsafe impl<Lrt> Send for Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Send,
+{
+}
+
+// SAFETY: Trees have no interior mutability, nor do they otherwise care for
+//     their calling CPU. They can be shared across CPUs, only limited by
+//     the stored type.
+unsafe impl<Lrt> Sync for Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+    Lrt::Ref: Sync,
+{
+}
+
+impl<Lrt> core::default::Default for Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Return a new empty tree.
+    ///
+    /// The new tree has no entries linked and is completely independent of
+    /// other trees.
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl<Lrt> core::ops::Drop for Tree<Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Clear a tree before dropping it.
+    ///
+    /// This drops all elements in the tree via [`Tree::clear()`], before
+    /// dropping the tree. This ensures that the elements of a tree are
+    /// not leaked.
+    fn drop(&mut self) {
+        // SAFETY: We treat `self` as pinned unconditionally.
+        let this = unsafe { Pin::new_unchecked(self) };
+        this.clear();
+    }
+}
+
+impl Node {
+    /// Create a new unlinked node.
+    ///
+    /// Note that unlinked nodes do not need to be pinned. However, to link a
+    /// node into a tree, it must be pinned. Therefore, you need to pin the
+    /// returned value in place, or move it into a pinned structure, to make
+    /// use of it.
+    pub fn new() -> Self {
+        Self {
+            owner: atomic::Atomic::new(0),
+            bindings: kernel::types::Opaque::new(
+                kernel::bindings::rb_node {
+                    __rb_parent_color: 0,
+                    rb_right: core::ptr::null_mut(),
+                    rb_left: core::ptr::null_mut(),
+                },
+            ),
+            _pin: core::marker::PhantomPinned,
+        }
+    }
+
+    /// Return a pointer to the atomic owner tag of a node.
+    ///
+    /// The owner tag of a node can be accessed at any time, as long as the
+    /// allocation of the node does not get deallocated. That is, the owner tag
+    /// can even be accessed if another part holds a mutable reference to the
+    /// node. The transparent wrapper around `UnsafeCell` in an atomic
+    /// guarantee that such accesses are safe.
+    ///
+    /// ## Safety
+    ///
+    /// The node pointer must refer to a valid and initialized allocation of a
+    /// node.
+    unsafe fn owner(node: NonNull<Self>) -> *mut atomic::Atomic<usize> {
+        // SAFETY: Delegated to caller.
+        unsafe { &raw mut (*node.as_ptr()).owner }
+    }
+
+    /// Get a node pointer from an rb-entry pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The rb-entry must refer to a valid allocation inside of a node. The
+    /// allocation does not have to be initialized.
+    unsafe fn from_rb(rb: NonNull<kernel::bindings::rb_node>) -> NonNull<Self> {
+        // SAFETY: Delegated to caller.
+        unsafe {
+            NonNull::new_unchecked(
+                kernel::container_of!(
+                    kernel::types::Opaque::cast_from(rb.as_ptr()),
+                    Self,
+                    bindings
+                ).cast_mut(),
+            )
+        }
+    }
+
+    /// Check whether this node is linked into a tree.
+    ///
+    /// This returns `true` if this node is linked into any tree. It returns
+    /// `false` if the node is currently unlinked.
+    ///
+    /// Note that a node can be linked into a tree at any time. That is,
+    /// validity of the returned boolean can change spuriously, unless the
+    /// caller otherwise ensures exclusive access to the node.
+    pub fn is_linked(&self) -> bool {
+        // SAFETY: `self` trivially points to a valid allocation of a node.
+        let v = unsafe {
+            (*Self::owner(util::nonnull_from_ref(self))).load(atomic::Relaxed)
+        };
+        v != 0
+    }
+}
+
+// SAFETY: Nodes are only ever modified through their owning tree, or through
+//     atomics. Hence, they can be freely sent across CPUs.
+unsafe impl Send for Node {
+}
+
+// SAFETY: Shared references to a node always use atomics for any data access.
+//     They can be freely shared across CPUs.
+unsafe impl Sync for Node {
+}
+
+impl core::clone::Clone for Node {
+    /// Returns a clean and unlinked node.
+    ///
+    /// Cloning a node always yields a new node that is unlinked and in no way
+    /// tied to the original node.
+    fn clone(&self) -> Self {
+        Self::new()
+    }
+}
+
+impl core::default::Default for Node {
+    /// Return a clean and unlinked node.
+    ///
+    /// The default state for nodes is an unlinked state. Such nodes are in no
+    /// way tied to a tree or any other node.
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl core::ops::Drop for Node {
+    /// Drop a node and verify it is unlinked.
+    ///
+    /// No special cleanup is required when dropping nodes. However, linked
+    /// nodes are owned by their respective tree and as such must never be
+    /// dropped. In an owning tree, this cannot happen, but in non-owning trees
+    /// it is the responsibility of the caller to ensure nodes are unlinked
+    /// before they are dropped.
+    ///
+    /// Since this is an owning tree implementation, this drop handler is a
+    /// safety net to ensure a correct implementation.
+    ///
+    /// ## Background
+    ///
+    /// The drop handler could attempt to disassociate the node. However, this
+    /// only works if node and tree are owned by the same thread. Since
+    /// [`Node`] was designed with [`Send`], it can be dropped by another
+    /// thread (possibly in parallel with a drop of the tree). Any attempt to
+    /// unlink would thus race.
+    ///
+    /// In case of non-owning trees, neither tree nor node can ensure the other
+    /// is valid for even the shortest interval, and thus cannot attempt any
+    /// unlink operation. Instead, validity of nodes is an invariant that must
+    /// be upheld by the user, and is protected by this drop implementation. In
+    /// case of an owning tree, nodes are always valid while linked, and thus
+    /// this drop implementation will hopefully be a no-op.
+    fn drop(&mut self) {
+        // SAFETY: The allocation behind `self` is valid.
+        let owner = unsafe {
+            (*Node::owner(util::nonnull_from_mut(self))).load(atomic::Relaxed)
+        };
+        if owner != 0 {
+            core::panic!(
+                "attempting drop of a claimed node: {:?}",
+                core::ptr::from_ref(&*self),
+            );
+        }
+    }
+}
+
+impl<'tree, Lrt> Cursor<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Return the reference target of the current element.
+    ///
+    /// Returns `None` if the tree is empty, or if the cursor points before the
+    /// first or after the last element.
+    pub fn get(&self) -> Option<Pin<&Lrt::Target>> {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            Some(unsafe { Lrt::borrow(ent_node) })
+        } else {
+            None
+        }
+    }
+
+    /// Return a clone of a reference to the current element.
+    ///
+    /// Returns `None` if the tree is empty, or if the cursor points before the
+    /// first or after the last element.
+    pub fn get_clone(&self) -> Option<Pin<Lrt::Ref>>
+    where
+        Lrt::Ref: Clone,
+    {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            Some(unsafe { Lrt::borrow_clone(ent_node) })
+        } else {
+            None
+        }
+    }
+
+    /// Move to the next entry, if any.
+    ///
+    /// Move the cursor to point to the next entry. If the tree is empty, or if
+    /// the cursor points to the last element, this is a no-op.
+    pub fn move_next(&mut self) {
+        match self.pos {
+            CursorPos::Empty => {},
+            CursorPos::Front(ent_node) => {
+                self.pos = CursorPos::At(ent_node);
+            },
+            CursorPos::Back(_ent_node) => {},
+            CursorPos::At(ent_node) => {
+                // SAFETY: `ent_node` points to a valid entry in the tree.
+                if let Some(v) = NonNull::new(unsafe {
+                    kernel::bindings::rb_next(
+                        ent_node.as_ref().bindings.get(),
+                    )
+                }) {
+                    // SAFETY: `v` points to a valid entry in the tree, thus
+                    // it is embedded in a `Node`.
+                    self.pos = CursorPos::At(unsafe { Node::from_rb(v) });
+                } else {
+                    self.pos = CursorPos::Back(ent_node);
+                }
+            },
+        }
+    }
+
+    /// Move to the previous entry, if any.
+    ///
+    /// Move the cursor to point to the previous entry. If the tree is empty,
+    /// or if the cursor points to the first element, this is a no-op.
+    pub fn move_prev(&mut self) {
+        match self.pos {
+            CursorPos::Empty => {},
+            CursorPos::Front(_ent_node) => {},
+            CursorPos::Back(ent_node) => {
+                self.pos = CursorPos::At(ent_node);
+            },
+            CursorPos::At(ent_node) => {
+                // SAFETY: `ent_node` points to a valid entry in the tree.
+                if let Some(v) = NonNull::new(unsafe {
+                    kernel::bindings::rb_prev(
+                        ent_node.as_ref().bindings.get(),
+                    )
+                }) {
+                    // SAFETY: `v` points to a valid entry in the tree, thus
+                    // it is embedded in a `Node`.
+                    self.pos = CursorPos::At(unsafe { Node::from_rb(v) });
+                } else {
+                    self.pos = CursorPos::Front(ent_node);
+                }
+            },
+        }
+    }
+}
+
+impl<'tree, Lrt> CursorMut<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Unlink a specific node from the tree.
+    ///
+    /// ## Safety
+    ///
+    /// `ent_node` must point to a valid entry in `tree`.
+    unsafe fn unlink_at(
+        mut tree: Pin<&mut Tree<Lrt>>,
+        ent_node: NonNull<Node>,
+    ) -> Pin<Lrt::Ref> {
+        // SAFETY: `rb_erase` only reshuffles a tree. So it is enough to
+        //     ensure it is passed a valid root with only valid nodes. This
+        //     invariant is always maintained by `Tree`.
+        unsafe {
+            kernel::bindings::rb_erase(
+                ent_node.as_ref().bindings.get(),
+                tree.as_mut().root_mut(),
+            )
+        };
+
+        // SAFETY: `ent_node` refers to a valid entry.
+        unsafe {
+            (*Node::owner(ent_node)).store(0, atomic::Release);
+        }
+
+        // SAFETY: `ent_node` was removed from the tree, as such it is
+        //     guaranteed to not be used any further.
+        unsafe { Lrt::release(ent_node) }
+    }
+
+    /// Return the reference target of the current element.
+    ///
+    /// Returns `None` if the tree is empty, or if the cursor points before the
+    /// first or after the last element.
+    pub fn get(&self) -> Option<Pin<&Lrt::Target>> {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            Some(unsafe { Lrt::borrow(ent_node) })
+        } else {
+            None
+        }
+    }
+
+    /// Return the mutable reference target of the current element.
+    ///
+    /// Returns `None` if the tree is empty, or if the cursor points before the
+    /// first or after the last element.
+    pub fn get_mut(&mut self) -> Option<Pin<&mut Lrt::Target>>
+    where
+        Lrt: util::intrusive::LinkMut<Node>,
+    {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree and the
+            // cursor is mutably borrowed for the same lifetime.
+            Some(unsafe { Lrt::borrow_mut(ent_node) })
+        } else {
+            None
+        }
+    }
+
+    /// Return a clone of a reference to the current element.
+    ///
+    /// Returns `None` if the tree is empty, or if the cursor points before the
+    /// first or after the last element.
+    pub fn get_clone(&self) -> Option<Pin<Lrt::Ref>>
+    where
+        Lrt::Ref: Clone,
+    {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            Some(unsafe { Lrt::borrow_clone(ent_node) })
+        } else {
+            None
+        }
+    }
+
+    /// Move to the next entry, if any.
+    ///
+    /// Move the cursor to point to the next entry. If the tree is empty, or if
+    /// the cursor points to the last element, this is a no-op.
+    pub fn move_next(&mut self) {
+        match self.pos {
+            CursorPos::Empty => {},
+            CursorPos::Front(ent_node) => {
+                self.pos = CursorPos::At(ent_node);
+            },
+            CursorPos::Back(_ent_node) => {},
+            CursorPos::At(ent_node) => {
+                // SAFETY: `ent_node` points to a valid entry in the tree.
+                if let Some(v) = NonNull::new(unsafe {
+                    kernel::bindings::rb_next(
+                        ent_node.as_ref().bindings.get(),
+                    )
+                }) {
+                    // SAFETY: `v` points to a valid entry in the tree, thus
+                    // it is embedded in a `Node`.
+                    self.pos = CursorPos::At(unsafe { Node::from_rb(v) });
+                } else {
+                    self.pos = CursorPos::Back(ent_node);
+                }
+            },
+        }
+    }
+
+    /// Move to the previous entry, if any.
+    ///
+    /// Move the cursor to point to the previous entry. If the tree is empty,
+    /// or if the cursor points to the first element, this is a no-op.
+    pub fn move_prev(&mut self) {
+        match self.pos {
+            CursorPos::Empty => {},
+            CursorPos::Front(_ent_node) => {},
+            CursorPos::Back(ent_node) => {
+                self.pos = CursorPos::At(ent_node);
+            },
+            CursorPos::At(ent_node) => {
+                // SAFETY: `ent_node` points to a valid entry in the tree.
+                if let Some(v) = NonNull::new(unsafe {
+                    kernel::bindings::rb_prev(
+                        ent_node.as_ref().bindings.get(),
+                    )
+                }) {
+                    // SAFETY: `v` points to a valid entry in the tree, thus
+                    // it is embedded in a `Node`.
+                    self.pos = CursorPos::At(unsafe { Node::from_rb(v) });
+                } else {
+                    self.pos = CursorPos::Front(ent_node);
+                }
+            },
+        }
+    }
+
+    /// Move to the next entry, trying to unlink the current entry first.
+    ///
+    /// If the cursor points to an element, the element is unlinked and
+    /// returned. Otherwise, `None` is returned. In all cases, the cursor is
+    /// moved to point to the next element.
+    pub fn try_unlink_and_move_next(&mut self) -> Option<Pin<Lrt::Ref>> {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            self.pos = if let Some(v) = NonNull::new(unsafe {
+                kernel::bindings::rb_next(
+                    ent_node.as_ref().bindings.get(),
+                )
+            }) {
+                // SAFETY: `v` points to a valid entry in the tree, thus
+                // it is embedded in a `Node`.
+                CursorPos::At(unsafe { Node::from_rb(v) })
+            } else if let Some(v) = NonNull::new(unsafe {
+                kernel::bindings::rb_prev(
+                    ent_node.as_ref().bindings.get(),
+                )
+            }) {
+                // SAFETY: `v` points to a valid entry in the tree, thus
+                // it is embedded in a `Node`.
+                CursorPos::Back(unsafe { Node::from_rb(v) })
+            } else {
+                CursorPos::Empty
+            };
+
+            // SAFETY: `ent_node` is a valid entry in `tree` and no longer
+            // referenced by the cursor.
+            Some(unsafe { Self::unlink_at(self.tree.as_mut(), ent_node) })
+        } else {
+            self.move_next();
+            None
+        }
+    }
+
+    /// Move to the previous entry, trying to unlink the current entry first.
+    ///
+    /// If the cursor points to an element, the element is unlinked and
+    /// returned. Otherwise, `None` is returned. In all cases, the cursor is
+    /// moved to point to the previous element.
+    pub fn try_unlink_and_move_prev(&mut self) -> Option<Pin<Lrt::Ref>> {
+        if let CursorPos::At(ent_node) = self.pos {
+            // SAFETY: `ent_node` points to a valid entry in the tree.
+            self.pos = if let Some(v) = NonNull::new(unsafe {
+                kernel::bindings::rb_prev(
+                    ent_node.as_ref().bindings.get(),
+                )
+            }) {
+                // SAFETY: `v` points to a valid entry in the tree, thus
+                // it is embedded in a `Node`.
+                CursorPos::At(unsafe { Node::from_rb(v) })
+            } else if let Some(v) = NonNull::new(unsafe {
+                kernel::bindings::rb_next(
+                    ent_node.as_ref().bindings.get(),
+                )
+            }) {
+                // SAFETY: `v` points to a valid entry in the tree, thus
+                // it is embedded in a `Node`.
+                CursorPos::Front(unsafe { Node::from_rb(v) })
+            } else {
+                CursorPos::Empty
+            };
+
+            // SAFETY: `ent_node` is a valid entry in `tree` and no longer
+            // referenced by the cursor.
+            Some(unsafe { Self::unlink_at(self.tree.as_mut(), ent_node) })
+        } else {
+            self.move_prev();
+            None
+        }
+    }
+
+    /// Try unlinking the current entry from the tree, consuming the cursor.
+    ///
+    /// This will unlink the current entry under the cursor from the tree. If
+    /// the cursor refers to an empty tree, this is a no-op and `None` is
+    /// returned. Otherwise, the removed entry is returned.
+    ///
+    /// This consumes the cursor. Use [`move_next_try_unlink()`] etc. to unlink
+    /// entries while retaining the cursor.
+    pub fn try_unlink(mut self) -> Option<Pin<Lrt::Ref>> {
+        let CursorPos::At(ent_node) = self.pos else {
+            return None;
+        };
+
+        // SAFETY: `ent_node` is a valid entry in `tree`.
+        Some(unsafe { Self::unlink_at(self.tree.as_mut(), ent_node) })
+    }
+}
+
+// Convenience helpers
+impl<'tree, Lrt> CursorMut<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Move to the next entry, unlinking the current entry first.
+    ///
+    /// Works like [`Self::try_unlink_and_move_next()`] but panics if the tree
+    /// is empty.
+    pub fn unlink_and_move_next(&mut self) -> Pin<Lrt::Ref> {
+        self.try_unlink_and_move_next().unwrap_or_else(
+            || core::panic!("attempting to unlink from an empty tree"),
+        )
+    }
+
+    /// Move to the previous entry, unlinking the current entry first.
+    ///
+    /// Works like [`Self::try_unlink_and_move_prev()`] but panics if the tree
+    /// is empty.
+    pub fn unlink_and_move_prev(&mut self) -> Pin<Lrt::Ref> {
+        self.try_unlink_and_move_prev().unwrap_or_else(
+            || core::panic!("attempting to unlink from an empty tree"),
+        )
+    }
+
+    /// Unlink the current entry from the tree, consuming the cursor.
+    ///
+    /// Works like [`Self::try_unlink()`] but panics if the tree is empty.
+    pub fn unlink(self) -> Pin<Lrt::Ref> {
+        self.try_unlink().unwrap_or_else(
+            || core::panic!("attempting to unlink from an empty tree"),
+        )
+    }
+}
+
+impl<'tree, Lrt> Slot<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Check whether the slot is available for insertion.
+    ///
+    /// If the slot was already occupied by another entry of the tree, or if
+    /// the slot was already consumed for an insertion, this will return
+    /// `false`. Otherwise, this will return `true`.
+    ///
+    /// If this returns `true`, a following attempt of [`Self::try_link()`] is
+    /// guaranteed to succeed for any unclaimed entry.
+    pub fn available(&self) -> bool {
+        if let Some(slot) = NonNull::new(self.slot) {
+            // SAFETY: If non-null, `slot` is a valid reference to some slot in
+            //     this tree. If the slot is non-empty, this slot is already in
+            //     use and not available for insertion.
+            unsafe { slot.as_ref().is_null() }
+        } else {
+            // The slot was already used for insertion and is now invalid.
+            false
+        }
+    }
+
+    fn entry_ptr(&self) -> Option<NonNull<Node>> {
+        let slot = NonNull::new(self.slot)?;
+
+        // SAFETY: `self.slot` is a valid tree entry.
+        let ent_rb = NonNull::new(unsafe { *slot.as_ref() })?;
+
+        // SAFETY: `ent_rb` refers to a valid rb-entry in the tree, and is thus
+        //     embedded in a valid node.
+        Some(unsafe { Node::from_rb(ent_rb) })
+    }
+
+    /// Get a reference to the entry in this slot.
+    ///
+    /// If the slot is occupied, this will return a reference to the entry in
+    /// this slot. Otherwise, this will return `None`.
+    pub fn entry(&self) -> Option<Pin<&Lrt::Target>> {
+        match self.entry_ptr() {
+            None => None,
+            // SAFETY: `entry_ptr()` only returns valid entryies.
+            Some(v) => Some(unsafe { Lrt::borrow(v) }),
+        }
+    }
+
+    /// Get a copy of the reference to the entry in this slot.
+    ///
+    /// If the slot is occupied, this will return a copy of the reference to
+    /// the entry in this slot. Otherwise, this will return `None`.
+    ///
+    /// This is only available if the reference type implements
+    /// [`core::clone::Clone`].
+    pub fn entry_clone(&self) -> Option<Pin<Lrt::Ref>>
+    where
+        Lrt::Ref: Clone,
+    {
+        self.entry_ptr().map(|v| {
+            // SAFETY: `v` is a valid tree entry, and as such is from
+            //     `acquire()`.
+            unsafe { Lrt::borrow_clone(v) }
+        })
+    }
+
+    /// Try linking a new entry in this slot.
+    ///
+    /// If the slot is occupied, was already used for linking, or if the passed
+    /// entry is already linked into a tree, this will return `Err`, moving
+    /// ownership of the new entry back to the caller.
+    ///
+    /// Otherwise, this will move ownership of the entry to the tree and link
+    /// the entry in this slot. A reference to the target is returned via `Ok`.
+    ///
+    /// Note that a slot can only be used once to link an entry. Once linked
+    /// successfully, the slot must be considered unavailable and should be
+    /// dropped. Neither [`Slot::entry()`], nor any other operations will be
+    /// available on a used slot. The only reason this does not consume the
+    /// slot, is to allow re-use of the slot in case the link failed.
+    pub fn try_link(
+        &mut self,
+        ent: Pin<Lrt::Ref>,
+    ) -> Result<Pin<&'tree Lrt::Target>, Pin<Lrt::Ref>> {
+        // If the entry was already occupied, or already used for insertion, it
+        // is no longer available. Refuse to attempt the link.
+        if !self.available() {
+            return Err(ent);
+        }
+
+        // Acquire the entry and get access to the node. This can be converted
+        // to a reference as long as we do not release it.
+        let ent_node = Lrt::acquire(ent);
+
+        let owner = self.tree.as_mut().as_owner();
+        // SAFETY: `ent_node` is a valid node.
+        let Ok(_) = (unsafe {
+            (*Node::owner(ent_node)).cmpxchg(0, owner, atomic::Acquire)
+        }) else {
+            // If the cmpxchg fails, the entry is already claimed (either by
+            // this tree or another tree). Refuse to use this entry, but return
+            // it fully to the caller so it can be reused.
+            //
+            // SAFETY: The pointer was just obtained from `acquire()`.
+            return Err(unsafe { Lrt::release(ent_node) });
+        };
+
+        // The entry was successfully claimed. Let `rb_link_node()` and
+        // `rb_insert_color()` do their work. Then clear `self.{anchor,slot}`
+        // as they are no longer valid for insertion.
+        // Note that preferably the function would consume `self`, but that
+        // would prevent the caller from re-using the slot on insertion
+        // failure.
+        //
+        // SAFETY: As long as `self.slot` is non-null it points to a valid slot
+        //     for insertion with `self.anchor` as the chosen `parent` value.
+        //     This was checked via `self.available()` just now.
+        //     `ent_node` was just claimed and as such is uniquely owned by
+        //     this tree now. Since `self.tree` has a mutable reference, we can
+        //     freely link it into the tree.
+        unsafe {
+            kernel::bindings::rb_link_node(
+                ent_node.as_ref().bindings.get(),
+                self.anchor,
+                self.slot,
+            );
+            kernel::bindings::rb_insert_color(
+                ent_node.as_ref().bindings.get(),
+                self.tree.as_mut().root_mut(),
+            );
+        }
+
+        self.anchor = core::ptr::null_mut();
+        self.slot = core::ptr::null_mut();
+
+        // SAFETY: The pointer was just obtained from `acquire()`.
+        Ok(unsafe { Lrt::borrow(ent_node) })
+    }
+}
+
+// Convenience helpers
+impl<'tree, Lrt> Slot<'tree, Lrt>
+where
+    Lrt: util::intrusive::Link<Node>,
+{
+    /// Link a new entry into this slot.
+    ///
+    /// Works like [`Self::try_link()`] but panics if the entry cannot be
+    /// linked.
+    pub fn link(mut self, ent: Pin<Lrt::Ref>) -> Pin<&'tree Lrt::Target> {
+        if !self.available() {
+            core::panic!("attempting to link on a used slot");
+        }
+        match self.try_link(ent) {
+            Ok(v) => v,
+            Err(v) => Tree::<Lrt>::panic_acquire(v),
+        }
+    }
+}
+
+#[kunit_tests(bus1_util_rb)]
+mod test {
+    use super::*;
+
+    #[derive(Default)]
+    struct Entry {
+        key: u8,
+        rb: Node,
+    }
+
+    util::field::impl_pin_field!(Entry, rb, Node);
+
+    #[test]
+    fn test_basic() {
+        let e0 = core::pin::pin!(Entry { key: 0, ..Default::default() });
+        let e1 = core::pin::pin!(Entry { key: 1, ..Default::default() });
+
+        let tree_o: Tree<node_of!(&Entry, rb)> = Tree::new();
+        let mut tree: Pin<&mut Tree<_>> = core::pin::pin!(tree_o);
+
+        assert!(tree.as_mut().is_empty());
+        tree.as_mut().find_slot_by(|other| e0.key.cmp(&other.key))
+            .link(e0.into_ref());
+        assert!(!tree.as_mut().is_empty());
+        assert!(
+            !tree.as_mut().find_slot_by(|other| 0.cmp(&other.key))
+                .available()
+        );
+        tree.as_mut().find_slot_by(|other| e1.key.cmp(&other.key))
+            .link(e1.into_ref());
+        assert!(!tree.as_mut().is_empty());
+
+        tree.as_mut().clear();
+        assert!(tree.as_mut().is_empty());
+    }
+}
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 14/16] bus1/acct: add resouce accounting
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (12 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 13/16] bus1/util: add intrusive rb-tree David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:03 ` [RFC 15/16] bus1: introduce peers, handles, and nodes David Rheinsberg
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add the `acct` module, including a C API, which implements the resource
accounting scheme of bus1. The module contains documentation on its
purpose and design.

The module uses the same design as was proposed before. In the meantime,
the same concept has been implemented and deployed in dbus-broker, and
backs the resource accounting of the standard dbus implementation in
most distributions out there.

The 2-tiered design is a relatively new introduction to fix some of the
shortcomings that have been identified over the years. It has also been
introduced to dbus-broker.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/acct.rs | 1792 ++++++++++++++++++++++++++++++++++++++++++++++
 ipc/bus1/lib.h   |   66 ++
 ipc/bus1/lib.rs  |    1 +
 3 files changed, 1859 insertions(+)
 create mode 100644 ipc/bus1/acct.rs

diff --git a/ipc/bus1/acct.rs b/ipc/bus1/acct.rs
new file mode 100644
index 000000000000..512b97b256c3
--- /dev/null
+++ b/ipc/bus1/acct.rs
@@ -0,0 +1,1792 @@
+//! # Resource Accounting
+//!
+//! To safely communicate across user boundaries, bus1 needs to apply quotas to
+//! pinned resources, so no user can exploit all resources of another. Every
+//! bus1 operation must be performed on behalf of an actor, and every actor
+//! operates on behalf of a user. Whenever an operation claims resources of a
+//! foreign user, the accounting system will ensure that a given quota is not
+//! exceeded.
+//!
+//! All actors that operate on behalf of the same user will share the same
+//! limits, but a second layer quota ensures that the individual actors are
+//! still semi-protected of each other (to protect against misbehaving actors
+//! on the same security domain).
+//!
+//! Actors that operate on behalf of different users (and thus have different
+//! security domains) have their interactions limited by a dynamic quota
+//! system. This ensures operations that cross security domains cannot exhaust
+//! the resources of another user.
+//!
+//! The accounting system ensures that dynamic quotas are applied and thus
+//! grants all users a fair share of the available resources. The exact
+//! algorithm is not part of the API guarantee, and thus a programmatic test
+//! of the resource limits is not reliable or predictable. Instead, resource
+//! limits are applied similar to mandatory access control: they ensure safe
+//! operation, and should only ever affect unexpected or malicious operations.
+//!
+//! ## Atomicity
+//!
+//! The observable effect of resource accounting is not atomic. That is, two
+//! parallel resource charges might both fail, even though each one
+//! individually without the other might have passed. Moreover, an operation
+//! that eventually fails might still temporarily claim (partial) resources.
+//!
+//! Given that the resource accounting provides no programmatic access, this
+//! should not pose any issues. If underlying accounting techniques or public
+//! APIs change, atomicity can be introduced if desired.
+
+use core::mem::ManuallyDrop;
+
+use kernel::prelude::*;
+use kernel::alloc::AllocError;
+use kernel::sync::{Arc, ArcBorrow, LockedBy};
+
+use crate::capi;
+use crate::util::{self, rb};
+
+/// Representation of user IDs in the accounting system. This is guaranteed to
+/// be a primitive unsigned integer big enough to hold Linux UIDs.
+pub type Id = capi::b1_acct_id_t;
+
+/// Representation of resource counters in the accounting system. This is
+/// guaranteed to be a primitive unsigned integer and big enough to hold
+/// values of type `usize`.
+pub type Value = capi::b1_acct_value_t;
+
+/// Number of resource slots in the accounting system. That is, it defines how
+/// many different (and independent) resource types are in use and tracked by
+/// the accounting system.
+pub const N_SLOTS: usize = capi::_B1_ACCT_SLOT_N;
+
+/// Errors that can be raised by a charging operation.
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+pub enum ChargeError {
+    /// Could not allocate required state-tracking.
+    Alloc(AllocError),
+    /// Charge would exceed the quota of the claiming user.
+    UserQuota,
+    /// Charge would exceed the quota of the claiming actor.
+    ActorQuota,
+}
+
+/// An object to track how many resources a stage has available and claimed.
+///
+/// A [`Claim`] object is always attached to an entity that claims resources
+/// (regardless whether it is a tail-node or an intermediate that re-shares
+/// the resources). It tracks how many resources this entity currently has
+/// claimed from its upper node, and how many it has available to its lower
+/// nodes. The amount it granted to its lower nodes is the difference between
+/// the two.
+///
+/// Tail nodes should never have resources available. They should always
+/// request exactly the amount they claim, and they should always release all
+/// resources back to the upper node if no longer needed.
+///
+/// Intermediate nodes will likely request more resources than they grant.
+/// This allows guarantees to other sub-nodes about future resource claims,
+/// without being required to request more from the upper nodes (which can
+/// possibly fail).
+///
+/// The root node cannot release resource to upper nodes (there are no upper
+/// nodes), and thus the available resource of the root node show exactly
+/// the resources that have not been claimed by the system, yet.
+struct Claim {
+    available: [Value; N_SLOTS],
+    claimed: [Value; N_SLOTS],
+}
+
+/// The root context of an independent accounting system.
+///
+/// All objects of the module are eventually tied to one [`Acct`] object.
+/// Different such objects are fully independent.
+///
+/// The root context is used to gain access to [`User`] and [`Actor`] objects,
+/// and provides the initial resource constraints of the system.
+#[pin_data]
+pub struct Acct {
+    #[pin]
+    inner: kernel::sync::Mutex<AcctInner>,
+}
+
+struct AcctInner {
+    users: rb::Tree<rb::node_of!(Arc<User>, acct_rb)>,
+    users_len: usize,
+    maxima: [Value; N_SLOTS],
+}
+
+/// An actor is an entity of the accounting system that can lay claims on
+/// resources.
+///
+/// Actors always operate on behalf of a user. Actors can either claim
+/// resources of their own user, or of other users. In case of the latter,
+/// the claim is subject to a quota.
+pub struct Actor {
+    user: UserRef,
+}
+
+/// Reference to a user in an accounting system.
+///
+/// A resource system is partitioned into independent users, which operate
+/// fully independently, and each have an assigned resource limit.
+///
+/// This type acts like an `Arc<User>`.
+///
+/// An [`Actor`] can then claim resources of a user up to the limit of the
+/// user. If the actor is tied to the user it claims from, then it can claim up
+/// to the resource limit. If the actor is tied to another user, then its
+/// resource claims are subject to a quota.
+#[derive(Clone)]
+pub struct UserRef {
+    arc: ManuallyDrop<Arc<User>>,
+}
+
+#[pin_data]
+struct User {
+    acct: Arc<Acct>,
+    acct_rb: rb::Node,
+    id: Id,
+    #[pin]
+    inner: kernel::sync::Mutex<UserInner>,
+}
+
+util::field::impl_pin_field!(User, acct_rb, rb::Node);
+
+struct UserInner {
+    quotas: rb::Tree<rb::node_of!(Arc<Quota>, user_rb)>,
+    quotas_len: usize,
+    maxima: [Value; N_SLOTS],
+    claim: Claim,
+}
+
+#[derive(Clone)]
+struct QuotaRef {
+    arc: ManuallyDrop<Arc<Quota>>,
+}
+
+struct Quota {
+    user: UserRef,
+    user_rb: rb::Node,
+    id: Id,
+    inner: LockedBy<QuotaInner, UserInner>,
+}
+
+// SAFETY: `user_rb` is structurally pinned and of type `Node`.
+util::field::impl_pin_field!(Quota, user_rb, rb::Node);
+
+struct QuotaInner {
+    traces: rb::Tree<rb::node_of!(Arc<Trace>, quota_rb)>,
+    traces_len: usize,
+    claim: Claim,
+}
+
+#[derive(Clone)]
+struct TraceRef {
+    arc: ManuallyDrop<Arc<Trace>>,
+}
+
+struct Trace {
+    quota: QuotaRef,
+    quota_rb: rb::Node,
+    actor: Arc<Actor>,
+    inner: LockedBy<TraceInner, UserInner>,
+}
+
+// SAFETY: `quota_rb` is structurally pinned and of type `Node`.
+util::field::impl_pin_field!(Trace, quota_rb, rb::Node);
+
+struct TraceInner {
+    claim: Claim,
+}
+
+/// An object to represent an active resource claim charged on the system.
+#[repr(transparent)]
+pub struct Charge {
+    inner: capi::b1_acct_charge,
+}
+
+// Helper to convert from `usize` to `Value`.
+fn value_from_usize(v: usize) -> Value {
+    build_assert!(
+        size_of::<Value>() >= size_of::<usize>(),
+        "bit-size of values in the accounting system must exceed `usize`"
+    );
+    v as Value
+}
+
+// Calculate the ceiled base-2 logarithm, rounding up in case of missing
+// precision.
+//
+// If `v` is 0, or negative, the function will produce a result, but does not
+// give guarantees on its value (currently, it will produce the maximum
+// logarithm representable).
+fn log2_ceil(v: Value) -> Value {
+    let length: u32 = Value::BITS;
+
+    // This calculates the ceiled logarithm. What we do is count the leading
+    // zeros of the value in question and then substract it from its
+    // bit-length. By subtracting one from the value first we make sure the
+    // value is ceiled.
+    //
+    // Hint: To calculate the floored value, you would do the subtraction of 1
+    //       from the final value, rather than from the source value:
+    //
+    //           `length - v.leading_zeros() - 1`
+    let log2: u32 = length - (v - 1).leading_zeros();
+
+    // Convert the value back into the source type. Since the base-2 logarithm
+    // only reduced in value, this cannot fail, even if the backing type is
+    // changed.
+    log2.into()
+}
+
+// A resource allocator that provides exponential allocation guarantees. It
+// ensures resource reserves are freely accessible, at the expense of granting
+// only very limited guarantees to each entity.
+//
+// This allocator grants access to half of the remaining resources for every
+// new allocation. As such, this allocator guarantees that each independent
+// entity gets access to 1 over `2^n` of the total resources.
+fn allocator_exponential(_users: Value) -> Option<Value> {
+    Some(2)
+}
+
+// A resource allocator that provides polynomial allocation guarantees. It
+// ensures resource reserves are easily accessible, at the expense of granting
+// only limited guarantees to each entity.
+//
+// This allocator grants access to 1 over `n + 1` of the remaining resources
+// for every new allocation (with `n` being the number of active entities). As
+// such, this allocator guarantees that each independent entity gets access to
+// 1 over `n^2` of the total resources.
+#[allow(unused)]
+fn allocator_polynomial(users: Value) -> Option<Value> {
+    users.checked_add(1)
+}
+
+// A resource allocator that provides quasilinear allocation guarantees. It
+// ensures strong guarantees to each entity, at the expense of heavily
+// restricting resource reserves.
+//
+// This allocator grants access to 1 over `(n+1) log(n+1) + n+1` of the
+// remaining resources for every new allocation (with `n` being the number of
+// active entities). As such, this allocator guarantees that each independent
+// entity gets access to 1 over `n log(n)^2` of the total resources.
+fn allocator_quasilinear(users: Value) -> Option<Value> {
+    let users1 = users.checked_add(1)?;
+    let log_mul = log2_ceil(users1).checked_mul(users1)?;
+    log_mul.checked_add(users1)
+}
+
+// Calculate the minimum reserve size required for an allocation request to
+// pass the quota.
+//
+// Whenever an allocation request is checked against the quota, this function
+// can be used to calculate how many resources must still be available in the
+// reserve to allow the request. If this minimum reserve size matches the size
+// of the request, the allocation would be allowed to consume all remaining
+// resources. In most cases, this function returns a bigger minimum reserve
+// size, to ensure future requests can be served as well.
+//
+// This function implements an algorithm to allow an unknown set of users to
+// fairly share a fixed pool of resources. It considers the changing parameters
+// and adjusts its quota check accordingly, ensuring that a growing or
+// shrinking set of users get arbitrated access to the available resources
+// in a fair manner.
+//
+// The quota-check is applied whenever a user requests resources from the
+// shared resource pool. The following information is involved in checking a
+// quota:
+//
+// * `remaining`: Amount of resources that are available for allocation
+// * `n_users`: Number of users that are tracked, including the claimant user
+// * `share`: Amount of resources the claimant user has already allocated
+// * `charge`: Amount of resources that are requested by the claimant user
+//
+// ## Algorithm
+//
+// Ideally, every user on the system would get `1 / n_users` of the available
+// resources. This would be a fair system, where everyone gets the same share.
+// Unfortunately, we do not know the number of users that will participate
+// upfront. Hence, we use an algorithm that guarantees something that comes
+// close to this ideal.
+//
+// An allocation that was granted is never revoked, nor do we reclaim any
+// resources that are actively used. Hence, the only place the algorithm is
+// applied is when an allocation is requested. There, we look at the amount
+// of resources that are still available, and then decide whether the user
+// is allowed their request. We consider every allocation of a user as a
+// `re-allocation` of their current share. That is, we pretend they release
+// their currently held resources, then request a new allocation that is the
+// size of the original request plus their previous share. This `re-allocation`
+// then has to satisfy the following inequality:
+//
+// ```txt
+//                       remaining + share
+//     charge + share <= ~~~~~~~~~~~~~~~~~
+//                           A(n_users)
+// ```
+//
+// In other words, of the remaining resources, a user gets a fraction that
+// only depends on the number of users that are active. Now depending on which
+// function `A()` is chosen, a user can request more or less resources.
+// However, selection of `A()` also affects the overall guarantees that the
+// algorithm will provide.
+//
+// For example, consider `A(n): 2`. That is, an allocator that always grants
+// half of the remaining resources to a user:
+//
+// ```txt
+//                       remaining + share
+//     charge + share <= ~~~~~~~~~~~~~~~~~
+//                               2
+// ```
+//
+// This will ensure that resources are easily available and little reserves
+// are kept. However, for a single user, this allocation scheme can only
+// guarantee that each user gets `1 / 2^n` of the available resources. So
+// while it ensures that resources are easily available and not held back,
+// it also prevents any meaningful guarantess for a single user, and as such
+// denials of service can ensue.
+//
+// If, on the other hand, you pick `A(n): n + 1`, then only a share
+// proportional to the number of currently active users is granted:
+//
+// ```txt
+//                       remaining + share
+//     charge + share <= ~~~~~~~~~~~~~~~~~
+//                          n_users + 1
+// ```
+//
+// (Note that `n_users` is not the total numbers of users involved in the
+// system eventually, but merely at the given moment).
+//
+// With this allocator, much bigger reserves are kept as the number of users
+// rises. Ultimately, this will guarantee that each users gets `1 / n^2` of
+// the total resources. This is already much better than the exponential
+// backoff of the previous allocator.
+//
+// Now, lastly, we consider `A(n): n log(n) + n`, or more precicesly
+// `A(n): (n+1) log(n+1) + n+1`. With this allocator, resources are kept
+// even tighter, but ultimately we get a quasilinear guarantee for each user
+// with `1 / (n * log(n)^2)`. This is already pretty close to the ideal of
+// `1 / n`.
+//
+// ## Hierarchy
+//
+// The algorithm can be applied to a hierarchical setup by simply chaining
+// quota checks. For instance, one could first check whether a user can
+// allocate resources from a global resource pool, and then check whether a
+// claimant can allocate resources on that user. Depending on what guarantees
+// are desired on each level, different allocators can be chosen. However,
+// any hierarchical setup will also significantly reduce the guarantees, as
+// each level operates only on the guarantees of the previous level.
+fn quota_reserve<F>(
+    allocator_fn: F,
+    n_users: Value,
+    share: Value,
+    charge: Value,
+) -> Option<Value>
+where
+    F: Fn(Value) -> Option<Value>,
+{
+    // For a quota check, we have to calculate:
+    //
+    //                       remaining + share
+    //     charge + share <= ~~~~~~~~~~~~~~~~~
+    //                           A(n_users)
+    //
+    // But to avoid the division, we instead calculate:
+    //
+    //     (charge + share) * A(n_users) - share <= remaining
+    //
+    // The inequality itself has to be checked by the caller. This function
+    // merely computes the left half of the inequality and returns it.
+    //
+    // None of these partial calculations exceed the actual limit by a factor
+    // of 2, and as such we expect all calculations to be possible within the
+    // limits of the integer type. Any overflow will thus result in a quota
+    // rejection.
+
+    let allocator = allocator_fn(n_users)?;
+    let charge_share = charge.checked_add(share)?;
+    let limit = charge_share.checked_mul(allocator)?;
+    let minimum = limit.checked_sub(share)?;
+
+    Some(minimum)
+}
+
+/// Find an entry in a lookup tree, or insert a new one if not found.
+///
+/// This will use `cmp_fn` to look for an entry in `tree`. If found, the entry
+/// is returned. Otherwise, `new_fn` is used to create a new entry and store
+/// it in `tree` at the same position.
+///
+/// To ensure tree order, the caller should make sure the newly created entry
+/// is consistent with the comparator `cmp_fn`.
+///
+/// ## Safety
+///
+/// When inserting a new entry, this splits a single reference count across 2
+/// `Arc`s. One that is returned to the caller, and one that is stored in the
+/// lookup tree. The caller must ensure to use `drop_or_unlink()` when dropping
+/// the last reference, or otherwise merge those `Arc`s back together.
+unsafe fn find_or_insert<T, Lrt, CmpFn, NewFn>(
+    tree: Pin<&mut rb::Tree<Lrt>>,
+    tree_len: &mut usize,
+    cmp_fn: CmpFn,
+    mut new_fn: NewFn,
+) -> Result<Arc<T>, AllocError>
+where
+    Lrt: util::intrusive::Link<rb::Node, Ref = Arc<T>, Target = T>,
+    CmpFn: FnMut(Pin<&T>) -> core::cmp::Ordering,
+    NewFn: FnMut() -> Result<Arc<T>, AllocError>,
+{
+    let slot = tree.find_slot_by(cmp_fn);
+    if let Some(v) = slot.entry_clone() {
+        Ok(util::arc_unpin(v))
+    } else {
+        let raw = Arc::into_raw(new_fn()?);
+        // SAFETY: The single reference-count is split here. `drop_or_unlink()`
+        //     must be used by the caller to ensure it is merged when dropped.
+        let (new, link) = unsafe {
+            (Arc::from_raw(raw), Arc::from_raw(raw))
+        };
+        slot.link(util::arc_pin(link));
+        *tree_len += 1;
+        Ok(new)
+    }
+}
+
+/// Unlink an `Arc` from a tree, if it is the last.
+///
+/// This tries to drop an `Arc`, but only if it is not the last `Arc`. If
+/// the drop goes through, `None` is returned. If not, the `Arc` is unlinked
+/// from `tree` and returned to the caller.
+///
+/// The `Arc` stored in the tree is leaked via `Arc::into_raw()`. To prevent
+/// this leak, the caller should have stored a shared reference in the tree
+/// in the first place.
+fn drop_or_unlink<T, Lrt>(
+    tree: Pin<&mut rb::Tree<Lrt>>,
+    tree_len: &mut usize,
+    ent: Arc<T>,
+) -> Option<Arc<T>>
+where
+    Lrt: util::intrusive::Link<rb::Node, Ref = Arc<T>, Target = T>,
+{
+    let ent = util::arc_pin(Arc::drop_unless_unique(ent)?);
+
+    if let Some(v) = tree.try_unlink(ent.as_ref()) {
+        let _v = Arc::into_raw(util::arc_unpin(v));
+        *tree_len -= 1;
+    }
+
+    Some(util::arc_unpin(ent))
+}
+
+impl core::convert::From<AllocError> for ChargeError {
+    /// Create a charge error from an allocation error.
+    ///
+    /// This will wrap the allocation error as a [`ChargeError::Alloc`].
+    fn from(v: AllocError) -> Self {
+        Self::Alloc(v)
+    }
+}
+
+impl Claim {
+    fn with(available: &[Value; N_SLOTS]) -> Self {
+        Claim {
+            available: *available,
+            claimed: *available,
+        }
+    }
+
+    fn new() -> Self {
+        Self::with(&[0; N_SLOTS])
+    }
+}
+
+impl Acct {
+    /// Create a new accounting system.
+    ///
+    /// All accounting systems are fully independent of each other, and this
+    /// object serves as root context for a given accounting system.
+    pub fn new(
+        maxima: &[Value; N_SLOTS],
+    ) -> Result<Arc<Self>, AllocError> {
+        match Arc::pin_init(
+            pin_init!(Self {
+                inner <- kernel::sync::new_mutex!(
+                    AcctInner {
+                        users: Default::default(),
+                        users_len: 0,
+                        maxima: *maxima,
+                    },
+                ),
+            }),
+            GFP_KERNEL,
+        ) {
+            Ok(v) => Ok(v),
+            Err(_) => Err(AllocError),
+        }
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_acct {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_acct) -> Arc<Self> {
+        // SAFETY: Delegated to caller.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Get a user object for a given user ID.
+    ///
+    /// Query the accounting system for the user object of the given user ID.
+    /// This will always return an existing user object, if there is one.
+    /// Otherwise, it will create a new one.
+    ///
+    /// It is never possible to have two [`User`] objects with the same user ID
+    /// but assigned to the same [`Acct`] object. This lookup function ensures
+    /// that user objects are always shared.
+    pub fn get_user(
+        self: ArcBorrow<'_, Acct>,
+        id: Id,
+    ) -> Result<UserRef, AllocError> {
+        let mut acct_guard = self.inner.lock();
+        let (users, users_len, maxima) = acct_guard.as_mut().unfold_mut();
+
+        // SAFETY: The new `Arc` is immediately wrapped in `UserRef`, which
+        //     ensures to merge back the split `Arc` of `find_or_insert()`
+        //     on drop.
+        let user = unsafe {
+            find_or_insert(
+                users,
+                users_len,
+                |ent| id.cmp(&ent.id),
+                || User::new(self, id, maxima),
+            )?
+        };
+
+        Ok(UserRef::new(user))
+    }
+}
+
+impl AcctInner {
+    #[allow(clippy::type_complexity)]
+    fn unfold_mut(
+        self: Pin<&mut Self>,
+    ) -> (
+        Pin<&mut rb::Tree<rb::node_of!(Arc<User>, acct_rb)>>,
+        &mut usize,
+        &mut [Value; N_SLOTS],
+    ) {
+        // SAFETY: Only `AcctInner.users` is structurally pinned.
+        unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                Pin::new_unchecked(&mut inner.users),
+                &mut inner.users_len,
+                &mut inner.maxima,
+            )
+        }
+    }
+}
+
+impl Actor {
+    /// Create a new actor assigned to the given user.
+    ///
+    /// Any amount of actors can be created for a given user, and they can be
+    /// shared between different operations, if desired. Actors always operate
+    /// on behalf of the user they were created for. As such, actors of
+    /// differnt users never share any quota. Additionally, multiple actors
+    /// assigned to the same user will be protected via a second-level quota.
+    ///
+    /// Long story short: Resource use of each actor can be safely traced and
+    /// is tracked by the resource accounting.
+    pub fn with(
+        user: UserRef,
+    ) -> Result<Arc<Self>, AllocError> {
+        Arc::new(
+            Self {
+                user,
+            },
+            GFP_KERNEL,
+        )
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_acct_actor {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_acct_actor) -> Arc<Self> {
+        // SAFETY: Delegated to caller.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Borrow a raw reference.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from releasing it via
+    /// `Self::from_raw()` for `'a`.
+    pub(crate) unsafe fn borrow_raw<'a>(
+        this: *mut capi::b1_acct_actor,
+    ) -> ArcBorrow<'a, Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()` and
+        // will not be released for `'a`.
+        unsafe { ArcBorrow::from_raw(this.cast::<Self>()) }
+    }
+
+    // Return the memory address of the actor as integer.
+    //
+    // The same value can be obtained via `(&raw const *actor).addr()`. Note
+    // that this address is stable for the lifetime of an `Arc<Actor>`, since
+    // kernel `Arc` is always pinned.
+    fn addr(self: ArcBorrow<'_, Self>) -> usize {
+        // In the kernel, `Arc` is always pinned, so the address can be used as
+        // stable indicator for this actor.
+        util::ptr_addr(core::ptr::from_ref(&*self))
+    }
+
+    /// Charge resources on the user of this actor with the given claimant
+    /// actor.
+    ///
+    /// See [`User::charge()`] for details on the charge operation.
+    pub fn charge(
+        self: ArcBorrow<'_, Self>,
+        claimant: ArcBorrow<'_, Self>,
+        amount: &[Value; N_SLOTS],
+    ) -> Result<Charge, ChargeError> {
+        self.user.charge(
+            claimant,
+            amount,
+        )
+    }
+}
+
+impl UserRef {
+    fn new(arc: Arc<User>) -> Self {
+        Self {
+            arc: ManuallyDrop::new(arc),
+        }
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(mut this: Self) -> *mut capi::b1_acct_user {
+        // SAFETY: The drop-handler is skipped if we leak the value, which
+        //     we do here. We also do not expose access to the refcount, so
+        //     either the value is leaked, or recreated via `Self::from_raw()`.
+        let arc = unsafe { ManuallyDrop::take(&mut this.arc) };
+        core::mem::forget(this);
+        Arc::into_raw(arc).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_acct_user) -> Self {
+        // SAFETY: Delegated to caller.
+        Self::new(unsafe { Arc::from_raw(this.cast()) })
+    }
+
+    fn charge_claims(
+        user_inner: Pin<&mut UserInner>,
+        quota_inner: Pin<&mut QuotaInner>,
+        trace_inner: Pin<&mut TraceInner>,
+        amount: &[Value; N_SLOTS],
+        cross_user: bool,
+    ) -> Result<(), ChargeError> {
+        //
+        // Fetch all relevant information needed for the quota operation.
+        // This simplifies the accessors and avoid dealing with pinning in
+        // the quota calculations.
+        //
+
+        let n_users = value_from_usize(user_inner.as_ref().quotas_len);
+        let n_actors = value_from_usize(quota_inner.as_ref().traces_len);
+        // SAFETY: `UserInner.claim` is not structurally pinned.
+        let claim_user = unsafe { &mut Pin::into_inner_unchecked(user_inner).claim };
+        // SAFETY: `QuotaInner.claim` is not structurally pinned.
+        let claim_quota = unsafe { &mut Pin::into_inner_unchecked(quota_inner).claim };
+        // SAFETY: `TraceInner.claim` is not structurally pinned.
+        let claim_trace = unsafe { &mut Pin::into_inner_unchecked(trace_inner).claim };
+
+        // Remember the charge amount for each level / claim-object.
+        let mut reqs = [[0; 3]; N_SLOTS];
+
+        // First check each slot independently, but apply nothing.
+        for (slot, req) in reqs.iter_mut().enumerate() {
+            let mut minimum;
+
+            req[0] = amount[slot];
+
+            // Direct allocation
+            //
+            // We start the allocation request on the trace object, and
+            // work our way upwards for as long as the request was not
+            // fulfilled.
+            //
+            // A trace object is a leaf node in the allocation tree, and as
+            // such cannot have any overallocated resources. Verify this!
+            // Then delegate the request to the next level.
+
+            assert_eq!(claim_trace.available[slot], 0);
+            req[1] = req[0];
+
+            // Trace allocation
+            //
+            // Since the trace object cannot serve the request, we delegate
+            // the allocation and request more resources for this trace
+            // object from the user. This uses a very lenient allocator,
+            // since no user boundaries are crossed, but we merely share
+            // resources across actors of the same user.
+            //
+            // We first calculate how big the reserve of the user has to be
+            // to serve the request, and then check whether the user
+            // already has enough resources available. If not, we delegate
+            // the allocation request to the next level.
+            //
+            // If trace boundaries are not crossed, we grant full access to
+            // all resources.
+
+            minimum = quota_reserve(
+                allocator_exponential,
+                n_actors,
+                claim_trace.claimed[slot],
+                req[1],
+            ).ok_or(ChargeError::ActorQuota)?;
+
+            if claim_quota.available[slot] >= minimum {
+                continue;
+            }
+
+            req[2] = minimum - claim_quota.available[slot];
+
+            // User allocation
+            //
+            // The reserve of the user was not big enough to serve the
+            // request, so we have to request more resources for this user.
+            // If this crosses user-boundaries, we have to ensure that a
+            // strong allocator is used. But if this does not cross
+            // user-boundaries, we grant full access to all resources of
+            // the user.
+
+            minimum = if cross_user {
+                quota_reserve(
+                    allocator_quasilinear,
+                    n_users,
+                    claim_quota.claimed[slot],
+                    req[2],
+                ).ok_or(ChargeError::UserQuota)?
+            } else {
+                req[2]
+            };
+
+            if claim_user.available[slot] >= minimum {
+                continue;
+            }
+
+            // Root allocation
+            //
+            // The resources of the user were exhausted. We do not support
+            // further propagation, but only provide per-user limits.
+            // Hence, we have to fail the request.
+
+            return Err(ChargeError::UserQuota);
+        }
+
+        // With all quotas checked, apply the charge to each slot.
+        for (slot, req) in reqs.iter().enumerate() {
+            claim_user.available[slot] -= req[2];
+            claim_quota.claimed[slot] += req[2];
+            claim_quota.available[slot] += req[2];
+
+            claim_quota.available[slot] -= req[1];
+            claim_trace.claimed[slot] += req[1];
+            claim_trace.available[slot] += req[1];
+
+            claim_trace.available[slot] -= req[0];
+        }
+
+        Ok(())
+    }
+
+    /// Charge resources on this user with the given actor.
+    pub fn charge(
+        &self,
+        claimant: ArcBorrow<'_, Actor>,
+        amount: &[Value; N_SLOTS],
+    ) -> Result<Charge, ChargeError> {
+        // First we lock the relevant user. This lock is used for all quota
+        // and trace lookups underneath a single user.
+        //
+        // Then try to resolve the quota and trace objects, and pin their inner
+        // objects. We then pass those pinned objects to `charge_claims()`,
+        // which performs the actual quota calculations.
+        //
+        // This might allocates a new quota or trace object. And in case the
+        // quota-check fails, those can then trigger their drop handler. Ensure
+        // that both `quota` and `trace` are declared before `user_guard`, to
+        // prevent them from being dropped with the lock held.
+        let quota: QuotaRef;
+        let trace: TraceRef;
+        let mut user_guard = self.arc.inner.lock();
+        let mut user_inner = user_guard.as_mut();
+
+        // SAFETY: `user_inner` and `self` refer to the same object.
+        quota = unsafe {
+            user_inner.as_mut().get_quota(self, claimant.user.arc.id)?
+        };
+        // SAFETY: `Quota.inner` is structurally pinned and `user_inner` is
+        //     held sufficiently long. `access_mut_unchecked()` is only called
+        //     once on this object.
+        let mut quota_inner = unsafe {
+            Pin::new_unchecked(
+                quota.arc.inner.access_mut_unchecked(
+                    Pin::into_inner_unchecked(user_inner.as_mut()),
+                )
+            )
+        };
+        // SAFETY: `quota_inner` and `quota` refer to the same object.
+        trace = unsafe {
+            quota_inner.as_mut().get_trace(&quota, claimant)?
+        };
+        // SAFETY: `Trace.inner` is structurally pinned and `user_inner` is
+        //     held sufficiently long. `access_mut_unchecked()` is only called
+        //     once on this object.
+        let trace_inner = unsafe {
+            Pin::new_unchecked(
+                trace.arc.inner.access_mut_unchecked(
+                    Pin::into_inner_unchecked(user_inner.as_mut()),
+                )
+            )
+        };
+
+        match Self::charge_claims(
+            user_inner,
+            quota_inner,
+            trace_inner,
+            amount,
+            quota.cross_user(),
+        ) {
+            Ok(()) => Ok(Charge::with(trace, amount)),
+            Err(e) => Err(e),
+        }
+
+    }
+}
+
+impl core::ops::Drop for UserRef {
+    fn drop(&mut self) {
+        // SAFETY: The value is always valid, and only cleared here.
+        let this = unsafe { ManuallyDrop::take(&mut self.arc) };
+
+        // Try dropping the reference. But if this is the last reference,
+        // take the lock first and retry. If it is still the last, unlink
+        // from the lookup tree and then drop it.
+        if let Some(this) = Arc::drop_unless_unique(this) {
+            let acct = this.acct.clone();
+            let mut acct_guard = acct.inner.lock();
+            let (users, users_len, _) = acct_guard.as_mut().unfold_mut();
+            let this = drop_or_unlink(users, users_len, this);
+            drop(acct_guard);
+            drop(this); // Drop outside of the lock to reduce contention.
+        }
+    }
+}
+
+impl User {
+    fn new(
+        acct: ArcBorrow<'_, Acct>,
+        id: Id,
+        maxima: &[Value; N_SLOTS],
+    ) -> Result<Arc<Self>, AllocError> {
+        match Arc::pin_init(
+            pin_init!(Self {
+                acct: acct.into(),
+                acct_rb: Default::default(),
+                id: id,
+                inner <- kernel::sync::new_mutex!(
+                    UserInner {
+                        quotas: Default::default(),
+                        quotas_len: 0,
+                        maxima: *maxima,
+                        claim: Claim::with(maxima),
+                    },
+                ),
+            }),
+            GFP_KERNEL,
+        ) {
+            Ok(v) => Ok(v),
+            Err(_) => Err(AllocError),
+        }
+    }
+}
+
+impl UserInner {
+    #[allow(clippy::type_complexity)]
+    fn unfold_mut(
+        self: Pin<&mut Self>,
+    ) -> (
+        Pin<&mut rb::Tree<rb::node_of!(Arc<Quota>, user_rb)>>,
+        &mut usize,
+        &mut [Value; N_SLOTS],
+        &mut Claim,
+    ) {
+        // SAFETY: Only `UserInner.quotas` is structurally pinned.
+        unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                Pin::new_unchecked(&mut inner.quotas),
+                &mut inner.quotas_len,
+                &mut inner.maxima,
+                &mut inner.claim,
+            )
+        }
+    }
+
+    /// Find a quota object, or create a new one.
+    ///
+    /// ## Safety
+    ///
+    /// `self_ref` and `self` must refer to the same `User` object.
+    unsafe fn get_quota(
+        self: Pin<&mut Self>,
+        self_ref: &UserRef,
+        id: Id,
+    ) -> Result<QuotaRef, AllocError> {
+        let (quotas, quotas_len, _, _) = self.unfold_mut();
+
+        // SAFETY: The new `Arc` is immediately wrapped in `QuotaRef`, which
+        //     ensures to merge back the split `Arc` of `find_or_insert()`
+        //     on drop.
+        //     Caller guarantees `self_ref` refers to the correct user.
+        let quota = unsafe {
+            find_or_insert(
+                quotas,
+                quotas_len,
+                |ent| id.cmp(&ent.id),
+                || Quota::new(self_ref.clone(), id),
+            )?
+        };
+
+        Ok(QuotaRef::new(quota))
+    }
+}
+
+impl QuotaRef {
+    fn new(arc: Arc<Quota>) -> Self {
+        Self {
+            arc: ManuallyDrop::new(arc),
+        }
+    }
+
+    fn cross_user(&self) -> bool {
+        self.arc.id != self.arc.user.arc.id
+    }
+}
+
+impl core::ops::Drop for QuotaRef {
+    fn drop(&mut self) {
+        // SAFETY: The value is always valid, and only cleared here.
+        let this = unsafe { ManuallyDrop::take(&mut self.arc) };
+
+        // Try dropping the reference. But if this is the last reference,
+        // take the lock first and retry. If it is still the last, unlink
+        // from the lookup tree and then drop it.
+        if let Some(this) = Arc::drop_unless_unique(this) {
+            let user = this.user.clone();
+            let mut user_guard = user.arc.inner.lock();
+            let (quotas, quotas_len, _, _) = user_guard.as_mut().unfold_mut();
+            let this = drop_or_unlink(quotas, quotas_len, this);
+            drop(user_guard);
+            drop(this); // Drop outside of the lock to reduce contention.
+        }
+    }
+}
+
+impl Quota {
+    fn new(
+        user: UserRef,
+        id: Id,
+    ) -> Result<Arc<Self>, AllocError> {
+        let inner = LockedBy::new(
+            &user.arc.inner,
+            QuotaInner {
+                traces: Default::default(),
+                traces_len: 0,
+                claim: Claim::new(),
+            },
+        );
+        Arc::new(
+            Self {
+                user,
+                user_rb: Default::default(),
+                id,
+                inner,
+            },
+            GFP_KERNEL,
+        )
+    }
+}
+
+impl QuotaInner {
+    /// Find a trace object, or create a new one.
+    ///
+    /// ## Safety
+    ///
+    /// `self_ref` and `self` must refer to the same `Quota` object.
+    unsafe fn get_trace(
+        self: Pin<&mut Self>,
+        self_ref: &QuotaRef,
+        actor: ArcBorrow<'_, Actor>,
+    ) -> Result<TraceRef, AllocError> {
+        let actor_addr = actor.addr();
+
+        // SAFETY: `Quota::inner` and `QuotaInner::traces` are
+        //     structurally pinned.
+        let (traces, traces_len) = unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                Pin::new_unchecked(&mut inner.traces),
+                &mut inner.traces_len,
+            )
+        };
+
+        // SAFETY: The new `Arc` is immediately wrapped in `QuotaRef`, which
+        //     ensures to merge back the split `Arc` of `find_or_insert()`
+        //     on drop.
+        //     Caller guarantees `self_ref` refers to the correct quota.
+        let trace = unsafe {
+            find_or_insert(
+                traces,
+                traces_len,
+                |ent| actor_addr.cmp(&ent.actor.as_arc_borrow().addr()),
+                || Trace::new(self_ref.clone(), actor.into()),
+            )?
+        };
+
+        Ok(TraceRef::new(trace))
+    }
+}
+
+impl TraceRef {
+    fn new(arc: Arc<Trace>) -> Self {
+        Self {
+            arc: ManuallyDrop::new(arc),
+        }
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(mut this: Self) -> *mut capi::b1_acct_trace {
+        // SAFETY: The drop-handler can be skipped if we leak the value, which
+        //     we do here. We also do not expose access to the refcount, so
+        //     either the value is leaked, or recreated via `Self::from_raw()`.
+        let arc = unsafe { ManuallyDrop::take(&mut this.arc) };
+        core::mem::forget(this);
+        Arc::into_raw(arc).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(trace: *mut capi::b1_acct_trace) -> Self {
+        // SAFETY: Delegated to caller.
+        Self::new(unsafe { Arc::from_raw(trace.cast()) })
+    }
+
+    fn discharge(&self, amount: &[Value; N_SLOTS]) {
+        //
+        // First we lock the owning user. The entire chain from `self` to
+        // `self.arc.quota.arc.user` is protected by the user lock.
+        //
+        // Put the remaining code into its own block to ensure any result of
+        // `LockedBy::access_mut_unchecked()` is released before `user_guard`.
+        //
+
+        let mut user_guard = self.arc.quota.arc.user.arc.inner.lock();
+        let mut user_inner = user_guard.as_mut();
+
+        {
+            //
+            // With the user locked, find or create the quota and trace objects
+            // for the claiming actor.
+            //
+
+            // SAFETY: `Quota.inner` is structurally pinned.
+            let quota_inner = unsafe {
+                Pin::new_unchecked(
+                    self.arc.quota.arc.inner.access_mut_unchecked(
+                        Pin::into_inner_unchecked(user_inner.as_mut()),
+                    )
+                )
+            };
+            // SAFETY: `Trace.inner` is structurally pinned.
+            let trace_inner = unsafe {
+                Pin::new_unchecked(
+                    self.arc.inner.access_mut_unchecked(
+                        Pin::into_inner_unchecked(user_inner.as_mut()),
+                    )
+                )
+            };
+
+            //
+            // Fetch all relevant information needed for the quota operation.
+            // This simplifies the accessors and avoid dealing with pinning in
+            // the quota calculations.
+            //
+
+            let n_actors = value_from_usize(quota_inner.as_ref().traces_len);
+            // SAFETY: `UserInner.claim` is not structurally pinned.
+            let claim_user = unsafe { &mut Pin::into_inner_unchecked(user_inner).claim };
+            // SAFETY: `QuotaInner.claim` is not structurally pinned.
+            let claim_quota = unsafe { &mut Pin::into_inner_unchecked(quota_inner).claim };
+            // SAFETY: `TraceInner.claim` is not structurally pinned.
+            let claim_trace = unsafe { &mut Pin::into_inner_unchecked(trace_inner).claim };
+
+            //
+            // Everything is prepared. Discharge each slot individually. This
+            // will drop the specified amount from the charge slot of the actor
+            // and then propagate resources back through the trace and quota to
+            // the user.
+            //
+
+            for (slot, amount_slot) in amount.iter().enumerate() {
+                let mut n;
+
+                // Release the claimed amount on the trace-object and make it
+                // available as cached resources. We do this for completeness
+                // reasons, but we immediately propagate the resources in the
+                // next step.
+
+                n = *amount_slot;
+
+                claim_trace.available[slot] += n;
+
+                // Release all unused resources on the trace object and grant
+                // them back to the quota. We do not want to cache any
+                // resources on the trace object, yet, but always ensure
+                // everything is properly returned to the lower levels.
+
+                n = claim_trace.available[slot];
+
+                claim_trace.available[slot] -= n;
+                claim_trace.claimed[slot] -= n;
+                claim_quota.available[slot] += n;
+
+                // We now want to release unused resources on the quota object
+                // back to the user. We cannot release all unused resources,
+                // since other charges might require reserves. Hence, we
+                // re-calculate how big the reserve shall be and then shrink
+                // down to this size.
+                //
+                // To figure out how big the reserve needs to be, we calculate
+                // the required reserve if a new user came along and requested
+                // the average size of all allocated resources (i.e., the ideal
+                // 1/n partition). We then use this as minimum reserve. This
+                // also ensures that if the average is 0, the minimum reserve
+                // will also be 0.
+
+                let claimed = claim_quota.claimed[slot];
+                let available = claim_quota.available[slot];
+
+                let average = claimed
+                    .checked_sub(available)
+                    .unwrap()
+                    .checked_div(n_actors)
+                    .unwrap();
+
+                let minimum = quota_reserve(
+                    allocator_exponential,
+                    n_actors,
+                    average,
+                    0,
+                ).unwrap_or(Value::MAX);
+
+                if minimum < claim_quota.available[slot] {
+                    n = claim_quota.available[slot] - minimum;
+                    claim_quota.available[slot] -= n;
+                    claim_quota.claimed[slot] -= n;
+                    claim_user.available[slot] += n;
+                }
+            }
+        }
+    }
+}
+
+impl core::ops::Drop for TraceRef {
+    fn drop(&mut self) {
+        // SAFETY: The value is always valid, and only cleared here.
+        let this = unsafe { ManuallyDrop::take(&mut self.arc) };
+
+        // Try dropping the reference. But if this is the last reference,
+        // take the lock first and retry. If it is still the last, unlink
+        // from the lookup tree and then drop it.
+        if let Some(this) = Arc::drop_unless_unique(this) {
+            let quota = this.quota.clone();
+            let mut user_guard = quota.arc.user.arc.inner.lock();
+            // SAFETY: `Quota::inner` and `QuotaInner::traces` are
+            //     structurally pinned.
+            let (traces, traces_len) = unsafe {
+                let inner = quota.arc.inner.access_mut(
+                    Pin::into_inner_unchecked(user_guard.as_mut())
+                );
+                (
+                    Pin::new_unchecked(&mut inner.traces),
+                    &mut inner.traces_len,
+                )
+            };
+            let this = drop_or_unlink(traces, traces_len, this);
+            drop(user_guard);
+            drop(this); // Drop outside of the lock to reduce contention.
+        }
+    }
+}
+
+impl Trace {
+    fn new(
+        quota: QuotaRef,
+        actor: Arc<Actor>,
+    ) -> Result<Arc<Self>, AllocError> {
+        let inner = LockedBy::new(
+            &quota.arc.user.arc.inner,
+            TraceInner {
+                claim: Claim::new(),
+            },
+        );
+        Arc::new(
+            Self {
+                quota,
+                quota_rb: Default::default(),
+                actor,
+                inner,
+            },
+            GFP_KERNEL,
+        )
+    }
+}
+
+impl Charge {
+    fn with(
+        trace: TraceRef,
+        amount: &[Value; N_SLOTS],
+    ) -> Self {
+        Self {
+            inner: capi::b1_acct_charge {
+                trace: TraceRef::into_raw(trace).cast(),
+                amount: *amount,
+            },
+        }
+    }
+
+    /// Wrap the C API charge representation as a `Charge`.
+    ///
+    /// `Charge` is a transparent wrapper around `capi::b1_acct_charge`. This
+    /// method allows interpreting any raw C API charges as a `Charge` object.
+    ///
+    /// This does not move the data, nor does it pin the contents. The caller
+    /// can swap out of the mutable reference, if desired.
+    ///
+    /// ## Safety
+    ///
+    /// `capi` must be convertible to a mutable reference for the lifetime `'a`
+    /// (which can be freely chosen by the caller). `capi.trace` must either be
+    /// `NULL` or a valid value acquired via [`TraceRef::into_raw()`].
+    ///
+    /// `capi.amount` should reflect the actual charge values, otherwise a
+    /// discharge might panic due to integer overflows (but it does not violate
+    /// safety requirements).
+    pub unsafe fn from_capi<'a>(
+        capi: *mut capi::b1_acct_charge,
+    ) -> &'a mut Self {
+        // SAFETY: Delegated to caller.
+        unsafe {
+            &mut *capi.cast::<Self>()
+        }
+    }
+
+    /// Discharge the claimed resources.
+    pub fn discharge(&mut self) {
+        let trace_ptr = core::mem::take(&mut self.inner.trace);
+        if let Some(trace_nn) = core::ptr::NonNull::new(trace_ptr) {
+            // SAFETY: `self.inner.trace` was acquired via
+            //     `TraceRef::into_raw()`. Re-use is prevented by always using
+            //     `core::mem::take()`.
+            let trace = unsafe { TraceRef::from_raw(trace_nn.as_ptr().cast()) };
+            trace.discharge(&self.inner.amount);
+        }
+    }
+}
+
+impl core::ops::Drop for Charge {
+    fn drop(&mut self) {
+        self.discharge();
+    }
+}
+
+/// Create a new accounting system.
+///
+/// This is the C API for [`Arc::new()`]. It returns an error pointer for
+/// `ENOMEM`, or a valid reference to the newly created [`Acct`] object.
+///
+/// ## Safety
+///
+/// `maxima` must be convertible to a shared reference.
+#[export_name = "b1_acct_new"]
+pub unsafe extern "C" fn acct_new(
+    maxima: *const [Value; N_SLOTS],
+) -> *mut capi::b1_acct {
+    // SAFETY: Delegated to caller.
+    match Acct::new(unsafe { &*maxima }) {
+        Ok(v) => Acct::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    }
+}
+
+/// Create a new reference to an accounting system.
+///
+/// This increases the reference count of the accounting system by one. If
+/// `NULL` is passed, this is a no-op.
+///
+/// This always returns back the same pointer as was passed.
+///
+/// ## Safety
+///
+/// If non-NULL, `acct` must refer to a valid accounting system and the caller
+/// must hold a reference to it.
+#[export_name = "b1_acct_ref"]
+pub unsafe extern "C" fn acct_ref(
+    this: *mut capi::b1_acct,
+) -> *mut capi::b1_acct {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // Ensure `this_ref` is not dropped on panic.
+        let this_ref = ManuallyDrop::new(
+            // SAFETY: Delegated to caller.
+            unsafe { Acct::from_raw(this_nn.as_ptr()) }
+        );
+        let r = (*this_ref).clone();
+        let _ = Acct::into_raw(ManuallyDrop::into_inner(this_ref));
+        Acct::into_raw(r)
+    } else {
+        this
+    }
+}
+
+/// Drop a reference to an accounting system.
+///
+/// This decreases the reference count of the accounting system by one. If
+/// `NULL` is passed, this is a no-op. If this drops the last reference, the
+/// entire accounting system is deallocated.
+///
+/// Note that actors and users also own a reference to the accounting system.
+///
+/// This always returns `NULL`.
+///
+/// ## Safety
+///
+/// If non-NULL, `acct` must refer to a valid accounting system and the caller
+/// must hold a reference to it.
+#[export_name = "b1_acct_unref"]
+pub unsafe extern "C" fn acct_unref(
+    this: *mut capi::b1_acct,
+) -> *mut capi::b1_acct {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Delegated to caller.
+        let _ = unsafe { Acct::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+/// Create a new actor for a given user.
+///
+/// This always creates a new actor, which will act on behalf of the given
+/// user.
+///
+/// This can return `ENOMEM` as error pointer on failure.
+///
+/// ## Safety
+///
+/// `user` must refer to a valid user and the caller must hold a reference
+/// to it.
+#[export_name = "b1_acct_actor_new"]
+pub unsafe extern "C" fn actor_new(
+    user: *mut capi::b1_acct_user,
+) -> *mut capi::b1_acct_actor {
+    let user_nn = core::ptr::NonNull::new(user).unwrap();
+    // Ensure `user_ref` is not dropped on panic.
+    let user_ref = ManuallyDrop::new(
+        // SAFETY: Delegated to caller.
+        unsafe { UserRef::from_raw(user_nn.as_ptr()) }
+    );
+    let r = match Actor::with((*user_ref).clone()) {
+        Ok(v) => Actor::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    };
+    let _ = UserRef::into_raw(ManuallyDrop::into_inner(user_ref));
+    r
+}
+
+/// Create a new reference to an actor.
+///
+/// This increases the reference count of the actor by one. If `NULL` is
+/// passed, this is a no-op.
+///
+/// This always returns back the same pointer as was passed.
+///
+/// ## Safety
+///
+/// If non-NULL, `actor` must refer to a valid actor and the caller must hold a
+/// reference to it.
+#[export_name = "b1_acct_actor_ref"]
+pub unsafe extern "C" fn actor_ref(
+    this: *mut capi::b1_acct_actor,
+) -> *mut capi::b1_acct_actor {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // Ensure `this_arc` is not dropped on panic.
+        let this_ref = ManuallyDrop::new(
+            // SAFETY: Delegated to caller.
+            unsafe { Actor::from_raw(this_nn.as_ptr()) }
+        );
+        let r = (*this_ref).clone();
+        let _ = Actor::into_raw(ManuallyDrop::into_inner(this_ref));
+        Actor::into_raw(r)
+    } else {
+        this
+    }
+}
+
+/// Drop a reference to an actor.
+///
+/// This decreases the reference count of the actor by one. If `NULL` is
+/// passed, this is a no-op. If this drops the last reference, the
+/// entire actor is deallocated.
+///
+/// This always returns `NULL`.
+///
+/// ## Safety
+///
+/// If non-NULL, `actor` must refer to a valid actor and the caller must hold a
+/// reference to it.
+#[export_name = "b1_acct_actor_unref"]
+pub unsafe extern "C" fn actor_unref(
+    this: *mut capi::b1_acct_actor,
+) -> *mut capi::b1_acct_actor {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Delegated to caller.
+        let _ = unsafe { Actor::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+/// Charge an actor.
+///
+/// Charge the actor `this` for the resource amount given in `amount`. The
+/// claimant actor is given as `claimant`. The charge is recorded in `charge`.
+///
+/// If `charge` is already used, this will return `EINVAL`. If the user quota
+/// is exceeded, this will return `EDQUOT`. If the actor quota is exceeded,
+/// this will return `EXFULL`.
+///
+/// ## Safety
+///
+/// `this` must refer to a valid actor and the caller must hold a reference
+/// to it.
+///
+/// `claimant` must refer to a valid actor and the caller must hold a reference
+/// to it.
+///
+/// `charge` must refer to an initialized and valid charge object.
+///
+/// `amount` must refer to a valid array of charge values.
+#[export_name = "b1_acct_actor_charge"]
+pub unsafe extern "C" fn actor_charge(
+    this: *mut capi::b1_acct_actor,
+    charge: *mut capi::b1_acct_charge,
+    claimant: *mut capi::b1_acct_actor,
+    amount: *const [Value; N_SLOTS],
+) -> c_int {
+    let this_nn = core::ptr::NonNull::new(this).unwrap();
+    let charge_nn = core::ptr::NonNull::new(charge).unwrap();
+    let claimant_nn = core::ptr::NonNull::new(claimant).unwrap();
+    let amount_nn = core::ptr::NonNull::new(amount.cast_mut()).unwrap();
+
+    // Ensure `this_ref` is not dropped on panic.
+    let this_ref = ManuallyDrop::new(
+        // SAFETY: Delegated to caller.
+        unsafe { Actor::from_raw(this_nn.as_ptr()) }
+    );
+    // Ensure `claimant_ref` is not dropped on panic.
+    let claimant_ref = ManuallyDrop::new(
+        // SAFETY: Delegated to caller.
+        unsafe { Actor::from_raw(claimant_nn.as_ptr()) }
+    );
+    // SAFETY: Delegated to caller.
+    let charge_ref = unsafe { Charge::from_capi(charge_nn.as_ptr()) };
+    // SAFETY: Delegated to caller.
+    let amount_ref = unsafe { amount_nn.as_ref() };
+
+    let r = if !charge_ref.inner.trace.is_null() {
+        EINVAL.to_errno()
+    } else {
+        match this_ref.as_arc_borrow().charge(
+            claimant_ref.as_arc_borrow(),
+            amount_ref,
+        ) {
+            Ok(mut v) => {
+                core::mem::swap(&mut v, charge_ref);
+                0
+            },
+            Err(ChargeError::Alloc(AllocError)) => {
+                ENOMEM.to_errno()
+            },
+            Err(ChargeError::UserQuota) => {
+                EDQUOT.to_errno()
+            },
+            Err(ChargeError::ActorQuota) => {
+                EXFULL.to_errno()
+            },
+        }
+    };
+
+    let _ = Actor::into_raw(ManuallyDrop::into_inner(claimant_ref));
+    let _ = Actor::into_raw(ManuallyDrop::into_inner(this_ref));
+    r
+}
+
+/// Get a user object from an accounting system.
+///
+/// This either creates a new user object, or returns the existing user object
+/// for the given ID in this accounting system.
+///
+/// This can return `ENOMEM` as error pointer on failure.
+///
+/// ## Safety
+///
+/// `acct` must refer to a valid accounting system and the caller must hold
+/// a reference to it.
+#[export_name = "b1_acct_get_user"]
+pub unsafe extern "C" fn acct_get_user(
+    this: *mut capi::b1_acct,
+    id: Id,
+) -> *mut capi::b1_acct_user {
+    let this_nn = core::ptr::NonNull::new(this).unwrap();
+    // Ensure `this_ref` is not dropped on panic.
+    let this_ref = ManuallyDrop::new(
+        // SAFETY: Delegated to caller.
+        unsafe { Acct::from_raw(this_nn.as_ptr()) }
+    );
+    let r = match this_ref.as_arc_borrow().get_user(id) {
+        Ok(v) => UserRef::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    };
+    let _ = Acct::into_raw(ManuallyDrop::into_inner(this_ref));
+    r
+}
+
+/// Create a new reference to a user.
+///
+/// This increases the reference count of the user by one. If `NULL` is
+/// passed, this is a no-op.
+///
+/// This always returns back the same pointer as was passed.
+///
+/// ## Safety
+///
+/// If non-NULL, `user` must refer to a valid user and the caller must hold a
+/// reference to it.
+#[export_name = "b1_acct_user_ref"]
+pub unsafe extern "C" fn user_ref(
+    this: *mut capi::b1_acct_user,
+) -> *mut capi::b1_acct_user {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // Ensure `this_ref` is not dropped on panic.
+        let this_ref = ManuallyDrop::new(
+            // SAFETY: Delegated to caller.
+            unsafe { UserRef::from_raw(this_nn.as_ptr()) }
+        );
+        let r = (*this_ref).clone();
+        let _ = UserRef::into_raw(ManuallyDrop::into_inner(this_ref));
+        UserRef::into_raw(r)
+    } else {
+        this
+    }
+}
+
+/// Drop a reference to a user.
+///
+/// This decreases the reference count of the user by one. If `NULL` is
+/// passed, this is a no-op. If this drops the last reference, the
+/// entire user is deallocated.
+///
+/// This always returns `NULL`.
+///
+/// ## Safety
+///
+/// If non-NULL, `user` must refer to a valid user and the caller must hold a
+/// reference to it.
+#[export_name = "b1_acct_user_unref"]
+pub unsafe extern "C" fn user_unref(
+    this: *mut capi::b1_acct_user,
+) -> *mut capi::b1_acct_user {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Delegated to caller.
+        let _ = unsafe { UserRef::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+/// Initialize a charge object.
+///
+/// This initializes the possibly uninitialized charge object. Any previous
+/// value is leaked and replaced.
+///
+/// On return, the charge object will be cleared and contain no charge. Any
+/// discharge operation will thus be a no-op, unless the object is charged
+/// in between.
+///
+/// ## Safety
+///
+/// `this` must refer to a charge object, but can be uninitialized.
+#[export_name = "b1_acct_charge_init"]
+pub unsafe extern "C" fn charge_init(
+    this: *mut capi::b1_acct_charge,
+) {
+    let this_nn = core::ptr::NonNull::new(this).unwrap();
+    unsafe {
+        this_nn.write(capi::b1_acct_charge {
+            trace: core::ptr::null_mut(),
+            amount: [0; _],
+        });
+    }
+}
+
+/// Deinitialize a charge object.
+///
+/// This deinitializes a charge object. Any charged are discharged and the
+/// object is put into a freshly initialized state, ready to be re-used or
+/// dropped.
+///
+/// ## Safety
+///
+/// `this` must refer to a valid charge object.
+#[export_name = "b1_acct_charge_deinit"]
+pub unsafe extern "C" fn charge_deinit(
+    this: *mut capi::b1_acct_charge,
+) {
+    let this_nn = core::ptr::NonNull::new(this).unwrap();
+    // SAFETY: Delegated to caller.
+    unsafe {
+        Charge::from_capi(this_nn.as_ptr()).discharge();
+    }
+}
+
+#[kunit_tests(bus1_acct)]
+mod test {
+    use super::*;
+
+    #[test]
+    fn basic() -> Result<(), AllocError> {
+        let acct = Acct::new(&[1024; N_SLOTS])?;
+        let u0 = acct.as_arc_borrow().get_user(0)?;
+        let u1 = acct.as_arc_borrow().get_user(1)?;
+        let u2 = acct.as_arc_borrow().get_user(2)?;
+        let u0a0 = Actor::with(u0.clone())?;
+        let u0a1 = Actor::with(u0.clone())?;
+        let u1a0 = Actor::with(u1.clone())?;
+        let u1a1 = Actor::with(u1.clone())?;
+        let u2a0 = Actor::with(u2.clone())?;
+        let u2a1 = Actor::with(u2.clone())?;
+        let u0a0b = u0a0.as_arc_borrow();
+        let u0a1b = u0a1.as_arc_borrow();
+        let u1a0b = u1a0.as_arc_borrow();
+        let u1a1b = u1a1.as_arc_borrow();
+        let u2a0b = u2a0.as_arc_borrow();
+        let u2a1b = u2a1.as_arc_borrow();
+
+        // Perform a self-charge and verify that half of the resources is the
+        // maximum that can be requested. Also verify that this grants full
+        // access to all user resources (i.e., requests of other users always
+        // fail).
+        // Then verify that releasing a chunk allows reclaiming the resources.
+        {
+            let _c0 = u0a0b.charge(u0a0b, &[256; N_SLOTS]).unwrap();
+            let c1 = u0a0b.charge(u0a0b, &[256; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u0a0b, &[1; N_SLOTS]).is_err());
+            assert!(u0a0b.charge(u1a0b, &[1; N_SLOTS]).is_err());
+
+            drop(c1);
+            let _c2 = u0a0b.charge(u0a0b, &[256; N_SLOTS]).unwrap();
+        }
+
+        // Perform a self-charge with two actors, and verify the first gets
+        // half, and the second gets half of that. Also ensure foreign actors
+        // cannot claim anything, since this series requires the full resource
+        // set of the user.
+        {
+            let _c0 = u0a0b.charge(u0a0b, &[512; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u0a0b, &[1; N_SLOTS]).is_err());
+            assert!(u0a0b.charge(u1a0b, &[1; N_SLOTS]).is_err());
+
+            let _c1 = u0a0b.charge(u0a1b, &[256; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u0a0b, &[1; N_SLOTS]).is_err());
+            assert!(u0a0b.charge(u0a1b, &[1; N_SLOTS]).is_err());
+        }
+
+        // Perform a foreign-charge with two actors of the same UID, and verify
+        // that they share the foreign charge (i.e., the second charge is half
+        // of the first charge, if both charge to their limit).
+        // Then perform a foreign-charge of another two actors of yet another
+        // UID. Again, charge to their limit (which must be non-zero, since the
+        // quota separates them from the previous charges) and verify it is
+        // the expected limit.
+        {
+            let _c0 = u0a0b.charge(u1a0b, &[128; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u1a0b, &[1; N_SLOTS]).is_err());
+            let _c1 = u0a0b.charge(u1a1b, &[64; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u1a0b, &[1; N_SLOTS]).is_err());
+            let _c2 = u0a0b.charge(u2a0b, &[42; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u2a0b, &[1; N_SLOTS]).is_err());
+            let _c3 = u0a0b.charge(u2a1b, &[21; N_SLOTS]).unwrap();
+            assert!(u0a0b.charge(u2a1b, &[1; N_SLOTS]).is_err());
+        }
+        Ok(())
+    }
+}
diff --git a/ipc/bus1/lib.h b/ipc/bus1/lib.h
index e84c47f97031..808f5da94919 100644
--- a/ipc/bus1/lib.h
+++ b/ipc/bus1/lib.h
@@ -15,4 +15,70 @@
 #include <linux/types.h>
 #include <uapi/linux/bus1.h>
 
+typedef __u32 b1_acct_id_t;
+typedef __u64 b1_acct_value_t;
+
+struct b1_acct;
+struct b1_acct_actor;
+struct b1_acct_charge;
+struct b1_acct_trace;
+struct b1_acct_user;
+
+/* accounting */
+
+enum: size_t {
+	B1_ACCT_SLOT_OBJECTS,
+	B1_ACCT_SLOT_BYTES,
+	_B1_ACCT_SLOT_N,
+};
+
+struct b1_acct_charge {
+	struct b1_acct_trace *trace;
+	b1_acct_value_t amount[_B1_ACCT_SLOT_N];
+};
+
+#define B1_ACCT_CHARGE_INIT() ((struct b1_acct_charge){})
+
+struct b1_acct *b1_acct_new(const b1_acct_value_t (*maxima)[_B1_ACCT_SLOT_N]);
+struct b1_acct *b1_acct_ref(struct b1_acct *acct);
+struct b1_acct *b1_acct_unref(struct b1_acct *acct);
+
+struct b1_acct_actor *b1_acct_actor_new(struct b1_acct_user *user);
+struct b1_acct_actor *b1_acct_actor_ref(struct b1_acct_actor *actor);
+struct b1_acct_actor *b1_acct_actor_unref(struct b1_acct_actor *actor);
+
+int b1_acct_actor_charge(
+	struct b1_acct_actor *actor,
+	struct b1_acct_charge *charge,
+	const b1_acct_value_t (*amount)[_B1_ACCT_SLOT_N]
+);
+
+struct b1_acct_user *b1_acct_get_user(struct b1_acct *acct, b1_acct_id_t id);
+struct b1_acct_user *b1_acct_user_ref(struct b1_acct_user *user);
+struct b1_acct_user *b1_acct_user_unref(struct b1_acct_user *user);
+
+void b1_acct_charge_init(struct b1_acct_charge *charge);
+void b1_acct_charge_deinit(struct b1_acct_charge *charge);
+
+DEFINE_FREE(
+	b1_acct_unref,
+	struct b1_acct *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_acct_unref(_T);
+)
+
+DEFINE_FREE(
+	b1_acct_actor_unref,
+	struct b1_acct_actor *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_acct_actor_unref(_T);
+)
+
+DEFINE_FREE(
+	b1_acct_user_unref,
+	struct b1_acct_user *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_acct_user_unref(_T);
+)
+
 #endif /* __B1_LIB_H */
diff --git a/ipc/bus1/lib.rs b/ipc/bus1/lib.rs
index a7e7a99086c2..05f21601f569 100644
--- a/ipc/bus1/lib.rs
+++ b/ipc/bus1/lib.rs
@@ -4,6 +4,7 @@
 //! This is the in-kernel implementation of the Bus1 communication system in
 //! rust. Any user-space API is outside the scope of this module.
 
+pub mod acct;
 pub mod util;
 
 #[allow(
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 15/16] bus1: introduce peers, handles, and nodes
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (13 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 14/16] bus1/acct: add resouce accounting David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:03 ` [RFC 16/16] bus1: implement the uapi David Rheinsberg
  2026-03-31 19:46 ` [RFC 00/16] bus1: Capability-based IPC for Linux Miguel Ojeda
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Add the main bus management, introducing peers, handles, and nodes. This
implements the core of bus1 and exposes it via a C API to the other
parts of the kernel. For now, this C API is limited to other code within
the same module, but could theoretically be exposed to the entire
kernel.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/bus.rs | 1510 +++++++++++++++++++++++++++++++++++++++++++++++
 ipc/bus1/lib.h  |  118 ++++
 ipc/bus1/lib.rs |    1 +
 3 files changed, 1629 insertions(+)
 create mode 100644 ipc/bus1/bus.rs

diff --git a/ipc/bus1/bus.rs b/ipc/bus1/bus.rs
new file mode 100644
index 000000000000..70a6a4f35d96
--- /dev/null
+++ b/ipc/bus1/bus.rs
@@ -0,0 +1,1510 @@
+//! Bus Management
+//!
+//! This module implements the core components of the bus. It provides peers,
+//! nodes, handles, as well as message handling and atomic operations.
+
+use core::ptr::NonNull;
+use kernel::prelude::*;
+use kernel::alloc::AllocError;
+use kernel::sync::{Arc, ArcBorrow, atomic};
+use crate::{acct, capi, util::{self, field, lll, rb, slist}};
+
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+enum MessageError {
+    /// Could not allocate required state-tracking.
+    Alloc(AllocError),
+    /// The target handle or a transfer handle is not owned by the operator.
+    HandleForeign,
+}
+
+#[pin_data]
+struct Peer {
+    actor: Arc<acct::Actor>,
+    waitq: *mut kernel::bindings::wait_queue_head,
+    queue: lll::List<TxNodeRef>,
+    queue_committed: atomic::Atomic<usize>,
+    shutdown: Arc<Tx>,
+    #[pin]
+    inner: kernel::sync::Mutex<PeerLocked>,
+}
+
+struct PeerLocked {
+    queue_ready: slist::List<TxNodeRef>,
+    queue_busy: slist::List<TxNodeRef>,
+}
+
+#[pin_data]
+struct Node {
+    owner: Arc<Peer>,
+    userdata: atomic::Atomic<usize>,
+    op_rb: rb::Node,
+    #[pin]
+    inner: kernel::sync::Mutex<NodeLocked>,
+}
+
+util::field::impl_pin_field!(Node, op_rb, rb::Node);
+
+struct NodeLocked {
+    handles: rb::Tree<rb::node_of!(Arc<Handle>, node_rb)>,
+}
+
+struct Handle {
+    node: Arc<Node>,
+    owner: Arc<Peer>,
+    userdata: atomic::Atomic<usize>,
+    node_rb: rb::Node,
+    op_rb: rb::Node,
+    release_node: TxNode,
+    release_handle: TxNode,
+}
+
+util::field::impl_pin_field!(Handle, node_rb, rb::Node);
+util::field::impl_pin_field!(Handle, op_rb, rb::Node);
+util::field::impl_pin_field!(Handle, release_node, TxNode);
+util::field::impl_pin_field!(Handle, release_handle, TxNode);
+
+struct Message {
+    via: Arc<Handle>,
+    transfers: KBox<[*mut capi::b1_handle]>,
+    shared: Arc<MessageShared>,
+    op_rb: rb::Node,
+    tx_node: TxNode,
+}
+
+util::field::impl_pin_field!(Message, op_rb, rb::Node);
+util::field::impl_pin_field!(Message, tx_node, TxNode);
+
+struct MessageShared {
+    data: KVBox<[u8]>,
+}
+
+struct Op {
+    operator: Arc<Peer>,
+    tx: Arc<Tx>,
+    messages: rb::Tree<rb::node_of!(Arc<Message>, op_rb)>,
+    nodes: rb::Tree<rb::node_of!(Arc<Node>, op_rb)>,
+    handles: rb::Tree<rb::node_of!(Arc<Handle>, op_rb)>,
+}
+
+struct Tx {
+    committed: atomic::Atomic<bool>,
+}
+
+struct TxNode {
+    kind: TxNodeKind,
+    peer_link: lll::Node,
+    // XXX: Switch to atomic pointers once available.
+    tx: atomic::Atomic<usize>,
+}
+
+util::field::impl_pin_field!(TxNode, peer_link, lll::Node);
+
+#[derive(Clone, Copy, Debug, Eq, PartialEq)]
+enum TxNodeKind {
+    User,
+    ReleaseNode,
+    ReleaseHandle,
+}
+
+#[derive(Clone)]
+enum TxNodeRef {
+    User(Arc<Message>),
+    ReleaseNode(Arc<Handle>),
+    ReleaseHandle(Arc<Handle>),
+}
+
+impl core::convert::From<AllocError> for MessageError {
+    fn from(v: AllocError) -> Self {
+        Self::Alloc(v)
+    }
+}
+
+impl Peer {
+    fn new(
+        actor: Arc<acct::Actor>,
+        waitq: *mut kernel::bindings::wait_queue_head,
+    ) -> Result<Arc<Self>, AllocError> {
+        let tx = Tx::new()?;
+        match Arc::pin_init(
+            pin_init!(Self {
+                actor,
+                waitq,
+                queue: lll::List::new(),
+                queue_committed: atomic::Atomic::new(0),
+                shutdown: tx,
+                inner <- kernel::sync::new_mutex!(
+                    PeerLocked {
+                        queue_ready: slist::List::new(),
+                        queue_busy: slist::List::new(),
+                    },
+                ),
+            }),
+            GFP_KERNEL,
+        ) {
+            Ok(v) => Ok(v),
+            Err(_) => Err(AllocError),
+        }
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_peer {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_peer) -> Arc<Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()`.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Borrow a raw reference.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from releasing it via
+    /// `Self::from_raw()` for `'a`.
+    unsafe fn borrow_raw<'a>(
+        this: *mut capi::b1_peer,
+    ) -> ArcBorrow<'a, Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()` and
+        // will not be released for `'a`.
+        unsafe { ArcBorrow::from_raw(this.cast::<Self>()) }
+    }
+
+    fn owns_node(&self, node: &Node) -> bool {
+        core::ptr::eq(self, &*node.owner)
+    }
+
+    fn owns_handle(&self, handle: &Handle) -> bool {
+        core::ptr::eq(self, &*handle.owner)
+    }
+
+    fn wake(&self) {
+        // XXX: This needs to be synchronized through begin()/end() and
+        // protected via rcu. The waitq should be considered detached on
+        // `end()`, but accessible for at least an rcu grace period.
+        unsafe {
+            kernel::bindings::__wake_up(
+                self.waitq,
+                kernel::bindings::TASK_INTERRUPTIBLE,
+                1,
+                core::ptr::null_mut(),
+            );
+        }
+    }
+
+    fn begin(&self) {
+        // Called when the owner of the peer considers it set up. Given that
+        // peers are standalone, there is nothing to be done here. As long as
+        // the caller did not expose it, they still retain full control.
+    }
+
+    /// Shutdown operations on this peer.
+    ///
+    /// Generally, a peer reference can simply be dropped without any shutdown.
+    /// As long as all nodes are released, the peer can no longer be reached.
+    /// However, due to the parallel nature of the bus, there might be messages
+    /// being about to be queued on this peer, even though the involved nodes
+    /// have been released. This might leave circular references hanging, if
+    /// those messages carry handles (as those handles carry a peer reference
+    /// themselves).
+    ///
+    /// To prevent this scenario, a shutdown will seal the queue of a peer and
+    /// ensure any ongoing transactions will discard the messages destined to
+    /// this peer (as well as any already queued messages).
+    ///
+    /// Usually, the peer should not longer be used for bus operations after a
+    /// shutdown, and any nodes should have been released before.
+    fn end(self: ArcBorrow<'_, Self>) {
+        let mut op = core::pin::pin!(Op::with(self.into(), self.shutdown.clone()));
+
+        // Properly release all handles that are currently queued on pending
+        // messages.
+        let mut q = self.queue.seal();
+        while let Some(v) = q.unlink_front() {
+            if let Some(m) = TxNodeRef::as_user(&v) {
+                for t in &*m.transfers {
+                    let b = unsafe { Handle::borrow_raw(*t) };
+                    op.as_mut().release_handle(b.into());
+                }
+            }
+        }
+
+        op.commit();
+    }
+
+    fn create_node(
+        self: ArcBorrow<'_, Self>,
+        other: ArcBorrow<'_, Self>,
+    ) -> Result<(Arc<Node>, Arc<Handle>), AllocError> {
+        let n = Node::new(self.into())?;
+        let h = Handle::new(n.clone(), other.into())?;
+        Ok((n, h))
+    }
+
+    fn create_handle(
+        self: ArcBorrow<'_, Self>,
+        from: ArcBorrow<'_, Handle>,
+    ) -> Result<Arc<Handle>, AllocError> {
+        Handle::new(from.node.clone(), self.into())
+    }
+
+    fn readable(
+        self: ArcBorrow<'_, Self>,
+    ) -> bool {
+        self.queue_committed.load(atomic::Relaxed) > 0
+    }
+
+    fn peek(
+        self: ArcBorrow<'_, Self>,
+        peek: &mut capi::b1_peer_peek,
+    ) -> bool {
+        if !self.readable() {
+            return false;
+        }
+
+        let mut peer_guard = self.inner.lock();
+        peer_guard.as_mut().prefetch(&self.queue);
+        let (ready, _) = peer_guard.as_mut().unfold_mut();
+
+        let Some(txref) = ready.cursor_mut().get_clone() else {
+            return false;
+        };
+
+        if let Some(m) = TxNodeRef::as_user(&txref) {
+            peek.type_ = capi::bus1_message_type_BUS1_MESSAGE_TYPE_USER;
+            peek.u.user.node = Arc::as_ptr(&m.via.node).cast_mut().cast();
+            peek.u.user.n_transfers = m.transfers.len() as u64;
+            peek.u.user.transfers = m.transfers.as_ptr().cast_mut();
+            peek.u.user.n_data = m.shared.data.len() as u64;
+            peek.u.user.data = m.shared.data.as_ptr().cast_mut().cast();
+            true
+        } else if let Some(h) = TxNodeRef::as_release_node(&txref) {
+            peek.type_ = capi::bus1_message_type_BUS1_MESSAGE_TYPE_NODE_RELEASE;
+            peek.u.node_release.handle = Arc::as_ptr(h).cast_mut().cast();
+            true
+        } else if let Some(h) = TxNodeRef::as_release_handle(&txref) {
+            peek.type_ = capi::bus1_message_type_BUS1_MESSAGE_TYPE_HANDLE_RELEASE;
+            peek.u.handle_release.node = Arc::as_ptr(&h.node).cast_mut().cast();
+            true
+        } else {
+            ready.unlink_front();
+            self.peek(peek)
+        }
+    }
+
+    fn pop(self: ArcBorrow<'_, Self>) {
+        let mut peer_guard = self.inner.lock();
+        let (ready, _) = peer_guard.as_mut().unfold_mut();
+        ready.unlink_front();
+    }
+}
+
+impl PeerLocked {
+    fn unfold_mut(
+        self: Pin<&mut Self>,
+    ) -> (
+        &mut slist::List<TxNodeRef>,
+        &mut slist::List<TxNodeRef>,
+    ) {
+        // SAFETY: Nothing is structurally pinned.
+        unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                &mut inner.queue_ready,
+                &mut inner.queue_busy,
+            )
+        }
+    }
+
+    fn prefetch(
+        self: Pin<&mut Self>,
+        queue: &lll::List<TxNodeRef>,
+    ) {
+        let (ready, busy) = self.unfold_mut();
+
+        if !ready.is_empty() {
+            return;
+        }
+
+        // Fetch the entire incoming queue and create an iterator for it. Note
+        // that new entries are added at the front, so this queue is in reverse
+        // order. But this is exactly what we need. Iterate the queue in this
+        // reverse order (so newest entries first). Any entry that is committed
+        // is then pushed to the ready list, uncommitted entries are left on
+        // the todo list. When done, the `ready` list contains the ready items
+        // in chronological order, so oldest message first, but `todo` has the
+        // remaining uncommitted items in the same inverse chronological order.
+        //
+        // This todo list is then saved in `busy` to be iterated on the next
+        // prefetch. However, first, any previous leftover busy list is
+        // appended to the end of the `todo` queue, with any committed entries
+        // moved to `ready`.
+
+        let mut todo = queue.clear();
+        let mut c = todo.cursor_mut();
+
+        while let Some(v) = c.get() {
+            if v.is_committed() {
+                if let Some(v) = c.unlink() {
+                    let _ = ready.try_link_front(v);
+                    continue;
+                }
+            }
+            c.move_next();
+        }
+
+        while let Some(v) = busy.unlink_front() {
+            if TxNodeRef::tx_node(&v).is_committed() {
+                let _ = ready.try_link_front(v);
+            } else {
+                let _ = c.try_link(v);
+                c.move_next();
+            }
+        }
+
+        core::mem::swap(busy, &mut todo);
+    }
+}
+
+impl Node {
+    fn new(
+        owner: Arc<Peer>,
+    ) -> Result<Arc<Self>, AllocError> {
+        match Arc::pin_init(
+            pin_init!(Self {
+                owner,
+                userdata: atomic::Atomic::new(0),
+                op_rb: rb::Node::new(),
+                inner <- kernel::sync::new_mutex!(
+                    NodeLocked {
+                        handles: rb::Tree::new(),
+                    },
+                ),
+            }),
+            GFP_KERNEL,
+        ) {
+            Ok(v) => Ok(v),
+            Err(_) => Err(AllocError),
+        }
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_node {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_node) -> Arc<Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()`.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Borrow a raw reference.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from releasing it via
+    /// `Self::from_raw()` for `'a`.
+    unsafe fn borrow_raw<'a>(
+        this: *mut capi::b1_node,
+    ) -> ArcBorrow<'a, Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()` and
+        // will not be released for `'a`.
+        unsafe { ArcBorrow::from_raw(this.cast::<Self>()) }
+    }
+
+    fn begin(&self) {
+        // Called by the node owner when the node is ready. Since nodes are
+        // completely independent, there is nothing to be done. The node will
+        // remain isolated for as long as the owner does not pass it along.
+    }
+
+    fn end(&self) {
+        // Called when a node has been released and a node owner will refrain
+        // from using it, anymore. Preferably, we would verify that the node
+        // has a transaction assigned (yet might be pending), but so far nodes
+        // do not carry such information, only their linked handles. Hence, we
+        // perform no validation for now.
+    }
+}
+
+impl NodeLocked {
+    fn unfold_mut(
+        self: Pin<&mut Self>,
+    ) -> (
+        Pin<&mut rb::Tree<rb::node_of!(Arc<Handle>, node_rb)>>,
+    ) {
+        // SAFETY: Only `Self.handles` is structurally pinned.
+        unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                Pin::new_unchecked(&mut inner.handles),
+            )
+        }
+    }
+}
+
+impl Handle {
+    fn new(
+        node: Arc<Node>,
+        owner: Arc<Peer>,
+    ) -> Result<Arc<Handle>, AllocError> {
+        Arc::new(
+            Self {
+                node,
+                owner,
+                userdata: atomic::Atomic::new(0),
+                node_rb: rb::Node::new(),
+                op_rb: rb::Node::new(),
+                release_node: TxNode::new(TxNodeKind::ReleaseNode),
+                release_handle: TxNode::new(TxNodeKind::ReleaseHandle),
+            },
+            GFP_KERNEL,
+        )
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_handle {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_handle) -> Arc<Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()`.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Borrow a raw reference.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from releasing it via
+    /// `Self::from_raw()` for `'a`.
+    unsafe fn borrow_raw<'a>(
+        this: *mut capi::b1_handle,
+    ) -> ArcBorrow<'a, Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()` and
+        // will not be released for `'a`.
+        unsafe { ArcBorrow::from_raw(this.cast::<Self>()) }
+    }
+
+    fn link(
+        self: ArcBorrow<'_, Self>,
+    ) {
+        let mut node_guard = self.node.inner.lock();
+        let (handles,) = node_guard.as_mut().unfold_mut();
+
+        // It is safe to call this multiple times. If the entry is already
+        // linked, we simply skip this operation. And the slot cannot be
+        // occupied, since we never return `Equal`.
+        let _ = handles.try_link_by(
+            util::arc_pin(self.into()),
+            |v, other| {
+                match util::ptr_cmp(
+                    &*v.owner.as_arc_borrow(),
+                    &*other.owner.as_arc_borrow(),
+                ) {
+                    v @ core::cmp::Ordering::Less => v,
+                    _ => core::cmp::Ordering::Greater,
+                }
+            },
+        );
+    }
+
+    fn unlink(
+        self: ArcBorrow<'_, Self>,
+    ) {
+        let mut node_guard = self.node.inner.lock();
+        let (handles,) = node_guard.as_mut().unfold_mut();
+
+        handles.try_unlink(util::arc_borrow_pin(self).as_ref());
+    }
+
+    fn begin(self: ArcBorrow<'_, Self>) {
+        // This is called when the handle owner is ready to make use of the
+        // handle. Simply link it into its node, to ensure it will take part
+        // in the notification system.
+        self.link();
+    }
+
+    fn end(&self) {
+        // This is called when the handle owner released the handle and no
+        // longer manages it. We simply verify that it is unlinked, since
+        // a proper handle-release will have done that.
+        kernel::warn_on!(self.node_rb.is_linked());
+    }
+}
+
+impl Message {
+    fn with(
+        via: Arc<Handle>,
+        transfers: KBox<[*mut capi::b1_handle]>,
+        shared: Arc<MessageShared>,
+    ) -> Result<Arc<Self>, AllocError> {
+        Arc::new(
+            Self {
+                via,
+                transfers,
+                shared,
+                op_rb: rb::Node::new(),
+                tx_node: TxNode::new(TxNodeKind::User),
+            },
+            GFP_KERNEL,
+        )
+    }
+}
+
+impl Drop for Message {
+    fn drop(&mut self) {
+        for t in &*self.transfers {
+            let _ = unsafe { Handle::from_raw(*t) };
+        }
+    }
+}
+
+impl MessageShared {
+    fn with(
+        data: KVBox<[u8]>,
+    ) -> Result<Arc<Self>, AllocError> {
+        Arc::new(
+            Self {
+                data,
+            },
+            GFP_KERNEL,
+        )
+    }
+
+    /// Turn the reference into a raw pointer.
+    ///
+    /// This will leak the reference and any pinned resources, unless the
+    /// original object is recreated via `Self::from_raw()`.
+    fn into_raw(this: Arc<Self>) -> *mut capi::b1_message_shared {
+        Arc::into_raw(this).cast_mut().cast()
+    }
+
+    /// Recreate the reference from its raw pointer.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from using the pointer any
+    /// further.
+    unsafe fn from_raw(this: *mut capi::b1_message_shared) -> Arc<Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()`.
+        unsafe { Arc::from_raw(this.cast::<Self>()) }
+    }
+
+    /// Borrow a raw reference.
+    ///
+    /// ## Safety
+    ///
+    /// The caller must guarantee this pointer was acquired via
+    /// `Self::into_raw()`, and they must refrain from releasing it via
+    /// `Self::from_raw()` for `'a`.
+    unsafe fn borrow_raw<'a>(
+        this: *mut capi::b1_message_shared,
+    ) -> ArcBorrow<'a, Self> {
+        // SAFETY: Caller guarantees `this` is from `Self::into_raw()` and
+        // will not be released for `'a`.
+        unsafe { ArcBorrow::from_raw(this.cast::<Self>()) }
+    }
+}
+
+impl Op {
+    fn with(operator: Arc<Peer>, tx: Arc<Tx>) -> Self {
+        Self {
+            operator,
+            tx,
+            messages: rb::Tree::new(),
+            nodes: rb::Tree::new(),
+            handles: rb::Tree::new(),
+        }
+    }
+
+    fn new(
+        operator: Arc<Peer>,
+    ) -> Result<Self, AllocError> {
+        Ok(Self::with(operator, Tx::new()?))
+    }
+
+    fn unfold_mut(
+        self: Pin<&mut Self>,
+    ) -> (
+        ArcBorrow<'_, Peer>,
+        ArcBorrow<'_, Tx>,
+        Pin<&mut rb::Tree<rb::node_of!(Arc<Message>, op_rb)>>,
+        Pin<&mut rb::Tree<rb::node_of!(Arc<Node>, op_rb)>>,
+        Pin<&mut rb::Tree<rb::node_of!(Arc<Handle>, op_rb)>>,
+    ) {
+        // SAFETY: The trees are structurally pinned.
+        unsafe {
+            let inner = Pin::into_inner_unchecked(self);
+            (
+                inner.operator.as_arc_borrow(),
+                inner.tx.as_arc_borrow(),
+                Pin::new_unchecked(&mut inner.messages),
+                Pin::new_unchecked(&mut inner.nodes),
+                Pin::new_unchecked(&mut inner.handles),
+            )
+        }
+    }
+
+    fn send_message(
+        self: Pin<&mut Self>,
+        to: ArcBorrow<'_, Handle>,
+        transfers: KBox<[*mut capi::b1_handle]>,
+        shared: Arc<MessageShared>,
+    ) -> Result<(), MessageError> {
+        let msg: Arc<Message>;
+
+        if kernel::warn_on!(!self.operator.owns_handle(&to)) {
+            return Err(MessageError::HandleForeign);
+        }
+
+        msg = Message::with(to.into(), transfers, shared)?;
+
+        let (_, _, messages, _, _) = self.unfold_mut();
+        let r = messages.try_link_last_by(
+            util::arc_pin(msg),
+            |v, other| {
+                util::ptr_cmp(&*v.via.node.owner, &*other.via.node.owner)
+            },
+        );
+        kernel::warn_on!(r.is_err());
+
+        Ok(())
+    }
+
+    fn release_node(
+        self: Pin<&mut Self>,
+        node: ArcBorrow<'_, Node>,
+    ) {
+        if kernel::warn_on!(!self.operator.owns_node(&node)) {
+            return;
+        }
+
+        let (_, _, _, nodes, _) = self.unfold_mut();
+        if !nodes.contains(&*node) {
+            let r = nodes.try_link_by(
+                util::arc_pin(node.into()),
+                |v, other| util::ptr_cmp(&**v, &*other),
+            );
+            kernel::warn_on!(r.is_err());
+        }
+    }
+
+    fn release_handle(
+        self: Pin<&mut Self>,
+        handle: ArcBorrow<'_, Handle>,
+    ) {
+        if kernel::warn_on!(!self.operator.owns_handle(&handle)) {
+            return;
+        }
+
+        let (_, _, _, _, handles) = self.unfold_mut();
+        if !handles.contains(&*handle) {
+            let r = handles.try_link_last_by(
+                util::arc_pin(handle.into()),
+                |v, other| {
+                    util::ptr_cmp(&*v.node.owner, &*other.node.owner)
+                },
+            );
+            kernel::warn_on!(r.is_err());
+        }
+    }
+
+    fn commit(self: Pin<&mut Self>) {
+        let (
+            _,
+            tx,
+            mut messages,
+            mut nodes,
+            mut handles,
+        ) = self.unfold_mut();
+
+        // Step #1
+        //
+        // Attach all nodes to the transaction and queue them on the respective
+        // receiving peer.
+
+        let mut c = messages.as_mut().cursor_mut_first();
+        while let Some(m) = c.get_clone() {
+            m.tx_node.set_tx(tx.into());
+
+            let to = m.via.node.owner.as_arc_borrow();
+            let txnode = TxNodeRef::new_user(util::arc_unpin(m.clone()));
+            if let Err(_) = to.queue.try_link_front(txnode) {
+                kernel::warn_on!(true);
+                c.try_unlink_and_move_next();
+                continue;
+            }
+
+            c.move_next();
+        }
+
+        let mut c = nodes.as_mut().cursor_mut_first();
+        while let Some(n) = c.get_clone() {
+            let mut node_guard = n.inner.lock();
+            let (handles,) = node_guard.as_mut().unfold_mut();
+
+            let mut c_inner = handles.cursor_mut_first();
+            while let Some(h) = c_inner.get_clone() {
+                h.release_node.set_tx(tx.into());
+
+                let to = h.owner.as_arc_borrow();
+                let txnode = TxNodeRef::new_release_node(util::arc_unpin(h.clone()));
+                if let Err(_) = to.queue.try_link_front(txnode) {
+                    kernel::warn_on!(true);
+                }
+
+                c_inner.move_next();
+            }
+
+            drop(node_guard);
+            c.move_next();
+        }
+
+        let mut c = handles.as_mut().cursor_mut_first();
+        while let Some(h) = c.get_clone().map(util::arc_unpin) {
+            h.release_handle.set_tx(tx.into());
+
+            let to = h.node.owner.as_arc_borrow();
+            let txnode = TxNodeRef::new_release_handle(h.clone());
+            if let Err(_) = to.queue.try_link_front(txnode) {
+                kernel::warn_on!(true);
+                c.try_unlink_and_move_next();
+                continue;
+            }
+
+            h.as_arc_borrow().unlink();
+            c.move_next();
+        }
+
+        // Step #2
+        //
+        // Mark the transaction as committed. From then on, peers might start
+        // dequeueing, but no other transaction can jump this one, anymore. The
+        // order is then settled.
+
+        tx.committed.store(true, atomic::Relaxed);
+
+        // Step #3
+        //
+        // With everything queued and committed, we iterate all nodes again and
+        // wake the remote peers.
+
+        messages.clear_with(|m| {
+            for t in &*m.transfers {
+                let b = unsafe { Handle::borrow_raw(*t) };
+                b.link();
+            }
+            m.via.node.owner.queue_committed.add(1, atomic::Relaxed);
+            m.via.node.owner.wake();
+        });
+
+        nodes.clear_with(|n| {
+            let mut node_guard = n.inner.lock();
+            let (handles,) = node_guard.as_mut().unfold_mut();
+
+            handles.clear_with(|h| {
+                h.owner.queue_committed.add(1, atomic::Relaxed);
+                h.owner.wake();
+            });
+        });
+
+        handles.clear_with(|h| {
+            h.node.owner.queue_committed.add(1, atomic::Relaxed);
+            h.node.owner.wake();
+        });
+    }
+}
+
+impl Tx {
+    fn new(
+    ) -> Result<Arc<Self>, AllocError> {
+        Arc::new(
+            Self {
+                committed: atomic::Atomic::new(false),
+            },
+            GFP_KERNEL,
+        )
+    }
+}
+
+impl TxNode {
+    fn new(kind: TxNodeKind) -> Self {
+        Self {
+            kind,
+            peer_link: lll::Node::new(),
+            tx: atomic::Atomic::new(0),
+        }
+    }
+
+    fn tx(&self) -> Option<ArcBorrow<'_, Tx>> {
+        // Paired with `Self::set_tx()`. Ensures that previous writes to Tx
+        // are visible when loading it.
+        let tx_addr = self.tx.load(atomic::Acquire);
+        let tx_ptr = tx_addr as *mut Tx;
+        let Some(tx_nn) = NonNull::new(tx_ptr) else {
+            return None;
+        };
+        // SAFETY: If `self.tx` is non-NULL, it is a valid Arc and does not
+        // change. We borrow `self` for as long as needed, so it cannot vanish
+        // while we hand out the ArcBorrow.
+        Some(unsafe { ArcBorrow::from_raw(tx_nn.as_ptr()) })
+    }
+
+    fn set_tx(&self, tx: Arc<Tx>) {
+        let tx_addr = Arc::as_ptr(&tx) as usize;
+        if let Ok(_) = self.tx.cmpxchg(
+            0,
+            tx_addr,
+            // Paired with `Self::tx()`. Ensures previous writes to Tx are
+            // visible to anyone fetching the Tx.
+            atomic::Release,
+        ) {
+            let _ = Arc::into_raw(tx);
+        } else {
+            kernel::warn_on!(true);
+        }
+    }
+
+    fn is_committed(&self) -> bool {
+        match self.tx() {
+            None => false,
+            Some(v) => v.committed.load(atomic::Relaxed),
+        }
+    }
+}
+
+impl core::ops::Drop for TxNode {
+    fn drop(&mut self) {
+        let tx_addr = self.tx.load(atomic::Acquire);
+        let tx_ptr = tx_addr as *mut Tx;
+        if let Some(tx_nn) = NonNull::new(tx_ptr) {
+            // SAFETY: `self.tx` is either NULL or from `Arc::into_raw()`.
+            let _ = unsafe { Arc::from_raw(tx_nn.as_ptr()) };
+        }
+    }
+}
+
+impl TxNodeRef {
+    fn new_user(v: Arc<Message>) -> Pin<Self> {
+        // SAFETY: `Arc` is always pinned.
+        unsafe { core::mem::transmute::<Self, Pin<Self>>(Self::User(v)) }
+    }
+
+    fn new_release_node(v: Arc<Handle>) -> Pin<Self> {
+        // SAFETY: `Arc` is always pinned.
+        unsafe { core::mem::transmute::<Self, Pin<Self>>(Self::ReleaseNode(v)) }
+    }
+
+    fn new_release_handle(v: Arc<Handle>) -> Pin<Self> {
+        // SAFETY: `Arc` is always pinned.
+        unsafe { core::mem::transmute::<Self, Pin<Self>>(Self::ReleaseHandle(v)) }
+    }
+
+    fn as_user(this: &Pin<Self>) -> Option<&Message> {
+        let inner = unsafe { core::mem::transmute::<&Pin<Self>, &Self>(this) };
+        if let Self::User(v) = inner {
+            Some(&v)
+        } else {
+            None
+        }
+    }
+
+    fn as_release_node(this: &Pin<Self>) -> Option<&Arc<Handle>> {
+        let inner = unsafe { core::mem::transmute::<&Pin<Self>, &Self>(this) };
+        if let Self::ReleaseNode(v) = inner {
+            Some(&v)
+        } else {
+            None
+        }
+    }
+
+    fn as_release_handle(this: &Pin<Self>) -> Option<&Arc<Handle>> {
+        let inner = unsafe { core::mem::transmute::<&Pin<Self>, &Self>(this) };
+        if let Self::ReleaseHandle(v) = inner {
+            Some(&v)
+        } else {
+            None
+        }
+    }
+
+    fn tx_node(this: &Pin<Self>) -> &TxNode {
+        let inner = unsafe { core::mem::transmute::<&Pin<Self>, &Self>(this) };
+        match inner {
+            Self::User(v) => &v.tx_node,
+            Self::ReleaseNode(v) => &v.release_node,
+            Self::ReleaseHandle(v) => &v.release_handle,
+        }
+    }
+}
+
+// The incoming queue on `Peer.queue` can take multiple different types as
+// nodes. They all use `TxNode` as metadata, but this is embedded in different
+// containing types. Thus, the standard `util::intrusive::Link` implementation
+// is not applicable. We define our own here, using the reference type
+// `TxNodeRef`.
+//
+// A `TxNodeRef` is an enum containing an `Arc<T>` to the respective containing
+// type of the underlying `TxNode`. When acquiring a reference, we validate
+// that the enum matches `TxNode.kind`. Given that the kind is a static field,
+// it cannot change. Therefore, when releasing (or borrowing) a node, we can
+// rely on `TxNode.kind` to know which containing type to convert back to.
+//
+// SAFETY: Upholds method guarantees.
+unsafe impl util::intrusive::Link<lll::Node> for TxNodeRef {
+    type Ref = Self;
+    type Target = TxNode;
+
+    fn acquire(v: Pin<Self::Ref>) -> NonNull<lll::Node> {
+        // SAFETY: `Pin` guarantees layout stability. We do not move out of `v`
+        // when accessing the inner pointer.
+        let v_inner = unsafe { core::mem::transmute::<Pin<Self::Ref>, Self::Ref>(v) };
+
+        match v_inner {
+            Self::User(v) => {
+                kernel::warn_on!(v.tx_node.kind != TxNodeKind::User);
+                let v_msg = Arc::into_raw(v);
+                // SAFETY: `Arc::into_raw()` guarantees that the allocation is
+                // valid and we can perform a field projection.
+                unsafe {
+                    NonNull::new_unchecked(
+                        (&raw const (*v_msg).tx_node.peer_link).cast_mut(),
+                    )
+                }
+            },
+            Self::ReleaseNode(v) => {
+                kernel::warn_on!(v.release_node.kind != TxNodeKind::ReleaseNode);
+                let v_handle = Arc::into_raw(v);
+                // SAFETY: `Arc::into_raw()` guarantees that the allocation is
+                // valid and we can perform a field projection.
+                unsafe {
+                    NonNull::new_unchecked(
+                        (&raw const (*v_handle).release_node.peer_link).cast_mut(),
+                    )
+                }
+            },
+            Self::ReleaseHandle(v) => {
+                kernel::warn_on!(v.release_handle.kind != TxNodeKind::ReleaseHandle);
+                let v_handle = Arc::into_raw(v);
+                // SAFETY: `Arc::into_raw()` guarantees that the allocation is
+                // valid and we can perform a field projection.
+                unsafe {
+                    NonNull::new_unchecked(
+                        (&raw const (*v_handle).release_handle.peer_link).cast_mut(),
+                    )
+                }
+            },
+        }
+    }
+
+    unsafe fn release(v: NonNull<lll::Node>) -> Pin<Self::Ref> {
+        // SAFETY: Caller guarantees `v` is from `acquire()`, thus must be
+        // embedded in a `TxNode` and convertible to a reference.
+        let v_txnode = unsafe {
+            field::base_of_nn::<field::field_of!(TxNode, peer_link)>(v)
+        };
+        // SAFETY: Caller guarantees `v` is from `acquire()` and thus
+        // convertible to a reference.
+        let kind = unsafe { v_txnode.as_ref().kind };
+
+        let r = match kind {
+            TxNodeKind::User => {
+                // SAFETY: Caller guarantees `v` is from `acquire()`, and thus
+                // `kind` has been verified and we can rely on it here.
+                let v_msg = unsafe {
+                    field::base_of_nn::<field::field_of!(Message, tx_node)>(v_txnode)
+                };
+                // SAFETY: Caller guarantees `v` is from `acquire()`, thus
+                // ultimately from `Arc::into_raw()`.
+                unsafe { Self::User(Arc::from_raw(v_msg.as_ptr())) }
+            },
+            TxNodeKind::ReleaseNode => {
+                // SAFETY: Caller guarantees `v` is from `acquire()`, and thus
+                // `kind` has been verified and we can rely on it here.
+                let v_handle = unsafe {
+                    field::base_of_nn::<field::field_of!(Handle, release_node)>(v_txnode)
+                };
+                // SAFETY: Caller guarantees `v` is from `acquire()`, thus
+                // ultimately from `Arc::into_raw()`.
+                unsafe { Self::ReleaseNode(Arc::from_raw(v_handle.as_ptr())) }
+            },
+            TxNodeKind::ReleaseHandle => {
+                // SAFETY: Caller guarantees `v` is from `acquire()`, and thus
+                // `kind` has been verified and we can rely on it here.
+                let v_handle = unsafe {
+                    field::base_of_nn::<field::field_of!(Handle, release_handle)>(v_txnode)
+                };
+                // SAFETY: Caller guarantees `v` is from `acquire()`, thus
+                // ultimately from `Arc::into_raw()`.
+                unsafe { Self::ReleaseHandle(Arc::from_raw(v_handle.as_ptr())) }
+            },
+        };
+
+        // SAFETY: `Pin` guarantees layout stability. Since `v` was from
+        // `acquire()`, it is pinned.
+        unsafe { core::mem::transmute::<Self::Ref, Pin<Self::Ref>>(r) }
+    }
+
+    fn project(v: &Self::Target) -> NonNull<lll::Node> {
+        NonNull::from_ref(&v.peer_link)
+    }
+
+    unsafe fn borrow<'a>(v: NonNull<lll::Node>) -> Pin<&'a Self::Target> {
+        // SAFETY: Caller guarantees `v` is from `acquire()`, thus must be
+        // embedded in a `TxNode`, pinned, and convertible to a reference.
+        unsafe {
+            Pin::new_unchecked(
+                field::base_of_nn::<
+                    field::field_of!(TxNode, peer_link),
+                >(v).as_ref(),
+            )
+        }
+    }
+}
+
+#[export_name = "b1_peer_new"]
+unsafe extern "C" fn peer_new(
+    actor: *mut capi::b1_acct_actor,
+    waitq: *mut kernel::bindings::wait_queue_head,
+) -> *mut capi::b1_peer {
+    // SAFETY: Caller guarantees `actor` is valid.
+    let actor = unsafe { acct::Actor::borrow_raw(actor) };
+    match Peer::new(actor.into(), waitq) {
+        Ok(v) => Peer::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    }
+}
+
+#[export_name = "b1_peer_ref"]
+unsafe extern "C" fn peer_ref(
+    this: *mut capi::b1_peer,
+) -> *mut capi::b1_peer {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid.
+        let this_b = unsafe { Peer::borrow_raw(this_nn.as_ptr()) };
+        core::mem::forget(Into::<Arc<Peer>>::into(this_b));
+    }
+    this
+}
+
+#[export_name = "b1_peer_unref"]
+unsafe extern "C" fn peer_unref(
+    this: *mut capi::b1_peer,
+) -> *mut capi::b1_peer {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid and no longer used.
+        let _ = unsafe { Peer::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+#[export_name = "b1_peer_begin"]
+unsafe extern "C" fn peer_begin(
+    this: *mut capi::b1_peer,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    this_b.begin();
+}
+
+#[export_name = "b1_peer_end"]
+unsafe extern "C" fn peer_end(
+    this: *mut capi::b1_peer,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    this_b.end();
+}
+
+#[export_name = "b1_peer_new_node"]
+unsafe extern "C" fn peer_new_node(
+    this: *mut capi::b1_peer,
+    other: *mut capi::b1_peer,
+    handlep: *mut *mut capi::b1_handle,
+) -> *mut capi::b1_node {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    // SAFETY: Caller guarantees `other` is valid.
+    let other_b = unsafe { Peer::borrow_raw(other) };
+    // SAFETY: Caller guarantees `handlep` is valid.
+    let handlep_b = unsafe { &mut *handlep };
+
+    match this_b.create_node(other_b) {
+        Ok((n, h)) => {
+            *handlep_b = Handle::into_raw(h);
+            Node::into_raw(n)
+        },
+        Err(AllocError) => ENOMEM.to_ptr(),
+    }
+}
+
+#[export_name = "b1_peer_new_handle"]
+unsafe extern "C" fn peer_new_handle(
+    this: *mut capi::b1_peer,
+    from: *mut capi::b1_handle,
+) -> *mut capi::b1_handle {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    // SAFETY: Caller guarantees `from` is valid.
+    let from_b = unsafe { Handle::borrow_raw(from) };
+
+    match this_b.create_handle(from_b) {
+        Ok(v) => Handle::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    }
+}
+
+#[export_name = "b1_peer_readable"]
+unsafe extern "C" fn peer_readable(
+    this: *mut capi::b1_peer,
+) -> bool {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    this_b.readable()
+}
+
+#[export_name = "b1_peer_peek"]
+unsafe extern "C" fn peer_peek(
+    this: *mut capi::b1_peer,
+    peek: *mut capi::b1_peer_peek,
+) -> bool {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    // SAFETY: Caller guarantees `peek` is valid.
+    let peek_b = unsafe { &mut *peek };
+    this_b.peek(peek_b)
+}
+
+#[export_name = "b1_peer_pop"]
+unsafe extern "C" fn peer_pop(
+    this: *mut capi::b1_peer,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Peer::borrow_raw(this) };
+    this_b.pop();
+}
+
+#[export_name = "b1_node_ref"]
+unsafe extern "C" fn node_ref(
+    this: *mut capi::b1_node,
+) -> *mut capi::b1_node {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid.
+        let this_b = unsafe { Node::borrow_raw(this_nn.as_ptr()) };
+        core::mem::forget(Into::<Arc<Node>>::into(this_b));
+    }
+    this
+}
+
+#[export_name = "b1_node_unref"]
+unsafe extern "C" fn node_unref(
+    this: *mut capi::b1_node,
+) -> *mut capi::b1_node {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid and no longer used.
+        let _ = unsafe { Node::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+#[export_name = "b1_node_get_userdata"]
+unsafe extern "C" fn node_get_userdata(
+    this: *mut capi::b1_node,
+) -> *mut kernel::ffi::c_void {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Node::borrow_raw(this) };
+    this_b.userdata.load(atomic::Relaxed) as *mut kernel::ffi::c_void
+}
+
+#[export_name = "b1_node_set_userdata"]
+unsafe extern "C" fn node_set_userdata(
+    this: *mut capi::b1_node,
+    userdata: *mut kernel::ffi::c_void,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Node::borrow_raw(this) };
+    this_b.userdata.store(userdata as usize, atomic::Relaxed);
+}
+
+#[export_name = "b1_node_begin"]
+unsafe extern "C" fn node_begin(
+    this: *mut capi::b1_node,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Node::borrow_raw(this) };
+    this_b.begin();
+}
+
+#[export_name = "b1_node_end"]
+unsafe extern "C" fn node_end(
+    this: *mut capi::b1_node,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Node::borrow_raw(this) };
+    this_b.end();
+}
+
+#[export_name = "b1_handle_ref"]
+unsafe extern "C" fn handle_ref(
+    this: *mut capi::b1_handle,
+) -> *mut capi::b1_handle {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid.
+        let this_b = unsafe { Handle::borrow_raw(this_nn.as_ptr()) };
+        core::mem::forget(Into::<Arc<Handle>>::into(this_b));
+    }
+    this
+}
+
+#[export_name = "b1_handle_unref"]
+unsafe extern "C" fn handle_unref(
+    this: *mut capi::b1_handle,
+) -> *mut capi::b1_handle {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid and no longer used.
+        let _ = unsafe { Handle::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+#[export_name = "b1_handle_get_userdata"]
+unsafe extern "C" fn handle_get_userdata(
+    this: *mut capi::b1_handle,
+) -> *mut kernel::ffi::c_void {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Handle::borrow_raw(this) };
+    this_b.userdata.load(atomic::Relaxed) as *mut kernel::ffi::c_void
+}
+
+#[export_name = "b1_handle_set_userdata"]
+unsafe extern "C" fn handle_set_userdata(
+    this: *mut capi::b1_handle,
+    userdata: *mut kernel::ffi::c_void,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Handle::borrow_raw(this) };
+    this_b.userdata.store(userdata as usize, atomic::Relaxed);
+}
+
+#[export_name = "b1_handle_begin"]
+unsafe extern "C" fn handle_begin(
+    this: *mut capi::b1_handle,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Handle::borrow_raw(this) };
+    this_b.begin();
+}
+
+#[export_name = "b1_handle_end"]
+unsafe extern "C" fn handle_end(
+    this: *mut capi::b1_handle,
+) {
+    // SAFETY: Caller guarantees `this` is valid.
+    let this_b = unsafe { Handle::borrow_raw(this) };
+    this_b.end();
+}
+
+#[export_name = "b1_message_shared_new"]
+unsafe extern "C" fn message_shared_new(
+    n_data: u64,
+    data: *mut kernel::ffi::c_void,
+) -> *mut capi::b1_message_shared {
+    let data_v = core::ptr::slice_from_raw_parts_mut::<u8>(
+        data.cast(),
+        n_data as usize,
+    );
+
+    // SAFETY: Caller guarantees `data` is an owned KVBox of length `n_data`.
+    let data_kv = unsafe { KVBox::from_raw(data_v) };
+
+    match MessageShared::with(data_kv) {
+        Ok(v) => MessageShared::into_raw(v),
+        Err(AllocError) => ENOMEM.to_ptr(),
+    }
+}
+
+#[export_name = "b1_message_shared_ref"]
+unsafe extern "C" fn message_shared_ref(
+    this: *mut capi::b1_message_shared,
+) -> *mut capi::b1_message_shared {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid.
+        let this_b = unsafe { MessageShared::borrow_raw(this_nn.as_ptr()) };
+        core::mem::forget(Into::<Arc<MessageShared>>::into(this_b));
+    }
+    this
+}
+
+#[export_name = "b1_message_shared_unref"]
+unsafe extern "C" fn message_shared_unref(
+    this: *mut capi::b1_message_shared,
+) -> *mut capi::b1_message_shared {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid and no longer used.
+        let _ = unsafe { MessageShared::from_raw(this_nn.as_ptr()) };
+    }
+    core::ptr::null_mut()
+}
+
+#[export_name = "b1_op_new"]
+unsafe extern "C" fn op_new(
+    peer: *mut capi::b1_peer,
+) -> *mut capi::b1_op {
+    // SAFETY: Caller guarantees `peer` is valid.
+    let peer_b = unsafe { Peer::borrow_raw(peer) };
+
+    let op = match Op::new(peer_b.into()) {
+        Ok(v) => v,
+        Err(AllocError) => return ENOMEM.to_ptr(),
+    };
+
+    let op_box = match KBox::pin(op, GFP_KERNEL) {
+        Ok(v) => v,
+        Err(AllocError) => return ENOMEM.to_ptr(),
+    };
+
+    // SAFETY: `capi::b1_op` is treated as pinned.
+    let op_nopin = unsafe { Pin::into_inner_unchecked(op_box) };
+    KBox::into_raw(op_nopin).cast()
+}
+
+#[export_name = "b1_op_free"]
+unsafe extern "C" fn op_free(
+    this: *mut capi::b1_op,
+) -> *mut capi::b1_op {
+    if let Some(this_nn) = core::ptr::NonNull::new(this) {
+        // SAFETY: Caller guarantees `this` is valid, pinned and no longer used.
+        let _ = unsafe {
+            Pin::new_unchecked(
+                KBox::from_raw(this_nn.as_ptr().cast::<Op>())
+            )
+        };
+    }
+    core::ptr::null_mut()
+}
+
+#[export_name = "b1_op_send_message"]
+unsafe extern "C" fn op_send_message(
+    this: *mut capi::b1_op,
+    to: *mut capi::b1_handle,
+    n_transfers: u64,
+    transfers: *mut *mut capi::b1_handle,
+    shared: *mut capi::b1_message_shared,
+) -> kernel::ffi::c_int {
+    // SAFETY: Caller guarantees `this` is valid and pinned.
+    let this_b = unsafe { Pin::new_unchecked(&mut *this.cast::<Op>()) };
+    // SAFETY: Caller guarantees `to` is valid.
+    let to_b = unsafe { Handle::borrow_raw(to) };
+    // Caller guarantees `n_transfers` represents an allocation.
+    let n_transfers_sz = n_transfers as usize;
+    let transfers_p = core::ptr::slice_from_raw_parts_mut::<*mut capi::b1_handle>(
+        transfers.cast(),
+        n_transfers_sz,
+    );
+    // SAFETY: Caller guarantees `transfers` is a valid slice with `n_transfers`
+    // handle references.
+    let transfers_v = unsafe { &*transfers_p };
+    // SAFETY: Caller guarantees `shared` is valid.
+    let shared_b = unsafe { MessageShared::borrow_raw(shared) };
+
+    let Ok(mut xfers_vec) = KVec::with_capacity(n_transfers_sz, GFP_KERNEL) else {
+        return ENOMEM.to_errno();
+    };
+
+    for t in transfers_v {
+        // SAFETY: Caller guarantees `transfers` has valid handles.
+        let handle = unsafe { Handle::borrow_raw(*t) };
+        if kernel::warn_on!(!this_b.operator.owns_handle(&handle)) {
+            return ENOTRECOVERABLE.to_errno();
+        }
+
+        let Ok(new) = Handle::new(handle.node.clone(), to_b.node.owner.clone()) else {
+            return ENOMEM.to_errno();
+        };
+
+        let Ok(_) = xfers_vec.push(Handle::into_raw(new), GFP_KERNEL) else {
+            return ENOMEM.to_errno();
+        };
+    }
+
+    let Ok(xfers_box) = xfers_vec.into_boxed_slice() else {
+        return ENOMEM.to_errno();
+    };
+
+    match this_b.send_message(to_b, xfers_box, shared_b.into()) {
+        Ok(()) => 0,
+        Err(MessageError::Alloc(AllocError)) => ENOMEM.to_errno(),
+        Err(MessageError::HandleForeign) => ENOTRECOVERABLE.to_errno(),
+    }
+}
+
+#[export_name = "b1_op_release_node"]
+unsafe extern "C" fn op_release_node(
+    this: *mut capi::b1_op,
+    node: *mut capi::b1_node,
+) {
+    // SAFETY: Caller guarantees `this` is valid and pinned.
+    let this_b = unsafe { Pin::new_unchecked(&mut *this.cast::<Op>()) };
+    // SAFETY: Caller guarantees `node` is valid.
+    let node_b = unsafe { Node::borrow_raw(node) };
+
+    this_b.release_node(node_b);
+}
+
+#[export_name = "b1_op_release_handle"]
+unsafe extern "C" fn op_release_handle(
+    this: *mut capi::b1_op,
+    handle: *mut capi::b1_handle,
+) {
+    // SAFETY: Caller guarantees `this` is valid and pinned.
+    let this_b = unsafe { Pin::new_unchecked(&mut *this.cast::<Op>()) };
+    // SAFETY: Caller guarantees `handle` is valid.
+    let handle_b = unsafe { Handle::borrow_raw(handle) };
+
+    this_b.release_handle(handle_b);
+}
+
+#[export_name = "b1_op_commit"]
+unsafe extern "C" fn op_commit(
+    this: *mut capi::b1_op,
+) {
+    // SAFETY: Caller guarantees `this` is valid, pinned and no longer used.
+    let mut this_o = unsafe {
+        Pin::new_unchecked(KBox::from_raw(this.cast::<Op>()))
+    };
+    this_o.as_mut().commit();
+}
diff --git a/ipc/bus1/lib.h b/ipc/bus1/lib.h
index 808f5da94919..942a3397383f 100644
--- a/ipc/bus1/lib.h
+++ b/ipc/bus1/lib.h
@@ -23,6 +23,17 @@ struct b1_acct_actor;
 struct b1_acct_charge;
 struct b1_acct_trace;
 struct b1_acct_user;
+struct b1_handle;
+struct b1_message_shared;
+struct b1_node;
+struct b1_op;
+struct b1_peer;
+struct b1_peer_peek;
+struct b1_peer_peek_handle_release;
+struct b1_peer_peek_node_release;
+union  b1_peer_peek_union;
+struct b1_peer_peek_user;
+struct wait_queue_head;
 
 /* accounting */
 
@@ -81,4 +92,111 @@ DEFINE_FREE(
 		b1_acct_user_unref(_T);
 )
 
+/* peer */
+
+struct b1_peer_peek {
+	u64 type;
+	union b1_peer_peek_union {
+		struct b1_peer_peek_user {
+			struct b1_node *node;
+			u64 n_transfers;
+			struct b1_handle **transfers;
+			u64 n_data;
+			void *data;
+		} user;
+		struct b1_peer_peek_node_release {
+			struct b1_handle *handle;
+		} node_release;
+		struct b1_peer_peek_handle_release {
+			struct b1_node *node;
+		} handle_release;
+	} u;
+};
+
+struct b1_peer *b1_peer_new(struct b1_acct_actor *actor, struct wait_queue_head *waitq);
+struct b1_peer *b1_peer_ref(struct b1_peer *peer);
+struct b1_peer *b1_peer_unref(struct b1_peer *peer);
+
+void b1_peer_begin(struct b1_peer *peer);
+void b1_peer_end(struct b1_peer *peer);
+
+struct b1_node *b1_peer_new_node(struct b1_peer *peer, struct b1_peer *other, struct b1_handle **handlep);
+struct b1_handle *b1_peer_new_handle(struct b1_peer *peer, struct b1_handle *from);
+
+bool b1_peer_readable(struct b1_peer *peer);
+bool b1_peer_peek(struct b1_peer *peer, struct b1_peer_peek *peek);
+void b1_peer_pop(struct b1_peer *peer);
+
+DEFINE_FREE(
+	b1_peer_unref,
+	struct b1_peer *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_peer_unref(_T);
+)
+
+/* node */
+
+struct b1_node *b1_node_ref(struct b1_node *node);
+struct b1_node *b1_node_unref(struct b1_node *node);
+
+void *b1_node_get_userdata(struct b1_node *node);
+void b1_node_set_userdata(struct b1_node *node, void *userdata);
+void b1_node_begin(struct b1_node *node);
+void b1_node_end(struct b1_node *node);
+
+DEFINE_FREE(
+	b1_node_unref,
+	struct b1_node *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_node_unref(_T);
+)
+
+/* handle */
+
+struct b1_handle *b1_handle_ref(struct b1_handle *handle);
+struct b1_handle *b1_handle_unref(struct b1_handle *handle);
+
+void *b1_handle_get_userdata(struct b1_handle *handle);
+void b1_handle_set_userdata(struct b1_handle *handle, void *userdata);
+void b1_handle_begin(struct b1_handle *handle);
+void b1_handle_end(struct b1_handle *handle);
+
+DEFINE_FREE(
+	b1_handle_unref,
+	struct b1_handle *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_handle_unref(_T);
+)
+
+/* message_shared */
+
+struct b1_message_shared *b1_message_shared_new(u64 n_data, void *data);
+struct b1_message_shared *b1_message_shared_ref(struct b1_message_shared *shared);
+struct b1_message_shared *b1_message_shared_unref(struct b1_message_shared *shared);
+
+/* op */
+
+struct b1_op *b1_op_new(struct b1_peer *peer);
+struct b1_op *b1_op_free(struct b1_op *op);
+
+int b1_op_send_message(
+	struct b1_op *op,
+	struct b1_handle *to,
+	u64 n_transfers,
+	struct b1_handle **transfers,
+	struct b1_message_shared *shared
+);
+
+void b1_op_release_node(struct b1_op *op, struct b1_node *node);
+void b1_op_release_handle(struct b1_op *op, struct b1_handle *handle);
+
+void b1_op_commit(struct b1_op *op);
+
+DEFINE_FREE(
+	b1_op_free,
+	struct b1_op *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_op_free(_T);
+)
+
 #endif /* __B1_LIB_H */
diff --git a/ipc/bus1/lib.rs b/ipc/bus1/lib.rs
index 05f21601f569..34a157fe96c4 100644
--- a/ipc/bus1/lib.rs
+++ b/ipc/bus1/lib.rs
@@ -5,6 +5,7 @@
 //! rust. Any user-space API is outside the scope of this module.
 
 pub mod acct;
+pub mod bus;
 pub mod util;
 
 #[allow(
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [RFC 16/16] bus1: implement the uapi
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (14 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 15/16] bus1: introduce peers, handles, and nodes David Rheinsberg
@ 2026-03-31 19:03 ` David Rheinsberg
  2026-03-31 19:46 ` [RFC 00/16] bus1: Capability-based IPC for Linux Miguel Ojeda
  16 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-03-31 19:03 UTC (permalink / raw)
  To: rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

Implement the character-device based uapi, as it is defined by
uapi/linux/bus1.h. A single dynamic-minor device is used for now, and
the character-device serves no purpose other than exposing the ioctls.

The biggest part of this is transferring data from user-space into the
kernel, verifying it, calling the bus1 C-API and then returning data to
user-space.

Signed-off-by: David Rheinsberg <david@readahead.eu>
---
 ipc/bus1/Makefile |    1 +
 ipc/bus1/cdev.c   | 1326 +++++++++++++++++++++++++++++++++++++++++++++
 ipc/bus1/cdev.h   |   35 ++
 ipc/bus1/main.c   |   22 +
 4 files changed, 1384 insertions(+)
 create mode 100644 ipc/bus1/cdev.c
 create mode 100644 ipc/bus1/cdev.h

diff --git a/ipc/bus1/Makefile b/ipc/bus1/Makefile
index 1f2fbbe8603f..601151e03937 100644
--- a/ipc/bus1/Makefile
+++ b/ipc/bus1/Makefile
@@ -22,6 +22,7 @@ $(obj)/lib.o: $(obj)/capi.rs
 $(obj)/lib.o: export BUS1_CAPI_PATH=$(abspath $(obj)/capi.rs)
 
 bus1-y :=		\
+	cdev.o			\
 	lib.o			\
 	main.o
 
diff --git a/ipc/bus1/cdev.c b/ipc/bus1/cdev.c
new file mode 100644
index 000000000000..876bfa549420
--- /dev/null
+++ b/ipc/bus1/cdev.c
@@ -0,0 +1,1326 @@
+// SPDX-License-Identifier: GPL-2.0
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/cleanup.h>
+#include <linux/container_of.h>
+#include <linux/cred.h>
+#include <linux/err.h>
+#include <linux/file.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/miscdevice.h>
+#include <linux/mutex.h>
+#include <linux/poll.h>
+#include <linux/rbtree.h>
+#include <linux/slab.h>
+#include <linux/stat.h>
+#include <linux/uidgid.h>
+#include <linux/uio.h>
+#include <linux/wait.h>
+#include <uapi/linux/bus1.h>
+#include "cdev.h"
+#include "lib.h"
+
+enum b1_uobject_type: unsigned int {
+	_B1_UOBJECT_INVALID,
+	B1_UOBJECT_NODE,
+	B1_UOBJECT_HANDLE,
+};
+
+struct b1_uobject {
+	unsigned int type;
+	u64 id;
+
+	union {
+		struct b1_node *node;
+		struct b1_handle *handle;
+	};
+
+	struct rb_node upeer_rb;
+	struct list_head op_link;
+};
+
+struct b1_upeer {
+	struct b1_peer *peer;
+	wait_queue_head_t waitq;
+
+	struct mutex lock;
+	struct rb_root objects;
+	u64 id_allocator;
+};
+
+struct b1_cdev {
+	struct b1_acct *acct;
+	struct miscdevice misc;
+};
+
+/*
+ * Verify details of the bus1-cdev-api required by this implementation:
+ * - `BUS1_INVALID` must be of type `u64`.
+ * - `BUS1_INVALID` must be marked as `managed`, to guarantee it does not
+ *   conflict with the unmanaged namespace.
+ * - `BUS1_INVALID` must match `(u64)-1` to ensure the id-allocator will never
+ *   accidentally return it.
+ * - `BUS1_MANAGED` must be `(u64)1` to ensure it can be set/cleared via LSB
+ *   and does not conflict with 2-aligned user-space pointers.
+ */
+static_assert(sizeof(BUS1_INVALID) == sizeof(u64));
+static_assert(__alignof(BUS1_INVALID) == __alignof(u64));
+static_assert(!!(BUS1_INVALID & BUS1_MANAGED));
+static_assert(BUS1_INVALID == ~(u64)0);
+static_assert(BUS1_MANAGED == (u64)1);
+
+/*
+ * Lock two mutices of the same class with a lock-order according to their
+ * memory address. If either mutex is NULL, it is not locked. If both refer to
+ * the same mutex, only one is locked.
+ */
+static void lock2(struct mutex *a, struct mutex *b)
+{
+	if (a < b) {
+		if (a)
+			mutex_lock(a);
+		if (b && b != a)
+			mutex_lock_nested(b, !!a);
+	} else {
+		if (b)
+			mutex_lock(b);
+		if (a && a != b)
+			mutex_lock_nested(a, !!b);
+	}
+}
+
+/* Inverse operation of `lock2()`. */
+static void unlock2(struct mutex *a, struct mutex *b)
+{
+	if (a)
+		mutex_unlock(a);
+	if (b && b != a)
+		mutex_unlock(b);
+}
+
+static struct b1_uobject *b1_uobject_new(u64 id)
+{
+	struct b1_uobject *uobject;
+
+	uobject = kzalloc_obj(struct b1_uobject);
+	if (!uobject)
+		return ERR_PTR(-ENOMEM);
+
+	uobject->id = id;
+	RB_CLEAR_NODE(&uobject->upeer_rb);
+	INIT_LIST_HEAD(&uobject->op_link);
+
+	return uobject;
+}
+
+static struct b1_uobject *
+b1_uobject_free(struct b1_uobject *uobject)
+{
+	if (uobject) {
+		WARN_ON(!list_empty(&uobject->op_link));
+		WARN_ON(!RB_EMPTY_NODE(&uobject->upeer_rb));
+
+		switch (uobject->type) {
+		case B1_UOBJECT_NODE:
+			if (uobject->node) {
+				b1_node_set_userdata(uobject->node, NULL);
+				uobject->node = b1_node_unref(uobject->node);
+			}
+			break;
+		case B1_UOBJECT_HANDLE:
+			if (uobject->handle) {
+				b1_handle_set_userdata(uobject->handle, NULL);
+				uobject->handle = b1_handle_unref(uobject->handle);
+			}
+			break;
+		}
+
+		kfree(uobject);
+	}
+
+	return NULL;
+}
+
+static bool
+b1_uobject_is_linked(struct b1_uobject *uobject)
+{
+	return !RB_EMPTY_NODE(&uobject->upeer_rb);
+}
+
+DEFINE_FREE(
+	b1_uobject_free,
+	struct b1_uobject *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_uobject_free(_T);
+)
+
+static bool
+b1_uobject_rb_less(struct rb_node *a, const struct rb_node *b)
+{
+	struct b1_uobject *a_node, *b_node;
+
+	a_node = container_of(a, struct b1_uobject, upeer_rb);
+	b_node = container_of(b, struct b1_uobject, upeer_rb);
+
+	return a_node->id < b_node->id;
+}
+
+static int
+b1_uobject_rb_cmp(const void *k, const struct rb_node *n)
+{
+	struct b1_uobject *node = container_of(n, struct b1_uobject, upeer_rb);
+	const u64 *key = k;
+
+	if (*key < node->id)
+		return -1;
+	else if (*key > node->id)
+		return 1;
+	else
+		return 0;
+}
+
+static struct b1_upeer *
+b1_upeer_free(struct b1_upeer *upeer)
+{
+	if (upeer) {
+		WARN_ON(!RB_EMPTY_ROOT(&upeer->objects));
+		mutex_destroy(&upeer->lock);
+		upeer->peer = b1_peer_unref(upeer->peer);
+		kfree(upeer);
+	}
+
+	return NULL;
+}
+
+DEFINE_FREE(
+	b1_upeer_free,
+	struct b1_upeer *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_upeer_free(_T);
+)
+
+static struct b1_upeer *
+b1_upeer_new(struct b1_acct_actor *actor)
+{
+	struct b1_upeer *upeer __free(b1_upeer_free) = NULL;
+	int r;
+
+	upeer = kzalloc_obj(struct b1_upeer);
+	if (!upeer)
+		return ERR_PTR(-ENOMEM);
+
+	init_waitqueue_head(&upeer->waitq);
+	mutex_init(&upeer->lock);
+	upeer->objects = RB_ROOT;
+
+	upeer->peer = b1_peer_new(actor, &upeer->waitq);
+	if (IS_ERR(upeer->peer)) {
+		r = PTR_ERR(upeer->peer);
+		upeer->peer = NULL;
+		return ERR_PTR(r);
+	}
+
+	return no_free_ptr(upeer);
+}
+
+static u64 b1_upeer_allocate_id(struct b1_upeer *upeer)
+{
+	/*
+	 * The namespace with the LSB unset is managed by user-space. For
+	 * kernel allocated IDs ("managed IDs"), we use a simple counter, but
+	 * shift to the left and set the LSB.
+	 */
+	lockdep_assert_held(&upeer->lock);
+	return (upeer->id_allocator++ << 1) | BUS1_MANAGED;
+}
+
+static void
+b1_upeer_link(struct b1_upeer *upeer, struct b1_uobject *uobject)
+{
+	lockdep_assert_held(&upeer->lock);
+
+	if (RB_EMPTY_NODE(&uobject->upeer_rb))
+		rb_add(&uobject->upeer_rb, &upeer->objects, b1_uobject_rb_less);
+}
+
+static void
+b1_upeer_unlink(struct b1_upeer *upeer, struct b1_uobject *uobject)
+{
+	lockdep_assert_held(&upeer->lock);
+
+	if (!RB_EMPTY_NODE(&uobject->upeer_rb)) {
+		rb_erase(&uobject->upeer_rb, &upeer->objects);
+		RB_CLEAR_NODE(&uobject->upeer_rb);
+	}
+}
+
+static struct b1_uobject *
+b1_upeer_find(struct b1_upeer *upeer, u64 id)
+{
+	struct rb_node *node;
+
+	lockdep_assert_held(&upeer->lock);
+
+	node = rb_find(&id, &upeer->objects, b1_uobject_rb_cmp);
+	if (node)
+		return container_of(node, struct b1_uobject, upeer_rb);
+
+	return NULL;
+}
+
+static struct b1_uobject *
+b1_upeer_find_handle(struct b1_upeer *upeer, u64 id)
+{
+	struct b1_uobject *uobject;
+
+	uobject = b1_upeer_find(upeer, id);
+	if (uobject && uobject->type != B1_UOBJECT_HANDLE)
+		uobject = NULL;
+
+	return uobject;
+}
+
+static struct b1_uobject *b1_upeer_new_node(struct b1_upeer *upeer)
+{
+	struct b1_uobject *unode __free(b1_uobject_free) = NULL;
+
+	lockdep_assert_held(&upeer->lock);
+
+	unode = b1_uobject_new(b1_upeer_allocate_id(upeer));
+	if (IS_ERR(unode))
+		return unode;
+
+	unode->type = B1_UOBJECT_NODE;
+
+	return no_free_ptr(unode);
+}
+
+static struct b1_uobject *b1_upeer_new_handle(struct b1_upeer *upeer)
+{
+	struct b1_uobject *uhandle __free(b1_uobject_free) = NULL;
+
+	lockdep_assert_held(&upeer->lock);
+
+	uhandle = b1_uobject_new(b1_upeer_allocate_id(upeer));
+	if (IS_ERR(uhandle))
+		return uhandle;
+
+	uhandle->type = B1_UOBJECT_HANDLE;
+
+	return no_free_ptr(uhandle);
+}
+
+static int b1_cdev_open(struct inode *inode, struct file *file)
+{
+	struct b1_cdev *cdev = container_of(file->private_data,
+					    struct b1_cdev, misc);
+	struct b1_acct_actor *actor __free(b1_acct_actor_unref) = NULL;
+	struct b1_acct_user *user __free(b1_acct_user_unref) = NULL;
+	struct b1_upeer *upeer __free(b1_upeer_free) = NULL;
+
+	user = b1_acct_get_user(cdev->acct, __kuid_val(file->f_cred->euid));
+	if (IS_ERR(user))
+		return PTR_ERR(user);
+
+	actor = b1_acct_actor_new(user);
+	if (IS_ERR(actor))
+		return PTR_ERR(actor);
+
+	upeer = b1_upeer_new(actor);
+	if (IS_ERR(upeer))
+		return PTR_ERR(upeer);
+
+	b1_peer_begin(upeer->peer);
+	file->private_data = no_free_ptr(upeer);
+	return 0;
+}
+
+static int b1_cdev_release(struct inode *inode, struct file *file)
+{
+	struct b1_upeer *upeer = file->private_data;
+	struct b1_op *op __free(b1_op_free) = NULL;
+	struct b1_uobject *uobject, *usafe;
+
+	op = b1_op_new(upeer->peer);
+	if (!WARN_ON(!op)) {
+		rbtree_postorder_for_each_entry_safe(uobject, usafe,
+						     &upeer->objects,
+						     upeer_rb) {
+			if (uobject->type == B1_UOBJECT_NODE) {
+				b1_op_release_node(op, uobject->node);
+				b1_node_end(uobject->node);
+			}
+		}
+		b1_op_commit(no_free_ptr(op));
+	}
+
+	b1_peer_end(upeer->peer);
+
+	op = b1_op_new(upeer->peer);
+	WARN_ON(!op);
+	rbtree_postorder_for_each_entry_safe(uobject, usafe, &upeer->objects,
+					     upeer_rb) {
+		RB_CLEAR_NODE(&uobject->upeer_rb);
+		if (op) {
+			if (uobject->type == B1_UOBJECT_HANDLE) {
+				b1_op_release_handle(op, uobject->handle);
+				b1_handle_end(uobject->handle);
+			}
+		}
+		b1_upeer_unlink(upeer, uobject);
+		b1_uobject_free(uobject);
+	}
+	upeer->objects = RB_ROOT;
+	if (op)
+		b1_op_commit(no_free_ptr(op));
+
+	file->private_data = b1_upeer_free(upeer);
+	return 0;
+}
+
+static unsigned int
+b1_cdev_poll(struct file *file, struct poll_table_struct *wait)
+{
+	struct b1_upeer *upeer = file->private_data;
+	unsigned int mask = 0;
+
+	poll_wait(file, &upeer->waitq, wait);
+
+	mask |= POLLOUT | POLLWRNORM;
+	if (b1_peer_readable(upeer->peer))
+		mask |= POLLIN | POLLRDNORM;
+
+	return mask;
+}
+
+static int
+b1_cdev_transfer_collect(
+	struct b1_upeer *from,
+	struct b1_upeer *to,
+	struct bus1_cmd_transfer *cmd,
+	struct list_head *unodes,
+	struct list_head *uhandles
+) {
+	struct bus1_transfer __user *u_src, __user *u_dst;
+	u64 i;
+
+	lockdep_assert_held(&from->lock);
+	lockdep_assert_held(&to->lock);
+
+	u_src = BUS1_TO_PTR(cmd->ptr_src);
+	u_dst = BUS1_TO_PTR(cmd->ptr_dst);
+	if (unlikely(cmd->ptr_src != BUS1_FROM_PTR(u_src) ||
+		     cmd->ptr_dst != BUS1_FROM_PTR(u_dst)))
+		return -EFAULT;
+
+	for (i = 0; i < cmd->n_transfers; ++i) {
+		struct b1_handle *handle __free(b1_handle_unref) = NULL;
+		struct b1_node *node __free(b1_node_unref) = NULL;
+		struct b1_uobject *unode, *uhandle;
+		struct bus1_transfer src, dst;
+		u64 ptr_src, ptr_dst;
+
+		BUILD_BUG_ON(sizeof(*u_src) != sizeof(src));
+		BUILD_BUG_ON(sizeof(*u_dst) != sizeof(dst));
+
+		if (check_add_overflow(cmd->ptr_src, i, &ptr_src))
+			return -EFAULT;
+		if (check_add_overflow(cmd->ptr_dst, i, &ptr_dst))
+			return -EFAULT;
+
+		u_src = BUS1_TO_PTR(ptr_src);
+		u_dst = BUS1_TO_PTR(ptr_dst);
+		if (ptr_src != BUS1_FROM_PTR(u_src) ||
+		    ptr_dst != BUS1_FROM_PTR(u_dst) ||
+		    copy_from_user(&src, u_src, sizeof(src)))
+			return -EFAULT;
+		if (src.flags & ~BUS1_TRANSFER_FLAG_CREATE)
+			return -EINVAL;
+
+		if (src.flags & BUS1_TRANSFER_FLAG_CREATE) {
+			if (src.id != BUS1_INVALID)
+				return -EINVAL;
+
+			node = b1_peer_new_node(from->peer, to->peer, &handle);
+			if (IS_ERR(node))
+				return PTR_ERR(node);
+
+			unode = b1_upeer_new_node(from);
+			if (IS_ERR(unode))
+				return PTR_ERR(unode);
+
+			list_add_tail(&unode->op_link, unodes);
+			b1_node_set_userdata(node, unode);
+			unode->node = no_free_ptr(node);
+
+			src.flags = 0;
+			src.id = unode->id;
+			if (copy_to_user(u_src, &src, sizeof(src)))
+				return -EFAULT;
+		} else {
+			uhandle = b1_upeer_find_handle(from, src.id);
+			if (!uhandle)
+				return -EBADRQC;
+
+			handle = b1_peer_new_handle(to->peer, uhandle->handle);
+			if (IS_ERR(handle))
+				return PTR_ERR(handle);
+		}
+
+		uhandle = b1_upeer_new_handle(to);
+		if (IS_ERR(uhandle))
+			return PTR_ERR(uhandle);
+
+		list_add_tail(&uhandle->op_link, uhandles);
+		b1_handle_set_userdata(handle, uhandle);
+		uhandle->handle = no_free_ptr(handle);
+
+		dst.flags = 0;
+		dst.id = uhandle->id;
+		if (copy_to_user(u_dst, &dst, sizeof(dst)))
+			return -EFAULT;
+	}
+
+	return 0;
+}
+
+static void
+b1_cdev_transfer_commit(
+	struct b1_upeer *from,
+	struct b1_upeer *to,
+	struct bus1_cmd_transfer *cmd,
+	struct list_head *unodes,
+	struct list_head *uhandles
+) {
+	struct b1_uobject *unode, *uhandle;
+
+	lockdep_assert_held(&from->lock);
+	lockdep_assert_held(&to->lock);
+
+	while ((unode = list_first_entry_or_null(unodes, struct b1_uobject,
+						 op_link))) {
+		list_del_init(&unode->op_link);
+		b1_upeer_link(from, unode);
+		b1_node_begin(unode->node);
+	}
+
+	while ((uhandle = list_first_entry_or_null(uhandles, struct b1_uobject,
+						   op_link))) {
+		list_del_init(&uhandle->op_link);
+		b1_upeer_link(to, uhandle);
+		b1_handle_begin(uhandle->handle);
+	}
+}
+
+static int
+b1_cdev_ioctl_transfer(struct file *file, struct b1_upeer *upeer, unsigned long arg)
+{
+	struct bus1_cmd_transfer __user *u_cmd = (void __user *)arg;
+	struct list_head uhandles = LIST_HEAD_INIT(uhandles);
+	struct list_head unodes = LIST_HEAD_INIT(unodes);
+	struct b1_uobject *unode, *uhandle;
+	CLASS_INIT(fd, fd, EMPTY_FD);
+	struct bus1_cmd_transfer cmd;
+	struct b1_upeer *other;
+	int r, to;
+
+	BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_TRANSFER) != sizeof(cmd));
+	BUILD_BUG_ON(sizeof(*u_cmd) != sizeof(cmd));
+
+	if (copy_from_user(&cmd, u_cmd, sizeof(cmd)))
+		return -EFAULT;
+	if (cmd.flags != 0)
+		return -EINVAL;
+
+	if (cmd.to != BUS1_INVALID) {
+		to = (int)cmd.to;
+		if (cmd.to != (u64)to || to < 0)
+			return -EBADF;
+		fd = fdget(to);
+		if (fd_empty(fd))
+			return -EBADF;
+		if (fd_file(fd)->f_op != file->f_op)
+			return -EOPNOTSUPP;
+		other = fd_file(fd)->private_data;
+	} else {
+		other = upeer;
+	}
+
+	lock2(&upeer->lock, &other->lock);
+
+	r = b1_cdev_transfer_collect(upeer, other, &cmd, &unodes, &uhandles);
+	if (r >= 0)
+		b1_cdev_transfer_commit(upeer, other, &cmd, &unodes, &uhandles);
+
+	while ((unode = list_first_entry_or_null(&unodes,
+						 struct b1_uobject,
+						 op_link))) {
+		list_del_init(&unode->op_link);
+		b1_uobject_free(unode);
+	}
+	while ((uhandle = list_first_entry_or_null(&uhandles,
+						   struct b1_uobject,
+						   op_link))) {
+		list_del_init(&uhandle->op_link);
+		b1_uobject_free(uhandle);
+	}
+
+	unlock2(&upeer->lock, &other->lock);
+
+	return r;
+}
+
+static int
+b1_cdev_release_collect(
+	struct b1_upeer *upeer,
+	struct bus1_cmd_release *cmd,
+	struct list_head *uobjs
+) {
+	const u64 __user *u_ids;
+	u64 i;
+
+	lockdep_assert_held(&upeer->lock);
+
+	u_ids = BUS1_TO_PTR(cmd->ptr_ids);
+	if (unlikely(cmd->ptr_ids != BUS1_FROM_PTR(u_ids)))
+		return -EFAULT;
+
+	for (i = 0; i < cmd->n_ids; ++i) {
+		struct b1_uobject *uobject;
+		const u64 __user *u_id;
+		u64 ptr_id, id;
+
+		BUILD_BUG_ON(sizeof(*u_id) != sizeof(id));
+
+		if (check_add_overflow(cmd->ptr_ids, i, &ptr_id))
+			return -EFAULT;
+
+		u_id = BUS1_TO_PTR(ptr_id);
+		if (ptr_id != BUS1_FROM_PTR(u_id) ||
+		    copy_from_user(&id, u_id, sizeof(id)))
+			return -EFAULT;
+
+		uobject = b1_upeer_find(upeer, id);
+		if (!uobject)
+			return -EBADRQC;
+
+		list_add_tail(&uobject->op_link, uobjs);
+	}
+
+	return 0;
+}
+
+static int
+b1_cdev_release_commit(
+	struct b1_upeer *upeer,
+	struct bus1_cmd_release *cmd,
+	struct list_head *uobjs
+) {
+	struct b1_op *op __free(b1_op_free) = NULL;
+	struct b1_uobject *uobject;
+
+	lockdep_assert_held(&upeer->lock);
+
+	op = b1_op_new(upeer->peer);
+	if (IS_ERR(op))
+		return PTR_ERR(op);
+
+	while ((uobject = list_first_entry_or_null(uobjs,
+						   struct b1_uobject,
+						   op_link))) {
+		list_del_init(&uobject->op_link);
+
+		if (uobject->type == B1_UOBJECT_NODE) {
+			b1_op_release_node(op, uobject->node);
+			b1_node_end(uobject->node);
+		} else if (uobject->type == B1_UOBJECT_HANDLE) {
+			b1_op_release_handle(op, uobject->handle);
+			b1_handle_end(uobject->handle);
+		}
+
+		b1_upeer_unlink(upeer, uobject);
+		b1_uobject_free(uobject);
+	}
+
+	b1_op_commit(no_free_ptr(op));
+	return 0;
+}
+
+static int
+b1_cdev_ioctl_release(struct b1_upeer *upeer, unsigned long arg)
+{
+	struct bus1_cmd_release __user *u_cmd = (void __user *)arg;
+	struct bus1_cmd_release cmd;
+	struct list_head uobjs = LIST_HEAD_INIT(uobjs);
+	struct b1_uobject *uobject;
+	int r;
+
+	BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_RELEASE) != sizeof(cmd));
+	BUILD_BUG_ON(sizeof(*u_cmd) != sizeof(cmd));
+
+	if (copy_from_user(&cmd, u_cmd, sizeof(cmd)))
+		return -EFAULT;
+	if (cmd.flags != 0)
+		return -EINVAL;
+
+	mutex_lock(&upeer->lock);
+	r = b1_cdev_release_collect(upeer, &cmd, &uobjs);
+	if (r >= 0)
+		b1_cdev_release_commit(upeer, &cmd, &uobjs);
+	while ((uobject = list_first_entry_or_null(&uobjs,
+						   struct b1_uobject,
+						   op_link)))
+		list_del_init(&uobject->op_link);
+	mutex_unlock(&upeer->lock);
+
+	return 0;
+}
+
+struct b1_umessage {
+	struct b1_upeer *upeer;
+	u64 n_transfers;
+	struct b1_handle **transfers;
+	struct b1_message_shared *shared;
+};
+
+static struct b1_umessage *
+b1_umessage_free(struct b1_umessage *umessage)
+{
+	if (umessage) {
+		umessage->shared = b1_message_shared_unref(umessage->shared);
+		kfree(umessage->transfers);
+		kfree(umessage);
+	}
+
+	return NULL;
+}
+
+DEFINE_FREE(
+	b1_umessage_free,
+	struct b1_umessage *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_umessage_free(_T);
+)
+
+static struct b1_umessage *
+b1_umessage_new(struct b1_upeer *upeer, u64 n_transfers)
+{
+	struct b1_umessage *umessage __free(b1_umessage_free) = NULL;
+	size_t n_transfers_sz = n_transfers;
+
+	if ((u64)n_transfers_sz != n_transfers)
+		return ERR_PTR(-ENOMEM);
+
+	umessage = kzalloc_obj(struct b1_umessage);
+	if (!umessage)
+		return ERR_PTR(-ENOMEM);
+
+	umessage->upeer = upeer;
+
+	umessage->transfers = kzalloc_objs(struct b1_handle *, n_transfers_sz);
+	if (!umessage->transfers)
+		return ERR_PTR(-ENOMEM);
+
+	return no_free_ptr(umessage);
+}
+
+static struct b1_umessage *
+b1_cdev_send_import(
+	struct b1_upeer *upeer,
+	struct bus1_cmd_send *cmd
+) {
+	struct b1_umessage *umessage __free(b1_umessage_free) = NULL;
+	const struct bus1_transfer __user *u_transfers;
+	const struct bus1_message __user *u_message;
+	const struct iovec __user *u_data_vecs;
+	struct iovec iov_stack[UIO_FASTIOV];
+	struct iovec *iov_vec __free(kfree) = NULL;
+	void *data __free(kvfree) = NULL;
+	struct bus1_message message;
+	unsigned int u_n_data_vecs;
+	struct iov_iter iov_iter;
+	ssize_t n;
+	u64 i;
+	int r;
+
+	lockdep_assert_held(&upeer->lock);
+
+	BUILD_BUG_ON(sizeof(*u_message) != sizeof(message));
+
+	u_message = BUS1_TO_PTR(cmd->ptr_message);
+	if (unlikely(cmd->ptr_message != BUS1_FROM_PTR(u_message) ||
+		     copy_from_user(&message, u_message, sizeof(message))))
+		return ERR_PTR(-EFAULT);
+	if (message.flags != 0 || message.type != BUS1_MESSAGE_TYPE_USER)
+		return ERR_PTR(-EINVAL);
+
+	u_transfers = BUS1_TO_PTR(message.ptr_transfers);
+	u_n_data_vecs = message.n_data_vecs;
+	u_data_vecs = BUS1_TO_PTR(message.ptr_data_vecs);
+	if (unlikely(message.ptr_transfers != BUS1_FROM_PTR(u_transfers) ||
+		     message.n_data_vecs != (u64)u_n_data_vecs ||
+		     message.ptr_data_vecs != BUS1_FROM_PTR(u_data_vecs)))
+		return ERR_PTR(-EFAULT);
+	if (unlikely(message.n_data > MAX_RW_COUNT))
+		return ERR_PTR(-EMSGSIZE);
+
+	umessage = b1_umessage_new(upeer, message.n_transfers);
+	if (IS_ERR(umessage))
+		return ERR_CAST(umessage);
+
+	/* Import the message data. */
+
+	iov_vec = iov_stack;
+	n = import_iovec(
+		ITER_SOURCE,
+		u_data_vecs,
+		u_n_data_vecs,
+		ARRAY_SIZE(iov_stack),
+		&iov_vec,
+		&iov_iter
+	);
+	if (n < 0)
+		return ERR_PTR(n);
+	if (n < message.n_data)
+		return ERR_PTR(-EMSGSIZE);
+
+	data = kvmalloc(n, GFP_KERNEL);
+	if (!data)
+		return ERR_PTR(-ENOMEM);
+	if (!copy_from_iter_full(data, n, &iov_iter))
+		return ERR_PTR(-EFAULT);
+
+	umessage->shared = b1_message_shared_new(n, no_free_ptr(data));
+	if (IS_ERR(umessage->shared)) {
+		r = PTR_ERR(umessage->shared);
+		umessage->shared = NULL;
+		return ERR_PTR(r);
+	}
+
+	/* Import the handle transfers. */
+
+	for (i = 0; i < message.n_transfers; ++i) {
+		struct b1_uobject *uhandle;
+		struct bus1_transfer __user *u_transfer;
+		struct bus1_transfer transfer;
+		u64 ptr_transfer;
+
+		BUILD_BUG_ON(sizeof(*u_transfer) != sizeof(transfer));
+
+		if (check_add_overflow(message.ptr_transfers, i, &ptr_transfer))
+			return ERR_PTR(-EFAULT);
+
+		u_transfer = BUS1_TO_PTR(ptr_transfer);
+		if (ptr_transfer != BUS1_FROM_PTR(u_transfer) ||
+		    copy_from_user(&transfer, u_transfer, sizeof(transfer)))
+			return ERR_PTR(-EFAULT);
+		if (transfer.flags != 0)
+			return ERR_PTR(-EINVAL);
+
+		uhandle = b1_upeer_find_handle(upeer, transfer.id);
+		if (!uhandle)
+			return ERR_PTR(-EBADRQC);
+
+		umessage->transfers[i] = uhandle->handle;
+		++umessage->n_transfers;
+	}
+
+	return no_free_ptr(umessage);
+}
+
+static int
+b1_cdev_send_commit(
+	struct b1_upeer *upeer,
+	struct bus1_cmd_send *cmd,
+	struct b1_umessage *umessage
+) {
+	struct b1_op *op __free(b1_op_free) = NULL;
+	const u64 __user *u_destinations;
+	u32 __user *u_errors;
+	u64 i;
+	int r;
+
+	lockdep_assert_held(&upeer->lock);
+
+	u_destinations = BUS1_TO_PTR(cmd->ptr_destinations);
+	u_errors = BUS1_TO_PTR(cmd->ptr_errors);
+	if (unlikely(cmd->ptr_destinations != BUS1_FROM_PTR(u_destinations) ||
+		     cmd->ptr_errors != BUS1_FROM_PTR(u_errors)))
+		return -EFAULT;
+
+	op = b1_op_new(upeer->peer);
+
+	for (i = 0; i < cmd->n_destinations; ++i) {
+		struct b1_uobject *uhandle;
+		__u64 __user *u_dst, *u_error;
+		u64 ptr_dst, ptr_error, dst, error;
+
+		BUILD_BUG_ON(sizeof(*u_dst) != sizeof(dst));
+		BUILD_BUG_ON(sizeof(*u_error) != sizeof(error));
+
+		if (check_add_overflow(cmd->ptr_destinations, i, &ptr_dst) ||
+		    check_add_overflow(cmd->ptr_errors, i, &ptr_error))
+			return -EFAULT;
+
+		u_dst = BUS1_TO_PTR(ptr_dst);
+		u_error = BUS1_TO_PTR(ptr_error);
+		if (ptr_dst != BUS1_FROM_PTR(u_dst) ||
+		    ptr_error != BUS1_FROM_PTR(u_error) ||
+		    get_user(dst, u_dst))
+			return -EFAULT;
+
+		uhandle = b1_upeer_find_handle(upeer, dst);
+		if (!uhandle)
+			return -EBADRQC;
+
+		r = b1_op_send_message(
+			op,
+			uhandle->handle,
+			umessage->n_transfers,
+			umessage->transfers,
+			umessage->shared
+		);
+		if (r < 0)
+			return r;
+
+		error = 0;
+		if (put_user(error, u_error))
+			return -EFAULT;
+	}
+
+	b1_op_commit(no_free_ptr(op));
+	return 0;
+}
+
+static int
+b1_cdev_ioctl_send(struct b1_upeer *upeer, unsigned long arg)
+{
+	struct bus1_cmd_send __user *u_cmd = (void __user *)arg;
+	struct bus1_cmd_send cmd;
+	struct b1_umessage *umessage;
+	int r;
+
+	BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_SEND) != sizeof(cmd));
+	BUILD_BUG_ON(sizeof(*u_cmd) != sizeof(cmd));
+
+	if (copy_from_user(&cmd, u_cmd, sizeof(cmd)))
+		return -EFAULT;
+	if (cmd.flags != 0)
+		return -EINVAL;
+
+	mutex_lock(&upeer->lock);
+	umessage = b1_cdev_send_import(upeer, &cmd);
+	if (IS_ERR(umessage))
+		r = PTR_ERR(umessage);
+	else
+		r = b1_cdev_send_commit(upeer, &cmd, umessage);
+	umessage = b1_umessage_free(umessage);
+	mutex_unlock(&upeer->lock);
+
+	return r;
+}
+
+static int
+b1_cdev_recv_export_transfers(
+	struct b1_upeer *upeer,
+	struct b1_peer_peek_user *peek,
+	struct bus1_message *message
+) {
+	struct bus1_transfer __user *u_transfers;
+	u64 i;
+
+	lockdep_assert_held(&upeer->lock);
+
+	u_transfers = BUS1_TO_PTR(message->ptr_transfers);
+	if (unlikely(message->ptr_transfers != BUS1_FROM_PTR(u_transfers)))
+		return -EFAULT;
+
+	for (i = 0; i < peek->n_transfers; ++i) {
+		struct b1_uobject *uhandle;
+		struct bus1_transfer __user *u_transfer;
+		struct bus1_transfer transfer;
+		u64 ptr_transfer;
+
+		BUILD_BUG_ON(sizeof(*u_transfer) != sizeof(transfer));
+
+		if (check_add_overflow(message->ptr_transfers, i, &ptr_transfer))
+			return -EFAULT;
+
+		u_transfer = BUS1_TO_PTR(ptr_transfer);
+		if (ptr_transfer != BUS1_FROM_PTR(u_transfer))
+			return -EFAULT;
+
+		uhandle = b1_handle_get_userdata(peek->transfers[i]);
+		if (!uhandle) {
+			uhandle = b1_upeer_new_handle(upeer);
+			if (IS_ERR(uhandle))
+				return PTR_ERR(uhandle);
+
+			b1_handle_set_userdata(peek->transfers[i], uhandle);
+			uhandle->handle = b1_handle_ref(peek->transfers[i]);
+		}
+
+		transfer.flags = 0;
+		transfer.id = uhandle->id;
+		if (copy_to_user(u_transfer, &transfer, sizeof(transfer)))
+			return -EFAULT;
+	}
+
+	return 0;
+}
+
+static int
+b1_cdev_recv_export_data(
+	struct b1_upeer *upeer,
+	struct b1_peer_peek_user *peek,
+	struct bus1_message *message
+) {
+	const struct iovec __user *u_data_vecs;
+	struct iovec iov_stack[UIO_FASTIOV];
+	struct iovec *iov_vec __free(kfree) = NULL;
+	unsigned int u_n_data_vecs;
+	struct iov_iter iov_iter;
+	ssize_t n;
+
+	lockdep_assert_held(&upeer->lock);
+
+	u_n_data_vecs = message->n_data_vecs;
+	u_data_vecs = BUS1_TO_PTR(message->ptr_data_vecs);
+	if (unlikely(message->n_data_vecs != (u64)u_n_data_vecs ||
+		     message->ptr_data_vecs != BUS1_FROM_PTR(u_data_vecs)))
+		return -EFAULT;
+
+	iov_vec = iov_stack;
+	n = import_iovec(
+		ITER_DEST,
+		u_data_vecs,
+		u_n_data_vecs,
+		ARRAY_SIZE(iov_stack),
+		&iov_vec,
+		&iov_iter
+	);
+	if (n < 0)
+		return n;
+	if (n < message->n_data)
+		return -EMSGSIZE;
+	if (message->n_data < peek->n_data)
+		return -EMSGSIZE;
+	if (!copy_to_iter_full(peek->data, peek->n_data, &iov_iter))
+		return -EFAULT;
+
+	return 0;
+}
+
+static int
+b1_cdev_recv_user(
+	struct b1_upeer *upeer,
+	struct b1_peer_peek_user *peek,
+	struct bus1_metadata *metadata,
+	struct bus1_message *message
+) {
+	struct b1_uobject *unode, *uhandle;
+	u64 i;
+	int r;
+
+	lockdep_assert_held(&upeer->lock);
+
+	unode = b1_node_get_userdata(peek->node);
+	if (!unode) {
+		b1_peer_pop(upeer->peer);
+		return 0;
+	}
+
+	r = b1_cdev_recv_export_transfers(upeer, peek, message);
+	if (r >= 0)
+		r = b1_cdev_recv_export_data(upeer, peek, message);
+	for (i = 0; i < peek->n_transfers; ++i) {
+		uhandle = b1_handle_get_userdata(peek->transfers[i]);
+		if (!uhandle || b1_uobject_is_linked(uhandle))
+			continue;
+		if (r >= 0) {
+			b1_upeer_link(upeer, uhandle);
+			b1_handle_begin(uhandle->handle);
+		} else {
+			b1_handle_set_userdata(peek->transfers[i], NULL);
+			b1_uobject_free(uhandle);
+		}
+	}
+	if (r < 0)
+		return r;
+
+	*metadata = (struct bus1_metadata){
+		.flags = 0,
+		.id = unode->id,
+		.account = BUS1_INVALID,
+	};
+	*message = (struct bus1_message){
+		.flags = 0,
+		.type = BUS1_MESSAGE_TYPE_USER,
+		.n_transfers = peek->n_transfers,
+		.ptr_transfers = message->ptr_transfers,
+		.n_data = peek->n_data,
+		.n_data_vecs = message->n_data_vecs,
+		.ptr_data_vecs = message->ptr_data_vecs,
+	};
+
+	return 1;
+}
+
+static int
+b1_cdev_recv_node_release(
+	struct b1_upeer *upeer,
+	struct b1_handle *handle,
+	struct bus1_metadata *metadata,
+	struct bus1_message *message
+) {
+	struct b1_uobject *uhandle;
+
+	lockdep_assert_held(&upeer->lock);
+
+	uhandle = b1_handle_get_userdata(handle);
+	if (!uhandle) {
+		b1_peer_pop(upeer->peer);
+		return 0;
+	}
+
+	*metadata = (struct bus1_metadata){
+		.flags = 0,
+		.id = uhandle->id,
+		.account = BUS1_INVALID,
+	};
+	*message = (struct bus1_message){
+		.flags = 0,
+		.type = BUS1_MESSAGE_TYPE_NODE_RELEASE,
+		.n_transfers = 0,
+		.ptr_transfers = message->ptr_transfers,
+		.n_data = 0,
+		.n_data_vecs = message->n_data_vecs,
+		.ptr_data_vecs = message->ptr_data_vecs,
+	};
+
+	b1_handle_end(uhandle->handle);
+	b1_upeer_unlink(upeer, uhandle);
+	b1_uobject_free(uhandle);
+	return 1;
+}
+
+static int
+b1_cdev_recv_handle_release(
+	struct b1_upeer *upeer,
+	struct b1_node *node,
+	struct bus1_metadata *metadata,
+	struct bus1_message *message
+) {
+	struct b1_uobject *unode;
+
+	lockdep_assert_held(&upeer->lock);
+
+	unode = b1_node_get_userdata(node);
+	if (!unode) {
+		b1_peer_pop(upeer->peer);
+		return 0;
+	}
+
+	*metadata = (struct bus1_metadata){
+		.flags = 0,
+		.id = unode->id,
+		.account = BUS1_INVALID,
+	};
+	*message = (struct bus1_message){
+		.flags = 0,
+		.type = BUS1_MESSAGE_TYPE_HANDLE_RELEASE,
+		.n_transfers = 0,
+		.ptr_transfers = message->ptr_transfers,
+		.n_data = 0,
+		.n_data_vecs = message->n_data_vecs,
+		.ptr_data_vecs = message->ptr_data_vecs,
+	};
+
+	b1_node_end(unode->node);
+	b1_upeer_unlink(upeer, unode);
+	b1_uobject_free(unode);
+	return 1;
+}
+
+static int
+b1_cdev_recv_peek(
+	struct b1_upeer *upeer,
+	struct bus1_metadata *metadata,
+	struct bus1_message *message
+) {
+	struct b1_peer_peek peek;
+
+	lockdep_assert_held(&upeer->lock);
+
+	if (!b1_peer_peek(upeer->peer, &peek))
+		return -EAGAIN;
+
+	switch (peek.type) {
+	case BUS1_MESSAGE_TYPE_USER:
+		return b1_cdev_recv_user(
+			upeer,
+			&peek.u.user,
+			metadata,
+			message
+		);
+	case BUS1_MESSAGE_TYPE_NODE_RELEASE:
+		return b1_cdev_recv_node_release(
+			upeer,
+			peek.u.node_release.handle,
+			metadata,
+			message
+		);
+	case BUS1_MESSAGE_TYPE_HANDLE_RELEASE:
+		return b1_cdev_recv_handle_release(
+			upeer,
+			peek.u.handle_release.node,
+			metadata,
+			message
+		);
+	default:
+		WARN_ONCE(1, "invalid message type: %llu", peek.type);
+		b1_peer_pop(upeer->peer);
+		return -ENOTRECOVERABLE;
+	}
+}
+
+static int
+b1_cdev_ioctl_recv(struct b1_upeer *upeer, unsigned long arg)
+{
+	struct bus1_cmd_recv __user *u_cmd = (void __user *)arg;
+	struct bus1_metadata __user *u_metadata;
+	struct bus1_message __user *u_message;
+	struct bus1_metadata metadata;
+	struct bus1_message message;
+	struct bus1_cmd_recv cmd;
+	int r;
+
+	BUILD_BUG_ON(_IOC_SIZE(BUS1_CMD_RECV) != sizeof(cmd));
+	BUILD_BUG_ON(sizeof(*u_cmd) != sizeof(cmd));
+
+	if (copy_from_user(&cmd, u_cmd, sizeof(cmd)))
+		return -EFAULT;
+	if (cmd.flags != 0)
+		return -EINVAL;
+
+	u_metadata = BUS1_TO_PTR(cmd.ptr_metadata);
+	u_message = BUS1_TO_PTR(cmd.ptr_message);
+	if (unlikely(cmd.ptr_metadata != BUS1_FROM_PTR(u_metadata) ||
+		     cmd.ptr_message != BUS1_FROM_PTR(u_message)))
+		return -EFAULT;
+
+	if (copy_from_user(&message, u_message, sizeof(message)))
+		return -EFAULT;
+	if (message.flags != 0 || message.type != BUS1_INVALID)
+		return -EINVAL;
+
+	memset(&metadata, 0, sizeof(metadata));
+
+	mutex_lock(&upeer->lock);
+	do {
+		r = b1_cdev_recv_peek(upeer, &metadata, &message);
+	} while (!r);
+	if (r > 0) {
+		if (copy_to_user(u_metadata, &metadata, sizeof(metadata)) ||
+		    copy_to_user(u_message, &message, sizeof(message)))
+			r = -EFAULT;
+		else
+			b1_peer_pop(upeer->peer);
+	}
+	mutex_unlock(&upeer->lock);
+
+	return r;
+}
+
+static long
+b1_cdev_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	struct b1_upeer *upeer = file->private_data;
+	int r;
+
+	switch (cmd) {
+	case BUS1_CMD_TRANSFER:
+		r = b1_cdev_ioctl_transfer(file, upeer, arg);
+		break;
+	case BUS1_CMD_RELEASE:
+		r = b1_cdev_ioctl_release(upeer, arg);
+		break;
+	case BUS1_CMD_SEND:
+		r = b1_cdev_ioctl_send(upeer, arg);
+		break;
+	case BUS1_CMD_RECV:
+		r = b1_cdev_ioctl_recv(upeer, arg);
+		break;
+	default:
+		r = -ENOTTY;
+		break;
+	}
+
+	return r;
+}
+
+static const struct file_operations b1_cdev_fops = {
+	.owner			= THIS_MODULE,
+	.open			= b1_cdev_open,
+	.release		= b1_cdev_release,
+	.poll			= b1_cdev_poll,
+	.unlocked_ioctl		= b1_cdev_ioctl,
+	.compat_ioctl		= b1_cdev_ioctl,
+};
+
+/**
+ * b1_cdev_new() - initialize a new bus1 character device
+ * @acct:		accounting system to use for this character device
+ *
+ * This registers a new bus1 character device and returns it to the caller.
+ * Once the object is returned, it will be live and ready.
+ *
+ * Return: A pointer to the new device is returned, ERR_PTR on failure.
+ */
+struct b1_cdev *b1_cdev_new(struct b1_acct *acct)
+{
+	struct b1_cdev *cdev __free(b1_cdev_free) = NULL;
+	int r;
+
+	cdev = kzalloc_obj(struct b1_cdev);
+	if (!cdev)
+		return ERR_PTR(-ENOMEM);
+
+	cdev->acct = b1_acct_ref(acct);
+	cdev->misc = (struct miscdevice){
+		.fops = &b1_cdev_fops,
+		.minor = MISC_DYNAMIC_MINOR,
+		.name = KBUILD_MODNAME,
+		.mode = S_IRUGO | S_IWUGO,
+	};
+
+	r = misc_register(&cdev->misc);
+	if (r < 0) {
+		cdev->misc.fops = NULL;
+		return ERR_PTR(r);
+	}
+
+	return no_free_ptr(cdev);
+}
+
+/**
+ * b1_cdev_free() - destroy a bus1 character device
+ * @cdev:		character device to operate on, or NULL
+ *
+ * This unregisters and frees a previously registered bus1 character device.
+ *
+ * If you pass NULL, this is a no-op.
+ *
+ * Return: NULL is returned.
+ */
+struct b1_cdev *b1_cdev_free(struct b1_cdev *cdev)
+{
+	if (cdev) {
+		if (cdev->misc.fops)
+			misc_deregister(&cdev->misc);
+		cdev->acct = b1_acct_unref(cdev->acct);
+		kfree(cdev);
+	}
+
+	return NULL;
+}
diff --git a/ipc/bus1/cdev.h b/ipc/bus1/cdev.h
new file mode 100644
index 000000000000..b4da7c815a43
--- /dev/null
+++ b/ipc/bus1/cdev.h
@@ -0,0 +1,35 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef __B1_CDEV_H
+#define __B1_CDEV_H
+
+/**
+ * DOC: Character Device for Bus1
+ *
+ * This implements the character-device API for Bus1. It allows full access to
+ * the Bus1 communication system through a singleton character device. The
+ * character device is named after `KBUILD_MODNAME` and registered with a
+ * dynamic minor number. Thus, it can be loaded multiple times under different
+ * names, usually for testing.
+ *
+ * Every file description associated with the character device will represent a
+ * single Bus1 peer. IOCTLs on the character device expose the different Bus1
+ * operations in a direct mapping.
+ */
+
+#include <linux/cleanup.h>
+#include <linux/err.h>
+
+struct b1_acct;
+struct b1_cdev;
+
+struct b1_cdev *b1_cdev_new(struct b1_acct *acct);
+struct b1_cdev *b1_cdev_free(struct b1_cdev *cdev);
+
+DEFINE_FREE(
+	b1_cdev_free,
+	struct b1_cdev *,
+	if (!IS_ERR_OR_NULL(_T))
+		b1_cdev_free(_T);
+)
+
+#endif /* __B1_CDEV_H */
diff --git a/ipc/bus1/main.c b/ipc/bus1/main.c
index bd6399b2ce3a..55725bbbfcf4 100644
--- a/ipc/bus1/main.c
+++ b/ipc/bus1/main.c
@@ -1,16 +1,38 @@
 // SPDX-License-Identifier: GPL-2.0
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+#include <linux/cleanup.h>
+#include <linux/err.h>
 #include <linux/init.h>
 #include <linux/module.h>
+#include <linux/sizes.h>
+#include "cdev.h"
 #include "lib.h"
 
+static struct b1_cdev *b1_main_cdev;
+
 static int __init b1_main_init(void)
 {
+	struct b1_acct *acct __free(b1_acct_unref) = NULL;
+	const b1_acct_value_t maxima[] = {
+		[B1_ACCT_SLOT_OBJECTS] = SZ_1M,
+		[B1_ACCT_SLOT_BYTES] = SZ_1G,
+	};
+
+	acct = b1_acct_new(&maxima);
+	if (IS_ERR(acct))
+		return PTR_ERR(acct);
+
+	b1_main_cdev = b1_cdev_new(acct);
+	if (IS_ERR(b1_main_cdev))
+		return PTR_ERR(b1_main_cdev);
+
 	return 0;
 }
 
 static void __exit b1_main_deinit(void)
 {
+	if (!IS_ERR_OR_NULL(b1_main_cdev))
+		b1_main_cdev = b1_cdev_free(b1_main_cdev);
 }
 
 module_init(b1_main_init);
-- 
2.53.0


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC 03/16] rust/alloc: add Vec::into_boxed_slice()
  2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
@ 2026-03-31 19:28   ` Miguel Ojeda
  2026-03-31 21:10   ` Gary Guo
  2026-03-31 22:07   ` Danilo Krummrich
  2 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:28 UTC (permalink / raw)
  To: David Rheinsberg, Danilo Krummrich
  Cc: rust-for-linux, teg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Lorenzo Stoakes, Vlastimil Babka, Liam R. Howlett,
	Uladzislau Rezki

On Tue, Mar 31, 2026 at 9:05 PM David Rheinsberg <david@readahead.eu> wrote:
>
>  rust/kernel/alloc/kvec.rs | 67 +++++++++++++++++++++++++++++++++++++++

Cc'ing "RUST" and "RUST [ALLOC]".

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked()
  2026-03-31 19:02 ` [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked() David Rheinsberg
@ 2026-03-31 19:29   ` Miguel Ojeda
  0 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:29 UTC (permalink / raw)
  To: David Rheinsberg, Boqun Feng
  Cc: rust-for-linux, teg, Miguel Ojeda, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, linux-kernel

On Tue, Mar 31, 2026 at 9:05 PM David Rheinsberg <david@readahead.eu> wrote:
>
>  rust/kernel/sync/locked_by.rs | 30 ++++++++++++++++++++++++++++++

Cc'ing "RUST" and "LOCKING PRIMITIVES" (Boqun).

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/16] bus1/util: add basic utilities
  2026-03-31 19:03 ` [RFC 08/16] bus1/util: add basic utilities David Rheinsberg
@ 2026-03-31 19:35   ` Miguel Ojeda
  2026-04-01 11:05     ` David Rheinsberg
  0 siblings, 1 reply; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:35 UTC (permalink / raw)
  To: David Rheinsberg; +Cc: rust-for-linux, teg, Miguel Ojeda

On Tue, Mar 31, 2026 at 9:05 PM David Rheinsberg <david@readahead.eu> wrote:
>
> Some helpers will become obsolete, once the MSRV is bumped. This is
> noted in the documentation.

Yes, I am bumping it to Rust 1.85.0, so that should be fine.

However, `feature(exposed_provenance)` is there since 1.76, so you
could use that already.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 09/16] bus1/util: add field projections
  2026-03-31 19:03 ` [RFC 09/16] bus1/util: add field projections David Rheinsberg
@ 2026-03-31 19:38   ` Miguel Ojeda
  0 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:38 UTC (permalink / raw)
  To: David Rheinsberg, Benno Lossin, Gary Guo, Xiang Fei Ding
  Cc: rust-for-linux, teg, Miguel Ojeda, Boqun Feng,
	Björn Roy Baron, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich

On Tue, Mar 31, 2026 at 9:06 PM David Rheinsberg <david@readahead.eu> wrote:
>
> Introduce a utility module that provides field-projections for stable
> Rust. The module is designed very similar to the official Rust field
> projections (which are still unstable), but without any requirement for
> compiler support.
>
> The module explicitly uses names similar to the ones from the official
> field projections, and is certainly meant to be replaced once those
> become stable or are otherwise introduced into the kernel.
>
> However, until then, this module is small and simple enough to allow
> very convenient intrusive collections, and thus is included here.

Cc'ing Benno, Gary, Ding, plus "RUST".

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 13/16] bus1/util: add intrusive rb-tree
  2026-03-31 19:03 ` [RFC 13/16] bus1/util: add intrusive rb-tree David Rheinsberg
@ 2026-03-31 19:43   ` Miguel Ojeda
  0 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:43 UTC (permalink / raw)
  To: David Rheinsberg, Alice Ryhl
  Cc: rust-for-linux, teg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg,
	Trevor Gross, Danilo Krummrich

On Tue, Mar 31, 2026 at 9:06 PM David Rheinsberg <david@readahead.eu> wrote:
>
> Add `util::rb`, an intrusive RB-Tree using `util::intrusive` for the
> API, and `linux/rbtree.h` for the implementation.
>
> The API is designed for very easy use, without requiring any unsafe code
> from a user. It tracks ownership via a simple atomic, and can thus
> assert collection association in O(1) in a completely safe manner.
>
> Unlike the owning version of RB-Trees in `kernel::rbtree`, the intrusive
> version clearly documents the node<->collection relationship in
> data-structures, avoids double pointer-chases in traversals, and can be
> used by bus1 to queue release/destruction notifications without fallible
> allocations.
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>

Cc'ing Alice, plus "RUST".

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 10/16] bus1/util: add IntoDeref/FromDeref
  2026-03-31 19:03 ` [RFC 10/16] bus1/util: add IntoDeref/FromDeref David Rheinsberg
@ 2026-03-31 19:44   ` Miguel Ojeda
  0 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:44 UTC (permalink / raw)
  To: David Rheinsberg
  Cc: rust-for-linux, teg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich

On Tue, Mar 31, 2026 at 9:06 PM David Rheinsberg <david@readahead.eu> wrote:
>
> Introduce two new utility traits: IntoDeref and FromDeref.
>
> The traits are an abstraction for `Box::into_raw()` and
> `Box::from_raw()`, as well as their equivalents in `Arc`. At the same
> time, the traits can be implemented for plain references as no-ops.
>
> The traits will be used by intrusive collections to generalize over the
> data-type stored in a collection, without moving the actual data into
> the collection.
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>

Cc'ing "RUST".

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 00/16] bus1: Capability-based IPC for Linux
  2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
                   ` (15 preceding siblings ...)
  2026-03-31 19:03 ` [RFC 16/16] bus1: implement the uapi David Rheinsberg
@ 2026-03-31 19:46 ` Miguel Ojeda
  16 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-03-31 19:46 UTC (permalink / raw)
  To: David Rheinsberg
  Cc: rust-for-linux, teg, Miguel Ojeda, Boqun Feng, Gary Guo,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich

On Tue, Mar 31, 2026 at 9:05 PM David Rheinsberg <david@readahead.eu> wrote:
>
> `dbus-broker` [2]) and are convinced more than ever that we should move
> forward with bus1. But for this initial submission I want to put focus
> on the Rust integration.

In general, if it is a big feature that still needs to be agreed by
some kernel maintainers etc., then it may be best to discuss that
first.

In any case, I have Cc'd certain people in a few patches, but Cc'ing
"RUST" here in general too for the "utilities" etc.

I hope that helps.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 03/16] rust/alloc: add Vec::into_boxed_slice()
  2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
  2026-03-31 19:28   ` Miguel Ojeda
@ 2026-03-31 21:10   ` Gary Guo
  2026-03-31 22:07   ` Danilo Krummrich
  2 siblings, 0 replies; 33+ messages in thread
From: Gary Guo @ 2026-03-31 21:10 UTC (permalink / raw)
  To: David Rheinsberg, rust-for-linux; +Cc: teg, Miguel Ojeda

On Tue Mar 31, 2026 at 8:02 PM BST, David Rheinsberg wrote:
> Add `Vec::into_boxed_slice()` similar to
> `std::vec::Vec::into_boxed_slice()` [1].
>
> There is currently no way to easily consume the allocation of a vector.
> However, it is very convenient to use `Vec` to initialize a dynamically
> sized array and then "seal" it, so it can be passed along as a Box:
>
>     fn create_from(src: &[T]) -> Result<KBox<[U]>, AllocError> {
>         let v = Vec::with_capacity(n, GFP_KERNEL)?;
>
>         for i in src {
>             v.push(foo(i)?, GFP_KERNEL)?;
>         }
>
>         Ok(v.into_boxed_slice())
>     }
>
> A valid alternative is to use `Box::new_uninit()` rather than
> `Vec::with_capacity()`, and eventually convert the box via
> `Box::assume_init()`. This works but needlessly requires unsafe code,
> awkward drop handling, etc. Using `Vec` is the much simpler solution.
>
> [1] https://doc.rust-lang.org/std/vec/struct.Vec.html#method.into_boxed_slice
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>
> ---
>  rust/kernel/alloc/kvec.rs | 67 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
>
> diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
> index ac8d6f763ae8..b8b0fa1a7505 100644
> --- a/rust/kernel/alloc/kvec.rs
> +++ b/rust/kernel/alloc/kvec.rs
> @@ -733,6 +733,73 @@ pub fn retain(&mut self, mut f: impl FnMut(&mut T) -> bool) {
>          }
>          self.truncate(num_kept);
>      }
> +
> +    fn shrink_to_fit(&mut self) -> Result<(), AllocError>  {
> +        if Self::is_zst() {
> +            // ZSTs always use maximum capacity.
> +            return Ok(());
> +        }
> +
> +        let layout = ArrayLayout::new(self.len()).map_err(|_| AllocError)?;
> +
> +        // SAFETY:
> +        // - `ptr` is valid because it's either `None` or comes from a previous
> +        //   call to `A::realloc`.
> +        // - `self.layout` matches the `ArrayLayout` of the preceding
> +        //   allocation.
> +        let ptr = unsafe {
> +            A::realloc(
> +                Some(self.ptr.cast()),
> +                layout.into(),
> +                self.layout.into(),
> +                crate::alloc::flags::GFP_NOWAIT,

The flag should be specified by the user.

Best,
Gary

> +                NumaNode::NO_NODE,
> +            )?
> +        };
> +
> +        // INVARIANT:
> +        // - `layout` is some `ArrayLayout::<T>`,
> +        // - `ptr` has been created by `A::realloc` from `layout`.
> +        self.ptr = ptr.cast();
> +        self.layout = layout;
> +        Ok(())
> +    }
> +
> +    /// Converts the vector into [`Box<[T], A>`].
> +    ///
> +    /// Excess capacity is retained in the allocation, but lost until the box
> +    /// is dropped.
> +    ///
> +    /// This function is fallible, because kernel allocators do not guarantee
> +    /// that shrinking reallocations are infallible, yet the Rust abstractions
> +    /// strictly require that layouts are correct. Hence, the caller must be
> +    /// ready to deal with reallocation failures.
> +    ///
> +    /// # Examples
> +    ///
> +    /// ```
> +    /// let mut v = KVec::<u16>::with_capacity(4, GFP_KERNEL)?;
> +    /// for i in 0..4 {
> +    ///     v.push(i, GFP_KERNEL);
> +    /// }
> +    /// let s: KBox<[u16]> = v.into_boxed_slice()?;
> +    /// assert_eq!(s.len(), 4);
> +    /// # Ok::<(), kernel::alloc::AllocError>(())
> +    /// ```
> +    pub fn into_boxed_slice(mut self) -> Result<Box<[T], A>, AllocError> {
> +        self.shrink_to_fit()?;
> +        let (buf, len, _cap) = self.into_raw_parts();
> +        let slice = ptr::slice_from_raw_parts_mut(buf, len);
> +
> +        // SAFETY:
> +        // - `slice` has been allocated with `A`
> +        // - `slice` is suitably aligned
> +        // - `slice` has an exact length of `len`
> +        // - all elements within `slice` are initialized values of `T`
> +        // - `len` does not exceed `isize::MAX`
> +        // - `slice` was allocated for `Layout::for_value::<[T]>()`
> +        Ok(unsafe { Box::from_raw(slice) })
> +    }
>  }
>  
>  impl<T: Clone, A: Allocator> Vec<T, A> {


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 03/16] rust/alloc: add Vec::into_boxed_slice()
  2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
  2026-03-31 19:28   ` Miguel Ojeda
  2026-03-31 21:10   ` Gary Guo
@ 2026-03-31 22:07   ` Danilo Krummrich
  2026-04-01  9:28     ` David Rheinsberg
  2 siblings, 1 reply; 33+ messages in thread
From: Danilo Krummrich @ 2026-03-31 22:07 UTC (permalink / raw)
  To: David Rheinsberg; +Cc: rust-for-linux, teg, Miguel Ojeda

On Tue Mar 31, 2026 at 9:02 PM CEST, David Rheinsberg wrote:
> Add `Vec::into_boxed_slice()` similar to
> `std::vec::Vec::into_boxed_slice()` [1].
>
> There is currently no way to easily consume the allocation of a vector.
> However, it is very convenient to use `Vec` to initialize a dynamically
> sized array and then "seal" it, so it can be passed along as a Box:
>
>     fn create_from(src: &[T]) -> Result<KBox<[U]>, AllocError> {
>         let v = Vec::with_capacity(n, GFP_KERNEL)?;
>
>         for i in src {
>             v.push(foo(i)?, GFP_KERNEL)?;
>         }
>
>         Ok(v.into_boxed_slice())
>     }
>
> A valid alternative is to use `Box::new_uninit()` rather than
> `Vec::with_capacity()`, and eventually convert the box via
> `Box::assume_init()`. This works but needlessly requires unsafe code,
> awkward drop handling, etc. Using `Vec` is the much simpler solution.
>
> [1] https://doc.rust-lang.org/std/vec/struct.Vec.html#method.into_boxed_slice
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>

Thanks for presenting a user! Please make sure to use scripts/get_maintainer.pl.

Also, this patch has already been posted in [1], so it should be mentioned that
this is a v2 and it should include a changelog.

> ---
>  rust/kernel/alloc/kvec.rs | 67 +++++++++++++++++++++++++++++++++++++++
>  1 file changed, 67 insertions(+)
>
> diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
> index ac8d6f763ae8..b8b0fa1a7505 100644
> --- a/rust/kernel/alloc/kvec.rs
> +++ b/rust/kernel/alloc/kvec.rs
> @@ -733,6 +733,73 @@ pub fn retain(&mut self, mut f: impl FnMut(&mut T) -> bool) {
>          }
>          self.truncate(num_kept);
>      }
> +
> +    fn shrink_to_fit(&mut self) -> Result<(), AllocError>  {
> +        if Self::is_zst() {
> +            // ZSTs always use maximum capacity.
> +            return Ok(());
> +        }
> +
> +        let layout = ArrayLayout::new(self.len()).map_err(|_| AllocError)?;
> +
> +        // SAFETY:
> +        // - `ptr` is valid because it's either `None` or comes from a previous
> +        //   call to `A::realloc`.
> +        // - `self.layout` matches the `ArrayLayout` of the preceding
> +        //   allocation.
> +        let ptr = unsafe {
> +            A::realloc(
> +                Some(self.ptr.cast()),
> +                layout.into(),
> +                self.layout.into(),
> +                crate::alloc::flags::GFP_NOWAIT,

Why? This should be specified by the caller. Besides, I don't see how this could
ever end up in memory reclaim in the first place.

> +                NumaNode::NO_NODE,
> +            )?
> +        };
> +
> +        // INVARIANT:
> +        // - `layout` is some `ArrayLayout::<T>`,
> +        // - `ptr` has been created by `A::realloc` from `layout`.
> +        self.ptr = ptr.cast();
> +        self.layout = layout;
> +        Ok(())
> +    }
> +
> +    /// Converts the vector into [`Box<[T], A>`].
> +    ///
> +    /// Excess capacity is retained in the allocation, but lost until the box
> +    /// is dropped.
> +    ///
> +    /// This function is fallible, because kernel allocators do not guarantee
> +    /// that shrinking reallocations are infallible, yet the Rust abstractions
> +    /// strictly require that layouts are correct. Hence, the caller must be
> +    /// ready to deal with reallocation failures.
> +    ///
> +    /// # Examples
> +    ///
> +    /// ```
> +    /// let mut v = KVec::<u16>::with_capacity(4, GFP_KERNEL)?;
> +    /// for i in 0..4 {
> +    ///     v.push(i, GFP_KERNEL);
> +    /// }
> +    /// let s: KBox<[u16]> = v.into_boxed_slice()?;
> +    /// assert_eq!(s.len(), 4);
> +    /// # Ok::<(), kernel::alloc::AllocError>(())
> +    /// ```
> +    pub fn into_boxed_slice(mut self) -> Result<Box<[T], A>, AllocError> {
> +        self.shrink_to_fit()?;

As mentioned in [1], I think into_boxed_slice() should call A::realloc()
directly; at least use a separate internal helper. shrink_to_fit() will
eventually be exposed to users and the actual semantics is yet to be defined.
I.e. it may have additional logic.

The requirement here is not to actually shrink the backing memory, but to
satisfy the safety requirement of A::free(). And the best way to ensure this is
to call A::realloc() with ArrayLayout::new(self.len()).

IOW, please don't call the above method shrink_to_fit(), but maybe
realloc_to_fit(). If shrink_to_fit() will just end up calling realloc_to_fit()
that's fine.

> +        let (buf, len, _cap) = self.into_raw_parts();
> +        let slice = ptr::slice_from_raw_parts_mut(buf, len);
> +
> +        // SAFETY:
> +        // - `slice` has been allocated with `A`
> +        // - `slice` is suitably aligned
> +        // - `slice` has an exact length of `len`
> +        // - all elements within `slice` are initialized values of `T`
> +        // - `len` does not exceed `isize::MAX`
> +        // - `slice` was allocated for `Layout::for_value::<[T]>()`

Thanks for adding this! Mind also sending a fix for Box::from_raw() which lacks
the safety requirement?

> +        Ok(unsafe { Box::from_raw(slice) })
> +    }
>  }

[1] https://lore.kernel.org/all/20260326095621.846840-1-david@readahead.eu/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 03/16] rust/alloc: add Vec::into_boxed_slice()
  2026-03-31 22:07   ` Danilo Krummrich
@ 2026-04-01  9:28     ` David Rheinsberg
  0 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-04-01  9:28 UTC (permalink / raw)
  To: Danilo Krummrich; +Cc: rust-for-linux, teg, Miguel Ojeda

Hi

On Wed, Apr 1, 2026, at 12:07 AM, Danilo Krummrich wrote:
> On Tue Mar 31, 2026 at 9:02 PM CEST, David Rheinsberg wrote:
>> diff --git a/rust/kernel/alloc/kvec.rs b/rust/kernel/alloc/kvec.rs
>> index ac8d6f763ae8..b8b0fa1a7505 100644
>> --- a/rust/kernel/alloc/kvec.rs
>> +++ b/rust/kernel/alloc/kvec.rs
>> @@ -733,6 +733,73 @@ pub fn retain(&mut self, mut f: impl FnMut(&mut T) -> bool) {
>>          }
>>          self.truncate(num_kept);
>>      }
>> +
>> +    fn shrink_to_fit(&mut self) -> Result<(), AllocError>  {
>> +        if Self::is_zst() {
>> +            // ZSTs always use maximum capacity.
>> +            return Ok(());
>> +        }
>> +
>> +        let layout = ArrayLayout::new(self.len()).map_err(|_| AllocError)?;
>> +
>> +        // SAFETY:
>> +        // - `ptr` is valid because it's either `None` or comes from a previous
>> +        //   call to `A::realloc`.
>> +        // - `self.layout` matches the `ArrayLayout` of the preceding
>> +        //   allocation.
>> +        let ptr = unsafe {
>> +            A::realloc(
>> +                Some(self.ptr.cast()),
>> +                layout.into(),
>> +                self.layout.into(),
>> +                crate::alloc::flags::GFP_NOWAIT,
>
> Why? This should be specified by the caller. Besides, I don't see how this could
> ever end up in memory reclaim in the first place.

For slub, this can end up in reclaim if alignment requirements change, or if a different numa node is requested (only with GFP_THISNODE). vmalloc refuses alignment changes, but has the same numa logic. Not sure whether we have any other allocators in rust right now.

My idea was to have a fixed call to realloc() that ensures it does no fail with any known allocators, but still be callable from atomic context.

But I can also take the flags in `into_boxed_slice()`. Yet, with current allocators, they will not have any effect as long as the layout and numa-node is not caller-controlled.

>> +                NumaNode::NO_NODE,
>> +            )?
>> +        };
>> +
>> +        // INVARIANT:
>> +        // - `layout` is some `ArrayLayout::<T>`,
>> +        // - `ptr` has been created by `A::realloc` from `layout`.
>> +        self.ptr = ptr.cast();
>> +        self.layout = layout;
>> +        Ok(())
>> +    }
>> +
>> +    /// Converts the vector into [`Box<[T], A>`].
>> +    ///
>> +    /// Excess capacity is retained in the allocation, but lost until the box
>> +    /// is dropped.
>> +    ///
>> +    /// This function is fallible, because kernel allocators do not guarantee
>> +    /// that shrinking reallocations are infallible, yet the Rust abstractions
>> +    /// strictly require that layouts are correct. Hence, the caller must be
>> +    /// ready to deal with reallocation failures.
>> +    ///
>> +    /// # Examples
>> +    ///
>> +    /// ```
>> +    /// let mut v = KVec::<u16>::with_capacity(4, GFP_KERNEL)?;
>> +    /// for i in 0..4 {
>> +    ///     v.push(i, GFP_KERNEL);
>> +    /// }
>> +    /// let s: KBox<[u16]> = v.into_boxed_slice()?;
>> +    /// assert_eq!(s.len(), 4);
>> +    /// # Ok::<(), kernel::alloc::AllocError>(())
>> +    /// ```
>> +    pub fn into_boxed_slice(mut self) -> Result<Box<[T], A>, AllocError> {
>> +        self.shrink_to_fit()?;
>
> As mentioned in [1], I think into_boxed_slice() should call A::realloc()
> directly; at least use a separate internal helper. shrink_to_fit() will
> eventually be exposed to users and the actual semantics is yet to be defined.
> I.e. it may have additional logic.
>
> The requirement here is not to actually shrink the backing memory, but to
> satisfy the safety requirement of A::free(). And the best way to ensure this is
> to call A::realloc() with ArrayLayout::new(self.len()).
>
> IOW, please don't call the above method shrink_to_fit(), but maybe
> realloc_to_fit(). If shrink_to_fit() will just end up calling realloc_to_fit()
> that's fine.

`realloc_to_fit()` is fine with me.

>> +        let (buf, len, _cap) = self.into_raw_parts();
>> +        let slice = ptr::slice_from_raw_parts_mut(buf, len);
>> +
>> +        // SAFETY:
>> +        // - `slice` has been allocated with `A`
>> +        // - `slice` is suitably aligned
>> +        // - `slice` has an exact length of `len`
>> +        // - all elements within `slice` are initialized values of `T`
>> +        // - `len` does not exceed `isize::MAX`
>> +        // - `slice` was allocated for `Layout::for_value::<[T]>()`
>
> Thanks for adding this! Mind also sending a fix for Box::from_raw() which lacks
> the safety requirement?

Sure.

You want me to carry the patch in this series, or should I resend it separately?

Thanks
David

>> +        Ok(unsafe { Box::from_raw(slice) })
>> +    }
>>  }
>
> [1] https://lore.kernel.org/all/20260326095621.846840-1-david@readahead.eu/

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/16] bus1/util: add basic utilities
  2026-03-31 19:35   ` Miguel Ojeda
@ 2026-04-01 11:05     ` David Rheinsberg
  2026-04-01 11:25       ` Miguel Ojeda
  0 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-04-01 11:05 UTC (permalink / raw)
  To: Miguel Ojeda; +Cc: rust-for-linux, teg, Miguel Ojeda

Hi

On Tue, Mar 31, 2026, at 9:35 PM, Miguel Ojeda wrote:
> On Tue, Mar 31, 2026 at 9:05 PM David Rheinsberg <david@readahead.eu> wrote:
>>
>> Some helpers will become obsolete, once the MSRV is bumped. This is
>> noted in the documentation.
>
> Yes, I am bumping it to Rust 1.85.0, so that should be fine.
>
> However, `feature(exposed_provenance)` is there since 1.76, so you
> could use that already.

You mean enabling the unstable feature in older rust releases? But there is no guarantee the methods exposed by that feature did not change while it was unstable, right? So I would have to verify that nothing changed over the lifetime of the unstable feature, at least for the parts I am using.

Anyway, if the bump to 1.85 is planned, I will gladly keep the backports for now.

Thanks
David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 08/16] bus1/util: add basic utilities
  2026-04-01 11:05     ` David Rheinsberg
@ 2026-04-01 11:25       ` Miguel Ojeda
  0 siblings, 0 replies; 33+ messages in thread
From: Miguel Ojeda @ 2026-04-01 11:25 UTC (permalink / raw)
  To: David Rheinsberg; +Cc: rust-for-linux, teg, Miguel Ojeda

On Wed, Apr 1, 2026 at 1:05 PM David Rheinsberg <david@readahead.eu> wrote:
>
> You mean enabling the unstable feature in older rust releases? But there is no guarantee the methods exposed by that feature did not change while it was unstable, right? So I would have to verify that nothing changed over the lifetime of the unstable feature, at least for the parts I am using.

Yes, exactly. We do that for other unstable features.

To clarify, it is not like we add unstable features for no reason, but
if it does simplify (and they are already known to be stable etc.),
then it can be worth it.

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 07/16] bus1: add man-page
  2026-03-31 19:02 ` [RFC 07/16] bus1: add man-page David Rheinsberg
@ 2026-04-01 16:30   ` Jonathan Corbet
  2026-04-01 18:01     ` David Rheinsberg
  2026-04-04 15:30   ` Thomas Meyer
  1 sibling, 1 reply; 33+ messages in thread
From: Jonathan Corbet @ 2026-04-01 16:30 UTC (permalink / raw)
  To: David Rheinsberg, rust-for-linux; +Cc: teg, Miguel Ojeda, David Rheinsberg

David Rheinsberg <david@readahead.eu> writes:

> Create an overview man-page `bus1(7)` describing the overall design of
> bus1 as well as its individual commands.
>
> The man-page can be compiled and read via:
>
>     rst2man Documentation/bus1/bus1.7.rst bus1.7
>     man ./bus1.7
>
> Signed-off-by: David Rheinsberg <david@readahead.eu>
> ---
>  Documentation/bus1/bus1.7.rst | 319 ++++++++++++++++++++++++++++++++++
>  1 file changed, 319 insertions(+)
>  create mode 100644 Documentation/bus1/bus1.7.rst

I'm really glad to see this documentation with the series!

That said, a couple of notes...

- Please do not create a new top-level directory under Documentation/
  for this.  It looks to me like it should be a part of the user-space
  API manual.

- You need to add your new RST file to the containing index.rst file for
  the docs system pick it up.  That, and some things in the file itself,
  suggest that you've not run the docs build on this file; that would be
  a good thing to do at some point.

As a nit, I would also take the ".7" out of the file name.

Thanks,

jon

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 07/16] bus1: add man-page
  2026-04-01 16:30   ` Jonathan Corbet
@ 2026-04-01 18:01     ` David Rheinsberg
  2026-04-01 18:06       ` David Rheinsberg
  0 siblings, 1 reply; 33+ messages in thread
From: David Rheinsberg @ 2026-04-01 18:01 UTC (permalink / raw)
  To: Jonathan Corbet, rust-for-linux; +Cc: teg, Miguel Ojeda

Hi Jonathan!

On Wed, Apr 1, 2026, at 6:30 PM, Jonathan Corbet wrote:
> David Rheinsberg <david@readahead.eu> writes:
>
>> Create an overview man-page `bus1(7)` describing the overall design of
>> bus1 as well as its individual commands.
>>
>> The man-page can be compiled and read via:
>>
>>     rst2man Documentation/bus1/bus1.7.rst bus1.7
>>     man ./bus1.7
>>
>> Signed-off-by: David Rheinsberg <david@readahead.eu>
>> ---
>>  Documentation/bus1/bus1.7.rst | 319 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 319 insertions(+)
>>  create mode 100644 Documentation/bus1/bus1.7.rst
>
> I'm really glad to see this documentation with the series!
>
> That said, a couple of notes...
>
> - Please do not create a new top-level directory under Documentation/
>   for this.  It looks to me like it should be a part of the user-space
>   API manual.

My intention is to submit this man-page to linux-manpages. It is included here to make sure it evolves with the series, but ultimately would not live in-tree. I thought I mentioned this in the commit message, but.. apparently didn't. I am sorry!

> - You need to add your new RST file to the containing index.rst file for
>   the docs system pick it up.  That, and some things in the file itself,
>   suggest that you've not run the docs build on this file; that would be
>   a good thing to do at some point.

It builds fine with `rst2man` (as described in the commit-message). It was not meant to be picked up by the kernel docs, so is not linked in any index.rst.

Thanks
David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 07/16] bus1: add man-page
  2026-04-01 18:01     ` David Rheinsberg
@ 2026-04-01 18:06       ` David Rheinsberg
  0 siblings, 0 replies; 33+ messages in thread
From: David Rheinsberg @ 2026-04-01 18:06 UTC (permalink / raw)
  To: Jonathan Corbet, rust-for-linux; +Cc: teg, Miguel Ojeda

On Wed, Apr 1, 2026, at 8:01 PM, David Rheinsberg wrote:
> Hi Jonathan!
>
> On Wed, Apr 1, 2026, at 6:30 PM, Jonathan Corbet wrote:
>> David Rheinsberg <david@readahead.eu> writes:
>>
>>> Create an overview man-page `bus1(7)` describing the overall design of
>>> bus1 as well as its individual commands.
>>>
>>> The man-page can be compiled and read via:
>>>
>>>     rst2man Documentation/bus1/bus1.7.rst bus1.7
>>>     man ./bus1.7
>>>
>>> Signed-off-by: David Rheinsberg <david@readahead.eu>
>>> ---
>>>  Documentation/bus1/bus1.7.rst | 319 ++++++++++++++++++++++++++++++++++
>>>  1 file changed, 319 insertions(+)
>>>  create mode 100644 Documentation/bus1/bus1.7.rst
>>
>> I'm really glad to see this documentation with the series!
>>
>> That said, a couple of notes...
>>
>> - Please do not create a new top-level directory under Documentation/
>>   for this.  It looks to me like it should be a part of the user-space
>>   API manual.
>
> My intention is to submit this man-page to linux-manpages. It is 
> included here to make sure it evolves with the series, but ultimately 
> would not live in-tree. I thought I mentioned this in the commit 
> message, but.. apparently didn't. I am sorry!

If you prefer, I can instead include it as rendered text in the cover-letter, as is common with new syscalls. But the man-page is rather long, so I kinda liked including it as a patch. I am fine either way.

Thanks
David

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC 07/16] bus1: add man-page
  2026-03-31 19:02 ` [RFC 07/16] bus1: add man-page David Rheinsberg
  2026-04-01 16:30   ` Jonathan Corbet
@ 2026-04-04 15:30   ` Thomas Meyer
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Meyer @ 2026-04-04 15:30 UTC (permalink / raw)
  To: David Rheinsberg; +Cc: rust-for-linux, teg, Miguel Ojeda

Am Tue, Mar 31, 2026 at 09:02:59PM +0200 schrieb David Rheinsberg:
> Create an overview man-page `bus1(7)` describing the overall design of
> bus1 as well as its individual commands.
> 
> The man-page can be compiled and read via:
> 
>     rst2man Documentation/bus1/bus1.7.rst bus1.7
>     man ./bus1.7
> 
> Signed-off-by: David Rheinsberg <david@readahead.eu>
> ---
>  Documentation/bus1/bus1.7.rst | 319 ++++++++++++++++++++++++++++++++++
>  1 file changed, 319 insertions(+)
>  create mode 100644 Documentation/bus1/bus1.7.rst
> 
> diff --git a/Documentation/bus1/bus1.7.rst b/Documentation/bus1/bus1.7.rst
> new file mode 100644
> index 000000000000..0e2f26fee3e2
> --- /dev/null
> +++ b/Documentation/bus1/bus1.7.rst
Hi,

here some suggested typo/grammer, consistency fixes ("file-descriptors" to "file descriptors"),
and some minor content tweaks:

diff --git a/Documentation/bus1/bus1.7.rst b/Documentation/bus1/bus1.7.rst
index 0e2f26f..4a8726f 100644
--- a/Documentation/bus1/bus1.7.rst
+++ b/Documentation/bus1/bus1.7.rst
@@ -38,8 +38,9 @@ visible behavior of them. However, all nodes and handles of a single peer share
 a message queue.
 
 When the last handle to a node is released, the owning peer of the node
-receives a notification. Similarly, if a node is released, the owning peers of
-all handles referring to that node receive a notification. All notifications
+receives a ``BUS1_MESSAGE_TYPE_NODE_RELEASE`` notification. Similarly, if a node
+is released, the owning peers of all handles referring to that node receive a
+``BUS1_MESSAGE_TYPE_HANDLE_RELEASE`` notification. All notifications
 are ordered causally with any other ongoing communication.
 
 Communication on the bus happens via transactions. A transaction is an atomic
@@ -71,8 +72,8 @@ Peer Creation
 | ``int bus1_peer_new();``
 
 Peers are independent entities that can be created at will. They are accessed
-via file-descriptors, with each peer having its own file-description. Multiple
-file-descriptors can refer to the same peer, yet currently all operations will
+via file descriptors, with each peer having its own file description. Multiple
+file descriptors can refer to the same peer, yet currently all operations will
 lock a peer and thus serialize all operations on that peer.
 
 Once the last file-descriptor referring to a peer is closed, the peer is
@@ -108,7 +109,7 @@ a handle from one peer to another, while holding file-descriptors to both
 peers.
 
 The command takes ``flags``, which currently is unused and must be 0. ``from``
-and ``to`` are file-descriptors referring to the involved peers. ``from`` must
+and ``to`` are file descriptors referring to the involved peers. ``from`` must
 be provided, while ``to`` can be ``-1``, in which case it will refer to the
 same peer as ``from``.
 
@@ -118,7 +119,7 @@ uninitialized, and will be filled in by the kernel. ``src`` must be initialized
 by the caller. ``src[i].flags`` must be 0 or ``BUS1_TRANSFER_FLAG_CREATE``.
 ``src[i].id`` must refer to an ID of a handle in ``from``. If
 ``BUS1_TRANSFER_FLAG_CREATE`` is set, ``src[i].id`` must be set to
-``BUS1_INVALID``. In this case a new node is create and the ID of the node
+``BUS1_INVALID``. In this case a new node is created and the ID of the node
 is returned in ``src[i].id`` with ``src[i].flags`` cleared to 0.
 
 In any case, a new handle in ``to`` is created for every provided transfer. Its
@@ -137,7 +138,7 @@ Release Command
 |         ``uint64_t *ids``
 | ``);``
 
-A release command takes a peer file-descriptor as ``peerfd`` and an array of
+A release command takes a peer file descriptor as ``peerfd`` and an array of
 node and handle IDs as ``ids`` with ``n_ids`` number of elements. All these
 nodes and handles will be released in a single atomic transaction.
 
@@ -174,7 +175,7 @@ Send Command
 |         ``struct bus1_message *message``
 | ``);``
 
-The send command takes a peer file-descriptor as ``peerfd``, the message to
+The send command takes a peer file descriptor as ``peerfd``, the message to
 send as ``message``, and an array of destination handles as ``destinations``
 (with ``n_destinations`` number of elements).
 
@@ -184,7 +185,7 @@ currently partial failure is not exposed, ``errors[i]`` is currently always
 set to 0 on success.
 
 All destination IDs must refer to a valid handle of the calling peer.
-``EBADRQC`` is returned if an ID did not refer to an handle. Currently, only
+``EBADRQC`` is returned if an ID did not refer to a handle. Currently, only
 a single message can be provided with a single send command, and this message
 is transmitted to all destinations in a single atomic transaction.
 
@@ -217,14 +218,14 @@ Recv Command
 |         ``struct bus1_message *message``
 | ``);``
 
-The recv command takes a peer file-descriptor as ``peerfd`` and fetches the
+The recv command takes a peer file descriptor as ``peerfd`` and fetches the
 next message from its queue. If no message is queued ``EAGAIN`` is returned.
 
 The message is returned in ``message``. The caller must set ``message.flags``
 to 0 and ``message.type`` to ``BUS1_INVALID``. ``message.n_transfers`` and
 ``message.ptr_transfers`` refer to an array of ``struct bus1_transfer``
 structures used to return the transferred handles of the next message. Upon
-return, ``message.n_transfers`` is updated to the actually transferred number
+return, ``message.n_transfers`` is updated to the actual transferred number
 of handles, while ``message.transfers[i]`` is updated as described in
 ``bus1_cmd_transfer(2)``.
 
@@ -240,8 +241,7 @@ transferred handles and data written to the transfer array and iovecs.
 
 ``metadata`` is updated to contain more data about the message.
 ``metadata.flags`` is unused and set to 0. ``metadata.id`` contains the ID
-of the node the message was received on (or the ID of the handle in case of
-``BUS1_MESSAGE_TYPE_NODE_RELEASE``). ``metadata.account`` contains the ID
+of the node the message was received on, or the ID of the released handle in the case of ``BUS1_MESSAGE_TYPE_HANDLE_RELEASE``. ``metadata.account`` contains the ID
 of the resource context of the sender.
 
 Errors
@@ -250,7 +250,7 @@ Errors
 All operations follow a strict error reporting model. If an operation has a
 documented error case, then this will be indicated to user-space with a
 negative return value (or ``errno`` respectively). Whenever an error appears,
-the operation will have been cancelled entirely and have no observable affect
+the operation will have been cancelled entirely and have no observable effect
 on the bus. User space can safely assume the system to be in the same state as
 if the operation was not invoked, unless explicitly documented.
 
@@ -294,12 +294,16 @@ can have (temporary) visible side-effects. But similar to the atomicity
 guarantees, these do not affect any other bus properties, but only the resource
 accounting.
 
+While resource accounting side-effects are temporary and do not violate bus
+atomicity, they may be observable via system monitoring tools during the
+operation.
+
 However, note that monitoring of bus accounting is not considered a
 programmatic interface, nor are any explicit accounting APIs exposed. Thus, the
 only visible effect of resource accounting is getting ``EDQUOT`` if a counter
 is exceeded.
 
-Additionally to standard resource accounting, a peer can also allocate remote
+In addition to standard resource accounting, a peer can also allocate remote
 resources. This happens whenever a transaction transmits resources from
 a sender to a receiver. All such transactions are always accounted on the
 receiver at the time of *send*. To prevent senders from exhausting resources

mfg
thomas

> @@ -0,0 +1,319 @@
> +====
> +bus1
> +====
> +
> +----------------------------------------------
> +Capability-based IPC for Linux
> +----------------------------------------------
> +
> +:Manual section: 7
> +:Manual group: Miscellaneous
> +
> +SYNOPSIS
> +========
> +
> +| ``#include <linux/bus1.h>``
> +
> +DESCRIPTION
> +-----------
> +
> +The bus1 API provides capability-based inter-process communication. Its core
> +primitive is a multi-producer/single-consumer unidirectional channel that can
> +transmit arbitrary user messages. The receiving end of the channel is called
> +a **node**, while the sending end is called a **handle**.
> +
> +A handle always refers to exactly one node, but there can be many handles
> +referring to the same node, and those handles can be held by independent
> +owners. Messages are sent via a handle, meaning it is transmitted to the node
> +the handle is linked to. A handle to a node is required to transmit a message
> +to that node.
> +
> +A sender can attach copies of any handle they hold to a message, and thus
> +transfer them alongside the message. The copied handles refer to the same node
> +as their respective original handle.
> +
> +All nodes and handles have an owning **peer**. A peer is a purely local
> +concept. The owning peer of a node or handle never affects the externally
> +visible behavior of them. However, all nodes and handles of a single peer share
> +a message queue.
> +
> +When the last handle to a node is released, the owning peer of the node
> +receives a notification. Similarly, if a node is released, the owning peers of
> +all handles referring to that node receive a notification. All notifications
> +are ordered causally with any other ongoing communication.
> +
> +Communication on the bus happens via transactions. A transaction is an atomic
> +transmission of messages, which can include release notifications. All message
> +types can be part of a transaction, and thus can happen atomically with any
> +other kind of message. A transaction with only a single message or notification
> +is called a unicast. Any other transaction is called a multicast.
> +
> +Transactions are causally ordered. That is, if any transaction is a reaction to
> +any previous transaction, all messages of the reaction transaction will be
> +received by any peer after the messages that were part of the original
> +transaction. This is even guaranteed if the causal relationship exists only via
> +a side-channel outside the scope of bus1. However, messages without causal
> +relationship have no stable order. This is especially noticeable with
> +multicasts, where receivers might see independent multicasts in a different
> +order.
> +
> +Operations
> +----------
> +
> +The user-space API of bus1 is not decided on. This section describes the
> +available operations as system calls, as they likely would be exposed by any
> +user-space library. However, for development reasons the actual user-space API
> +is currently performed via ioctls on a character device.
> +
> +Peer Creation
> +^^^^^^^^^^^^^
> +
> +| ``int bus1_peer_new();``
> +
> +Peers are independent entities that can be created at will. They are accessed
> +via file-descriptors, with each peer having its own file-description. Multiple
> +file-descriptors can refer to the same peer, yet currently all operations will
> +lock a peer and thus serialize all operations on that peer.
> +
> +Once the last file-descriptor referring to a peer is closed, the peer is
> +released. Any resources of that peer are released, and any ongoing transactions
> +targetting the peer will discard their messages.
> +
> +File descriptions pin the credentials of the calling process. A peer will use
> +those pinned credentials for resource accounting. Otherwise, no ambient
> +resources are used by bus1.
> +
> +Transfer Command
> +^^^^^^^^^^^^^^^^
> +
> +| ``#define BUS1_TRANSFER_FLAG_CREATE 0x1``
> +|
> +| ``struct bus1_transfer {``
> +|         ``uint64_t flags;``
> +|         ``uint64_t id;``
> +| ``};``
> +|
> +| ``int bus1_cmd_transfer(``
> +|         ``uint64_t flags,``
> +|         ``int from,``
> +|         ``int to,``
> +|         ``size_t n,``
> +|         ``struct bus1_transfer *src,``
> +|         ``struct bus1_transfer *dst``
> +| ``);``
> +
> +A transfer command can be used for two different operations. First, it can be
> +used to create nodes and handles on a peer. Second, it can be used to transfer
> +a handle from one peer to another, while holding file-descriptors to both
> +peers.
> +
> +The command takes ``flags``, which currently is unused and must be 0. ``from``
> +and ``to`` are file-descriptors referring to the involved peers. ``from`` must
> +be provided, while ``to`` can be ``-1``, in which case it will refer to the
> +same peer as ``from``.
> +
> +``n`` defines the number of transfer operations that are performed atomically.
> +``src`` and ``dst`` must refer to arrays with ``n`` elements. ``dst`` can be
> +uninitialized, and will be filled in by the kernel. ``src`` must be initialized
> +by the caller. ``src[i].flags`` must be 0 or ``BUS1_TRANSFER_FLAG_CREATE``.
> +``src[i].id`` must refer to an ID of a handle in ``from``. If
> +``BUS1_TRANSFER_FLAG_CREATE`` is set, ``src[i].id`` must be set to
> +``BUS1_INVALID``. In this case a new node is create and the ID of the node
> +is returned in ``src[i].id`` with ``src[i].flags`` cleared to 0.
> +
> +In any case, a new handle in ``to`` is created for every provided transfer. Its
> +ID is returned in ``dst[i].id`` and ``dst[i].flags`` is set to 0.
> +
> +Note that both arrays ``src`` and ``dst`` can be partially modified by the
> +kernel even if the operation fails (even if it fails with a different error
> +than ``EFAULT``).
> +
> +Release Command
> +^^^^^^^^^^^^^^^
> +
> +| ``int bus1_cmd_release(``
> +|         ``int peerfd,``
> +|         ``size_t n_ids,``
> +|         ``uint64_t *ids``
> +| ``);``
> +
> +A release command takes a peer file-descriptor as ``peerfd`` and an array of
> +node and handle IDs as ``ids`` with ``n_ids`` number of elements. All these
> +nodes and handles will be released in a single atomic transaction.
> +
> +The command does not fail, except if invalid arguments are provided.
> +
> +No subsequent operation on this peer will refer to the IDs once this call
> +returns. Furthermore, those IDs will never be reused.
> +
> +Send Command
> +^^^^^^^^^^^^
> +
> +| ``enum bus1_message_type: uint64_t {``
> +|         ``BUS1_MESSAGE_TYPE_USER = 0,``
> +|         ``BUS1_MESSAGE_TYPE_NODE_RELEASE = 1,``
> +|         ``BUS1_MESSAGE_TYPE_HANDLE_RELEASE = 2,``
> +|         ``_BUS1_MESSAGE_TYPE_N,``
> +| ``}``
> +|
> +| ``struct bus1_message {``
> +|         ``uint64_t flags;``
> +|         ``uint64_t type;``
> +|         ``uint64_t n_transfers;   // size_t n_transfers``
> +|         ``uint64_t ptr_transfers; // struct bus1_transfer *transfers;``
> +|         ``uint64_t n_data;        // size_t n_data;``
> +|         ``uint64_t n_data_vecs;   // size_t n_data_vecs;``
> +|         ``uint64_t ptr_data_vecs; // struct iovec *data_vecs;``
> +| ``};``
> +|
> +| ``int bus1_cmd_send(``
> +|         ``int peerfd,``
> +|         ``size_t n_destinations,``
> +|         ``uint64_t *destinations,``
> +|         ``int32_t *errors,``
> +|         ``struct bus1_message *message``
> +| ``);``
> +
> +The send command takes a peer file-descriptor as ``peerfd``, the message to
> +send as ``message``, and an array of destination handles as ``destinations``
> +(with ``n_destinations`` number of elements).
> +
> +Additionally, ``errors`` is used to return the individual error code for each
> +destination. This is only done if the send command returns success. Since
> +currently partial failure is not exposed, ``errors[i]`` is currently always
> +set to 0 on success.
> +
> +All destination IDs must refer to a valid handle of the calling peer.
> +``EBADRQC`` is returned if an ID did not refer to an handle. Currently, only
> +a single message can be provided with a single send command, and this message
> +is transmitted to all destinations in a single atomic transaction.
> +
> +The message to be transmitted is provided as ``message``. This structure
> +describes the payload of the message. ``message.flags`` must be 0.
> +``message.type`` must be ``BUS1_MESSAGE_TYPE_USER``. ``message.n_transfers``
> +and ``message.ptr_transfers`` refer to an array of ``struct bus1_transfer``
> +and describe handles to be transferred with the message. The transfers are
> +used the same as in ``bus1_cmd_transfer(2)``, but ``BUS1_TRANSFER_FLAG_CREATE``
> +is currently not refused.
> +
> +``message.n_data_vecs`` and ``message.ptr_data_vecs`` provide the iovecs with
> +the data to be transmitted with the message. Only the first ``message.n_data``
> +bytes of the iovecs are considered part of the message. Any trailing bytes
> +are ignored. The data is copied into kernel buffers and the iovecs are no
> +longer accessed once the command returns.
> +
> +Recv Command
> +^^^^^^^^^^^^
> +
> +| ``struct bus1_metadata {``
> +|         ``uint64_t flags;``
> +|         ``uint64_t id;``
> +|         ``uint64_t account;``
> +| ``};``
> +|
> +| ``int bus1_cmd_recv(``
> +|         ``int peerfd,``
> +|         ``struct bus1_metadata *metadata,``
> +|         ``struct bus1_message *message``
> +| ``);``
> +
> +The recv command takes a peer file-descriptor as ``peerfd`` and fetches the
> +next message from its queue. If no message is queued ``EAGAIN`` is returned.
> +
> +The message is returned in ``message``. The caller must set ``message.flags``
> +to 0 and ``message.type`` to ``BUS1_INVALID``. ``message.n_transfers`` and
> +``message.ptr_transfers`` refer to an array of ``struct bus1_transfer``
> +structures used to return the transferred handles of the next message. Upon
> +return, ``message.n_transfers`` is updated to the actually transferred number
> +of handles, while ``message.transfers[i]`` is updated as described in
> +``bus1_cmd_transfer(2)``.
> +
> +``message.n_data``, ``message.n_data_vecs``, and ``message.ptr_data_vecs``
> +must be initialized by the caller and provide the space to store the data of
> +the next message. The iovecs are never modified by the operation.
> +
> +If the message would exceed ``message.n_transfers`` or ``message.n_data``,
> +``EMSGSIZE`` is returned and the fields are updated accordingly.
> +
> +Upon success, ``message`` is updated with data of the received message, with
> +transferred handles and data written to the transfer array and iovecs.
> +
> +``metadata`` is updated to contain more data about the message.
> +``metadata.flags`` is unused and set to 0. ``metadata.id`` contains the ID
> +of the node the message was received on (or the ID of the handle in case of
> +``BUS1_MESSAGE_TYPE_NODE_RELEASE``). ``metadata.account`` contains the ID
> +of the resource context of the sender.
> +
> +Errors
> +------
> +
> +All operations follow a strict error reporting model. If an operation has a
> +documented error case, then this will be indicated to user-space with a
> +negative return value (or ``errno`` respectively). Whenever an error appears,
> +the operation will have been cancelled entirely and have no observable affect
> +on the bus. User space can safely assume the system to be in the same state as
> +if the operation was not invoked, unless explicitly documented.
> +
> +One major exception is ``EFAULT``. The ``EFAULT`` error code is returned
> +whenever user-space supplied malformed pointers to the kernel, and the kernel
> +was unable to fetch information from, or return information to, user-space.
> +This indicates a misbehaving client, and usually there is no way to recover
> +from this, unless user-space intentionally triggered this behavior. User-space
> +should treat ``EFAULT`` as an assertion failure and not try to recover. If the
> +bus1 API is used in a correct manner, ``EFAULT`` will never be returned by any
> +operation.
> +
> +Resource Accounting
> +-------------------
> +
> +Every peer has an associated resource context used to account claimed
> +resources. This resource context is determined at the time the peer is created
> +and it will never change over its lifetime. The default, and at this time only,
> +accounting model is based on UNIX ``UIDs``. That is, each peer gets assigned
> +the resource-context of the ``Effective UID`` of the process that creates it.
> +From then on any resource consumption of the peer is accounted on this
> +resource-context, and thus shared with all other peers of the same ``UID``.
> +
> +All allocations have upper limits which cannot be exceeded. An operation will
> +return ``EDQUOT`` if the quota limits prevent an operation from being
> +performed. User-space is expected to treat this as an administration or
> +configuration error, since there is generally no meaningful way to recover.
> +Applications should expect to be spawned with suitable resource limits
> +pre-configured. However, this is not enforced and user-space is free to react
> +to ``EDQUOT`` as it wishes.
> +
> +Unlike all other bus properties, resource accounting is not part of the bus
> +atomicity and ordering guarantees, nor does it implement strict rollback. This
> +means, if an operation allocates multiple resources, the resource counters are
> +updated before the operation will happen on the bus. Hence, the resource
> +counter modifications are visible to the system before the operation itself is.
> +Furthermore, while any failing operation will correctly revert any temporary
> +resource allocations, the allocations will have been visible to the system
> +for the time of this (failed) operation. Therefore, even a failed operation
> +can have (temporary) visible side-effects. But similar to the atomicity
> +guarantees, these do not affect any other bus properties, but only the resource
> +accounting.
> +
> +However, note that monitoring of bus accounting is not considered a
> +programmatic interface, nor are any explicit accounting APIs exposed. Thus, the
> +only visible effect of resource accounting is getting ``EDQUOT`` if a counter
> +is exceeded.
> +
> +Additionally to standard resource accounting, a peer can also allocate remote
> +resources. This happens whenever a transaction transmits resources from
> +a sender to a receiver. All such transactions are always accounted on the
> +receiver at the time of *send*. To prevent senders from exhausting resources
> +of a receiver, a peer only ever gets access to a subset of the resources of any
> +other resource-context that does not match its own.
> +
> +The exact quotas are
> +calculated at runtime and dynamically adapt to the number of different users
> +that currently partake. The ideal is a fair linear distribution of the
> +available resources, and the algorithm guarantees a quasi-linear distribution.
> +Yet, the details are implementation specific and can change over time.
> +
> +Additionally, a second layer resource accounting separates peers of the same
> +resource context. This is done to prevent malfunctioning peers from exceeding
> +all resources of their resource context, and thus affecting other peers with
> +the same resource context. This uses a much less strict quota system, since
> +it does not span security domains.
> -- 
> 2.53.0
> 

^ permalink raw reply related	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2026-04-04 15:58 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-31 19:02 [RFC 00/16] bus1: Capability-based IPC for Linux David Rheinsberg
2026-03-31 19:02 ` [RFC 01/16] rust/sync: add LockedBy::access_mut_unchecked() David Rheinsberg
2026-03-31 19:29   ` Miguel Ojeda
2026-03-31 19:02 ` [RFC 02/16] rust/sync: add Arc::drop_unless_unique() David Rheinsberg
2026-03-31 19:02 ` [RFC 03/16] rust/alloc: add Vec::into_boxed_slice() David Rheinsberg
2026-03-31 19:28   ` Miguel Ojeda
2026-03-31 21:10   ` Gary Guo
2026-03-31 22:07   ` Danilo Krummrich
2026-04-01  9:28     ` David Rheinsberg
2026-03-31 19:02 ` [RFC 04/16] rust/error: add EXFULL, EBADRQC, EDQUOT, ENOTRECOVERABLE David Rheinsberg
2026-03-31 19:02 ` [RFC 05/16] bus1: add module scaffolding David Rheinsberg
2026-03-31 19:02 ` [RFC 06/16] bus1: add the user-space API David Rheinsberg
2026-03-31 19:02 ` [RFC 07/16] bus1: add man-page David Rheinsberg
2026-04-01 16:30   ` Jonathan Corbet
2026-04-01 18:01     ` David Rheinsberg
2026-04-01 18:06       ` David Rheinsberg
2026-04-04 15:30   ` Thomas Meyer
2026-03-31 19:03 ` [RFC 08/16] bus1/util: add basic utilities David Rheinsberg
2026-03-31 19:35   ` Miguel Ojeda
2026-04-01 11:05     ` David Rheinsberg
2026-04-01 11:25       ` Miguel Ojeda
2026-03-31 19:03 ` [RFC 09/16] bus1/util: add field projections David Rheinsberg
2026-03-31 19:38   ` Miguel Ojeda
2026-03-31 19:03 ` [RFC 10/16] bus1/util: add IntoDeref/FromDeref David Rheinsberg
2026-03-31 19:44   ` Miguel Ojeda
2026-03-31 19:03 ` [RFC 11/16] bus1/util: add intrusive data-type helpers David Rheinsberg
2026-03-31 19:03 ` [RFC 12/16] bus1/util: add intrusive single linked lists David Rheinsberg
2026-03-31 19:03 ` [RFC 13/16] bus1/util: add intrusive rb-tree David Rheinsberg
2026-03-31 19:43   ` Miguel Ojeda
2026-03-31 19:03 ` [RFC 14/16] bus1/acct: add resouce accounting David Rheinsberg
2026-03-31 19:03 ` [RFC 15/16] bus1: introduce peers, handles, and nodes David Rheinsberg
2026-03-31 19:03 ` [RFC 16/16] bus1: implement the uapi David Rheinsberg
2026-03-31 19:46 ` [RFC 00/16] bus1: Capability-based IPC for Linux Miguel Ojeda

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox