[PATCH 0/3] rust: more memory barriers bindings

public inbox for rust-for-linux@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] rust: more memory barriers bindings
@ 2026-04-02 15:24 Gary Guo
  2026-04-02 15:24 ` [PATCH 1/3] rust: sync: add helpers for mb, dma_mb and friends Gary Guo
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Gary Guo @ 2026-04-02 15:24 UTC (permalink / raw)
  To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
	rust-for-linux, nouveau, linux-kernel, linux-arch, lkmm

From: Gary Guo <gary@garyguo.net>

This expands the existing smp barriers to also mandatory barriers and
DMA barriers.

The API looks like:
`mb(Ordering)`/`smp_mb(Ordering)`/`dma_mb(Ordering)`, where `Ordering`
is one of `Full`, `Read`, `Write` and also `Acquire`, `Release`.

The `Acquire` and `Release` barriers are mapped to `Full` barriers for
now and they only serve the purpose of documenting what ordering is
needed without codegen optimizations, but they could be improved
later to produce better codegen compared to full barriers. More on them
in the commit message of patch 2.

A user of these introduced API is included in patch 3, which is a
concurrency bug that exists in Nova code today due to missing barriers.

To: Miguel Ojeda <ojeda@kernel.org>
To: Boqun Feng <boqun@kernel.org>
To: Danilo Krummrich <dakr@kernel.org>
To: Alexandre Courbot <acourbot@nvidia.com>
Cc: Björn Roy Baron <bjorn3_gh@protonmail.com>
Cc: Benno Lossin <lossin@kernel.org>
Cc: Andreas Hindborg <a.hindborg@kernel.org>
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Trevor Gross <tmgross@umich.edu>
Cc: Alan Stern <stern@rowland.harvard.edu>
Cc: Andrea Parri <parri.andrea@gmail.com>
Cc: Will Deacon <will@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Jade Alglave <j.alglave@ucl.ac.uk>
Cc: Luc Maranget <luc.maranget@inria.fr>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Akira Yokosawa <akiyks@gmail.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: Joel Fernandes <joelagnelf@nvidia.com>
Cc: rust-for-linux@vger.kernel.org
Cc: nouveau@lists.freedesktop.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: lkmm@lists.linux.dev

Gary Guo (3):
  rust: sync: add helpers for mb, dma_mb and friends
  rust: sync: generic memory barriers
  gpu: nova-core: fix wrong use of barriers in GSP code

 drivers/gpu/nova-core/gsp/cmdq.rs   |  19 +++
 drivers/gpu/nova-core/gsp/fw.rs     |  12 --
 rust/helpers/barrier.c              |  30 +++++
 rust/kernel/sync/atomic/ordering.rs |   2 +-
 rust/kernel/sync/barrier.rs         | 194 ++++++++++++++++++++++++----
 5 files changed, 217 insertions(+), 40 deletions(-)

base-commit: 36ece9697e89016181e5ae87510e40fb31d86f2b
-- 
2.51.2

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] rust: sync: add helpers for mb, dma_mb and friends
  2026-04-02 15:24 [PATCH 0/3] rust: more memory barriers bindings Gary Guo
@ 2026-04-02 15:24 ` Gary Guo
  2026-04-02 15:24 ` [PATCH 2/3] rust: sync: generic memory barriers Gary Guo
  2026-04-02 15:24 ` [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code Gary Guo
  2 siblings, 0 replies; 11+ messages in thread
From: Gary Guo @ 2026-04-02 15:24 UTC (permalink / raw)
  To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
	rust-for-linux, nouveau, linux-kernel, linux-arch, lkmm

From: Gary Guo <gary@garyguo.net>

They supplement the existing smp_mb, smp_rmb and smp_wmb.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/helpers/barrier.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/rust/helpers/barrier.c b/rust/helpers/barrier.c
index fed8853745c8..dbc7a3017c78 100644
--- a/rust/helpers/barrier.c
+++ b/rust/helpers/barrier.c
@@ -2,6 +2,36 @@
 
 #include <asm/barrier.h>
 
+__rust_helper void rust_helper_mb(void)
+{
+	mb();
+}
+
+__rust_helper void rust_helper_rmb(void)
+{
+	rmb();
+}
+
+__rust_helper void rust_helper_wmb(void)
+{
+	wmb();
+}
+
+__rust_helper void rust_helper_dma_mb(void)
+{
+	dma_mb();
+}
+
+__rust_helper void rust_helper_dma_rmb(void)
+{
+	dma_rmb();
+}
+
+__rust_helper void rust_helper_dma_wmb(void)
+{
+	dma_wmb();
+}
+
 __rust_helper void rust_helper_smp_mb(void)
 {
 	smp_mb();
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] rust: sync: generic memory barriers
  2026-04-02 15:24 [PATCH 0/3] rust: more memory barriers bindings Gary Guo
  2026-04-02 15:24 ` [PATCH 1/3] rust: sync: add helpers for mb, dma_mb and friends Gary Guo
@ 2026-04-02 15:24 ` Gary Guo
  2026-04-02 21:49   ` Joel Fernandes
  2026-04-02 15:24 ` [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code Gary Guo
  2 siblings, 1 reply; 11+ messages in thread
From: Gary Guo @ 2026-04-02 15:24 UTC (permalink / raw)
  To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Will Deacon, Peter Zijlstra, Mark Rutland
  Cc: Alan Stern, Andrea Parri, Nicholas Piggin, David Howells,
	Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
	Daniel Lustig, Joel Fernandes, rust-for-linux, nouveau,
	linux-kernel, linux-arch, lkmm

From: Gary Guo <gary@garyguo.net>

Implement a generic interface for memory barriers (full system/DMA/SMP).
The interface uses a parameter to force user to specify their intent with
barriers.

It provides `Read`, `Write`, `Full` orderings which map to the existing
`rmb()`, `wmb()` and `mb()`, but also `Acquire` and `Release` which is
documented to have `LOAD->{LOAD,STORE}` ordering and `{LOAD,STORE}->WRITE`
ordering, although for now they're still mapped to a full `mb()`. But in
the future it could be mapped to a more efficient form depending on the
architecture. I included them as many users do not need the STORE->LOAD
ordering, and having them use `Acquire`/`Release` is more clear on their
intent in what reordering is to be prevented.

Generic is used here instead of providing individual standalone functions
to reduce code duplication. For example, the `Acquire` -> `Full` upgrade
here is uniformly implemented for all three types. The `CONFIG_SMP` check
in `smp_mb` is uniformly implemented for all SMP barriers. This could
extend to `virt_mb`'s if they're introduced in the future.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 rust/kernel/sync/atomic/ordering.rs |   2 +-
 rust/kernel/sync/barrier.rs         | 194 ++++++++++++++++++++++++----
 2 files changed, 168 insertions(+), 28 deletions(-)

diff --git a/rust/kernel/sync/atomic/ordering.rs b/rust/kernel/sync/atomic/ordering.rs
index 3f103aa8db99..c4e732e7212f 100644
--- a/rust/kernel/sync/atomic/ordering.rs
+++ b/rust/kernel/sync/atomic/ordering.rs
@@ -15,7 +15,7 @@
 //!   - It provides ordering between the annotated operation and all the following memory accesses.
 //!   - It provides ordering between all the preceding memory accesses and all the following memory
 //!     accesses.
-//!   - All the orderings are the same strength as a full memory barrier (i.e. `smp_mb()`).
+//!   - All the orderings are the same strength as a full memory barrier (i.e. `smp_mb(Full)`).
 //! - [`Relaxed`] provides no ordering except the dependency orderings. Dependency orderings are
 //!   described in "DEPENDENCY RELATIONS" in [`LKMM`]'s [`explanation`].
 //!
diff --git a/rust/kernel/sync/barrier.rs b/rust/kernel/sync/barrier.rs
index 8f2d435fcd94..0331bb353a76 100644
--- a/rust/kernel/sync/barrier.rs
+++ b/rust/kernel/sync/barrier.rs
@@ -7,6 +7,23 @@
 //!
 //! [`LKMM`]: srctree/tools/memory-model/
 
+#![expect(private_bounds, reason = "sealed implementation")]
+
+pub use super::atomic::ordering::{
+    Acquire,
+    Full,
+    Release, //
+};
+
+/// The annotation type for read operations.
+pub struct Read;
+
+/// The annotation type for write operations.
+pub struct Write;
+
+struct Smp;
+struct Dma;
+
 /// A compiler barrier.
 ///
 /// A barrier that prevents compiler from reordering memory accesses across the barrier.
@@ -19,43 +36,166 @@ pub(crate) fn barrier() {
     unsafe { core::arch::asm!("") };
 }
 
-/// A full memory barrier.
-///
-/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
-#[inline(always)]
-pub fn smp_mb() {
-    if cfg!(CONFIG_SMP) {
-        // SAFETY: `smp_mb()` is safe to call.
-        unsafe { bindings::smp_mb() };
-    } else {
-        barrier();
+trait MemoryBarrier<Flavour = ()> {
+    fn run();
+}
+
+// Currently kernel only support `rmb`, `wmb` and full `mb`.
+// Upgrade `Acquire`/`Release` barriers to full barriers.
+
+impl<F> MemoryBarrier<F> for Acquire
+where
+    Full: MemoryBarrier<F>,
+{
+    #[inline]
+    fn run() {
+        Full::run();
     }
 }
 
-/// A write-write memory barrier.
-///
-/// A barrier that prevents compiler and CPU from reordering memory write accesses across the
-/// barrier.
-#[inline(always)]
-pub fn smp_wmb() {
-    if cfg!(CONFIG_SMP) {
+impl<F> MemoryBarrier<F> for Release
+where
+    Full: MemoryBarrier<F>,
+{
+    #[inline]
+    fn run() {
+        Full::run();
+    }
+}
+
+// Specific barrier implementations.
+
+impl MemoryBarrier for Read {
+    #[inline]
+    fn run() {
+        // SAFETY: `rmb()` is safe to call.
+        unsafe { bindings::rmb() };
+    }
+}
+
+impl MemoryBarrier for Write {
+    #[inline]
+    fn run() {
+        // SAFETY: `wmb()` is safe to call.
+        unsafe { bindings::wmb() };
+    }
+}
+
+impl MemoryBarrier for Full {
+    #[inline]
+    fn run() {
+        // SAFETY: `mb()` is safe to call.
+        unsafe { bindings::mb() };
+    }
+}
+
+impl MemoryBarrier<Dma> for Read {
+    #[inline]
+    fn run() {
+        // SAFETY: `dma_rmb()` is safe to call.
+        unsafe { bindings::dma_rmb() };
+    }
+}
+
+impl MemoryBarrier<Dma> for Write {
+    #[inline]
+    fn run() {
+        // SAFETY: `dma_wmb()` is safe to call.
+        unsafe { bindings::dma_wmb() };
+    }
+}
+
+impl MemoryBarrier<Dma> for Full {
+    #[inline]
+    fn run() {
+        // SAFETY: `dma_mb()` is safe to call.
+        unsafe { bindings::dma_mb() };
+    }
+}
+
+impl MemoryBarrier<Smp> for Read {
+    #[inline]
+    fn run() {
+        // SAFETY: `smp_rmb()` is safe to call.
+        unsafe { bindings::smp_rmb() };
+    }
+}
+
+impl MemoryBarrier<Smp> for Write {
+    #[inline]
+    fn run() {
         // SAFETY: `smp_wmb()` is safe to call.
         unsafe { bindings::smp_wmb() };
-    } else {
-        barrier();
     }
 }
 
-/// A read-read memory barrier.
+impl MemoryBarrier<Smp> for Full {
+    #[inline]
+    fn run() {
+        // SAFETY: `smp_mb()` is safe to call.
+        unsafe { bindings::smp_mb() };
+    }
+}
+
+/// Memory barrier.
 ///
-/// A barrier that prevents compiler and CPU from reordering memory read accesses across the
-/// barrier.
-#[inline(always)]
-pub fn smp_rmb() {
+/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
+///
+/// The specific forms of reordering can be specified using the parameter.
+/// - `mb(Read)` provides a read-read barrier.
+/// - `mb(Write)` provides a write-write barrier.
+/// - `mb(Full)` provides a full barrier.
+/// - `mb(Acquire)` prevents preceding read from being ordered against succeeding memory
+///    operations.
+/// - `mb(Release)` prevents preceding memory operations from being ordered against succeeding
+///    writes.
+///
+/// # Examples
+///
+/// ```
+/// # use kernel::sync::barrier::*;
+/// mb(Read);
+/// mb(Write);
+/// mb(Acquire);
+/// mb(Release);
+/// mb(Full);
+/// ```
+#[inline]
+#[doc(alias = "rmb")]
+#[doc(alias = "wmb")]
+pub fn mb<T: MemoryBarrier>(_: T) {
+    T::run()
+}
+
+/// Memory barrier between CPUs.
+///
+/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
+/// Does not prevent re-ordering with respect to other bus-mastering devices.
+///
+/// Prefer using `Acquire` [loads](super::atomic::Atomic::load) to `Acquire` barriers, and `Release`
+/// [stores](super::atomic::Atomic::store) to `Release` barriers.
+///
+/// See [`mb`] for usage.
+#[inline]
+#[doc(alias = "smp_rmb")]
+#[doc(alias = "smp_wmb")]
+pub fn smp_mb<T: MemoryBarrier<Smp>>(_: T) {
     if cfg!(CONFIG_SMP) {
-        // SAFETY: `smp_rmb()` is safe to call.
-        unsafe { bindings::smp_rmb() };
+        T::run()
     } else {
-        barrier();
+        barrier()
     }
 }
+
+/// Memory barrier between local CPU and bus-mastering devices.
+///
+/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
+/// Does not prevent re-ordering with respect to other CPUs.
+///
+/// See [`mb`] for usage.
+#[inline]
+#[doc(alias = "dma_rmb")]
+#[doc(alias = "dma_wmb")]
+pub fn dma_mb<T: MemoryBarrier<Dma>>(_: T) {
+    T::run()
+}
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code
  2026-04-02 15:24 [PATCH 0/3] rust: more memory barriers bindings Gary Guo
  2026-04-02 15:24 ` [PATCH 1/3] rust: sync: add helpers for mb, dma_mb and friends Gary Guo
  2026-04-02 15:24 ` [PATCH 2/3] rust: sync: generic memory barriers Gary Guo
@ 2026-04-02 15:24 ` Gary Guo
  2026-04-02 21:56   ` Joel Fernandes
  2 siblings, 1 reply; 11+ messages in thread
From: Gary Guo @ 2026-04-02 15:24 UTC (permalink / raw)
  To: Miguel Ojeda, Boqun Feng, Gary Guo, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Alexandre Courbot, David Airlie, Simona Vetter
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, Joel Fernandes,
	rust-for-linux, nouveau, linux-kernel, linux-arch, lkmm,
	dri-devel

From: Gary Guo <gary@garyguo.net>

Currently, in the GSP->CPU messaging path, the current code misses a read
barrier before data read. The barrier after read is updated to a DMA
barrier (with release ordering desired), instead of the existing (Rust)
SeqCst SMP barrier; the location of barrier is also moved to the beginning
of function, because the barrier is needed to synchronizing between data
and ring-buffer pointer, the RMW operation does not internally need a
barrier (nor it has to be atomic, as CPU pointers are updated by CPU only).

In the CPU->GSP messaging path, the current code misses a write barrier
after data write and before updating the CPU write pointer. Barrier is not
needed before data write due to control dependency, this fact is documented
explicitly. This could be replaced with an acquire barrier if needed.

Signed-off-by: Gary Guo <gary@garyguo.net>
---
 drivers/gpu/nova-core/gsp/cmdq.rs | 19 +++++++++++++++++++
 drivers/gpu/nova-core/gsp/fw.rs   | 12 ------------
 2 files changed, 19 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
index 2224896ccc89..7e4315b13984 100644
--- a/drivers/gpu/nova-core/gsp/cmdq.rs
+++ b/drivers/gpu/nova-core/gsp/cmdq.rs
@@ -19,6 +19,12 @@
     prelude::*,
     sync::{
         aref::ARef,
+        barrier::{
+            dma_mb,
+            Read,
+            Release,
+            Write, //
+        },
         Mutex, //
     },
     time::Delta,
@@ -258,6 +264,9 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
         let tx = self.cpu_write_ptr() as usize;
         let rx = self.gsp_read_ptr() as usize;
 
+        // ORDERING: control dependency provides necessary LOAD->STORE ordering.
+        // `dma_mb(Acquire)` may be used here if we don't want to rely on control dependency.
+
         // SAFETY:
         // - We will only access the driver-owned part of the shared memory.
         // - Per the safety statement of the function, no concurrent access will be performed.
@@ -311,6 +320,9 @@ fn driver_write_area_size(&self) -> usize {
         let tx = self.gsp_write_ptr() as usize;
         let rx = self.cpu_read_ptr() as usize;
 
+        // ORDERING: Ensure data load is ordered after load of GSP write pointer.
+        dma_mb(Read);
+
         // SAFETY:
         // - We will only access the driver-owned part of the shared memory.
         // - Per the safety statement of the function, no concurrent access will be performed.
@@ -408,6 +420,10 @@ fn cpu_read_ptr(&self) -> u32 {
 
     // Informs the GSP that it can send `elem_count` new pages into the message queue.
     fn advance_cpu_read_ptr(&mut self, elem_count: u32) {
+        // ORDERING: Ensure read pointer is properly ordered.
+        //
+        dma_mb(Release);
+
         super::fw::gsp_mem::advance_cpu_read_ptr(&self.0, elem_count)
     }
 
@@ -422,6 +438,9 @@ fn cpu_write_ptr(&self) -> u32 {
 
     // Informs the GSP that it can process `elem_count` new pages from the command queue.
     fn advance_cpu_write_ptr(&mut self, elem_count: u32) {
+        // ORDERING: Ensure all command data is visible before updateing ring buffer pointer.
+        dma_mb(Write);
+
         super::fw::gsp_mem::advance_cpu_write_ptr(&self.0, elem_count)
     }
 }
diff --git a/drivers/gpu/nova-core/gsp/fw.rs b/drivers/gpu/nova-core/gsp/fw.rs
index 0c8a74f0e8ac..62c2cf1b030c 100644
--- a/drivers/gpu/nova-core/gsp/fw.rs
+++ b/drivers/gpu/nova-core/gsp/fw.rs
@@ -42,11 +42,6 @@
 
 // TODO: Replace with `IoView` projections once available.
 pub(super) mod gsp_mem {
-    use core::sync::atomic::{
-        fence,
-        Ordering, //
-    };
-
     use kernel::{
         dma::Coherent,
         dma_read,
@@ -72,10 +67,6 @@ pub(in crate::gsp) fn cpu_read_ptr(qs: &Coherent<GspMem>) -> u32 {
 
     pub(in crate::gsp) fn advance_cpu_read_ptr(qs: &Coherent<GspMem>, count: u32) {
         let rptr = cpu_read_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
-
-        // Ensure read pointer is properly ordered.
-        fence(Ordering::SeqCst);
-
         dma_write!(qs, .cpuq.rx.0.readPtr, rptr);
     }
 
@@ -87,9 +78,6 @@ pub(in crate::gsp) fn advance_cpu_write_ptr(qs: &Coherent<GspMem>, count: u32) {
         let wptr = cpu_write_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
 
         dma_write!(qs, .cpuq.tx.0.writePtr, wptr);
-
-        // Ensure all command data is visible before triggering the GSP read.
-        fence(Ordering::SeqCst);
     }
 }
 
-- 
2.51.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] rust: sync: generic memory barriers
  2026-04-02 15:24 ` [PATCH 2/3] rust: sync: generic memory barriers Gary Guo
@ 2026-04-02 21:49   ` Joel Fernandes
  2026-04-03  0:07     ` Gary Guo
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Fernandes @ 2026-04-02 21:49 UTC (permalink / raw)
  To: Gary Guo, Miguel Ojeda, Boqun Feng, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Will Deacon, Peter Zijlstra, Mark Rutland
  Cc: Alan Stern, Andrea Parri, Nicholas Piggin, David Howells,
	Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
	Daniel Lustig, rust-for-linux, linux-kernel, linux-arch, lkmm,
	Alexandre Courbot, John Hubbard, Timur Tabi, Eliot Courtney,
	Alistair Popple

Hi Gary,

On 4/2/2026 11:24 AM, Gary Guo wrote:
> From: Gary Guo <gary@garyguo.net>
>
> Implement a generic interface for memory barriers (full system/DMA/SMP).
> The interface uses a parameter to force user to specify their intent with
> barriers.
>
> It provides `Read`, `Write`, `Full` orderings which map to the existing
> `rmb()`, `wmb()` and `mb()`, but also `Acquire` and `Release` which is
> documented to have `LOAD->{LOAD,STORE}` ordering and `{LOAD,STORE}->WRITE`
> ordering, although for now they're still mapped to a full `mb()`. But in
> the future it could be mapped to a more efficient form depending on the
> architecture. I included them as many users do not need the STORE->LOAD
> ordering, and having them use `Acquire`/`Release` is more clear on their
> intent in what reordering is to be prevented.
>
> Generic is used here instead of providing individual standalone functions
> to reduce code duplication. For example, the `Acquire` -> `Full` upgrade
> here is uniformly implemented for all three types. The `CONFIG_SMP` check
> in `smp_mb` is uniformly implemented for all SMP barriers. This could
> extend to `virt_mb`'s if they're introduced in the future.
>
> Signed-off-by: Gary Guo <gary@garyguo.net>
> ---
>  rust/kernel/sync/atomic/ordering.rs |   2 +-
>  rust/kernel/sync/barrier.rs         | 194 ++++++++++++++++++++++++----

IMO this patch should be split up into different patches for CPU vs IO, and
perhaps even more patches separating out different barrier types.

>  2 files changed, 168 insertions(+), 28 deletions(-)
>
> diff --git a/rust/kernel/sync/atomic/ordering.rs b/rust/kernel/sync/atomic/ordering.rs
> index 3f103aa8db99..c4e732e7212f 100644
> --- a/rust/kernel/sync/atomic/ordering.rs
> +++ b/rust/kernel/sync/atomic/ordering.rs
[...]> +// Currently kernel only support `rmb`, `wmb` and full `mb`.
> +impl MemoryBarrier<Smp> for Read {
> +    #[inline]
> +    fn run() {
> +        // SAFETY: `smp_rmb()` is safe to call.
> +        unsafe { bindings::smp_rmb() };
> +    }
> +}
> +
> +impl MemoryBarrier<Smp> for Write {
> +    #[inline]
> +    fn run() {
>          // SAFETY: `smp_wmb()` is safe to call.
>          unsafe { bindings::smp_wmb() };
> -    } else {
> -        barrier();
>      }
>  }
>
> -/// A read-read memory barrier.
> +impl MemoryBarrier<Smp> for Full {
> +    #[inline]
> +    fn run() {
> +        // SAFETY: `smp_mb()` is safe to call.
> +        unsafe { bindings::smp_mb() };
> +    }
> +}
> +
> +/// Memory barrier.
>  ///
> -/// A barrier that prevents compiler and CPU from reordering memory read accesses across the
> -/// barrier.
> -#[inline(always)]
> -pub fn smp_rmb() {
> +/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
> +///
> +/// The specific forms of reordering can be specified using the parameter.
> +/// - `mb(Read)` provides a read-read barrier.
> +/// - `mb(Write)` provides a write-write barrier.
> +/// - `mb(Full)` provides a full barrier.
> +/// - `mb(Acquire)` prevents preceding read from being ordered against succeeding memory
> +///    operations.
> +/// - `mb(Release)` prevents preceding memory operations from being ordered against succeeding
> +///    writes.

I don't agree with this definition of Release. Release is always associated with
a specific store, likewise acquire with a load. The definition above also
doesn't make sense 'prevents preceding memory operations from being ordered
against succeeding writes', that's not what Release semantics are. Release
orders memory operations with a specific memory operation associated with
Release. Same for Acquire.

See also in Documentation/memory-barriers.txt, ACQUIRE and RELEASE are defined as being
tied to specific memory operations.

Or am I missing something subtle?

thanks,

-- 
Joel Fernandes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code
  2026-04-02 15:24 ` [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code Gary Guo
@ 2026-04-02 21:56   ` Joel Fernandes
  2026-04-02 21:59     ` Joel Fernandes
  2026-04-04 13:02     ` Gary Guo
  0 siblings, 2 replies; 11+ messages in thread
From: Joel Fernandes @ 2026-04-02 21:56 UTC (permalink / raw)
  To: Gary Guo, Miguel Ojeda, Boqun Feng, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Alexandre Courbot, David Airlie, Simona Vetter
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, rust-for-linux,
	nouveau, linux-kernel, linux-arch, lkmm, dri-devel

Hi Gary,

On 4/2/2026 11:24 AM, Gary Guo wrote:
> From: Gary Guo <gary@garyguo.net>
> 
> Currently, in the GSP->CPU messaging path, the current code misses a read
> barrier before data read. The barrier after read is updated to a DMA
> barrier (with release ordering desired), instead of the existing (Rust)
> SeqCst SMP barrier; the location of barrier is also moved to the beginning
> of function, because the barrier is needed to synchronizing between data
> and ring-buffer pointer, the RMW operation does not internally need a
> barrier (nor it has to be atomic, as CPU pointers are updated by CPU only).
> 
> In the CPU->GSP messaging path, the current code misses a write barrier
> after data write and before updating the CPU write pointer. Barrier is not
> needed before data write due to control dependency, this fact is documented
> explicitly. This could be replaced with an acquire barrier if needed.
> 
> Signed-off-by: Gary Guo <gary@garyguo.net>
> ---
>  drivers/gpu/nova-core/gsp/cmdq.rs | 19 +++++++++++++++++++
>  drivers/gpu/nova-core/gsp/fw.rs   | 12 ------------
>  2 files changed, 19 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
> index 2224896ccc89..7e4315b13984 100644
> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
> @@ -19,6 +19,12 @@
>      prelude::*,
>      sync::{
>          aref::ARef,
> +        barrier::{
> +            dma_mb,
> +            Read,
> +            Release,
> +            Write, //
> +        },
>          Mutex, //
>      },
>      time::Delta,
> @@ -258,6 +264,9 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>          let tx = self.cpu_write_ptr() as usize;
>          let rx = self.gsp_read_ptr() as usize;
>  
> +        // ORDERING: control dependency provides necessary LOAD->STORE ordering.
> +        // `dma_mb(Acquire)` may be used here if we don't want to rely on control dependency.

Just checking, does control dependency on CPU side really apply to ordering for
IO (what the device perceives?). IOW, the loads are stores might be ordered on
the CPU side, but the device might be seeing these operations out of order. If
that is the case, perhaps the control dependency comment is misleading.


> +
>          // SAFETY:
>          // - We will only access the driver-owned part of the shared memory.
>          // - Per the safety statement of the function, no concurrent access will be performed.
> @@ -311,6 +320,9 @@ fn driver_write_area_size(&self) -> usize {
>          let tx = self.gsp_write_ptr() as usize;
>          let rx = self.cpu_read_ptr() as usize;
>  
> +        // ORDERING: Ensure data load is ordered after load of GSP write pointer.
> +        dma_mb(Read);
> +

I suggest taking it on a case by case basis, and splitting the patch for each
case, for easier review. There are many patterns AFAICS, load-store, store-store
etc.

I do acknowledge the issue you find here though. thanks,

--
Joel Fernandes



>          // SAFETY:
>          // - We will only access the driver-owned part of the shared memory.
>          // - Per the safety statement of the function, no concurrent access will be performed.
> @@ -408,6 +420,10 @@ fn cpu_read_ptr(&self) -> u32 {
>  
>      // Informs the GSP that it can send `elem_count` new pages into the message queue.
>      fn advance_cpu_read_ptr(&mut self, elem_count: u32) {
> +        // ORDERING: Ensure read pointer is properly ordered.
> +        //
> +        dma_mb(Release);
> +
>          super::fw::gsp_mem::advance_cpu_read_ptr(&self.0, elem_count)
>      }
>  
> @@ -422,6 +438,9 @@ fn cpu_write_ptr(&self) -> u32 {
>  
>      // Informs the GSP that it can process `elem_count` new pages from the command queue.
>      fn advance_cpu_write_ptr(&mut self, elem_count: u32) {
> +        // ORDERING: Ensure all command data is visible before updateing ring buffer pointer.
> +        dma_mb(Write);
> +
>          super::fw::gsp_mem::advance_cpu_write_ptr(&self.0, elem_count)
>      }
>  }
> diff --git a/drivers/gpu/nova-core/gsp/fw.rs b/drivers/gpu/nova-core/gsp/fw.rs
> index 0c8a74f0e8ac..62c2cf1b030c 100644
> --- a/drivers/gpu/nova-core/gsp/fw.rs
> +++ b/drivers/gpu/nova-core/gsp/fw.rs
> @@ -42,11 +42,6 @@
>  
>  // TODO: Replace with `IoView` projections once available.
>  pub(super) mod gsp_mem {
> -    use core::sync::atomic::{
> -        fence,
> -        Ordering, //
> -    };
> -
>      use kernel::{
>          dma::Coherent,
>          dma_read,
> @@ -72,10 +67,6 @@ pub(in crate::gsp) fn cpu_read_ptr(qs: &Coherent<GspMem>) -> u32 {
>  
>      pub(in crate::gsp) fn advance_cpu_read_ptr(qs: &Coherent<GspMem>, count: u32) {
>          let rptr = cpu_read_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
> -
> -        // Ensure read pointer is properly ordered.
> -        fence(Ordering::SeqCst);
> -
>          dma_write!(qs, .cpuq.rx.0.readPtr, rptr);
>      }
>  
> @@ -87,9 +78,6 @@ pub(in crate::gsp) fn advance_cpu_write_ptr(qs: &Coherent<GspMem>, count: u32) {
>          let wptr = cpu_write_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
>  
>          dma_write!(qs, .cpuq.tx.0.writePtr, wptr);
> -
> -        // Ensure all command data is visible before triggering the GSP read.
> -        fence(Ordering::SeqCst);
>      }
>  }
>  

-- 
Joel Fernandes


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code
  2026-04-02 21:56   ` Joel Fernandes
@ 2026-04-02 21:59     ` Joel Fernandes
  2026-04-04 13:02     ` Gary Guo
  1 sibling, 0 replies; 11+ messages in thread
From: Joel Fernandes @ 2026-04-02 21:59 UTC (permalink / raw)
  To: Gary Guo, Miguel Ojeda, Boqun Feng, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Alexandre Courbot, David Airlie, Simona Vetter
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, rust-for-linux,
	linux-kernel, linux-arch, lkmm, dri-devel, John Hubbard,
	Eliot Courtney, Alexandre Courbot, Timur Tabi, Alistair Popple

Btw, nouveau@ list has issues at the moment, I suggest CC Nvidia nova folks
directly. Adding some ++ and dropping Nouveau. In case my reply didn't make it,
it is below.

On 4/2/2026 5:56 PM, Joel Fernandes wrote:
> Hi Gary,
> 
> On 4/2/2026 11:24 AM, Gary Guo wrote:
>> From: Gary Guo <gary@garyguo.net>
>>
>> Currently, in the GSP->CPU messaging path, the current code misses a read
>> barrier before data read. The barrier after read is updated to a DMA
>> barrier (with release ordering desired), instead of the existing (Rust)
>> SeqCst SMP barrier; the location of barrier is also moved to the beginning
>> of function, because the barrier is needed to synchronizing between data
>> and ring-buffer pointer, the RMW operation does not internally need a
>> barrier (nor it has to be atomic, as CPU pointers are updated by CPU only).
>>
>> In the CPU->GSP messaging path, the current code misses a write barrier
>> after data write and before updating the CPU write pointer. Barrier is not
>> needed before data write due to control dependency, this fact is documented
>> explicitly. This could be replaced with an acquire barrier if needed.
>>
>> Signed-off-by: Gary Guo <gary@garyguo.net>
>> ---
>>  drivers/gpu/nova-core/gsp/cmdq.rs | 19 +++++++++++++++++++
>>  drivers/gpu/nova-core/gsp/fw.rs   | 12 ------------
>>  2 files changed, 19 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
>> index 2224896ccc89..7e4315b13984 100644
>> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
>> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
>> @@ -19,6 +19,12 @@
>>      prelude::*,
>>      sync::{
>>          aref::ARef,
>> +        barrier::{
>> +            dma_mb,
>> +            Read,
>> +            Release,
>> +            Write, //
>> +        },
>>          Mutex, //
>>      },
>>      time::Delta,
>> @@ -258,6 +264,9 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>>          let tx = self.cpu_write_ptr() as usize;
>>          let rx = self.gsp_read_ptr() as usize;
>>  
>> +        // ORDERING: control dependency provides necessary LOAD->STORE ordering.
>> +        // `dma_mb(Acquire)` may be used here if we don't want to rely on control dependency.
> 
> Just checking, does control dependency on CPU side really apply to ordering for
> IO (what the device perceives?). IOW, the loads are stores might be ordered on
> the CPU side, but the device might be seeing these operations out of order. If
> that is the case, perhaps the control dependency comment is misleading.
> 
> 
>> +
>>          // SAFETY:
>>          // - We will only access the driver-owned part of the shared memory.
>>          // - Per the safety statement of the function, no concurrent access will be performed.
>> @@ -311,6 +320,9 @@ fn driver_write_area_size(&self) -> usize {
>>          let tx = self.gsp_write_ptr() as usize;
>>          let rx = self.cpu_read_ptr() as usize;
>>  
>> +        // ORDERING: Ensure data load is ordered after load of GSP write pointer.
>> +        dma_mb(Read);
>> +
> 
> I suggest taking it on a case by case basis, and splitting the patch for each
> case, for easier review. There are many patterns AFAICS, load-store, store-store
> etc.
> 
> I do acknowledge the issue you find here though. thanks,
> 
> --
> Joel Fernandes
> 
> 
> 
>>          // SAFETY:
>>          // - We will only access the driver-owned part of the shared memory.
>>          // - Per the safety statement of the function, no concurrent access will be performed.
>> @@ -408,6 +420,10 @@ fn cpu_read_ptr(&self) -> u32 {
>>  
>>      // Informs the GSP that it can send `elem_count` new pages into the message queue.
>>      fn advance_cpu_read_ptr(&mut self, elem_count: u32) {
>> +        // ORDERING: Ensure read pointer is properly ordered.
>> +        //
>> +        dma_mb(Release);
>> +
>>          super::fw::gsp_mem::advance_cpu_read_ptr(&self.0, elem_count)
>>      }
>>  
>> @@ -422,6 +438,9 @@ fn cpu_write_ptr(&self) -> u32 {
>>  
>>      // Informs the GSP that it can process `elem_count` new pages from the command queue.
>>      fn advance_cpu_write_ptr(&mut self, elem_count: u32) {
>> +        // ORDERING: Ensure all command data is visible before updateing ring buffer pointer.
>> +        dma_mb(Write);
>> +
>>          super::fw::gsp_mem::advance_cpu_write_ptr(&self.0, elem_count)
>>      }
>>  }
>> diff --git a/drivers/gpu/nova-core/gsp/fw.rs b/drivers/gpu/nova-core/gsp/fw.rs
>> index 0c8a74f0e8ac..62c2cf1b030c 100644
>> --- a/drivers/gpu/nova-core/gsp/fw.rs
>> +++ b/drivers/gpu/nova-core/gsp/fw.rs
>> @@ -42,11 +42,6 @@
>>  
>>  // TODO: Replace with `IoView` projections once available.
>>  pub(super) mod gsp_mem {
>> -    use core::sync::atomic::{
>> -        fence,
>> -        Ordering, //
>> -    };
>> -
>>      use kernel::{
>>          dma::Coherent,
>>          dma_read,
>> @@ -72,10 +67,6 @@ pub(in crate::gsp) fn cpu_read_ptr(qs: &Coherent<GspMem>) -> u32 {
>>  
>>      pub(in crate::gsp) fn advance_cpu_read_ptr(qs: &Coherent<GspMem>, count: u32) {
>>          let rptr = cpu_read_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
>> -
>> -        // Ensure read pointer is properly ordered.
>> -        fence(Ordering::SeqCst);
>> -
>>          dma_write!(qs, .cpuq.rx.0.readPtr, rptr);
>>      }
>>  
>> @@ -87,9 +78,6 @@ pub(in crate::gsp) fn advance_cpu_write_ptr(qs: &Coherent<GspMem>, count: u32) {
>>          let wptr = cpu_write_ptr(qs).wrapping_add(count) % MSGQ_NUM_PAGES;
>>  
>>          dma_write!(qs, .cpuq.tx.0.writePtr, wptr);
>> -
>> -        // Ensure all command data is visible before triggering the GSP read.
>> -        fence(Ordering::SeqCst);
>>      }
>>  }
>>  
> 

-- 
Joel Fernandes


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] rust: sync: generic memory barriers
  2026-04-02 21:49   ` Joel Fernandes
@ 2026-04-03  0:07     ` Gary Guo
  2026-04-03 21:33       ` Joel Fernandes
  0 siblings, 1 reply; 11+ messages in thread
From: Gary Guo @ 2026-04-03  0:07 UTC (permalink / raw)
  To: Joel Fernandes, Gary Guo, Miguel Ojeda, Boqun Feng,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Will Deacon, Peter Zijlstra,
	Mark Rutland
  Cc: Alan Stern, Andrea Parri, Nicholas Piggin, David Howells,
	Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
	Daniel Lustig, rust-for-linux, linux-kernel, linux-arch, lkmm,
	Alexandre Courbot, John Hubbard, Timur Tabi, Eliot Courtney,
	Alistair Popple

On Thu Apr 2, 2026 at 10:49 PM BST, Joel Fernandes wrote:
> Hi Gary,
>
> On 4/2/2026 11:24 AM, Gary Guo wrote:
>> From: Gary Guo <gary@garyguo.net>
>>
>> Implement a generic interface for memory barriers (full system/DMA/SMP).
>> The interface uses a parameter to force user to specify their intent with
>> barriers.
>>
>> It provides `Read`, `Write`, `Full` orderings which map to the existing
>> `rmb()`, `wmb()` and `mb()`, but also `Acquire` and `Release` which is
>> documented to have `LOAD->{LOAD,STORE}` ordering and `{LOAD,STORE}->WRITE`
>> ordering, although for now they're still mapped to a full `mb()`. But in
>> the future it could be mapped to a more efficient form depending on the
>> architecture. I included them as many users do not need the STORE->LOAD
>> ordering, and having them use `Acquire`/`Release` is more clear on their
>> intent in what reordering is to be prevented.
>>
>> Generic is used here instead of providing individual standalone functions
>> to reduce code duplication. For example, the `Acquire` -> `Full` upgrade
>> here is uniformly implemented for all three types. The `CONFIG_SMP` check
>> in `smp_mb` is uniformly implemented for all SMP barriers. This could
>> extend to `virt_mb`'s if they're introduced in the future.
>>
>> Signed-off-by: Gary Guo <gary@garyguo.net>
>> ---
>>  rust/kernel/sync/atomic/ordering.rs |   2 +-
>>  rust/kernel/sync/barrier.rs         | 194 ++++++++++++++++++++++++----
>
> IMO this patch should be split up into different patches for CPU vs IO, and
> perhaps even more patches separating out different barrier types.

Given the different barrier types are quite closely related, I don't want to
create a bunch of small patches adding one each. But splitting out the SMP
change and then add DMA/mandatory barrier as another patch does sound
reasonable.

>
>>  2 files changed, 168 insertions(+), 28 deletions(-)
>>
>> diff --git a/rust/kernel/sync/atomic/ordering.rs b/rust/kernel/sync/atomic/ordering.rs
>> index 3f103aa8db99..c4e732e7212f 100644
>> --- a/rust/kernel/sync/atomic/ordering.rs
>> +++ b/rust/kernel/sync/atomic/ordering.rs
> [...]> +// Currently kernel only support `rmb`, `wmb` and full `mb`.
>> +impl MemoryBarrier<Smp> for Read {
>> +    #[inline]
>> +    fn run() {
>> +        // SAFETY: `smp_rmb()` is safe to call.
>> +        unsafe { bindings::smp_rmb() };
>> +    }
>> +}
>> +
>> +impl MemoryBarrier<Smp> for Write {
>> +    #[inline]
>> +    fn run() {
>>          // SAFETY: `smp_wmb()` is safe to call.
>>          unsafe { bindings::smp_wmb() };
>> -    } else {
>> -        barrier();
>>      }
>>  }
>>
>> -/// A read-read memory barrier.
>> +impl MemoryBarrier<Smp> for Full {
>> +    #[inline]
>> +    fn run() {
>> +        // SAFETY: `smp_mb()` is safe to call.
>> +        unsafe { bindings::smp_mb() };
>> +    }
>> +}
>> +
>> +/// Memory barrier.
>>  ///
>> -/// A barrier that prevents compiler and CPU from reordering memory read accesses across the
>> -/// barrier.
>> -#[inline(always)]
>> -pub fn smp_rmb() {
>> +/// A barrier that prevents compiler and CPU from reordering memory accesses across the barrier.
>> +///
>> +/// The specific forms of reordering can be specified using the parameter.
>> +/// - `mb(Read)` provides a read-read barrier.
>> +/// - `mb(Write)` provides a write-write barrier.
>> +/// - `mb(Full)` provides a full barrier.
>> +/// - `mb(Acquire)` prevents preceding read from being ordered against succeeding memory
>> +///    operations.
>> +/// - `mb(Release)` prevents preceding memory operations from being ordered against succeeding
>> +///    writes.
>
> I don't agree with this definition of Release. Release is always associated with
> a specific store, likewise acquire with a load. The definition above also
> doesn't make sense 'prevents preceding memory operations from being ordered
> against succeeding writes', that's not what Release semantics are. Release
> orders memory operations with a specific memory operation associated with
> Release. Same for Acquire.

I worded my commit message to say that these are about *intentions* and not
semantics. I do want to change the semantics too, but it's not fully ready yet.
But looks like you're interested, so let me share them too (perhaps a bit
prematurely).

>
> See also in Documentation/memory-barriers.txt, ACQUIRE and RELEASE are defined as being
> tied to specific memory operations.

That's what we have today, hence the implementation upgrades them to full memory
barriers. But ACQUIRE and RELEASE orderings doesn't *need* to be tied to
specific memory operations and they can still make conceptual sense as barriers.

C11 memory model defines Acquire and Release fences, and it looks to me it's
relatively easy to add it to LKMM. I was playing with Herd7 and I think I've got
it working, see the attached diff.

Another thing that I'd like to note is that in all architectures that we have
today except ARM and PARISC, the smp_load_acquire and smp_store_release are
actually implemented as READ_ONCE + ACQUIRE barrier and RELEASE barrier +
WRITE_ONCE. I'm planning to propose C API and corresponding memory model change
too, but I want to gather some more concrete numbers (the performance benefit of
having dma_mb_acquire/dma_mb_release compared to full dma_mb) before proposing so.

Note that marked dma_load_acquire/dma_store_release (and their mandatory
versions) don't make too much sense, as AFAIK no architectures have instructions
for them so you're implementing these as fence instructions anyway.

Best,
Gary

diff --git a/tools/memory-model/linux-kernel.bell b/tools/memory-model/linux-kernel.bell
index fe65998002b9..9b3322fc5b9c 100644
--- a/tools/memory-model/linux-kernel.bell
+++ b/tools/memory-model/linux-kernel.bell
@@ -25,6 +25,8 @@ instructions RMW[Accesses]
 enum Barriers = 'wmb (*smp_wmb*) ||
 		'rmb (*smp_rmb*) ||
 		'MB (*smp_mb*) ||
+		'mb-acquire (*smp_mb_acquire*) ||
+		'mb-release (*smp_mb_release*) ||
 		'barrier (*barrier*) ||
 		'rcu-lock (*rcu_read_lock*)  ||
 		'rcu-unlock (*rcu_read_unlock*) ||
diff --git a/tools/memory-model/linux-kernel.cat b/tools/memory-model/linux-kernel.cat
index d7e7bf13c831..d7be7c03bb25 100644
--- a/tools/memory-model/linux-kernel.cat
+++ b/tools/memory-model/linux-kernel.cat
@@ -33,6 +33,8 @@ let po-unlock-lock-po = po ; [UL] ; (po|rf) ; [LKR] ; po
 let R4rmb = R \ Noreturn	(* Reads for which rmb works *)
 let rmb = [R4rmb] ; fencerel(Rmb) ; [R4rmb]
 let wmb = [W] ; fencerel(Wmb) ; [W]
+let mb-acquire = [R4rmb] ; fencerel(Mb-acquire) ; [M]
+let mb-release = [M] ; fencerel(Mb-release) ; [W]
 let mb = ([M] ; fencerel(Mb) ; [M]) |
 	(*
 	 * full-barrier RMWs (successful cmpxchg(), xchg(), etc.) act as
@@ -64,9 +66,10 @@ let mb = ([M] ; fencerel(Mb) ; [M]) |
 let gp = po ; [Sync-rcu | Sync-srcu] ; po?
 let strong-fence = mb | gp
 
-let nonrw-fence = strong-fence | po-rel | acq-po
+let nonrw-fence = strong-fence | po-rel | acq-po | mb-acquire | mb-release
 let fence = nonrw-fence | wmb | rmb
-let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Sync-rcu | Sync-srcu |
+let barrier = fencerel(Barrier | Rmb | Wmb | Mb | Mb-acquire | Mb-release |
+        Sync-rcu | Sync-srcu |
 		Before-atomic | After-atomic | Acquire | Release |
 		Rcu-lock | Rcu-unlock | Srcu-lock | Srcu-unlock) |
 	(po ; [Release]) | ([Acquire] ; po)
@@ -97,7 +100,7 @@ let ppo = to-r | to-w | (fence & int) | (po-unlock-lock-po & int)
 (* Propagation: Ordering from release operations and strong fences. *)
 let A-cumul(r) = (rfe ; [Marked])? ; r
 let rmw-sequence = (rf ; rmw)*
-let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel) | wmb |
+let cumul-fence = [Marked] ; (A-cumul(strong-fence | po-rel | mb-release) | wmb |
 	po-unlock-lock-po) ; [Marked] ; rmw-sequence
 let prop = [Marked] ; (overwrite & ext)? ; cumul-fence* ;
 	[Marked] ; rfe? ; [Marked]
diff --git a/tools/memory-model/linux-kernel.def b/tools/memory-model/linux-kernel.def
index 49e402782e49..e32aea2c01a9 100644
--- a/tools/memory-model/linux-kernel.def
+++ b/tools/memory-model/linux-kernel.def
@@ -20,6 +20,8 @@ smp_store_mb(X,V) { __store{ONCE}(X,V); __fence{MB}; }
 smp_mb() { __fence{MB}; }
 smp_rmb() { __fence{rmb}; }
 smp_wmb() { __fence{wmb}; }
+smp_mb_acquire() { __fence{mb-acquire}; }
+smp_mb_release() { __fence{mb-release}; }
 smp_mb__before_atomic() { __fence{before-atomic}; }
 smp_mb__after_atomic() { __fence{after-atomic}; }
 smp_mb__after_spinlock() { __fence{after-spinlock}; }
diff --git a/tools/memory-model/litmus-tests/MP+fencembreleaseonceonce+fencembacquireonceonce.litmus b/tools/memory-model/litmus-tests/MP+fencembreleaseonceonce+fencembacquireonceonce.litmus
new file mode 100644
index 000000000000..d53b848c5687
--- /dev/null
+++ b/tools/memory-model/litmus-tests/MP+fencembreleaseonceonce+fencembacquireonceonce.litmus
@@ -0,0 +1,30 @@
+C MP+fencembrelonceonce+fencembacquireonceonce
+
+(*
+ * Result: Never
+ *
+ * This litmus test demonstrates that smp_mb_release() and
+ * smp_mb_acquire() provide sufficient ordering for the message-passing
+ * pattern.
+ *)
+
+{}
+
+P0(int *buf, int *flag) // Producer
+{
+	WRITE_ONCE(*buf, 1);
+	smp_mb_release();
+	WRITE_ONCE(*flag, 1);
+}
+
+P1(int *buf, int *flag) // Consumer
+{
+	int r0;
+	int r1;
+
+	r0 = READ_ONCE(*flag);
+	smp_mb_acquire();
+	r1 = READ_ONCE(*buf);
+}
+
+exists (1:r0=1 /\ 1:r1=0) (* Bad outcome. *)


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] rust: sync: generic memory barriers
  2026-04-03  0:07     ` Gary Guo
@ 2026-04-03 21:33       ` Joel Fernandes
  2026-04-04 12:43         ` Gary Guo
  0 siblings, 1 reply; 11+ messages in thread
From: Joel Fernandes @ 2026-04-03 21:33 UTC (permalink / raw)
  To: Gary Guo, Miguel Ojeda, Boqun Feng, Björn Roy Baron,
	Benno Lossin, Andreas Hindborg, Alice Ryhl, Trevor Gross,
	Danilo Krummrich, Will Deacon, Peter Zijlstra, Mark Rutland
  Cc: Alan Stern, Andrea Parri, Nicholas Piggin, David Howells,
	Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
	Daniel Lustig, rust-for-linux, linux-kernel, linux-arch, lkmm,
	Alexandre Courbot, John Hubbard, Timur Tabi, Eliot Courtney,
	Alistair Popple

On 4/2/2026 8:07 PM, Gary Guo wrote:
> On Thu Apr 2, 2026 at 10:49 PM BST, Joel Fernandes wrote:

[...]

>> See also in Documentation/memory-barriers.txt, ACQUIRE and RELEASE are defined as being
>> tied to specific memory operations.
> 
> That's what we have today, hence the implementation upgrades them to full memory
> barriers. But ACQUIRE and RELEASE orderings doesn't *need* to be tied to
> specific memory operations and they can still make conceptual sense as barriers.
> 
> C11 memory model defines Acquire and Release fences, and it looks to me it's
> relatively easy to add it to LKMM. I was playing with Herd7 and I think I've got
> it working, see the attached diff.
> 
> Another thing that I'd like to note is that in all architectures that we have
> today except ARM and PARISC, the smp_load_acquire and smp_store_release are
> actually implemented as READ_ONCE + ACQUIRE barrier and RELEASE barrier +
> WRITE_ONCE. I'm planning to propose C API and corresponding memory model change
> too, but I want to gather some more concrete numbers (the performance benefit of
> having dma_mb_acquire/dma_mb_release compared to full dma_mb) before proposing so.
> 
> Note that marked dma_load_acquire/dma_store_release (and their mandatory
> versions) don't make too much sense, as AFAIK no architectures have instructions
> for them so you're implementing these as fence instructions anyway.

Ah, so you're proposing new memory barrier types that don't exist anywhere
in the kernel yet. I thought you were just wrapping existing ones so it got
a bit confusing.

My suggestion: fix the nova-core issue in patch 3 using dma_mb() directly,
without the new barrier types. Adding a new standalone release/acquire
fence semantic to the kernel is something that will need broad consensus
from the LKMM maintainers IMO, independent of this series. It's a bigger
conversation that could delay your nova bug fix.

Once smp_mb_release() / smp_mb_acquire() (or the DMA equivalents) land
properly, updating nova to use them will be trivial. But the correctness of
the fix shouldn't depend on semantics that don't exist in the kernel yet,
even if they currently degrade to a full barrier.

That said, I find the idea genuinely interesting, and thanks for keeping me
in the loop. Standalone acquire/release fences that are formally modeled in
LKMM and A-cumulative is interesting as you showed in the herd7 code. You
would want to update Documentation/memory-barriers.txt too.

thanks,

--
Joel Fernandes

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/3] rust: sync: generic memory barriers
  2026-04-03 21:33       ` Joel Fernandes
@ 2026-04-04 12:43         ` Gary Guo
  0 siblings, 0 replies; 11+ messages in thread
From: Gary Guo @ 2026-04-04 12:43 UTC (permalink / raw)
  To: Joel Fernandes, Gary Guo, Miguel Ojeda, Boqun Feng,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Will Deacon, Peter Zijlstra,
	Mark Rutland
  Cc: Alan Stern, Andrea Parri, Nicholas Piggin, David Howells,
	Jade Alglave, Luc Maranget, Paul E. McKenney, Akira Yokosawa,
	Daniel Lustig, rust-for-linux, linux-kernel, linux-arch, lkmm,
	Alexandre Courbot, John Hubbard, Timur Tabi, Eliot Courtney,
	Alistair Popple

On Fri Apr 3, 2026 at 10:33 PM BST, Joel Fernandes wrote:
>
>
> On 4/2/2026 8:07 PM, Gary Guo wrote:
>> On Thu Apr 2, 2026 at 10:49 PM BST, Joel Fernandes wrote:
>
> [...]
>
>>> See also in Documentation/memory-barriers.txt, ACQUIRE and RELEASE are defined as being
>>> tied to specific memory operations.
>> 
>> That's what we have today, hence the implementation upgrades them to full memory
>> barriers. But ACQUIRE and RELEASE orderings doesn't *need* to be tied to
>> specific memory operations and they can still make conceptual sense as barriers.
>> 
>> C11 memory model defines Acquire and Release fences, and it looks to me it's
>> relatively easy to add it to LKMM. I was playing with Herd7 and I think I've got
>> it working, see the attached diff.
>> 
>> Another thing that I'd like to note is that in all architectures that we have
>> today except ARM and PARISC, the smp_load_acquire and smp_store_release are
>> actually implemented as READ_ONCE + ACQUIRE barrier and RELEASE barrier +
>> WRITE_ONCE. I'm planning to propose C API and corresponding memory model change
>> too, but I want to gather some more concrete numbers (the performance benefit of
>> having dma_mb_acquire/dma_mb_release compared to full dma_mb) before proposing so.
>> 
>> Note that marked dma_load_acquire/dma_store_release (and their mandatory
>> versions) don't make too much sense, as AFAIK no architectures have instructions
>> for them so you're implementing these as fence instructions anyway.
>
> Ah, so you're proposing new memory barrier types that don't exist anywhere
> in the kernel yet. I thought you were just wrapping existing ones so it got
> a bit confusing.

Well, I am not proposing new memory barrier types in this change, hence I
mentioned that these are annotations about intention and documentation and not
about their *semantics* currently. Perhaps I didn't phrase this clear enough?

How about change the wording in documentation like this:

    /// `Acquire` is a full memory barrier, but caller indicates that it only
    /// uses it to provide LOAD->{LOAD,STORE} ordering. 

or do you think even that is too much?

>
> My suggestion: fix the nova-core issue in patch 3 using dma_mb() directly,
> without the new barrier types. Adding a new standalone release/acquire
> fence semantic to the kernel is something that will need broad consensus
> from the LKMM maintainers IMO, independent of this series. It's a bigger
> conversation that could delay your nova bug fix.

Something like this?

    // TODO: Replace with `dma_mb(Release)` when it's available.
    dma_mb(Full);

Or not even mentioning that? I really like the Acquire/Release because they
serve themselves a documentation on why the barrier is there to exist.

When I read a full barrier, I always need to think whether the code really
relies on STORE->LOAD ordering or a weaker ordering that could be used.
There's a conventional wisdom in userspace C/Rust where when you see seq_cst,
99% of the time it's wrong.

>
> Once smp_mb_release() / smp_mb_acquire() (or the DMA equivalents) land
> properly, updating nova to use them will be trivial. But the correctness of
> the fix shouldn't depend on semantics that don't exist in the kernel yet,
> even if they currently degrade to a full barrier.
>
> That said, I find the idea genuinely interesting, and thanks for keeping me
> in the loop. Standalone acquire/release fences that are formally modeled in
> LKMM and A-cumulative is interesting as you showed in the herd7 code. You
> would want to update Documentation/memory-barriers.txt too.

Yeah, there're more works to do on changing the semantics, hence I want to limit
this to not be about semantics.

Best,
Gary

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code
  2026-04-02 21:56   ` Joel Fernandes
  2026-04-02 21:59     ` Joel Fernandes
@ 2026-04-04 13:02     ` Gary Guo
  1 sibling, 0 replies; 11+ messages in thread
From: Gary Guo @ 2026-04-04 13:02 UTC (permalink / raw)
  To: Joel Fernandes, Gary Guo, Miguel Ojeda, Boqun Feng,
	Björn Roy Baron, Benno Lossin, Andreas Hindborg, Alice Ryhl,
	Trevor Gross, Danilo Krummrich, Alexandre Courbot, David Airlie,
	Simona Vetter
  Cc: Alan Stern, Andrea Parri, Will Deacon, Peter Zijlstra,
	Nicholas Piggin, David Howells, Jade Alglave, Luc Maranget,
	Paul E. McKenney, Akira Yokosawa, Daniel Lustig, rust-for-linux,
	nouveau, linux-kernel, linux-arch, lkmm, dri-devel

On Thu Apr 2, 2026 at 10:56 PM BST, Joel Fernandes wrote:
> Hi Gary,
>
> On 4/2/2026 11:24 AM, Gary Guo wrote:
>> From: Gary Guo <gary@garyguo.net>
>> 
>> Currently, in the GSP->CPU messaging path, the current code misses a read
>> barrier before data read. The barrier after read is updated to a DMA
>> barrier (with release ordering desired), instead of the existing (Rust)
>> SeqCst SMP barrier; the location of barrier is also moved to the beginning
>> of function, because the barrier is needed to synchronizing between data
>> and ring-buffer pointer, the RMW operation does not internally need a
>> barrier (nor it has to be atomic, as CPU pointers are updated by CPU only).
>> 
>> In the CPU->GSP messaging path, the current code misses a write barrier
>> after data write and before updating the CPU write pointer. Barrier is not
>> needed before data write due to control dependency, this fact is documented
>> explicitly. This could be replaced with an acquire barrier if needed.
>> 
>> Signed-off-by: Gary Guo <gary@garyguo.net>
>> ---
>>  drivers/gpu/nova-core/gsp/cmdq.rs | 19 +++++++++++++++++++
>>  drivers/gpu/nova-core/gsp/fw.rs   | 12 ------------
>>  2 files changed, 19 insertions(+), 12 deletions(-)
>> 
>> diff --git a/drivers/gpu/nova-core/gsp/cmdq.rs b/drivers/gpu/nova-core/gsp/cmdq.rs
>> index 2224896ccc89..7e4315b13984 100644
>> --- a/drivers/gpu/nova-core/gsp/cmdq.rs
>> +++ b/drivers/gpu/nova-core/gsp/cmdq.rs
>> @@ -19,6 +19,12 @@
>>      prelude::*,
>>      sync::{
>>          aref::ARef,
>> +        barrier::{
>> +            dma_mb,
>> +            Read,
>> +            Release,
>> +            Write, //
>> +        },
>>          Mutex, //
>>      },
>>      time::Delta,
>> @@ -258,6 +264,9 @@ fn new(dev: &device::Device<device::Bound>) -> Result<Self> {
>>          let tx = self.cpu_write_ptr() as usize;
>>          let rx = self.gsp_read_ptr() as usize;
>>  
>> +        // ORDERING: control dependency provides necessary LOAD->STORE ordering.
>> +        // `dma_mb(Acquire)` may be used here if we don't want to rely on control dependency.
>
> Just checking, does control dependency on CPU side really apply to ordering for
> IO (what the device perceives?). IOW, the loads are stores might be ordered on
> the CPU side, but the device might be seeing these operations out of order. If
> that is the case, perhaps the control dependency comment is misleading.

Given that CPU cannot speculate store, I don't see how dependency ordering can
be broken even if the other side is a bus-mastering device.

For this to be broken, the device would be able to see the dependency-ordered
STORE to be propagated to it before it does it own STORE that propagates to the
CPU to trigger the CPU STORE in the first place? That'll just break casuality.

I'll say that control dependency is sufficient here. I'm not worried that the
compiler will break this particular control dependency given that if
`gsp_read_ptr == cpu_write_ptr` will imply command allocation failure and this
condition cannot possibly be optimized out by the compiler.

Best,
Gary

>
>
>> +
>>          // SAFETY:
>>          // - We will only access the driver-owned part of the shared memory.
>>          // - Per the safety statement of the function, no concurrent access will be performed.
>> @@ -311,6 +320,9 @@ fn driver_write_area_size(&self) -> usize {
>>          let tx = self.gsp_write_ptr() as usize;
>>          let rx = self.cpu_read_ptr() as usize;
>>  
>> +        // ORDERING: Ensure data load is ordered after load of GSP write pointer.
>> +        dma_mb(Read);
>> +
>
> I suggest taking it on a case by case basis, and splitting the patch for each
> case, for easier review. There are many patterns AFAICS, load-store, store-store
> etc.
>
> I do acknowledge the issue you find here though. thanks,

This is pretty much just a common MP pattern, where there's even an example in
Documentation/core-api/circular-buffers.rst (except we're using dma barriers
here as we're talking to devices rather than another SMP CPU), so I think
there's no real need of breaking this into small parts.

(That documentation makes use of smp_load_acquire/smp_store_release which is
another reason I want to at least document the barrrier to be of ACQUIRE/RELEASE
ordering).

Perhaps I can move the removal of redundant barrier after write pointer update to
another patch.

Best,
Gary

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-04-04 13:02 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-02 15:24 [PATCH 0/3] rust: more memory barriers bindings Gary Guo
2026-04-02 15:24 ` [PATCH 1/3] rust: sync: add helpers for mb, dma_mb and friends Gary Guo
2026-04-02 15:24 ` [PATCH 2/3] rust: sync: generic memory barriers Gary Guo
2026-04-02 21:49   ` Joel Fernandes
2026-04-03  0:07     ` Gary Guo
2026-04-03 21:33       ` Joel Fernandes
2026-04-04 12:43         ` Gary Guo
2026-04-02 15:24 ` [PATCH 3/3] gpu: nova-core: fix wrong use of barriers in GSP code Gary Guo
2026-04-02 21:56   ` Joel Fernandes
2026-04-02 21:59     ` Joel Fernandes
2026-04-04 13:02     ` Gary Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox