[PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
@ 2026-04-09 12:15 Eliot Courtney
  2026-04-09 22:56 ` John Hubbard
  2026-04-10 15:57 ` Gary Guo
  0 siblings, 2 replies; 6+ messages in thread
From: Eliot Courtney @ 2026-04-09 12:15 UTC (permalink / raw)
  To: Danilo Krummrich, Alexandre Courbot, Alice Ryhl, David Airlie,
	Simona Vetter
  Cc: John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel, Eliot Courtney

Current `Gpu::new` will not unregister SysmemFlush if something fails
after it is created, since it needs manual unregistering. Add a `Drop`
implementation which will clean it up in that case. Maintain the manual
unregister path because it can stay infallible, unlike the Drop path
which depends on revocable access. In the case that `Gpu::new` fails the
access is guaranteed to succeed, however.

Fixes: 6554ad65b589 ("gpu: nova-core: register sysmem flush page")
Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
---
 drivers/gpu/nova-core/fb.rs  | 29 ++++++++++++++++++++---------
 drivers/gpu/nova-core/gpu.rs |  7 ++++++-
 2 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/nova-core/fb.rs b/drivers/gpu/nova-core/fb.rs
index bdd5eed760e1..edfbdc9a2512 100644
--- a/drivers/gpu/nova-core/fb.rs
+++ b/drivers/gpu/nova-core/fb.rs
@@ -7,6 +7,7 @@
 
 use kernel::{
     device,
+    devres::Devres,
     dma::CoherentHandle,
     fmt,
     io::Io,
@@ -16,7 +17,10 @@
         Alignment, //
     },
     sizes::*,
-    sync::aref::ARef, //
+    sync::{
+        aref::ARef,
+        Arc, //
+    },
 };
 
 use crate::{
@@ -46,12 +50,14 @@
 /// Because of this, the sysmem flush memory page must be registered as early as possible during
 /// driver initialization, and before any falcon is reset.
 ///
-/// Users are responsible for manually calling [`Self::unregister`] before dropping this object,
-/// otherwise the GPU might still use it even after it has been freed.
+/// Users should call [`Self::unregister`] before unloading to ensure unregistering is infallible.
+/// [`Drop`] performs a best-effort fallback using revocable BAR access.
 pub(crate) struct SysmemFlush {
     /// Chipset we are operating on.
     chipset: Chipset,
     device: ARef<device::Device>,
+    /// MMIO mapping of PCI BAR 0.
+    bar: Arc<Devres<Bar0>>,
     /// Keep the page alive as long as we need it.
     page: CoherentHandle,
 }
@@ -60,6 +66,7 @@ impl SysmemFlush {
     /// Allocate a memory page and register it as the sysmem flush page.
     pub(crate) fn register(
         dev: &device::Device<device::Bound>,
+        devres_bar: Arc<Devres<Bar0>>,
         bar: &Bar0,
         chipset: Chipset,
     ) -> Result<Self> {
@@ -70,18 +77,17 @@ pub(crate) fn register(
         Ok(Self {
             chipset,
             device: dev.into(),
+            bar: devres_bar,
             page,
         })
     }
 
     /// Unregister the managed sysmem flush page.
-    ///
-    /// In order to gracefully tear down the GPU, users must make sure to call this method before
-    /// dropping the object.
     pub(crate) fn unregister(&self, bar: &Bar0) {
         let hal = hal::fb_hal(self.chipset);
+        let registered_dma_handle = hal.read_sysmem_flush_page(bar);
 
-        if hal.read_sysmem_flush_page(bar) == self.page.dma_handle() {
+        if registered_dma_handle == self.page.dma_handle() {
             let _ = hal.write_sysmem_flush_page(bar, 0).inspect_err(|e| {
                 dev_warn!(
                     &self.device,
@@ -89,8 +95,7 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
                     e
                 )
             });
-        } else {
-            // Another page has been registered after us for some reason - warn as this is a bug.
+        } else if registered_dma_handle != 0 {
             dev_warn!(
                 &self.device,
                 "attempt to unregister a sysmem flush page that is not active\n"
@@ -99,6 +104,12 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
     }
 }
 
+impl Drop for SysmemFlush {
+    fn drop(&mut self) {
+        let _ = self.bar.try_access_with(|bar| self.unregister(bar));
+    }
+}
+
 pub(crate) struct FbRange(Range<u64>);
 
 impl FbRange {
diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
index 0f6fe9a1b955..5bad5a055b3b 100644
--- a/drivers/gpu/nova-core/gpu.rs
+++ b/drivers/gpu/nova-core/gpu.rs
@@ -257,7 +257,12 @@ pub(crate) fn new<'a>(
                     .inspect_err(|_| dev_err!(pdev, "GFW boot did not complete\n"))?;
             },
 
-            sysmem_flush: SysmemFlush::register(pdev.as_ref(), bar, spec.chipset)?,
+            sysmem_flush: SysmemFlush::register(
+                pdev.as_ref(),
+                devres_bar.clone(),
+                bar,
+                spec.chipset,
+            )?,
 
             gsp_falcon: Falcon::new(
                 pdev.as_ref(),

---
base-commit: a7a080bb4236ebe577b6776d940d1717912ff6dd
change-id: 20260409-fix-systemflush-de66dc90378a

Best regards,
--  
Eliot Courtney <ecourtney@nvidia.com>


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
@ 2026-04-09 22:56 ` John Hubbard
  2026-04-10 15:57 ` Gary Guo
  1 sibling, 0 replies; 6+ messages in thread
From: John Hubbard @ 2026-04-09 22:56 UTC (permalink / raw)
  To: Eliot Courtney, Danilo Krummrich, Alexandre Courbot, Alice Ryhl,
	David Airlie, Simona Vetter
  Cc: Alistair Popple, Joel Fernandes, Timur Tabi, rust-for-linux,
	dri-devel, linux-kernel

On 4/9/26 5:15 AM, Eliot Courtney wrote:
> Current `Gpu::new` will not unregister SysmemFlush if something fails
> after it is created, since it needs manual unregistering. Add a `Drop`
> implementation which will clean it up in that case. Maintain the manual
> unregister path because it can stay infallible, unlike the Drop path
> which depends on revocable access. In the case that `Gpu::new` fails the
> access is guaranteed to succeed, however.

Hi Eliot,

The code looks exactly correct to me. I just have some tiny commit
message suggestions:

1. The subject line could be tightened up slightly, to:

    gpu: nova-core: fb: unregister SysmemFlush on boot failure

2. And I'd like to rewrite the commit message body, to approximately this:

If Gpu::new fails after SysmemFlush::register succeeds, the registered
sysmem flush page is never unregistered because SysmemFlush has no Drop
impl and try_pin_init! only drops already-initialized fields on failure.

Add a Drop impl that unregisters through revocable BAR access, which
covers the init-failure path. The manual unregister in Gpu::unbind is
still needed because by the time Drop runs during normal teardown,
devres has already revoked the BAR.

With that,

Reviewed-by: John Hubbard <jhubbard@nvidia.com>

thanks,
-- 
John Hubbard

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
  2026-04-09 22:56 ` John Hubbard
@ 2026-04-10 15:57 ` Gary Guo
  2026-04-13 10:55   ` Danilo Krummrich
  1 sibling, 1 reply; 6+ messages in thread
From: Gary Guo @ 2026-04-10 15:57 UTC (permalink / raw)
  To: Eliot Courtney, Danilo Krummrich, Alexandre Courbot, Alice Ryhl,
	David Airlie, Simona Vetter
  Cc: John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel

On Thu Apr 9, 2026 at 1:15 PM BST, Eliot Courtney wrote:
> Current `Gpu::new` will not unregister SysmemFlush if something fails
> after it is created, since it needs manual unregistering. Add a `Drop`
> implementation which will clean it up in that case. Maintain the manual
> unregister path because it can stay infallible, unlike the Drop path
> which depends on revocable access. In the case that `Gpu::new` fails the
> access is guaranteed to succeed, however.
>
> Fixes: 6554ad65b589 ("gpu: nova-core: register sysmem flush page")
> Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> ---
>  drivers/gpu/nova-core/fb.rs  | 29 ++++++++++++++++++++---------
>  drivers/gpu/nova-core/gpu.rs |  7 ++++++-
>  2 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/gpu/nova-core/fb.rs b/drivers/gpu/nova-core/fb.rs
> index bdd5eed760e1..edfbdc9a2512 100644
> --- a/drivers/gpu/nova-core/fb.rs
> +++ b/drivers/gpu/nova-core/fb.rs
> @@ -7,6 +7,7 @@
>  
>  use kernel::{
>      device,
> +    devres::Devres,
>      dma::CoherentHandle,
>      fmt,
>      io::Io,
> @@ -16,7 +17,10 @@
>          Alignment, //
>      },
>      sizes::*,
> -    sync::aref::ARef, //
> +    sync::{
> +        aref::ARef,
> +        Arc, //
> +    },
>  };
>  
>  use crate::{
> @@ -46,12 +50,14 @@
>  /// Because of this, the sysmem flush memory page must be registered as early as possible during
>  /// driver initialization, and before any falcon is reset.
>  ///
> -/// Users are responsible for manually calling [`Self::unregister`] before dropping this object,
> -/// otherwise the GPU might still use it even after it has been freed.
> +/// Users should call [`Self::unregister`] before unloading to ensure unregistering is infallible.
> +/// [`Drop`] performs a best-effort fallback using revocable BAR access.
>  pub(crate) struct SysmemFlush {
>      /// Chipset we are operating on.
>      chipset: Chipset,
>      device: ARef<device::Device>,
> +    /// MMIO mapping of PCI BAR 0.
> +    bar: Arc<Devres<Bar0>>,
>      /// Keep the page alive as long as we need it.
>      page: CoherentHandle,
>  }
> @@ -60,6 +66,7 @@ impl SysmemFlush {
>      /// Allocate a memory page and register it as the sysmem flush page.
>      pub(crate) fn register(
>          dev: &device::Device<device::Bound>,
> +        devres_bar: Arc<Devres<Bar0>>,
>          bar: &Bar0,
>          chipset: Chipset,
>      ) -> Result<Self> {
> @@ -70,18 +77,17 @@ pub(crate) fn register(
>          Ok(Self {
>              chipset,
>              device: dev.into(),
> +            bar: devres_bar,
>              page,
>          })
>      }
>  
>      /// Unregister the managed sysmem flush page.
> -    ///
> -    /// In order to gracefully tear down the GPU, users must make sure to call this method before
> -    /// dropping the object.
>      pub(crate) fn unregister(&self, bar: &Bar0) {
>          let hal = hal::fb_hal(self.chipset);
> +        let registered_dma_handle = hal.read_sysmem_flush_page(bar);
>  
> -        if hal.read_sysmem_flush_page(bar) == self.page.dma_handle() {
> +        if registered_dma_handle == self.page.dma_handle() {
>              let _ = hal.write_sysmem_flush_page(bar, 0).inspect_err(|e| {
>                  dev_warn!(
>                      &self.device,
> @@ -89,8 +95,7 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
>                      e
>                  )
>              });
> -        } else {
> -            // Another page has been registered after us for some reason - warn as this is a bug.
> +        } else if registered_dma_handle != 0 {
>              dev_warn!(
>                  &self.device,
>                  "attempt to unregister a sysmem flush page that is not active\n"
> @@ -99,6 +104,12 @@ pub(crate) fn unregister(&self, bar: &Bar0) {
>      }
>  }
>  
> +impl Drop for SysmemFlush {
> +    fn drop(&mut self) {
> +        let _ = self.bar.try_access_with(|bar| self.unregister(bar));

I feel that this is the wrong solution to the problem.

The thing we want is to *ensure* that `SysmemFlush` Drop is called with device
still being bound.

It's not yet fully clear to me how we'd want to guarantee that, but one API that
might make sense is to create a DevRes API that allows you to reference an
existing `DevRes` and have driver-core making sure that the tear down happens in
reverse order. So inside the `Drop` the `bar` can still be unconditionally
access.

Best,
Gary

> +    }
> +}
> +
>  pub(crate) struct FbRange(Range<u64>);
>  
>  impl FbRange {
> diff --git a/drivers/gpu/nova-core/gpu.rs b/drivers/gpu/nova-core/gpu.rs
> index 0f6fe9a1b955..5bad5a055b3b 100644
> --- a/drivers/gpu/nova-core/gpu.rs
> +++ b/drivers/gpu/nova-core/gpu.rs
> @@ -257,7 +257,12 @@ pub(crate) fn new<'a>(
>                      .inspect_err(|_| dev_err!(pdev, "GFW boot did not complete\n"))?;
>              },
>  
> -            sysmem_flush: SysmemFlush::register(pdev.as_ref(), bar, spec.chipset)?,
> +            sysmem_flush: SysmemFlush::register(
> +                pdev.as_ref(),
> +                devres_bar.clone(),
> +                bar,
> +                spec.chipset,
> +            )?,
>  
>              gsp_falcon: Falcon::new(
>                  pdev.as_ref(),
>
> ---
> base-commit: a7a080bb4236ebe577b6776d940d1717912ff6dd
> change-id: 20260409-fix-systemflush-de66dc90378a
>
> Best regards,
> --  
> Eliot Courtney <ecourtney@nvidia.com>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-10 15:57 ` Gary Guo
@ 2026-04-13 10:55   ` Danilo Krummrich
  2026-04-13 13:33     ` Eliot Courtney
  0 siblings, 1 reply; 6+ messages in thread
From: Danilo Krummrich @ 2026-04-13 10:55 UTC (permalink / raw)
  To: Gary Guo, Eliot Courtney
  Cc: Alexandre Courbot, Alice Ryhl, David Airlie, Simona Vetter,
	John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel

On Fri Apr 10, 2026 at 5:57 PM CEST, Gary Guo wrote:
>> +impl Drop for SysmemFlush {
>> +    fn drop(&mut self) {
>> +        let _ = self.bar.try_access_with(|bar| self.unregister(bar));
>
> I feel that this is the wrong solution to the problem.

Yeah, it is pretty fragile as it relies on outer implementation details, such as
the fact that SysmemFlush is part of the device private data and hence
try_access_with() will fail when not dropped in an unwind path.

> The thing we want is to *ensure* that `SysmemFlush` Drop is called with device
> still being bound.
>
> It's not yet fully clear to me how we'd want to guarantee that, but one API that
> might make sense is to create a DevRes API that allows you to reference an
> existing `DevRes` and have driver-core making sure that the tear down happens in
> reverse order. So inside the `Drop` the `bar` can still be unconditionally
> access.

Yes, I have something like this on my list of things I want to look into for
while (my list entry calls it DevresChain).

I want to leverage the internal reference count of Devres for this, which is not
exactly straight forward, but should be possible.

I will prioritize this and have a look.

Thanks,
Danilo

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-13 10:55   ` Danilo Krummrich
@ 2026-04-13 13:33     ` Eliot Courtney
  2026-04-13 14:00       ` Danilo Krummrich
  0 siblings, 1 reply; 6+ messages in thread
From: Eliot Courtney @ 2026-04-13 13:33 UTC (permalink / raw)
  To: Danilo Krummrich, Gary Guo, Eliot Courtney
  Cc: Alexandre Courbot, Alice Ryhl, David Airlie, Simona Vetter,
	John Hubbard, Alistair Popple, Joel Fernandes, Timur Tabi,
	rust-for-linux, dri-devel, linux-kernel

On Mon Apr 13, 2026 at 7:55 PM JST, Danilo Krummrich wrote:
> On Fri Apr 10, 2026 at 5:57 PM CEST, Gary Guo wrote:
>>> +impl Drop for SysmemFlush {
>>> +    fn drop(&mut self) {
>>> +        let _ = self.bar.try_access_with(|bar| self.unregister(bar));
>>
>> I feel that this is the wrong solution to the problem.
>
> Yeah, it is pretty fragile as it relies on outer implementation details, such as
> the fact that SysmemFlush is part of the device private data and hence
> try_access_with() will fail when not dropped in an unwind path.
>
>> The thing we want is to *ensure* that `SysmemFlush` Drop is called with device
>> still being bound.
>>
>> It's not yet fully clear to me how we'd want to guarantee that, but one API that
>> might make sense is to create a DevRes API that allows you to reference an
>> existing `DevRes` and have driver-core making sure that the tear down happens in
>> reverse order. So inside the `Drop` the `bar` can still be unconditionally
>> access.
>
> Yes, I have something like this on my list of things I want to look into for
> while (my list entry calls it DevresChain).
>
> I want to leverage the internal reference count of Devres for this, which is not
> exactly straight forward, but should be possible.
>
> I will prioritize this and have a look.

Yeah, I agree that this patch is the wrong long term solution. If we
don't have the infra for the proper way to do it soon, it might be worth
taking it anyway since it adds minimal complexity and fixes a real
issue.

I had a brief look into the Devres chain stuff that exists on the C side
and it looks like it doesn't provide any actual guarantees about the
lifetime (it seems possible to delete from the middle of the chain, so
descendants can't assume ancestors exist in general, AFAICT), so the
the translation of that into lifetimes in rust might get interesting

thanks

>
> Thanks,
> Danilo


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure
  2026-04-13 13:33     ` Eliot Courtney
@ 2026-04-13 14:00       ` Danilo Krummrich
  0 siblings, 0 replies; 6+ messages in thread
From: Danilo Krummrich @ 2026-04-13 14:00 UTC (permalink / raw)
  To: Eliot Courtney
  Cc: Gary Guo, Alexandre Courbot, Alice Ryhl, David Airlie,
	Simona Vetter, John Hubbard, Alistair Popple, Joel Fernandes,
	Timur Tabi, rust-for-linux, dri-devel, linux-kernel

On Mon Apr 13, 2026 at 3:33 PM CEST, Eliot Courtney wrote:
> Yeah, I agree that this patch is the wrong long term solution. If we
> don't have the infra for the proper way to do it soon, it might be worth
> taking it anyway since it adds minimal complexity and fixes a real
> issue.

Yeah, let's see how it goes the next weeks.

> I had a brief look into the Devres chain stuff that exists on the C side
> and it looks like it doesn't provide any actual guarantees about the
> lifetime (it seems possible to delete from the middle of the chain, so
> descendants can't assume ancestors exist in general, AFAICT), so the
> the translation of that into lifetimes in rust might get interesting

There is no such API that does what we want here on the C side. In C it simply
is the responsibility of drivers to take care of this. So, I'm not sure what you
are referring to.

In case you are talking about devres groups, they have a different purpose. They
are used in cases where you e.g. have some middle layer API between drivers and
a subsystem, so the middle layer can acquire resources on the driver's behalf
and manage them. If the middle layer fails on optional features, it can simply
release the group of resources acquired for the purpose of the optional feature.

I.e. devres groups are simply markers for a certain range within the list of
devres nodes of a specific device.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-04-13 14:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-09 12:15 [PATCH] gpu: nova-core: fb: make sure to unregister SysmemFlush on boot failure Eliot Courtney
2026-04-09 22:56 ` John Hubbard
2026-04-10 15:57 ` Gary Guo
2026-04-13 10:55   ` Danilo Krummrich
2026-04-13 13:33     ` Eliot Courtney
2026-04-13 14:00       ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox