Re: [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues

NVIDIA GPU driver infrastructure
 help / color / mirror / Atom feed

From: Alistair Popple <apopple@nvidia.com>
To: Eliot Courtney <ecourtney@nvidia.com>
Cc: Gary Guo <gary@garyguo.net>, Danilo Krummrich <dakr@kernel.org>,
	 Alexandre Courbot <acourbot@nvidia.com>,
	Alice Ryhl <aliceryhl@google.com>,
	 David Airlie <airlied@gmail.com>,
	Simona Vetter <simona@ffwll.ch>,
	 Benno Lossin <lossin@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	 Timur Tabi <ttabi@nvidia.com>,
	nova-gpu@lists.linux.dev, dri-devel@lists.freedesktop.org,
	 linux-kernel@vger.kernel.org, rust-for-linux@vger.kernel.org,
	 dri-devel <dri-devel-bounces@lists.freedesktop.org>
Subject: Re: [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues
Date: Wed, 17 Jun 2026 14:31:40 +1000	[thread overview]
Message-ID: <ajIgJh5JgWyI-jsS@nvdebian.thelocal> (raw)
In-Reply-To: <DJB0OWXDA3SA.2C7C24RNJY2LQ@nvidia.com>

On 2026-06-17 at 13:51 +1000, Eliot Courtney <ecourtney@nvidia.com> wrote...
> On Tue Jun 16, 2026 at 7:57 PM JST, Gary Guo wrote:
> > On Tue Jun 16, 2026 at 8:57 AM BST, Alistair Popple wrote:
> >> On 2026-06-16 at 03:15 +1000, Gary Guo <gary@garyguo.net> wrote...
> >>> On Mon Jun 15, 2026 at 3:40 PM BST, Eliot Courtney wrote:
> >>> > Currently, `poll_msgq` will report a message of size 4 if the queue
> >>> > pointers are broken. It's easy to catch this if it occurs, so have
> >>> > `poll_msgq` return an error in this case.
> >>> >
> >>> > Signed-off-by: Eliot Courtney <ecourtney@nvidia.com>
> >>> > ---
> >>> >  drivers/gpu/nova-core/falcon/fsp.rs | 15 +++++++++------
> >>> >  1 file changed, 9 insertions(+), 6 deletions(-)
> >>> >
> >>> > diff --git a/drivers/gpu/nova-core/falcon/fsp.rs b/drivers/gpu/nova-core/falcon/fsp.rs
> >>> > index e7419a6e71e2..21eaa8e261ce 100644
> >>> > --- a/drivers/gpu/nova-core/falcon/fsp.rs
> >>> > +++ b/drivers/gpu/nova-core/falcon/fsp.rs
> >>> > @@ -107,19 +107,22 @@ fn read_emem(&mut self, bar: Bar0<'_>, data: &mut [u8]) -> Result {
> >>> >      /// Poll FSP for incoming data.
> >>> >      ///
> >>> >      /// Returns the size of available data in bytes, or 0 if no data is available.
> >>> > +    /// Returns an error if the queue pointers are bogus (`tail < head`).
> >>> >      ///
> >>> >      /// The FSP message queue is not circular. Pointers are reset to 0 after each
> >>> >      /// message exchange, so `tail >= head` is always true when data is present.
> >>> > -    fn poll_msgq(&self, bar: Bar0<'_>) -> u32 {
> >>> > +    fn poll_msgq(&self, bar: Bar0<'_>) -> Result<u32> {
> >>> >          let head = bar.read(regs::NV_PFSP_MSGQ_HEAD::at(0)).val();
> >>> >          let tail = bar.read(regs::NV_PFSP_MSGQ_TAIL::at(0)).val();
> >>> >  
> >>> >          if head == tail {
> >>> > -            return 0;
> >>> > +            Ok(0)
> >>> > +        } else {
> >>> > +            // TAIL points at the last DWORD written, so the size is `tail - head + 4`.
> >>> > +            tail.checked_sub(head)
> >>> > +                .and_then(|delta| delta.checked_add(4))
> >>> > +                .ok_or(EIO)
> >>> 
> >>> Whenever we fail with this, we should print a message (actually, the same thing
> >>> probably should be done for patch 1 as well).
> >>> 
> >>> A plain EIO is going be very difficult to troubleshoot if this is ever hit.
> >>
> >> I don't disagree with the sentiment - this is a problem through out the kernel
> >> and I have spent way too long tracing where exactly error codes have come from
> >> both in C and Rust.
> >>
> >> But it seems odd to worry about these particular instances - they _should_
> >> never happen or at least be extremely rare and very unlikely by an end-user.
> >
> > I think we should either consider it possible and add prints, or consider it a
> > bug (hardware or driver) and add a `WARN_ON`, or consider it impossible and not
> > add failure path at all.

I think we have had this discussion previously for Nova and we settled on
progagating errors even if they are "impossible" because programmer mistakes are
always possible, even when trying to reason about an error being "impossible" :-)
 
> I am new to this, but IIUC it's normal to guard against
> hardware/firmware bugs - currently, for this error case, we would try to
> create the `FspResponse` in `send_sync_fsp` and print an error "FSP
> response too small: 4" without this patch. With this patch, we would
> output "FSP response error: EIO\n". I think the latter is marginally
> easier to debug, but both you would find it pretty quickly looking at
> the code.

Right - there are plenty of other "impossible" error paths in Nova where we just
return an error without printing. I don't see anything special about this case
that warrants treating it differently.

If we do want to solve the problem of the cause of return codes being ambiguous
or difficult to debug then I think that should be done as a concerted effort
across Nova or the kernel. I don't there is much value in doing it ad-hoc in some
places and not others.

> I am not that opposed to adding a print, but it seems low value to me,
> since it's very unlikely this will occur and we have a print in the only
> calling function right now (`send_sync_fsp`).

I agree - I'm not totally opposed to a print being added but I think it's low
value especially given we don't consistently do that now.

> I think the semantics are clearer with this patch (no need to reason
> about what happens when the saturating sub + add are hit).

Yeah, patch looks good to me so:

Reviewed-By: Alistair Popple <apopple@nvidia.com>

> >
> >> More to the point though there are many other places in nova-core (and I'm
> >> sure other drivers) where this pattern of just returning a fairly generic
> >> error code exists. So it feels like it would be nicer to deal with this at some
> >> other layer, eg. some kind of debug option to tag error codes with location
> >> or something.
> >
> > This should happen whenever the information is lost. Creating a generic error
> > code would be such occasion. Handling it at upper layers is okay if the
> > information is kept for longer. For example:
> >
> >     enum NovaError {
> >         ...
> >     }
> >
> >     impl From<NovaError> for Error {
> >         fn from(err: NovaError) -> Self {
> >             // Print here
> >             EIO
> >         }
> >     }
> >
> > Would be fine to me because the information is emitted to user when it is lost
> > by the concrete error -> error code conversion.
> 
> It feels very odd to me to have a From implementation have side effects
> (printing) although I agree semantically with the "information is lost"
> idea.

It's an interesting idea - maybe not a print directly but some kind of debug
tracing infrastructure that would allow us to see where exactly an error is
first generated. It's actually something I've thought about before. But IMHO
that's a separate concern that could/should be dealt with independently of this
particular patch series.

 - Alistair

> >
> > Best,
> > Gary
> >
> >>
> >> So I'm not opposed to the comment, but maybe it would be better addressed as a
> >> separate question/patch series to figure out how to do this error reporting in a
> >> more generic or consistent way across all of Nova at least?
> >>
> >>  - Alistair
>

next prev parent reply	other threads:[~2026-06-17  4:31 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-15 14:40 [PATCH 00/13] gpu: nova-core: blackwell follow-ups and fixes Eliot Courtney
2026-06-15 14:40 ` [PATCH 01/13] gpu: nova-core: fsp: limit FSP receive message allocation size Eliot Courtney
2026-06-15 17:11   ` Gary Guo
2026-06-16  7:33   ` Alistair Popple
2026-06-15 14:40 ` [PATCH 02/13] gpu: nova-core: fsp: catch bogus queue pointer issues Eliot Courtney
2026-06-15 17:15   ` Gary Guo
2026-06-16  7:57     ` Alistair Popple
2026-06-16 10:57       ` Gary Guo
2026-06-17  3:51         ` Eliot Courtney
2026-06-17  4:31           ` Alistair Popple [this message]
2026-06-15 14:40 ` [PATCH 03/13] gpu: nova-core: fsp: try to enforce exclusive access to FSP channel Eliot Courtney
2026-06-15 17:16   ` Gary Guo
2026-06-17  2:55     ` Eliot Courtney
2026-06-17  3:12       ` John Hubbard
2026-06-15 14:40 ` [PATCH 04/13] gpu: nova-core: falcon: gsp: move PRIV target mask constants Eliot Courtney
2026-06-15 17:17   ` Gary Guo
2026-06-16  8:02   ` Alistair Popple
2026-06-15 14:40 ` [PATCH 05/13] gpu: nova-core: gsp: keep FMC boot params DMA region alive during error Eliot Courtney
2026-06-15 17:23   ` Gary Guo
2026-06-17  5:27     ` Eliot Courtney
2026-06-15 14:40 ` [PATCH 06/13] gpu: nova-core: fsp: move FMC firmware loading into wait_secure_boot Eliot Courtney
2026-06-15 17:24   ` Gary Guo
2026-06-15 14:40 ` [PATCH 07/13] gpu: nova-core: gsp: ensure lifetime for FMC boot DMA allocations Eliot Courtney
2026-06-15 14:40 ` [PATCH 08/13] gpu: nova-core: gsp: ensure LibOS DMA allocation lives long enough Eliot Courtney
2026-06-17  6:11   ` Alistair Popple
2026-06-15 14:40 ` [PATCH 09/13] gpu: nova-core: wait for FSP boot earlier Eliot Courtney
2026-06-15 14:40 ` [PATCH 10/13] gpu: nova-core: split FbLayout into FSP and non-FSP versions Eliot Courtney
2026-06-15 14:40 ` [PATCH 11/13] gpu: nova-core: correct FRTS vidmem offset calculation Eliot Courtney
2026-06-15 14:40 ` [PATCH 12/13] gpu: nova-core: rename heap size field Eliot Courtney
2026-06-15 14:40 ` [PATCH 13/13] gpu: nova-core: return non-WPR heap size as u64 from HALs Eliot Courtney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ajIgJh5JgWyI-jsS@nvdebian.thelocal \
    --to=apopple@nvidia.com \
    --cc=acourbot@nvidia.com \
    --cc=airlied@gmail.com \
    --cc=aliceryhl@google.com \
    --cc=dakr@kernel.org \
    --cc=dri-devel-bounces@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=ecourtney@nvidia.com \
    --cc=gary@garyguo.net \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lossin@kernel.org \
    --cc=nova-gpu@lists.linux.dev \
    --cc=rust-for-linux@vger.kernel.org \
    --cc=simona@ffwll.ch \
    --cc=ttabi@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox