Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Christian König" <christian.koenig@amd.com>
Cc: "Riana Tauro" <riana.tauro@intel.com>,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	anshuman.gupta@intel.com, lucas.demarchi@intel.com,
	aravind.iddamsetty@linux.intel.com, raag.jadav@intel.com,
	umesh.nerlige.ramappa@intel.com, frank.scarbrough@intel.com,
	"André Almeida" <andrealmeid@igalia.com>,
	"David Airlie" <airlied@gmail.com>
Subject: Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
Date: Tue, 24 Jun 2025 17:36:29 -0400	[thread overview]
Message-ID: <aFsaXXKZSBPrcYJb@intel.com> (raw)
In-Reply-To: <44eac6fd-df68-4ae1-8970-57a686f5782f@amd.com>

On Tue, Jun 24, 2025 at 04:23:31PM +0200, Christian König wrote:
> On 24.06.25 16:03, Riana Tauro wrote:
> > Hi Christian
> > 
> > On 6/24/2025 5:56 PM, Christian König wrote:
> >> On 23.06.25 12:01, Riana Tauro wrote:
> >>> A device is declared wedged when it is non-recoverable from
> >>> the driver context.
> >>
> >> Well, not quite.
> > 
> > i took this from the below document. Should it be changed?
> 
> The wedge event basically meant that something unexpected happened during the lifetime of the the device (crash, hang whatever).
> 
> It can be that the device recovered on it's own and nothing needs to be done (the none case in the documentation) and the event is just send for telemetry collection.
> 
> But the usual case is to trigger a bus reset, rebing or even reboot.
> 
> > https://www.kernel.org/doc/html/v6.16-rc3/gpu/drm-uapi.html#device-wedging
> > 
> >>
> >>> Some firmware errors can also cause
> >>> the device to enter this state and the only method to recover
> >>> from this would be to do a firmware flash
> >>
> >> What? What exactly do you mean with firmware flash here?
> >>
> >> Usually that means updating the firmware, but I don't see how this will bring you out of a wedge state?
> > 
> > It means updating the firmware.
> > 
> > Series:  https://patchwork.freedesktop.org/series/149756/
> > 
> > In this xe kmd series, there are few firmware errors that cause the card to be non-functional. The device is declared wedged and a firmware-flash action is sent.
> 
> Ok, so let me recap that just to make sure that I did understood that correctly.
> 
> You find that the firmware flashed into the device is buggy and then raise a wedge event to automatically trigger a firmware update?
> 
> Why not fail to load the driver in the first place?

We already have that in place. If during the probe the fw machinery underneath
identified something is so bad that it needs to be flashed we boot in the
'survivability mode'. The device is not discoverable for any gpu command
submission or memory management, but only fw flashing is possible on that
mode.

This is on top of that. If the fw machinery had a bad unrecoverable error
and decided that fw updating is needed.

> Or at least print a big warning into the system log?
> 
> I mean a firmware update is usually something which the system administrator triggers very explicitly because when it fails for some reason (e.g. unexpected reset, power outage or whatever) it can sometimes brick the HW.
> 
> I think it's rather brave to do this automatically. Are you sure we don't talk past each other on the meaning of the wedge event?

The goal is not to do that automatically, but raise the uevent to the admin
with enough information that they can decide for the right correctable
action.

Thanks,
Rodrigo.

> 
> Thanks,
> Christian.
> 
> > 
> > There is corresponding fwupd PR in work that uses this uevent to trigger a firmware flash
> > 
> > fwupd PR: https://github.com/fwupd/fwupd/pull/8944/
> > 
> > Thanks
> > Riana
> > 
> >>
> >> Where is the rest of the series?
> >>
> >> Regards,
> >> Christian.
> >>
> >>> v2: modify documentation (Raag, Rodrigo)
> >>>
> >>> Cc: André Almeida <andrealmeid@igalia.com>
> >>> Cc: Christian König <christian.koenig@amd.com>
> >>> Signed-off-by: Riana Tauro <riana.tauro@intel.com>
> >>> ---
> >>>   Documentation/gpu/drm-uapi.rst | 6 +++---
> >>>   drivers/gpu/drm/drm_drv.c      | 2 ++
> >>>   include/drm/drm_device.h       | 1 +
> >>>   3 files changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/Documentation/gpu/drm-uapi.rst b/Documentation/gpu/drm-uapi.rst
> >>> index 263e5a97c080..cd2481458755 100644
> >>> --- a/Documentation/gpu/drm-uapi.rst
> >>> +++ b/Documentation/gpu/drm-uapi.rst
> >>> @@ -422,9 +422,8 @@ Current implementation defines three recovery methods, out of which, drivers
> >>>   can use any one, multiple or none. Method(s) of choice will be sent in the
> >>>   uevent environment as ``WEDGED=<method1>[,..,<methodN>]`` in order of less to
> >>>   more side-effects. If driver is unsure about recovery or method is unknown
> >>> -(like soft/hard system reboot, firmware flashing, physical device replacement
> >>> -or any other procedure which can't be attempted on the fly), ``WEDGED=unknown``
> >>> -will be sent instead.
> >>> +(like soft/hard system reboot, physical device replacement or any other procedure
> >>> +which can't be attempted on the fly), ``WEDGED=unknown`` will be sent instead.
> >>>     Userspace consumers can parse this event and attempt recovery as per the
> >>>   following expectations.
> >>> @@ -435,6 +434,7 @@ following expectations.
> >>>       none            optional telemetry collection
> >>>       rebind          unbind + bind driver
> >>>       bus-reset       unbind + bus reset/re-enumeration + bind
> >>> +    firmware-flash  firmware flash
> >>>       unknown         consumer policy
> >>>       =============== ========================================
> >>>   diff --git a/drivers/gpu/drm/drm_drv.c b/drivers/gpu/drm/drm_drv.c
> >>> index 02556363e918..5f3bbe01c207 100644
> >>> --- a/drivers/gpu/drm/drm_drv.c
> >>> +++ b/drivers/gpu/drm/drm_drv.c
> >>> @@ -535,6 +535,8 @@ static const char *drm_get_wedge_recovery(unsigned int opt)
> >>>           return "rebind";
> >>>       case DRM_WEDGE_RECOVERY_BUS_RESET:
> >>>           return "bus-reset";
> >>> +    case DRM_WEDGE_RECOVERY_FW_FLASH:
> >>> +        return "firmware-flash";
> >>>       default:
> >>>           return NULL;
> >>>       }
> >>> diff --git a/include/drm/drm_device.h b/include/drm/drm_device.h
> >>> index 08b3b2467c4c..9d57c8882d93 100644
> >>> --- a/include/drm/drm_device.h
> >>> +++ b/include/drm/drm_device.h
> >>> @@ -30,6 +30,7 @@ struct pci_controller;
> >>>   #define DRM_WEDGE_RECOVERY_NONE        BIT(0)    /* optional telemetry collection */
> >>>   #define DRM_WEDGE_RECOVERY_REBIND    BIT(1)    /* unbind + bind driver */
> >>>   #define DRM_WEDGE_RECOVERY_BUS_RESET    BIT(2)    /* unbind + reset bus device + bind */
> >>> +#define DRM_WEDGE_RECOVERY_FW_FLASH    BIT(3)  /* firmware flash */
> >>>     /**
> >>>    * struct drm_wedge_task_info - information about the guilty task of a wedge dev
> >>
> > 
> > 
>

next prev parent reply	other threads:[~2025-06-24 21:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20250623100109.1086459-1-riana.tauro@intel.com>
     [not found] ` <20250623100109.1086459-2-riana.tauro@intel.com>
     [not found]   ` <a2bfb8be-35bc-4db9-9352-02eab1ae0881@amd.com>
2025-06-24 14:03     ` [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-24 14:23       ` Christian König
2025-06-24 21:36         ` Rodrigo Vivi [this message]
2025-06-27 21:38           ` Rodrigo Vivi
2025-06-30  8:29             ` Christian König
2025-06-30 17:33               ` Rodrigo Vivi
2025-07-01 11:37                 ` Riana Tauro
2025-07-01 11:41                   ` Riana Tauro
2025-07-01 14:23                     ` Raag Jadav
2025-07-01 14:35                       ` Christian König
2025-07-01 16:02                         ` Raag Jadav
2025-07-01 16:44                           ` Riana Tauro
2025-07-01 17:15                             ` André Almeida

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aFsaXXKZSBPrcYJb@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=aravind.iddamsetty@linux.intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=frank.scarbrough@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=lucas.demarchi@intel.com \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=umesh.nerlige.ramappa@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).