From: "Christian König" <christian.koenig@amd.com>
To: Raag Jadav <raag.jadav@intel.com>, Riana Tauro <riana.tauro@intel.com>
Cc: "Rodrigo Vivi" <rodrigo.vivi@intel.com>,
intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
anshuman.gupta@intel.com, lucas.demarchi@intel.com,
aravind.iddamsetty@linux.intel.com,
umesh.nerlige.ramappa@intel.com, frank.scarbrough@intel.com,
"André Almeida" <andrealmeid@igalia.com>,
"David Airlie" <airlied@gmail.com>
Subject: Re: [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent
Date: Tue, 1 Jul 2025 16:35:42 +0200 [thread overview]
Message-ID: <8f0c1489-df00-4d40-a51c-39dcfa185d3e@amd.com> (raw)
In-Reply-To: <aGPvbagza6HwF4kE@black.fi.intel.com>
On 01.07.25 16:23, Raag Jadav wrote:
> On Tue, Jul 01, 2025 at 05:11:24PM +0530, Riana Tauro wrote:
>> On 7/1/2025 5:07 PM, Riana Tauro wrote:
>>> On 6/30/2025 11:03 PM, Rodrigo Vivi wrote:
>>>> On Mon, Jun 30, 2025 at 10:29:10AM +0200, Christian König wrote:
>>>>> On 27.06.25 23:38, Rodrigo Vivi wrote:
>>>>>>>> Or at least print a big warning into the system log?
>>>>>>>>
>>>>>>>> I mean a firmware update is usually something which
>>>>>>>> the system administrator triggers very explicitly
>>>>>>>> because when it fails for some reason (e.g.
>>>>>>>> unexpected reset, power outage or whatever) it can
>>>>>>>> sometimes brick the HW.
>>>>>>>>
>>>>>>>> I think it's rather brave to do this automatically.
>>>>>>>> Are you sure we don't talk past each other on the
>>>>>>>> meaning of the wedge event?
>>>>>>>
>>>>>>> The goal is not to do that automatically, but raise the
>>>>>>> uevent to the admin
>>>>>>> with enough information that they can decide for the right correctable
>>>>>>> action.
>>>>>>
>>>>>> Christian, Andre, any concerns with this still?
>>>>>
>>>>> Well, that sounds not quite the correct use case for wedge events.
>>>>>
>>>>> See the wedge event is made for automation.
>>>>
>>>> I respectfully disagree with this statement.
>>>>
>>>> The wedged state in i915 and xe, then ported to drm, was never just about
>>>> automation. Of course, the unbind + flr + rebind is one that driver
>>>> cannot
>>>> do by itself, hence needs automation. But wedge cases were also very
>>>> useful
>>>> in other situations like keeping the device in the failure stage for
>>>> debuging
>>>> (without automation) or keeping other critical things up like
>>>> display with SW
>>>> rendering (again, nothing about automation).
Granted, automation is probably not the right term.
What I wanted to say is that the wedge event should not replace information in the system log.
>>>>
>>>>> For example to allow a process supervising containers get the
>>>>> device working again and re-start the container which used it or
>>>>> gather crash log etc .....
>>>>>
>>>>> When you want to notify the system administrator which manual
>>>>> intervention is necessary then I would just write that into the
>>>>> system log and raise a device event with WEDGED=unknown.
>>>>>
>>>>> What we could potentially do is to separate between
>>>>> WEDGED=unknown and WEDGED=manual, e.g. between driver has no
>>>>> idea what to do and driver printed useful info into the system
>>>>> log.
>>>>
>>>> Well, you are right here. Even our official documentation in drm-uapi.rst
>>>> already tells that firmware flashing should be a case for 'unknown'.
>>>
>>> I had added specific method since we know firmware flash will recover
>>> the error. Sure will change it.
>>>
>>> In the current code, there is no recovery method named "unknown" even
>>> though the document mentions it
>>>
>>> https://elixir.bootlin.com/linux/v6.16-rc4/source/drivers/gpu/drm/
>>> drm_drv.c#L534
>>>
>>> Since we are adding something new, can it be "manual" instead of unknown?
>>
>> Okay missed it. It's in the drm_dev_wedged_event function. Will use unknown
>>>
>>>> Let's go with that then. And use other hints like logs and sysfs so,
>>>> Admin
>>>> has a better information of what to do.
>>>>
>>>>> But creating an event with WEDGED=firmware-flash just sounds to
>>>>> specific, when we go down that route we might soon have
>>>>> WEDGE=change- bios-setting, WEDGE=....
>>>>
>>>> Well, I agree that we shouldn't explode the options exponentially here.
>>>>
>>>> Although I believe that firmware flashing should be a common case in many
>>>> case and could be a candidate for another indication.
>>>>
>>>> But let's move on with WEDGE='unknown' for this case.
>
> I understand that WEDGED=firmware-flash can't be handled in a generic way
> for all drivers but it is simply not as same as WEDGED=unknown since the
> driver knows something specific needs to be done here.
>
> I'm wondering if we could add a WEDGED=vendor-specific method for such
> cases?
Works for me as well.
My main concern was that we should not start to invent specific wedge events for all kind of different problems.
On the other hand what's the additional value to distinct between unknown and vendor-specific?
In other words even if the necessary handling is unknown to the wedge event, the administrator could and should still examine the logs to see what to do.
Regards,
Christian.
>
> Chris? André?
>
> Raag
next prev parent reply other threads:[~2025-07-01 14:35 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-06-23 10:01 [PATCH v2 0/5] Handle Firmware reported Hardware Errors Riana Tauro
2025-06-23 9:42 ` ✗ CI.checkpatch: warning for Handle Firmware reported Hardware Errors (rev2) Patchwork
2025-06-23 9:44 ` ✓ CI.KUnit: success " Patchwork
2025-06-23 10:01 ` [PATCH v2 1/5] drm: Add a firmware flash method to device wedged uevent Riana Tauro
2025-06-24 12:26 ` Christian König
2025-06-24 14:03 ` Riana Tauro
2025-06-24 14:23 ` Christian König
2025-06-24 21:36 ` Rodrigo Vivi
2025-06-27 21:38 ` Rodrigo Vivi
2025-06-30 8:29 ` Christian König
2025-06-30 17:33 ` Rodrigo Vivi
2025-07-01 11:37 ` Riana Tauro
2025-07-01 11:41 ` Riana Tauro
2025-07-01 14:23 ` Raag Jadav
2025-07-01 14:35 ` Christian König [this message]
2025-07-01 16:02 ` Raag Jadav
2025-07-01 16:44 ` Riana Tauro
2025-07-01 17:15 ` André Almeida
2025-06-23 10:01 ` [PATCH v2 2/5] drm/xe: Add a helper function to set recovery method Riana Tauro
2025-06-23 10:01 ` [PATCH v2 3/5] drm/xe: Add support to handle hardware errors Riana Tauro
2025-06-23 10:01 ` [PATCH v2 4/5] drm/xe/xe_hw_error: Handle CSC Firmware reported Hardware errors Riana Tauro
2025-06-23 10:01 ` [PATCH v2 5/5] drm/xe/xe_hw_error: Add fault injection to trigger csc error handler Riana Tauro
2025-06-23 10:02 ` ✗ CI.checksparse: warning for Handle Firmware reported Hardware Errors (rev2) Patchwork
2025-06-23 11:11 ` ✓ Xe.CI.BAT: success " Patchwork
2025-06-23 14:11 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8f0c1489-df00-4d40-a51c-39dcfa185d3e@amd.com \
--to=christian.koenig@amd.com \
--cc=airlied@gmail.com \
--cc=andrealmeid@igalia.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=frank.scarbrough@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
--cc=raag.jadav@intel.com \
--cc=riana.tauro@intel.com \
--cc=rodrigo.vivi@intel.com \
--cc=umesh.nerlige.ramappa@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox