Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
To: "Christian König" <christian.koenig@amd.com>,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	rodrigo.vivi@intel.com
Cc: <andrealmeid@igalia.com>, <airlied@gmail.com>,
	<simona.vetter@ffwll.ch>, <mripard@kernel.org>,
	<anshuman.gupta@intel.com>, <badal.nilawar@intel.com>,
	<riana.tauro@intel.com>, <karthik.poosa@intel.com>,
	<sk.anirban@intel.com>, <raag.jadav@intel.com>
Subject: Re: [RFC PATCH 0/4] Add cold reset recovery method for critical errors
Date: Fri, 13 Feb 2026 16:09:38 +0530	[thread overview]
Message-ID: <bf3ab4cf-d49f-4785-8df6-74f13436f854@intel.com> (raw)
In-Reply-To: <1db2ea6c-4a0b-4071-9918-5ba756d17a0c@amd.com>


Hi Christian,

On 11-02-2026 05:57 pm, Christian König wrote:
> On 2/11/26 12:59, Mallesh Koujalagi wrote:
>> This RFC patch series introduces a new DRM wedge recovery method
>> 'DRM_WEDGE_RECOVERY_COLD_RESET' for handling critical errors
>> that cannot be recovered through existing software-based mechanisms.
>>
>> Background
>> ----------
>> Current recovery methods (driver rebind, bus reset, FLR) are effective
>> for most error scenarios. However, certain critical errors
>> affect device-level persistent state that survives warm resets and
>> software recovery attempts. These errors require complete device power
>> cycling to restore functionality.
> I don't think that this is a sufficient justification for making those changes.
>
> Especially since the patch set doesn't seem to add any detection for those cases, but rather just exposes a debugfs file to trigger them.
>
> So what is the actual technical background? In other words when is that necessary?
>
> Regards,
> Christian.

Thanks for the feedback. Sorry I missed to add reference of actual usecase.
This method is for handling errors from power management unit, which 
requires
a complete power cycle (cold reset) to recover.

I'll add actual implementation in next revision. It will be an extension of
our WIP RAS infrastructure which is being developed in parallel.
(see https://patchwork.freedesktop.org/series/160482/)

Current RFC series is to get community insight on proposed recovery method
as part of wedging uapi.

Thanks,
-/Mallesh


>> Proposed Solution
>> -----------------
>> This series adds DRM_WEDGE_RECOVERY_COLD_RESET (BIT(4)) as a new
>> recovery method to the DRM wedging framework. When this method is set,
>> it signals to userspace that only a complete device cold reset (power
>> cycle) can restore normal operation.
>>
>> Example uevent received:
>>    SUBSYSTEM=drm
>>    WEDGED=cold-reset
>>    DEVPATH=/devices/.../drm/card0
>>
>> Testing
>> -------
>> The debugfs interface allows testing the cold reset recovery path:
>>
>>    echo 1 > /sys/kernel/debug/dri/N/trigger_critical_error
>>
>> This triggers the critical error handler, wedges the device with
>> cold reset method, and sends the appropriate uevent to userspace.
>>
>> Cc: André Almeida <andrealmeid@igalia.com>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: David Airlie <airlied@gmail.com>
>> Cc: Simona Vetter <simona.vetter@ffwll.ch>
>> Cc: Maxime Ripard <mripard@kernel.org>
>>
>> Mallesh Koujalagi (4):
>>    drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error
>>    drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method
>>    drm/xe: Add handler for critical errors which require cold-reset
>>    drm/xe/debugfs: Add interface to trigger critical error handler
>>
>>   Documentation/gpu/drm-uapi.rst   | 73 +++++++++++++++++++++++++++++++-
>>   drivers/gpu/drm/drm_drv.c        |  2 +
>>   drivers/gpu/drm/xe/xe_debugfs.c  | 38 +++++++++++++++++
>>   drivers/gpu/drm/xe/xe_hw_error.c | 28 ++++++++++++
>>   drivers/gpu/drm/xe/xe_hw_error.h |  1 +
>>   include/drm/drm_device.h         |  1 +
>>   6 files changed, 142 insertions(+), 1 deletion(-)
>>

  reply	other threads:[~2026-02-13 10:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-11 11:59 [RFC PATCH 0/4] Add cold reset recovery method for critical errors Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 1/4] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for critical error Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 2/4] drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-02-11 13:29   ` Jani Nikula
2026-02-12  7:54     ` Mallesh, Koujalagi
2026-02-11 11:59 ` [PATCH 3/4] drm/xe: Add handler for critical errors which require cold-reset Mallesh Koujalagi
2026-02-11 11:59 ` [PATCH 4/4] drm/xe/debugfs: Add interface to trigger critical error handler Mallesh Koujalagi
2026-02-11 12:27 ` [RFC PATCH 0/4] Add cold reset recovery method for critical errors Christian König
2026-02-13 10:39   ` Mallesh, Koujalagi [this message]
2026-02-11 15:02 ` ✓ CI.KUnit: success for " Patchwork
2026-02-11 15:23 ` ✗ CI.checksparse: warning " Patchwork
2026-02-11 16:16 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-12 22:30 ` ✗ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bf3ab4cf-d49f-4785-8df6-74f13436f854@intel.com \
    --to=mallesh.koujalagi@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=mripard@kernel.org \
    --cc=raag.jadav@intel.com \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=sk.anirban@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox