Re: [PATCH v2 5/5] drm/xe/debugfs: Add interface to trigger power management unit error handler

public inbox for intel-xe@lists.freedesktop.org
 help / color / mirror / Atom feed

From: Raag Jadav <raag.jadav@intel.com>
To: "Mallesh, Koujalagi" <mallesh.koujalagi@intel.com>
Cc: "Tauro, Riana" <riana.tauro@intel.com>,
	andrealmeid@igalia.com, christian.koenig@amd.com,
	airlied@gmail.com, simona.vetter@ffwll.ch, mripard@kernel.org,
	anshuman.gupta@intel.com, badal.nilawar@intel.com,
	karthik.poosa@intel.com, sk.anirban@intel.com,
	intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org,
	rodrigo.vivi@intel.com
Subject: Re: [PATCH v2 5/5] drm/xe/debugfs: Add interface to trigger power management unit error handler
Date: Thu, 2 Apr 2026 10:31:10 +0200	[thread overview]
Message-ID: <ac4pTtjVZ7nxEsag@black.igk.intel.com> (raw)
In-Reply-To: <227a4bce-b3dd-4633-a2ae-8dceb82a6653@intel.com>

On Mon, Mar 30, 2026 at 07:10:33PM +0530, Mallesh, Koujalagi wrote:
> On 30-03-2026 10:25 am, Tauro, Riana wrote:
> > On 3/18/2026 12:10 PM, Mallesh Koujalagi wrote:
> > > Add a debugfs interface to manually trigger power management unit error
> > > handler for testing cold reset recovery paths. This is useful for
> > > validating the error recovery mechanism.
> > > 
> > > The new debugfs entry 'trigger_punit_error' is located at:
> > >    /sys/kernel/debug/dri/N/trigger_punit_error
> > > 
> > > Reading the file displays usage instructions. Writing '1' invokes
> > > xe_punit_error_handler(), which marks the device as wedged with
> > > DRM_WEDGE_RECOVERY_COLD_RESET method and sends a uevent to userspace
> > > indicating that a complete device power cycle is required for recovery.
> > > 
> > > Writing '0' or any other false value has no effect.
> > > 
> > > This interface is intended for development, testing, and validation
> > > of power management unit error recovery code.
> > 
> > Would fault injection be more appropriate here?
> 
> Here we need a deterministic way to invoke the punit error handler to test
> the cold-reset
> 
> recovery flow end-to-end. Using debugfs interface, we directly triggers
> wedge/reset status via a debugfs write
> 
> rather than using fault injection.

I think the question from Riana was, since fault injection can provide
wider coverage of all different kind of error flows, would it make more
sense to reuse it for punit as well?

Raag

> > > Signed-off-by: Mallesh Koujalagi <mallesh.koujalagi@intel.com>
> > > ---
> > >   drivers/gpu/drm/xe/xe_debugfs.c | 38 +++++++++++++++++++++++++++++++++
> > >   1 file changed, 38 insertions(+)
> > > 
> > > diff --git a/drivers/gpu/drm/xe/xe_debugfs.c
> > > b/drivers/gpu/drm/xe/xe_debugfs.c
> > > index 844cfafe1ec7..390bbed9c1af 100644
> > > --- a/drivers/gpu/drm/xe/xe_debugfs.c
> > > +++ b/drivers/gpu/drm/xe/xe_debugfs.c
> > > @@ -18,6 +18,7 @@
> > >   #include "xe_gt_debugfs.h"
> > >   #include "xe_gt_printk.h"
> > >   #include "xe_guc_ads.h"
> > > +#include "xe_hw_error.h"
> > >   #include "xe_mmio.h"
> > >   #include "xe_pm.h"
> > >   #include "xe_psmi.h"
> > > @@ -509,6 +510,40 @@ static const struct file_operations
> > > disable_late_binding_fops = {
> > >       .write = disable_late_binding_set,
> > >   };
> > >   +static ssize_t trigger_punit_error_show(struct file *f, char
> > > __user *ubuf,
> > > +                    size_t size, loff_t *pos)
> > > +{
> > > +    const char *msg = "Write 1 to trigger power management unit
> > > error handler\n";
> > > +
> > > +    return simple_read_from_buffer(ubuf, size, pos, msg, strlen(msg));
> > > +}
> > > +
> > > +static ssize_t trigger_punit_error_set(struct file *f,
> > > +                       const char __user *ubuf,
> > > +                       size_t size, loff_t *pos)
> > > +{
> > > +    struct xe_device *xe = file_inode(f)->i_private;
> > > +    bool trigger;
> > > +    ssize_t ret;
> > > +
> > > +    ret = kstrtobool_from_user(ubuf, size, &trigger);
> > > +    if (ret)
> > > +        return ret;
> > > +
> > > +    if (trigger) {
> > > +        xe_punit_error_handler(xe);
> > > +        drm_info(&xe->drm, "PMU error handler triggered via
> > > debugfs\n");
> > > +    }
> > > +
> > > +    return size;
> > > +}
> > > +
> > > +static const struct file_operations trigger_punit_error_fops = {
> > > +    .owner = THIS_MODULE,
> > > +    .read = trigger_punit_error_show,
> > > +    .write = trigger_punit_error_set,
> > > +};
> > > +
> > >   void xe_debugfs_register(struct xe_device *xe)
> > >   {
> > >       struct ttm_device *bdev = &xe->ttm;
> > > @@ -550,6 +585,9 @@ void xe_debugfs_register(struct xe_device *xe)
> > >       debugfs_create_file("disable_late_binding", 0600, root, xe,
> > >                   &disable_late_binding_fops);
> > >   +    debugfs_create_file("trigger_punit_error", 0600, root, xe,
> > > +                &trigger_punit_error_fops);
> > > +
> > >       /*
> > >        * Don't expose page reclaim configuration file if not
> > > supported by the
> > >        * hardware initially.

next prev parent reply	other threads:[~2026-04-02  8:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18  6:40 [PATCH v2 0/5] Introduce cold reset recovery method Mallesh Koujalagi
2026-03-18  6:40 ` [PATCH v2 1/5] Introduce Xe Uncorrectable Error Handling Mallesh Koujalagi
2026-03-18 19:35   ` kernel test robot
2026-03-19 14:42   ` kernel test robot
2026-03-19 20:02   ` kernel test robot
2026-03-18  6:40 ` [PATCH v2 2/5] drm: Add DRM_WEDGE_RECOVERY_COLD_RESET for power management unit error Mallesh Koujalagi
2026-03-30  5:26   ` Tauro, Riana
2026-03-18  6:40 ` [PATCH v2 3/5] drm/doc: Document DRM_WEDGE_RECOVERY_COLD_RESET recovery method Mallesh Koujalagi
2026-03-30  5:00   ` Tauro, Riana
2026-03-30 14:02     ` Mallesh, Koujalagi
2026-04-02  8:16   ` Raag Jadav
2026-04-06 12:26     ` Mallesh, Koujalagi
2026-03-18  6:40 ` [PATCH v2 4/5] drm/xe: Add handler for power management unit errors which require cold-reset Mallesh Koujalagi
2026-03-30  4:54   ` Tauro, Riana
2026-03-30 13:50     ` Mallesh, Koujalagi
2026-04-02  8:19   ` Raag Jadav
2026-03-18  6:40 ` [PATCH v2 5/5] drm/xe/debugfs: Add interface to trigger power management unit error handler Mallesh Koujalagi
2026-03-30  4:55   ` Tauro, Riana
2026-03-30 13:40     ` Mallesh, Koujalagi
2026-04-02  8:31       ` Raag Jadav [this message]
2026-04-06 12:49         ` Mallesh, Koujalagi
2026-03-18  6:49 ` ✗ CI.checkpatch: warning for Introduce cold reset recovery method Patchwork
2026-03-18  6:50 ` ✓ CI.KUnit: success " Patchwork
2026-03-18  7:33 ` ✓ Xe.CI.BAT: " Patchwork
2026-03-19 20:20 ` ✓ Xe.CI.FULL: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ac4pTtjVZ7nxEsag@black.igk.intel.com \
    --to=raag.jadav@intel.com \
    --cc=airlied@gmail.com \
    --cc=andrealmeid@igalia.com \
    --cc=anshuman.gupta@intel.com \
    --cc=badal.nilawar@intel.com \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=karthik.poosa@intel.com \
    --cc=mallesh.koujalagi@intel.com \
    --cc=mripard@kernel.org \
    --cc=riana.tauro@intel.com \
    --cc=rodrigo.vivi@intel.com \
    --cc=simona.vetter@ffwll.ch \
    --cc=sk.anirban@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox