All of lore.kernel.org
 help / color / mirror / Atom feed
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>,
	"Roper, Matthew D" <matthew.d.roper@intel.com>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>
Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails
Date: Thu, 20 Jul 2023 12:32:39 -0400	[thread overview]
Message-ID: <ZLlhp1t9q9BeHRJ+@intel.com> (raw)
In-Reply-To: <MW4PR11MB7056C4F7A20F8995CC5D3207B339A@MW4PR11MB7056.namprd11.prod.outlook.com>

On Wed, Jul 19, 2023 at 01:51:37AM +0000, Ghimiray, Himal Prasad wrote:
> Hi Matt,
> 
> > -----Original Message-----
> > From: Roper, Matthew D <matthew.d.roper@intel.com>
> > Sent: 19 July 2023 05:23
> > To: Ghimiray, Himal Prasad <himal.prasad.ghimiray@intel.com>
> > Cc: intel-xe@lists.freedesktop.org; Vivi, Rodrigo <rodrigo.vivi@intel.com>
> > Subject: Re: [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset
> > fails
> > 
> > On Tue, Jul 18, 2023 at 07:02:15PM +0530, Himal Prasad Ghimiray wrote:
> > > Send uevent in case of gt reset failure. This intimation can be used
> > > by userspace monitoring tool to do the device level reset/reboot when
> > > GT reset fails. udevadm can be used to monitor the uevents.
> > 
> > This seems a bit questionable.  In theory a GT reset shouldn't fail; if it does it
> > either means we've either got a major software bug or a really catastrophic
> > hardware failure.  Letting the kernel driver just blindly move forward after
> > reporting a uevent doesn't seem like a proper response to a huge problem
> > like that.  Relying on userspace to pick up the pieces seems insufficient.
> > 
> > Right now the return code from gt_reset() is just ignored and no meaningful
> > action is taken within the driver.  It seems like we should at least be putting
> > the device into some kind of 'unusable' state (similar to i915's 'wedged') so
> > that other parts of the driver know that everything is completely broken and
> > that they should stop trying to use the device (or some subset of the device).
> > We could also potentially attempt recovery ourselves by escalating to a
> > DriverFLR.
> 
> Agreed. These feature needs to be there in the driver.

indeed. one possibility is to remove the drm card but keep the pci up waiting
for the reset. Another is to 'wedge' like i915.

> 
> > 
> > I'm not sure why any userspace would ever care about the specific condition
> > of "gt reset failed" --- it seems like it would pretty much only care about the
> > device being in one of three states and wouldn't care about the driver-
> > internal details about how/why it's in one of
> > those:
> > 
> >  * Everything is good; can use device as normal
> >  * Device partially broken; it's no longer possible to use these
> >    specific GTs, engines, etc., but other units should still function as
> >    expected.
> >  * Absolutely everything is broken and nothing works.  You _might_ be
> >    able to unload the driver, issue a PCI FLR, and reload the driver,
> >    but otherwise all hope is lost and the device cannot be used anymore.
> > 
> > Even if we decide that some uevent is appropriate here, this is new uapi and
> > will need a real userspace consumer too.  The fact that we can watch any
> > uevents raised by the device with udevadm doesn't seem like it would be
> > sufficient for that purpose; that doesn't feel much different than claiming
> > 'cat' as the userspace consumer of a sysfs node or something.
> > What we really need for new uapi is a meaningful top-to-bottom solution
> > that gives us confidence that we have an appropriate and maintainable
> > interface defined.
> 
> https://spec.oneapi.io/level-zero/latest/sysman/api.html#_CPPv441ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED is the consumer for this uapi.
> The application can register for events ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED with Sysman and start listening to this event.
> Then Sysman would send ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event to application, incase sysman recieves uevent "RESET_FAILED=1" or "RESET_REQUIRED=1" or if Sysman detects a repair have occurred.
> 
> Then after getting ZES_EVENT_TYPE_FLAG_DEVICE_RESET_REQUIRED event application could query zesDeviceGetState  to get the reason of RESET. And Sysman would tell the reason of reset as :
> ZES_RESET_REASON_FLAG_WEDGED: Sysman would try to create a context and if context creation fails due to EIO error, then device is determined to be wedged and reset reason is declared as WEDGED
> 
> And sysman triggers a SBR to reset the device.

yeap, this is already good by itself, even independent of that mode.

But we need to think about the possibility of removing the drm card above.
Maybe it would be better to add the uevent at the pci level rather then
the drm.
And maybe with a general device_status= variable.

+Joonas

> 
> BR
> Himal
> > 
> > 
> > Matt
> > 
> > >
> > > v2:
> > > -Add NULL check for xe_gt_hw_engine return(Aravind) -Arrange variables
> > > in Christmas tree order(Tejas) -Check GUC_GSC_OTHER_CLASS(Tejas)
> > >
> > > v3:
> > > - Rebase
> > > - Remove notification for engine reset failure (Rodrigo)
> > >
> > > v4
> > > - Rectify the comments in header file.
> > >
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> > > Cc: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Reviewed-by: Badal Nilawar <badal.nilawar@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray<himal.prasad.ghimiray@intel.com>
> > > Signed-off-by: Himal Prasad Ghimiray <himal.prasad.ghimiray@intel.com>
> > > ---
> > >  drivers/gpu/drm/xe/xe_gt.c | 18 ++++++++++++++++++
> > > include/uapi/drm/xe_drm.h  |  8 ++++++++
> > >  2 files changed, 26 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > > index a21d44bfe9e8..1db4d610f2fd 100644
> > > --- a/drivers/gpu/drm/xe/xe_gt.c
> > > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > > @@ -8,6 +8,7 @@
> > >  #include <linux/minmax.h>
> > >
> > >  #include <drm/drm_managed.h>
> > > +#include <drm/xe_drm.h>
> > >
> > >  #include "regs/xe_gt_regs.h"
> > >  #include "xe_bb.h"
> > > @@ -500,6 +501,20 @@ static int do_gt_restart(struct xe_gt *gt)
> > >  	return 0;
> > >  }
> > >
> > > +static void xe_uevent_gt_reset_failure(struct xe_device *xe, u8 id) {
> > > +	char *reset_event[5];
> > > +
> > > +	reset_event[0] = XE_RESET_FAILED_UEVENT "=1";
> > > +	reset_event[1] = "RESET_ENABLED=1";
> > > +	reset_event[2] = "RESET_UNIT=gt";
> > > +	reset_event[3] = kasprintf(GFP_KERNEL, "RESET_ID=%d", id);
> > > +	reset_event[4] = NULL;
> > > +	kobject_uevent_env(&xe->drm.primary->kdev->kobj, KOBJ_CHANGE,
> > > +reset_event);
> > > +
> > > +	kfree(reset_event[3]);
> > > +}
> > > +
> > >  static int gt_reset(struct xe_gt *gt)  {
> > >  	int err;
> > > @@ -549,6 +564,9 @@ static int gt_reset(struct xe_gt *gt)
> > >  	xe_device_mem_access_put(gt_to_xe(gt));
> > >  	xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> > >
> > > +	/* Notify userspace about gt reset failure */
> > > +	xe_uevent_gt_reset_failure(gt_to_xe(gt), gt->info.id);
> > > +
> > >  	return err;
> > >  }
> > >
> > > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > > index 347351a8f618..c28bb54812e5 100644
> > > --- a/include/uapi/drm/xe_drm.h
> > > +++ b/include/uapi/drm/xe_drm.h
> > > @@ -16,6 +16,14 @@ extern "C" {
> > >   * subject to backwards-compatibility constraints.
> > >   */
> > >
> > > +/*
> > > + * Uevents generated by xe on it's device node.
> > > + *
> > > + * XE_RESET_FAILED_UEVENT - Event is generated when attempt to reset
> > > + * gt fails. The value supplied with the event is always 1.
> > > + */
> > > +#define XE_RESET_FAILED_UEVENT "RESET_FAILED"
> > > +
> > >  /**
> > >   * struct xe_user_extension - Base class for defining a chain of extensions
> > >   *
> > > --
> > > 2.25.1
> > >
> > 
> > --
> > Matt Roper
> > Graphics Software Engineer
> > Linux GPU Platform Enablement
> > Intel Corporation

  reply	other threads:[~2023-07-20 16:32 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:32 [Intel-xe] [PATCH v5 0/2] Notify userspace about uevent failure Himal Prasad Ghimiray
2023-07-18 13:32 ` [Intel-xe] [PATCH v5 1/2] drm/xe: Notify Userspace when gt reset fails Himal Prasad Ghimiray
2023-07-18 15:01   ` Nilawar, Badal
2023-07-18 23:52   ` Roper, Matthew D
2023-07-19  1:51     ` Ghimiray, Himal Prasad
2023-07-20 16:32       ` Rodrigo Vivi [this message]
2023-07-24  8:35         ` Ghimiray, Himal Prasad
2023-07-18 13:32 ` [Intel-xe] [PATCH v5 2/2] drm/xe: Introduce fault injection for gt reset Himal Prasad Ghimiray
2023-07-20 16:28   ` Rodrigo Vivi
2023-07-24  8:38     ` Ghimiray, Himal Prasad
2023-07-18 15:52 ` [Intel-xe] ✓ CI.Patch_applied: success for Notify userspace about uevent failure Patchwork
2023-07-18 15:52 ` [Intel-xe] ✗ CI.checkpatch: warning " Patchwork
2023-07-18 15:52 ` [Intel-xe] ✗ CI.KUnit: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZLlhp1t9q9BeHRJ+@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=joonas.lahtinen@linux.intel.com \
    --cc=matthew.d.roper@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.