From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Aravind Iddamsetty <aravind.iddamsetty@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>,
Lucas De Marchi <lucas.demarchi@intel.com>,
Anshuman Gupta <anshuman.gupta@intel.com>
Subject: Re: [PATCH 1/3] drm/xe: Introduce a simple busted state
Date: Fri, 15 Mar 2024 09:50:28 -0400 [thread overview]
Message-ID: <ZfRSJCq2ruBJ4YUM@intel.com> (raw)
In-Reply-To: <d26e6c71-81a3-4d43-9317-126ab5aeaca0@linux.intel.com>
On Fri, Mar 15, 2024 at 12:43:07PM +0530, Aravind Iddamsetty wrote:
>
> On 15/03/24 06:33, Rodrigo Vivi wrote:
> > Introduce a very simple 'busted' state where any attempt
> > to access the GPU is entirely blocked.
> >
> > On some critical cases, like on gt_reset failure, we need to
> > block any other attempt to use the GPU. Otherwise we are at
> > a risk of reaching cases that would force us to reboot the machine.
> >
> > So, when this cases are identified we corner and block any GPU
> > access. No IOCTL and not even another GT reset should be attempted.
> >
> > The 'busted' state in Xe is an end state with no way back.
> > Only a device "re-probe" (unbind + bind) can restore the GPU access.
> >
> > v2: - s/wedged/busted (Lucas)
> > - use unbind+bind instead of module reload (Lucas)
> > - added more info on unbind operations and instruction on bug report
> > - only print the message once.
> >
> > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > Cc: Anshuman Gupta <anshuman.gupta@intel.com>
> > Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_device.c | 6 ++++++
> > drivers/gpu/drm/xe/xe_device.h | 18 ++++++++++++++++++
> > drivers/gpu/drm/xe/xe_device_types.h | 3 +++
> > drivers/gpu/drm/xe/xe_gt.c | 4 ++++
> > drivers/gpu/drm/xe/xe_migrate.c | 6 ++++++
> > 5 files changed, 37 insertions(+)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
> > index b0bfe75eb59f..d02e59fb49eb 100644
> > --- a/drivers/gpu/drm/xe/xe_device.c
> > +++ b/drivers/gpu/drm/xe/xe_device.c
> > @@ -142,6 +142,9 @@ static long xe_drm_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
> > struct xe_device *xe = to_xe_device(file_priv->minor->dev);
> > long ret;
> >
> > + if (xe_device_busted(xe))
> > + return -ECANCELED;
> > +
> > ret = xe_pm_runtime_get_ioctl(xe);
> > if (ret >= 0)
> > ret = drm_ioctl(file, cmd, arg);
> > @@ -157,6 +160,9 @@ static long xe_drm_compat_ioctl(struct file *file, unsigned int cmd, unsigned lo
> > struct xe_device *xe = to_xe_device(file_priv->minor->dev);
> > long ret;
> >
> > + if (xe_device_busted(xe))
> > + return -ECANCELED;
> > +
> > ret = xe_pm_runtime_get_ioctl(xe);
> > if (ret >= 0)
> > ret = drm_compat_ioctl(file, cmd, arg);
> > diff --git a/drivers/gpu/drm/xe/xe_device.h b/drivers/gpu/drm/xe/xe_device.h
> > index 14be34d9f543..2c6d9b77821a 100644
> > --- a/drivers/gpu/drm/xe/xe_device.h
> > +++ b/drivers/gpu/drm/xe/xe_device.h
> > @@ -176,4 +176,22 @@ void xe_device_snapshot_print(struct xe_device *xe, struct drm_printer *p);
> > u64 xe_device_canonicalize_addr(struct xe_device *xe, u64 address);
> > u64 xe_device_uncanonicalize_addr(struct xe_device *xe, u64 address);
> >
> > +static inline bool xe_device_busted(struct xe_device *xe)
> > +{
> > + return atomic_read(&xe->busted);
> > +}
> > +
> > +static inline void xe_device_declare_busted(struct xe_device *xe)
> > +{
> > + if (!atomic_xchg(&xe->busted, 1))
> > + drm_err(&xe->drm,
> > + "CRITICAL: Xe has declared device %s as busted.\n"
> > + "IOCTLs and executions are blocked until device is probed again with unbind and bind operations:\n"
> > + "echo '%s' | sudo tee /sys/bus/pci/drivers/xe/unbind\n"
> > + "echo '%s' | sudo tee /sys/bus/pci/drivers/xe/bind\n"
> > + "Please file a _new_ bug report at https://gitlab.freedesktop.org/drm/xe/kernel/issues/new\n",
> > + dev_name(xe->drm.dev), dev_name(xe->drm.dev),
> > + dev_name(xe->drm.dev));
> I know we set needs_flr_on_fini when GT reset fails and i can see in xe_driver_flr that FLR can fail, in such a case
> do we need to do a bigger reset like warm reset(SBR)
yeap, I was always considering to add instructions for the real pci device FLR
in the middle. But I was afraid of this to become to verbose and decided to give
a chance for our driver-initiated FLR.
Perhaps we start with this and if we start seeing more cases of pci reset needed
then we add to the instructions here?
>
>
> Thanks,
> Aravind.
> > +}
> > +
> > #endif
> > diff --git a/drivers/gpu/drm/xe/xe_device_types.h b/drivers/gpu/drm/xe/xe_device_types.h
> > index 9785eef2e5a4..2633fdfc1a38 100644
> > --- a/drivers/gpu/drm/xe/xe_device_types.h
> > +++ b/drivers/gpu/drm/xe/xe_device_types.h
> > @@ -455,6 +455,9 @@ struct xe_device {
> > /** @needs_flr_on_fini: requests function-reset on fini */
> > bool needs_flr_on_fini;
> >
> > + /** @busted: Xe device faced a critical error and is now blocked. */
> > + atomic_t busted;
> > +
> > /* private: */
> >
> > #if IS_ENABLED(CONFIG_DRM_XE_DISPLAY)
> > diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> > index 85408e7a932b..2f29f7fa682b 100644
> > --- a/drivers/gpu/drm/xe/xe_gt.c
> > +++ b/drivers/gpu/drm/xe/xe_gt.c
> > @@ -633,6 +633,9 @@ static int gt_reset(struct xe_gt *gt)
> > {
> > int err;
> >
> > + if (xe_device_busted(gt_to_xe(gt)))
> > + return -ECANCELED;
> > +
> > /* We only support GT resets with GuC submission */
> > if (!xe_device_uc_enabled(gt_to_xe(gt)))
> > return -ENODEV;
> > @@ -686,6 +689,7 @@ static int gt_reset(struct xe_gt *gt)
> > xe_gt_err(gt, "reset failed (%pe)\n", ERR_PTR(err));
> >
> > gt_to_xe(gt)->needs_flr_on_fini = true;
> > + xe_device_declare_busted(gt_to_xe(gt));
> >
> > return err;
> > }
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > index ee1bb938c493..d7eb409e8415 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -713,6 +713,9 @@ struct dma_fence *xe_migrate_copy(struct xe_migrate *m,
> > xe_bo_needs_ccs_pages(src_bo) && xe_bo_needs_ccs_pages(dst_bo);
> > bool copy_system_ccs = copy_ccs && (!src_is_vram || !dst_is_vram);
> >
> > + if (xe_device_busted(xe))
> > + return ERR_PTR(-ECANCELED);
> > +
> > /* Copying CCS between two different BOs is not supported yet. */
> > if (XE_WARN_ON(copy_ccs && src_bo != dst_bo))
> > return ERR_PTR(-EINVAL);
> > @@ -986,6 +989,9 @@ struct dma_fence *xe_migrate_clear(struct xe_migrate *m,
> > int err;
> > int pass = 0;
> >
> > + if (xe_device_busted(xe))
> > + return ERR_PTR(-ECANCELED);
> > +
> > if (!clear_vram)
> > xe_res_first_sg(xe_bo_sg(bo), 0, bo->size, &src_it);
> > else
next prev parent reply other threads:[~2024-03-15 13:50 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-15 1:03 [PATCH 1/3] drm/xe: Introduce a simple busted state Rodrigo Vivi
2024-03-15 1:03 ` [PATCH 2/3] drm/xe: declare busted upon GuC load failure Rodrigo Vivi
2024-03-15 11:16 ` Ghimiray, Himal Prasad
2024-03-15 13:54 ` Rodrigo Vivi
2024-03-15 14:23 ` Ghimiray, Himal Prasad
2024-03-15 1:03 ` [PATCH 3/3] drm/xe: Force busted state and block GT reset upon any GPU hang Rodrigo Vivi
2024-03-15 4:31 ` Ghimiray, Himal Prasad
2024-03-15 13:59 ` Rodrigo Vivi
2024-03-15 1:08 ` ✓ CI.Patch_applied: success for series starting with [1/3] drm/xe: Introduce a simple busted state Patchwork
2024-03-15 1:08 ` ✓ CI.checkpatch: " Patchwork
2024-03-15 1:09 ` ✓ CI.KUnit: " Patchwork
2024-03-15 1:20 ` ✓ CI.Build: " Patchwork
2024-03-15 1:22 ` ✓ CI.Hooks: " Patchwork
2024-03-15 1:23 ` [PATCH 1/3] " Dixit, Ashutosh
2024-03-15 3:08 ` Lucas De Marchi
2024-03-15 3:09 ` Rodrigo Vivi
2024-03-15 14:14 ` Tvrtko Ursulin
2024-03-18 23:23 ` Dixit, Ashutosh
2024-03-15 1:24 ` ✓ CI.checksparse: success for series starting with [1/3] " Patchwork
2024-03-15 1:48 ` ✓ CI.BAT: " Patchwork
2024-03-15 7:13 ` [PATCH 1/3] " Aravind Iddamsetty
2024-03-15 13:50 ` Rodrigo Vivi [this message]
2024-03-15 11:52 ` Ghimiray, Himal Prasad
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZfRSJCq2ruBJ4YUM@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=anshuman.gupta@intel.com \
--cc=aravind.iddamsetty@linux.intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=lucas.demarchi@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.