Re: [Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Ofir Bitton <obitton@habana.ai>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	Andrzej Hajda <andrzej.hajda@intel.com>
Subject: Re: [Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset
Date: Wed, 4 Oct 2023 13:19:23 -0400	[thread overview]
Message-ID: <ZR2em8wyd4iRxhnp@intel.com> (raw)
In-Reply-To: <c356834b-7312-9965-3fd2-7e08a494dab1@habana.ai>

On Mon, Oct 02, 2023 at 12:22:19PM +0000, Ofir Bitton wrote:
> On 02/10/2023 11:11, Andrzej Hajda wrote:
> > 
> > 
> > On 01.10.2023 10:50, Ofir Bitton wrote:
> >> On 27/09/2023 10:46, Andrzej Hajda wrote:
> >>> Driver initiated functional reset (FLR) is the highest level of reset
> >>> that we can trigger from within the driver. In contrast to PCI FLR it
> >>> doesn't require re-enumeration of PCI BAR. It can be useful in case
> >>> GT fails to reset. It is also the only way to trigger GSC reset from
> >>> the driver and can be used in future addition of GSC support.
> >>>
> >>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> >>> ---
> >>> Hi all,
> >>>
> >>> This is initial attempt to implement Driver-FLR in xe. Implementation
> >>> is copied from i915. I've skipped comments, but I can re-add them if
> >>> requested.
> >>> In i915 Driver-FLR is performed on driver unload if GSC was used.
> >>> In xe GSC is not yet supported, but I have added trigger to perform FLR
> >>> in case GT fails to reset, I guess this could be proper use-case and
> >>> of course this is prerequisite to GSC support.
> >>>
> >>> Since this is my 1st patch to xe please be merciful in comments :)
> >>>
> >>> Regards
> >>> Andrzej
> >>> ---
> >>>    drivers/gpu/drm/xe/regs/xe_gt_regs.h |  6 +++++
> >>>    drivers/gpu/drm/xe/xe_gt.c           |  2 ++
> >>>    drivers/gpu/drm/xe/xe_gt_types.h     |  3 +++
> >>>    drivers/gpu/drm/xe/xe_mmio.c         | 38 
> >>> ++++++++++++++++++++++++++++
> >>>    4 files changed, 49 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/xe/regs/xe_gt_regs.h 
> >>> b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> index e13fbbdf6929ea..769a8fd005931a 100644
> >>> --- a/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> +++ b/drivers/gpu/drm/xe/regs/xe_gt_regs.h
> >>> @@ -359,6 +359,12 @@
> >>>    #define RCU_MODE                            XE_REG(0x14800, 
> >>> XE_REG_OPTION_MASKED)
> >>>    #define   RCU_MODE_CCS_ENABLE                       REG_BIT(0)
> >>>
> >>> +#define GU_CNTL                                      XE_REG(0x101010)
> >>> +#define   LMEM_INIT                          REG_BIT(7)
> >>> +#define   DRIVERFLR                          REG_BIT(31)
> >>> +#define GU_DEBUG                             XE_REG(0x101018)
> >>> +#define   DRIVERFLR_STATUS                   REG_BIT(31)
> >>> +
> >>>    #define FORCEWAKE_ACK_GT                    XE_REG(0x130044)
> >>>    #define   FORCEWAKE_KERNEL                  BIT(0)
> >>>    #define   FORCEWAKE_USER                    BIT(1)
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt.c b/drivers/gpu/drm/xe/xe_gt.c
> >>> index 1aa44d4f9ac1c8..1e0b76000ed46b 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt.c
> >>> +++ b/drivers/gpu/drm/xe/xe_gt.c
> >>> @@ -604,6 +604,8 @@ static int gt_reset(struct xe_gt *gt)
> >>>        xe_uevent_gt_reset_failure(to_pci_dev(gt_to_xe(gt)->drm.dev),
> >>>                                   gt_to_tile(gt)->id, gt->info.id);
> >>>
> >>> +     gt->needs_flr_on_fini = true;
> >>> +
> >>>        return err;
> >>>    }
> >> I think this should be platform dependent and not set to true by default.
> > 
> > Why? Do we have xe supported hw without FLR?
> > This is kind-of bigger hammer solution - if GT reset fails, lets try FLR.
> 
> In future platforms this might not be straight forward, this is why I
> think it is preferable to be able to choose whether this feature is
> valid or not. We can always have bypass flags in innersource.

we could check if this is PVC case already and make the case for adding
a device flag already. Or maybe just a if () return; in the beggining
of the function when/where that is needed.

> 
> > 
> >>
> >>> diff --git a/drivers/gpu/drm/xe/xe_gt_types.h 
> >>> b/drivers/gpu/drm/xe/xe_gt_types.h
> >>> index d4310be3e1e7c8..dc194e9656abd5 100644
> >>> --- a/drivers/gpu/drm/xe/xe_gt_types.h
> >>> +++ b/drivers/gpu/drm/xe/xe_gt_types.h
> >>> @@ -347,6 +347,9 @@ struct xe_gt {
> >>>                /** @oob: bitmap with active OOB workaroudns */
> >>>                unsigned long *oob;
> >>>        } wa_active;
> >>> +
> >>> +     /** @needs_flr_on_fini: requests functional reset on fini */
> >>> +     bool needs_flr_on_fini;
> >>>    };
> >>>
> >>>    #endif
> >>> diff --git a/drivers/gpu/drm/xe/xe_mmio.c b/drivers/gpu/drm/xe/xe_mmio.c
> >>> index 3ccc0af4430be4..593ecc98cdfe8a 100644
> >>> --- a/drivers/gpu/drm/xe/xe_mmio.c
> >>> +++ b/drivers/gpu/drm/xe/xe_mmio.c
> >>> @@ -4,6 +4,7 @@
> >>>     */
> >>>
> >>>    #include <linux/minmax.h>
> >>> +#include <linux/units.h>
> >>>
> >>>    #include "xe_mmio.h"
> >>>
> >>> @@ -364,9 +365,46 @@ static void xe_mmio_probe_tiles(struct xe_device 
> >>> *xe)
> >>>        }
> >>>    }
> >>>
> >>> +static void driver_initiated_flr(struct xe_gt *gt)
> >>> +{
> >>> +     const unsigned int flr_timeout = 3 * MICRO; /* specs recommend 
> >>> a 3s wait */
> >>> +     struct xe_device *xe = gt->tile->xe;
> >>> +     int ret;
> >>> +
> >>> +     drm_dbg(&xe->drm, "Triggering Driver-FLR\n");
> >>> +
> >>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout, 
> >>> NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-prepare wait for ready 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> >>> +     xe_mmio_rmw32(gt, GU_CNTL, 0, DRIVERFLR);
> >>> +     ret = xe_mmio_wait32(gt, GU_CNTL, DRIVERFLR, 0, flr_timeout, 
> >>> NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-teardown wait completion 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +     ret = xe_mmio_wait32(gt, GU_DEBUG, DRIVERFLR_STATUS, 
> >>> DRIVERFLR_STATUS,
> >>> +                          flr_timeout, NULL, false);
> >>> +     if (ret) {
> >>> +             drm_err(&xe->drm, "Driver-FLR-reinit wait completion 
> >>> failed! %d\n", ret);
> >>> +             return;
> >>> +     }
> >>> +
> >>> +     /* Clear sticky completion status */
> >>> +     xe_mmio_write32(gt, GU_DEBUG, DRIVERFLR_STATUS);
> >>> +}
> >>> +
> >>>    static void mmio_fini(struct drm_device *drm, void *arg)
> >>>    {
> >>>        struct xe_device *xe = arg;
> >>> +     struct xe_gt *gt;
> >>> +     u8 id;
> >>> +
> >>> +     for_each_gt(gt, xe, id)
> >>> +             if (gt->needs_flr_on_fini)
> >>> +                     driver_initiated_flr(gt);
> >>>
> >>>        pci_iounmap(to_pci_dev(xe->drm.dev), xe->mmio.regs);
> >>>        if (xe->mem.vram.mapping)
> >> Not sure mmio_fini is the best place to put this, as this is not mmio
> >> related.
> > 
> > 
> > The reason I have put it there, is that it should be the last action 
> > before unmapping mmio. What place would you suggest then?
> 
> maybe add something like:
> 'drmm_add_action_or_reset(&xe->drm, flr, xe);' under 'xe_device_probe'
> after 'mmio_init'.

yeap, I also had my doubts about the placement. Why not simply calling this
function or a work-queue calling this function from the place where the
gt_reset failed?

hmmm... and maybe not even a work-queue but a direct call is better, since
on that point we are surely serialized and no execution attempt happening
in parallel.

> 
> --
> Ofir
> 
> > 
> > Regards
> > Andrzej
> > 
> > 
> >>
> >> -- 
> >> Ofir
> >>
> >>
> > 
>

next prev parent reply	other threads:[~2023-10-04 17:19 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-27  7:46 [Intel-xe] [PATCH] drm/xe: implement driver initiated functional reset Andrzej Hajda
2023-09-27  8:51 ` [Intel-xe] ✓ CI.Patch_applied: success for " Patchwork
2023-09-27  8:51 ` [Intel-xe] ✓ CI.checkpatch: " Patchwork
2023-09-27  8:52 ` [Intel-xe] ✓ CI.KUnit: " Patchwork
2023-09-27  8:59 ` [Intel-xe] ✓ CI.Build: " Patchwork
2023-09-27  8:59 ` [Intel-xe] ✓ CI.Hooks: " Patchwork
2023-09-27  9:01 ` [Intel-xe] ✓ CI.checksparse: " Patchwork
2023-09-27  9:33 ` [Intel-xe] ✓ CI.BAT: " Patchwork
2023-10-01  8:50 ` [Intel-xe] [PATCH] " Ofir Bitton
2023-10-02  8:11   ` Andrzej Hajda
2023-10-02 12:22     ` Ofir Bitton
2023-10-04 17:19       ` Rodrigo Vivi [this message]
2023-10-05 13:50         ` Daniele Ceraolo Spurio
2023-10-03  8:14 ` Daniele Ceraolo Spurio
2023-10-03  9:22   ` Andrzej Hajda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZR2em8wyd4iRxhnp@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=andrzej.hajda@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=obitton@habana.ai \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.