Re: [PATCH 5/5] drm/xe: disable wa_15015404425 for PTL B0

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matt Atwood <matthew.s.atwood@intel.com>
Cc: <matthew.d.roper@intel.com>, <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 5/5] drm/xe: disable wa_15015404425 for PTL B0
Date: Tue, 24 Jun 2025 16:53:26 -0400	[thread overview]
Message-ID: <aFsQRs8tObO_3tK5@intel.com> (raw)
In-Reply-To: <aFr-19HqjMgXar6v@msatwood-mobl>

On Tue, Jun 24, 2025 at 12:39:03PM -0700, Matt Atwood wrote:
> On Tue, Jun 24, 2025 at 03:25:37PM -0400, Rodrigo Vivi wrote:
> > On Mon, Jun 23, 2025 at 04:31:30PM -0700, Matt Roper wrote:
> > > On Mon, Jun 23, 2025 at 05:12:05PM -0400, Rodrigo Vivi wrote:
> > > > On Fri, Jun 20, 2025 at 02:49:20PM -0700, Matt Atwood wrote:
> > > > > This workaround only applies to PTL Compute Die A0. However, this
> > > > > information cannot be determined until after the GT is brought up. This
> > > > > means that we will assume that it is required for the initial bring up of
> > > > > the gt. After GT init, the oob workarounds are enabled for the GT. Use
> > > > > this flag to then manually set the bit in the soc oob bit field to 0
> > > > > which will help performance after device bring up.
> > > > > 
> > > > > Signed-off-by: Matt Atwood <matthew.s.atwood@intel.com>
> > > > > ---
> > > > >  drivers/gpu/drm/xe/xe_pci.c        | 6 ++++++
> > > > >  drivers/gpu/drm/xe/xe_wa_oob.rules | 1 +
> > > > >  2 files changed, 7 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_pci.c b/drivers/gpu/drm/xe/xe_pci.c
> > > > > index ded0f3dc8d73..a624c3fb9498 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_pci.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_pci.c
> > > > > @@ -34,6 +34,9 @@
> > > > >  #include "xe_tile.h"
> > > > >  #include "xe_wa.h"
> > > > >  
> > > > > +#include "generated/xe_wa_oob.h"
> > > > > +#include "generated/xe_soc_wa_oob.h"
> > > > > +
> > > > >  enum toggle_d3cold {
> > > > >  	D3COLD_DISABLE,
> > > > >  	D3COLD_ENABLE,
> > > > > @@ -890,6 +893,9 @@ static int xe_pci_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
> > > > >  	drm_dbg(&xe->drm, "d3cold: capable=%s\n",
> > > > >  		str_yes_no(xe->d3cold.capable));
> > > > >  
> > > > > +	if (XE_WA(xe->tiles->media_gt, 15015404425_disable))
> > > > > +		xe->oob[XE_SOC_WA_OOB_15015404425] = 0;
> > > > 
> > > > We are discussing this offline, but I need to make it very clear here
> > > > that we should not move forward with this as is.
> > > > 
> > > > Two unnaceptable points in here:
> > > > \
> > > > 1. _disable. We either enable or we don't. If we need to wait for the gmdid,
> > > > let it be and enable the workaround after that. GMDID should be one of the
> > > > first things and I confirmed this workaround is so rare that in an a0 situation
> > > > you could wait to only enable after you read the gmd-id and confirm the
> > > > media is A0.
> > > 
> > > This is exactly what we *can't* do for this workaround.  The requirement
> > > is that on any impacted platform, every single register read must be
> > > preceded by four extra dummy writes.  There's no such thing as "early
> > > enough to ignore" ---
> > 
> > What I got from the Architects on this is that it would be safe enough to
> > read the GMDID without this workaround since the condition for the issue
> > to manifest is not on a single read like this, although the protection is
> > global.
> > 
> > But okay, let's work with the assumption that it is better to protect
> > anyway.
> > 
> > > it's specifically noted that even pre-OS firmware
> > > and such needs to be careful about doing this as well.  So this causes a
> > > bit of a chicken-and-egg issue:  we cannot read the GMD_ID register
> > > without having the workaround active on impacted platforms, but we
> > > cannot figure out whether the platform is impacted until after we've
> > > read that register[*].  We also do a bunch of other register reads
> > > during early driver probe before the xe_gt is initialized enough to be
> > > able to service workaround lookup queries.  So there are a few options
> > > here:
> > > 
> > >  * Mark the platform as always being active on PTL, then come back and
> > >    disable it later if/when we confirm that we're on a stepping that
> > >    isn't impacted by the issue.  This is pretty simple conceptually, and
> > >    is quite likely something we'll need in the future for other
> > >    workarounds.  That's the approach MattA has taken here.
> > 
> > ack.
> > 
> > > 
> > >  * Add two separate workarounds in the driver: 15015404425_early and
> > >    15015404425.  15015404425_early is an SoC workaround that applies
> > >    unconditionally on PTL, and 15015404425 is a GT workaround that
> > >    applies only on specific media steppings.  Both do exactly the same
> > >    thing (4 dummy writes before any register read), but
> > >    15015404425_early is checked before all early MMIO accesses and
> > >    15015404425 is checked on all others.  The downside of this approach
> > >    is that we'd need to use a completely different set of MMIO
> > >    operations for early driver boot (i.e., no xe_mmio_read32 and such).
> > > 
> > >  * Ignore this workaround completely if we can confirm that it only
> > >    impacts pre-production steppings.  I don't think we have confirmation
> > >    yet that A-step hardware is preprod-only, so we can't take this easy
> > >    path yet.
> > > 
> > > > 
> > > > 2. Don't mix SoC with Media. If the Bug is SoC don't wait of the media stepping
> > > > and check directly for the SoC that needs this. So, don't create an infra that
> > > > already has an exception in it. And if possible, avoid the infra at all. This
> > > > might bring even more confusion to the w/a handling.
> > > > 
> > > > This w/a in specific here is soc, but it is getting mapped to our media-ip,
> > > > so let's use the media check...
> > > 
> > > I think more explanation needs to be added to the patches to clarify
> > > exactly what's going on since it's still a bit non-obvious.  The reality
> > > is that our modern platforms aren't actually "SOCs" at all anymore;
> > > they're technically MCP's --- Multi Chip Packages with logic (and
> > > sometimes hardware issues) spread across the various dies.  The hardware
> > > teams still refer certain things to be "SoC" logic for historical
> > > reasons, even though that logic is technically distributed across
> > > multiple chips these days.
> > 
> > Well, I still see the glue of all the chips as the SoC. It is easier to
> > understand.
> > 
> > > 
> > > We absolutely need some kind of "device" workaround framework that isn't
> > > tied to the GT and that can be used before GTs are even up; we have some
> > > non-GT workarounds today that we're not really handling properly, and we
> > > know there are more coming.  I think when this workaround first came up
> > > I suggested a "device workaround" infrastructure and using the same
> > > general XE_WA() calls, with a _Generic() implementation to lookup the
> > > status of a workaround in either the xe_device or the xe_gt depending on
> > > which is passed (with the xe_device workaround table being initialized
> > > much earlier in the probe sequence).
> > 
> > If we split in device_wa and gt_wa that would make much more sense with
> > our Xe design indeed.
> Is the ask here to change the name from XE_SOC_WA -> XE_DEVICE_WA?

yes, please.
Then as a follow-up we change the other one from xe_wa to xe_gt_wa...

> > 
> > > 
> > > Issues outside of the GT can live in different places:  the GCD die, the
> > > compute (aka CPU) die, the IO die, etc.  Each of these have their own
> > > steppings, and the MCP itself has a stepping too.  In some of these
> > > cases, the proper way to determine a stepping is to map PCI revid into a
> > > die stepping.  In other cases the proper approach is to "fingerprint" a
> > > die's stepping by inspecting some other IP that lives on the same die
> > > (similar to how we used to fingerprint the PCH back in the day by
> > > inspecting the ISA bus device).  For this workaround the stepping we
> > > care about is the compute (CPU) die stepping, and since standalone media
> > > happens to live on that same die, the bspec tells us how to figure out
> > > compute die stepping from media stepping (A-step compute die <=> A-step
> > > media IP in this case, although that isn't something that's guaranteed
> > > to always be true).
> > > 
> > > 
> > > [*] An alternative to fingerprinting compute die based on media stepping
> > >     would probably be to check the CPU stepping through whatever
> > >     mechanism is used to print the stepping in /proc/cpuinfo.  We'd have
> > >     to find separate documentation on how to map the numeric value there
> > >     into somthing like "B0;" I don't know off the top of my head where
> > >     that documentation would be but the core kernel guys could probably
> > >     point us in the right direction.
> > 
> > Yeap, in this case it would be the combination of the pci id + cpu-revid.
> > This makes much more sense for this w/a, but it is hard to generalize indeed.
> > 
> > But also, any device_wa is hard to generalize anyway, since we won't have
> > a device stepping or anything like that.
> > 
> > So, okay, 15015404425 in xe_device_wa enables it and
> > 15015404425_disable in xe_gt_wa disables it for MEDIA_STEP(B0, FOREVER)
> > 
> > > 
> > > 
> > > Matt
> > > 
> > > > 
> > > > > +
> > > > >  	return 0;
> > > > >  
> > > > >  err_driver_cleanup:
> > > > > diff --git a/drivers/gpu/drm/xe/xe_wa_oob.rules b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > > > index 8c2aa48cb33a..822cbff13819 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > > > +++ b/drivers/gpu/drm/xe/xe_wa_oob.rules
> > > > > @@ -71,3 +71,4 @@ no_media_l3	MEDIA_VERSION(3000)
> > > > >  # primary GT GMDID
> > > > >  14022085890	GRAPHICS_VERSION(2001)
> > > > >  16026007364 	MEDIA_VERSION(3000)
> > > > > +15015404425_disable	PLATFORM(PANTHERLAKE), MEDIA_STEP(B0, FOREVER)
> > > > > -- 
> > > > > 2.49.0
> > > > > 
> > > 
> > > -- 
> > > Matt Roper
> > > Graphics Software Engineer
> > > Linux GPU Platform Enablement
> > > Intel Corporation
> MattA

next prev parent reply	other threads:[~2025-06-24 20:53 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-20 21:49 [PATCH 0/5] drm/xe: Create and use SoC WA infrastructure Matt Atwood
2025-06-20 21:49 ` [PATCH 1/5] drm/xe: add xe_soc_wa infrastructure Matt Atwood
2025-06-20 21:49 ` [PATCH 2/5] drm/xe: Add infrastructure for SoC OOB workarounds Matt Atwood
2025-06-20 21:49 ` [PATCH 3/5] drm/xe: Move Wa_15015404425 to use the new EX_SOC_WA macro Matt Atwood
2025-06-20 21:49 ` [PATCH 4/5] drm/xe: extend Wa_15015404425 to apply to PTL Matt Atwood
2025-06-20 21:49 ` [PATCH 5/5] drm/xe: disable wa_15015404425 for PTL B0 Matt Atwood
2025-06-23 21:12   ` Rodrigo Vivi
2025-06-23 23:31     ` Matt Roper
2025-06-24 19:25       ` Rodrigo Vivi
2025-06-24 19:39         ` Matt Atwood
2025-06-24 20:53           ` Rodrigo Vivi [this message]
2025-06-20 21:55 ` ✗ CI.checkpatch: warning for drm/xe: Create and use SoC WA infrastructure Patchwork
2025-06-20 21:57 ` ✓ CI.KUnit: success " Patchwork
2025-06-20 22:58 ` ✗ Xe.CI.BAT: failure " Patchwork
2025-06-23 16:13   ` Matt Atwood
2025-06-21  6:05 ` ✗ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aFsQRs8tObO_3tK5@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.d.roper@intel.com \
    --cc=matthew.s.atwood@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox