From: "Summers, Stuart" <stuart.summers@intel.com>
To: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"K V P, Satyanarayana" <satyanarayana.k.v.p@intel.com>
Cc: "Brost, Matthew" <matthew.brost@intel.com>,
"Wajdeczko, Michal" <Michal.Wajdeczko@intel.com>,
"Auld, Matthew" <matthew.auld@intel.com>
Subject: Re: [PATCH] drm/xe: Improve VF provision stability with fault injection
Date: Fri, 11 Jul 2025 15:02:36 +0000 [thread overview]
Message-ID: <bee9b61fc61a2eabfa2baa3e182b2c6744355b2b.camel@intel.com> (raw)
In-Reply-To: <20250711115135.1113303-1-satyanarayana.k.v.p@intel.com>
On Fri, 2025-07-11 at 11:51 +0000, Satyanarayana K V P wrote:
> In unlikely event, due to PF malfunction or misconfiguration, VF may
> receive incomplete or invalid configuration and it must be prepared
> to handle such cases without causing a crash.
>
> When simulating errors with the kernel's fault injection framework,
> crashes
> were observed during device unbind. These crashes were primarily due
> to the
> use of the XE_BO_FLAG_GGTT_INVALIDATE flag when creating buffer
> objects.
>
> The GGTT is invalidated using CTB, which is allocated with the
> XE_BO_FLAG_GGTT_INVALIDATE flag. However, the buffer object for CTB
> is
> freed before GGTT invalidation completes, leading to crashes.
>
> Similarly, for buffer objects allocated in memirq_alloc_pages() and
> __xe_sa_bo_manager_init(), the CTB is already freed by the time GGTT
> invalidation occurs, resulting in system crashes.
>
> To prevent these issues, the XE_BO_FLAG_GGTT_INVALIDATE flag is no
> longer
> used when creating buffer objects in memirq_alloc_pages(),
> __xe_sa_bo_manager_init() and xe_guc_ct_init().
Ok so the idea here is that these allocations are happening once during
driver load (or after reset) and not again until the BO is completely
de-allocated/re-allocated and so we don't need to actually issue an
invalidation since we always allocate the buffer from scratch which
would presumably flush the TLBs at that time? Is that right?
Honestly this feels risky to me. Why can't we just check that the
buffer is still alive when doing the invalidation and if not, skip it?
That way we aren't breaking existing functionality and still handling
the error case.
Thanks,
Stuart
>
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> ---
> drivers/gpu/drm/xe/xe_guc_ct.c | 1 -
> drivers/gpu/drm/xe/xe_memirq.c | 1 -
> drivers/gpu/drm/xe/xe_sa.c | 1 -
> 3 files changed, 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> b/drivers/gpu/drm/xe/xe_guc_ct.c
> index 23e8c155025e..f32103811d00 100644
> --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> @@ -258,7 +258,6 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
> bo = xe_managed_bo_create_pin_map(xe, tile, guc_ct_size(),
> XE_BO_FLAG_SYSTEM |
> XE_BO_FLAG_GGTT |
> - XE_BO_FLAG_GGTT_INVALIDATE
> |
>
> XE_BO_FLAG_PINNED_NORESTORE);
> if (IS_ERR(bo))
> return PTR_ERR(bo);
> diff --git a/drivers/gpu/drm/xe/xe_memirq.c
> b/drivers/gpu/drm/xe/xe_memirq.c
> index 49c45ec3e83c..678b78f42132 100644
> --- a/drivers/gpu/drm/xe/xe_memirq.c
> +++ b/drivers/gpu/drm/xe/xe_memirq.c
> @@ -180,7 +180,6 @@ static int memirq_alloc_pages(struct xe_memirq
> *memirq)
> bo = xe_managed_bo_create_pin_map(xe, tile, bo_size,
> XE_BO_FLAG_SYSTEM |
> XE_BO_FLAG_GGTT |
> - XE_BO_FLAG_GGTT_INVALIDATE
> |
> XE_BO_FLAG_NEEDS_UC |
>
> XE_BO_FLAG_NEEDS_CPU_ACCESS);
> if (IS_ERR(bo)) {
> diff --git a/drivers/gpu/drm/xe/xe_sa.c b/drivers/gpu/drm/xe/xe_sa.c
> index 1d43e183ca21..e11ed0a1ed13 100644
> --- a/drivers/gpu/drm/xe/xe_sa.c
> +++ b/drivers/gpu/drm/xe/xe_sa.c
> @@ -60,7 +60,6 @@ struct xe_sa_manager
> *__xe_sa_bo_manager_init(struct xe_tile *tile, u32 size, u3
> bo = xe_managed_bo_create_pin_map(xe, tile, size,
>
> XE_BO_FLAG_VRAM_IF_DGFX(tile) |
> XE_BO_FLAG_GGTT |
> - XE_BO_FLAG_GGTT_INVALIDATE
> |
>
> XE_BO_FLAG_PINNED_NORESTORE);
> if (IS_ERR(bo)) {
> drm_err(&xe->drm, "Failed to prepare %uKiB BO for SA
> manager (%pe)\n",
next prev parent reply other threads:[~2025-07-11 15:02 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-07-11 11:51 [PATCH] drm/xe: Improve VF provision stability with fault injection Satyanarayana K V P
2025-07-11 12:01 ` ✓ CI.KUnit: success for " Patchwork
2025-07-11 12:44 ` ✓ Xe.CI.BAT: " Patchwork
2025-07-11 15:02 ` Summers, Stuart [this message]
2025-07-11 19:53 ` ✗ Xe.CI.Full: failure " Patchwork
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bee9b61fc61a2eabfa2baa3e182b2c6744355b2b.camel@intel.com \
--to=stuart.summers@intel.com \
--cc=Michal.Wajdeczko@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=satyanarayana.k.v.p@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox