From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: "Souza, Jose" <jose.souza@intel.com>
Cc: "intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
"Upadhyay, Tejas" <tejas.upadhyay@intel.com>,
"Brost, Matthew" <matthew.brost@intel.com>,
"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
"Auld, Matthew" <matthew.auld@intel.com>,
"thomas.hellstrom@linux.intel.com"
<thomas.hellstrom@linux.intel.com>,
"Mrozek, Michal" <michal.mrozek@intel.com>
Subject: Re: [PATCH V11 11/12] drm/xe/uapi: Expose ban reason in EXEC_QUEUE_GET_PROPERTY_BAN
Date: Mon, 8 Jun 2026 20:37:59 -0400 [thread overview]
Message-ID: <aidgZ1zsihvCPGqq@intel.com> (raw)
In-Reply-To: <543ed281612b0f8b1cd289448ae917896f18200c.camel@intel.com>
On Mon, Jun 08, 2026 at 10:03:50AM -0400, Souza, Jose wrote:
> On Fri, 2026-06-05 at 18:08 +0530, Tejas Upadhyay wrote:
> > Extend DRM_XE_EXEC_QUEUE_GET_PROPERTY_BAN to return a bitmask
> > indicating
> > the reason for the ban, rather than a simple boolean. This allows
> > userspace to distinguish between different ban causes:
> >
> > - DRM_XE_EXEC_QUEUE_BAN_REASON_GPU_HANG (bit 0): exec queue was
> > banned
> > due to a GPU hang or job timeout detected by the TDR.
> > - DRM_XE_EXEC_QUEUE_BAN_REASON_PAGE_OFFLINE (bit 1): exec queue was
> > banned because a VRAM page backing its resources was taken offline.
> >
> > The ban_reason field is added to struct xe_exec_queue and set at the
> > point where the ban is triggered:
> > - In guc_exec_queue_timedout_job() for GPU hang.
> > - In xe_ttm_vram_purge_page() for memory page offline, before calling
> > xe_exec_queue_kill() or xe_vm_kill().
> >
> > The reset_status op is updated to return u64 with the reason bitmask.
> > When a queue is banned but no explicit reason was recorded (e.g.,
> > from a
> > generic CAT error), it defaults to GPU_HANG for backward
> > compatibility.
> > A value of 0 means the exec queue is not banned.
> >
>
> Acked-by: José Roberto de Souza <jose.souza@intel.com>
Do we already have a userpace change with this?
Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com>
Thomas, thought on this vs the watch_queue you have or they are orthogonal?
>
> > Assisted-by: Copilot:claude-opus-4.6
> > Signed-off-by: Tejas Upadhyay <tejas.upadhyay@intel.com>
> > cc: Mrozek, Michal <michal.mrozek@intel.com>
> > cc: José Roberto de Souza <jose.souza@intel.com>
> > cc: Vivi, Rodrigo <rodrigo.vivi@intel.com>
> > ---
> > drivers/gpu/drm/xe/xe_exec_queue_types.h | 7 +++++--
> > drivers/gpu/drm/xe/xe_execlist.c | 4 ++--
> > drivers/gpu/drm/xe/xe_guc_submit.c | 24 +++++++++++++++++++---
> > --
> > drivers/gpu/drm/xe/xe_ttm_vram_mgr.c | 7 +++++++
> > include/uapi/drm/xe_drm.h | 12 +++++++++++-
> > 5 files changed, 44 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > index 2f5ccf294675..77a621da4487 100644
> > --- a/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > +++ b/drivers/gpu/drm/xe/xe_exec_queue_types.h
> > @@ -143,6 +143,9 @@ struct xe_exec_queue {
> > */
> > unsigned long flags;
> >
> > + /** @ban_reason: Bitmask of ban reasons
> > (DRM_XE_EXEC_QUEUE_BAN_REASON_*) */
> > + u32 ban_reason;
> > +
> > union {
> > /** @multi_gt_list: list head for VM bind engines if
> > multi-GT */
> > struct list_head multi_gt_list;
> > @@ -316,8 +319,8 @@ struct xe_exec_queue_ops {
> > * signalled when this function is called.
> > */
> > void (*resume)(struct xe_exec_queue *q);
> > - /** @reset_status: check exec queue reset status */
> > - bool (*reset_status)(struct xe_exec_queue *q);
> > + /** @reset_status: check exec queue ban status, returns ban
> > reason bitmask */
> > + u64 (*reset_status)(struct xe_exec_queue *q);
> > /** @active: check exec queue is active */
> > bool (*active)(struct xe_exec_queue *q);
> > };
> > diff --git a/drivers/gpu/drm/xe/xe_execlist.c
> > b/drivers/gpu/drm/xe/xe_execlist.c
> > index 9fb99c038ea8..35e6e05ba418 100644
> > --- a/drivers/gpu/drm/xe/xe_execlist.c
> > +++ b/drivers/gpu/drm/xe/xe_execlist.c
> > @@ -452,10 +452,10 @@ static void execlist_exec_queue_resume(struct
> > xe_exec_queue *q)
> > /* NIY */
> > }
> >
> > -static bool execlist_exec_queue_reset_status(struct xe_exec_queue
> > *q)
> > +static u64 execlist_exec_queue_reset_status(struct xe_exec_queue *q)
> > {
> > /* NIY */
> > - return false;
> > + return 0;
> > }
> >
> > static bool execlist_exec_queue_active(struct xe_exec_queue *q)
> > diff --git a/drivers/gpu/drm/xe/xe_guc_submit.c
> > b/drivers/gpu/drm/xe/xe_guc_submit.c
> > index 4b247a3019d2..ff28eab7cee2 100644
> > --- a/drivers/gpu/drm/xe/xe_guc_submit.c
> > +++ b/drivers/gpu/drm/xe/xe_guc_submit.c
> > @@ -6,6 +6,7 @@
> > #include "xe_guc_submit.h"
> >
> > #include <linux/bitfield.h>
> > +#include <uapi/drm/xe_drm.h>
> > #include <linux/bitmap.h>
> > #include <linux/circ_buf.h>
> > #include <linux/dma-fence-array.h>
> > @@ -1530,6 +1531,7 @@ guc_exec_queue_timedout_job(struct
> > drm_sched_job *drm_job)
> > if (!exec_queue_killed(q))
> > wedged =
> > guc_submit_hint_wedged(exec_queue_to_guc(q));
> >
> > + q->ban_reason |= DRM_XE_EXEC_QUEUE_BAN_REASON_GPU_HANG;
> > set_exec_queue_banned(q);
> >
> > /* Kick job / queue off hardware */
> > @@ -2211,13 +2213,25 @@ static void guc_exec_queue_resume(struct
> > xe_exec_queue *q)
> > xe_sched_msg_unlock(sched);
> > }
> >
> > -static bool guc_exec_queue_reset_status(struct xe_exec_queue *q)
> > +static u64 guc_exec_queue_reset_status(struct xe_exec_queue *q)
> > {
> > - if (xe_exec_queue_is_multi_queue_secondary(q) &&
> > -
> > guc_exec_queue_reset_status(xe_exec_queue_multi_queue_primary(q)))
> > - return true;
> > + if (xe_exec_queue_is_multi_queue_secondary(q)) {
> > + u64 status = guc_exec_queue_reset_status(
> > + xe_exec_queue_multi_queue_primary(q)
> > );
> > + if (status)
> > + return status;
> > + }
> > +
> > + if (exec_queue_reset(q) ||
> > exec_queue_killed_or_banned_or_wedged(q)) {
> > + u64 reason = q->ban_reason;
> >
> > - return exec_queue_reset(q) ||
> > exec_queue_killed_or_banned_or_wedged(q);
> > + /* If no specific reason was recorded, default to
> > GPU hang */
> > + if (!reason)
> > + reason =
> > DRM_XE_EXEC_QUEUE_BAN_REASON_GPU_HANG;
> > + return reason;
> > + }
> > +
> > + return 0;
> > }
> >
> > static bool guc_exec_queue_active(struct xe_exec_queue *q)
> > diff --git a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > index 35b5eaf590fa..3765e8fcdcec 100644
> > --- a/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > +++ b/drivers/gpu/drm/xe/xe_ttm_vram_mgr.c
> > @@ -7,6 +7,7 @@
> > #include <drm/drm_managed.h>
> > #include <drm/drm_drv.h>
> > #include <drm/drm_buddy.h>
> > +#include <uapi/drm/xe_drm.h>
> >
> > #include <drm/ttm/ttm_placement.h>
> > #include <drm/ttm/ttm_range_manager.h>
> > @@ -537,10 +538,15 @@ static int xe_ttm_vram_purge_page(struct
> > xe_device *xe, struct xe_bo *bo)
> > xe_bo_unlock(bo);
> > /* Ban VM if BO is PPGTT */
> > if (vm && (flags & XE_BO_FLAG_PAGETABLE)) {
> > + struct xe_exec_queue *eq;
> > +
> > down_write(&vm->lock);
> > + list_for_each_entry(eq, &vm->preempt.exec_queues,
> > lr.link)
> > + eq->ban_reason |=
> > DRM_XE_EXEC_QUEUE_BAN_REASON_PAGE_OFFLINE;
> > xe_vm_kill(vm, true);
> > up_write(&vm->lock);
> > }
> > +
> > if (vm)
> > xe_vm_put(vm);
> >
> > @@ -548,6 +554,7 @@ static int xe_ttm_vram_purge_page(struct
> > xe_device *xe, struct xe_bo *bo)
> > /* Ban exec queue if BO is lrc */
> > if (bo->q && xe_exec_queue_get_unless_zero(bo->q)) {
> > /* ban queue */
> > + bo->q->ban_reason |=
> > DRM_XE_EXEC_QUEUE_BAN_REASON_PAGE_OFFLINE;
> > xe_exec_queue_kill(bo->q);
> > xe_exec_queue_put(bo->q);
> > }
> > diff --git a/include/uapi/drm/xe_drm.h b/include/uapi/drm/xe_drm.h
> > index 48e9f1fdb78d..904d58b039fe 100644
> > --- a/include/uapi/drm/xe_drm.h
> > +++ b/include/uapi/drm/xe_drm.h
> > @@ -1503,7 +1503,17 @@ struct drm_xe_exec_queue_get_property {
> > /** @property: property to get */
> > __u32 property;
> >
> > - /** @value: property value */
> > + /**
> > + * @value: property value
> > + *
> > + * For %DRM_XE_EXEC_QUEUE_GET_PROPERTY_BAN, this is a
> > bitmask of:
> > + * - %DRM_XE_EXEC_QUEUE_BAN_REASON_GPU_HANG - banned due to
> > GPU hang/timeout
> > + * - %DRM_XE_EXEC_QUEUE_BAN_REASON_PAGE_OFFLINE - banned
> > due to memory page offline
> > + *
> > + * Value of 0 means the exec queue is not banned.
> > + */
> > +#define DRM_XE_EXEC_QUEUE_BAN_REASON_GPU_HANG (1 << 0)
> > +#define DRM_XE_EXEC_QUEUE_BAN_REASON_PAGE_OFFLINE (1 << 1)
> > __u64 value;
> >
> > /** @reserved: Reserved */
next prev parent reply other threads:[~2026-06-09 0:38 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-05 12:38 [PATCH V11 00/12] Add memory page offlining support Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 01/12] drm/xe: Link VRAM object with gpu buddy Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 02/12] drm/buddy: Integrate lockdep annotations for gpu buddy manager Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 03/12] drm/gpu: Add gpu_buddy_allocated_addr_to_block helper Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 04/12] drm/xe: Link LRC BO and its execution Queue Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 05/12] drm/xe: Extend BO purge to handle vram pages as well Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 06/12] drm/xe: Handle physical memory address error Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 07/12] drm/xe/cri: Add debugfs to inject faulty vram address Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 08/12] gpu/buddy: Add routine to dump allocated buddy blocks Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 09/12] drm/xe/configfs: Add vram bad page reservation policy Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 10/12] drm/xe/cri: Add sysfs interface for bad gpu vram pages Tejas Upadhyay
2026-06-05 12:38 ` [PATCH V11 11/12] drm/xe/uapi: Expose ban reason in EXEC_QUEUE_GET_PROPERTY_BAN Tejas Upadhyay
2026-06-08 14:03 ` Souza, Jose
2026-06-09 0:37 ` Rodrigo Vivi [this message]
2026-06-05 12:38 ` [PATCH V11 12/12] drm/xe: Add soft/hard offline mode for VRAM page retirement Tejas Upadhyay
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aidgZ1zsihvCPGqq@intel.com \
--to=rodrigo.vivi@intel.com \
--cc=himal.prasad.ghimiray@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=jose.souza@intel.com \
--cc=matthew.auld@intel.com \
--cc=matthew.brost@intel.com \
--cc=michal.mrozek@intel.com \
--cc=tejas.upadhyay@intel.com \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.