Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
	 intel-xe@lists.freedesktop.org
Cc: Matthew Brost <matthew.brost@intel.com>,
	Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Matthew Auld <matthew.auld@intel.com>
Subject: Re: [PATCH 1/2] drm/xe/vf: Fix fs_reclaim warning with CCS save/restore BB allocation
Date: Thu, 29 Jan 2026 16:39:37 +0100	[thread overview]
Message-ID: <39c1fd57b78f09761867a469fae16c347d879dcb.camel@linux.intel.com> (raw)
In-Reply-To: <20260129125141.523087-5-satyanarayana.k.v.p@intel.com>

Hi.

On Thu, 2026-01-29 at 12:51 +0000, Satyanarayana K V P wrote:
> CCS save/restore batch buffers are attached during BO allocation and
> detached during BO teardown. The shrinker triggers xe_bo_move(),
> which is
> used for both allocation and deletion paths.
> 
> When BO allocation and shrinking occur concurrently, a circular
> locking
> dependency involving fs_reclaim and swap_guard can occur, leading to
> a
> deadlock such as:
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> ------------------------------------------------------
> 
>       CPU0                    CPU1
>       ----                    ----
>  lock(fs_reclaim);
>                               lock(&sa_manager->swap_guard);
>                               lock(fs_reclaim);
>  lock(&sa_manager->swap_guard);
> 
>  *** DEADLOCK ***
> =====================================================
> 
> To avoid this, allocate CCS save/restore BB BOs using GFP_ATOMIC,
> preventing reclaim from being invoked in this context.
> 
> Fixes: 864690cf4dd62 ("drm/xe/vf: Attach and detach CCS copy commands
> with BO")
> Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> Suggested-by: Matthew Brost <matthew.brost@intel.com>
> Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>

If shrinking and allocation is indeed happening concurrently, then
GFP_ATOMIC is highly likely to fail. In fact it shouldn't be used if we
can't gracefully recover from a failure or if the failure doesn't
matter at all, like in debugging cases where we can lose data.

I'm trying to wrap my head around the sa manager shadow approach and it
seems to me like external code putting the sa manager in a particular
state and the internal lock is accessed from outside the sa code with
no clearly defined locking rules / asserts? IMO it's a very odd and
fragile construct, in particular when used with guard() where it's
unclear exactly what scope is needing locking.

I'm trying to find some documentation to explain why this is used and
the only thing I can find is 

"Directly clearing the BB lacks atomicity and can lead to undefined
behavior if the vCPU is halted mid-operation...."

But what if the vCPU is halted during xe_sa_bo_sync_shadow()? What is
exactly is the atomicity in this case? 

Thanks,
Thomas


> ---
>  drivers/gpu/drm/xe/xe_bb.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/xe/xe_bb.c b/drivers/gpu/drm/xe/xe_bb.c
> index 8b678297aaa2..355365625df9 100644
> --- a/drivers/gpu/drm/xe/xe_bb.c
> +++ b/drivers/gpu/drm/xe/xe_bb.c
> @@ -62,7 +62,7 @@ struct xe_bb *xe_bb_new(struct xe_gt *gt, u32
> dwords, bool usm)
>  struct xe_bb *xe_bb_ccs_new(struct xe_gt *gt, u32 dwords,
>  			    enum xe_sriov_vf_ccs_rw_ctxs ctx_id)
>  {
> -	struct xe_bb *bb = kmalloc(sizeof(*bb), GFP_KERNEL);
> +	struct xe_bb *bb = kmalloc(sizeof(*bb), GFP_ATOMIC);
>  	struct xe_device *xe = gt_to_xe(gt);
>  	struct xe_sa_manager *bb_pool;
>  	int err;
> @@ -78,7 +78,7 @@ struct xe_bb *xe_bb_ccs_new(struct xe_gt *gt, u32
> dwords,
>  	 */
>  
>  	bb_pool = xe->sriov.vf.ccs.contexts[ctx_id].mem.ccs_bb_pool;
> -	bb->bo = xe_sa_bo_new(bb_pool, 4 * (dwords + 1));
> +	bb->bo = __xe_sa_bo_new(bb_pool, 4 * (dwords + 1),
> GFP_ATOMIC);
>  
>  	if (IS_ERR(bb->bo)) {
>  		err = PTR_ERR(bb->bo);

  reply	other threads:[~2026-01-29 15:39 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-29 12:51 [PATCH 0/2] Fix fs_reclaim deadlock caused by CCS save/restore Satyanarayana K V P
2026-01-29 12:51 ` [PATCH 1/2] drm/xe/vf: Fix fs_reclaim warning with CCS save/restore BB allocation Satyanarayana K V P
2026-01-29 15:39   ` Thomas Hellström [this message]
2026-01-29 18:03     ` Matthew Brost
2026-01-29 18:30       ` Thomas Hellström
2026-01-29 19:35         ` Matthew Brost
2026-01-31  1:28           ` Matthew Brost
2026-01-29 12:51 ` [PATCH 2/2] drm/xe/sa: Add lockdep annotations for SA manager swap_guard Satyanarayana K V P
2026-01-29 13:32 ` ✓ CI.KUnit: success for Fix fs_reclaim deadlock caused by CCS save/restore Patchwork
2026-01-29 14:13 ` ✓ Xe.CI.BAT: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=39c1fd57b78f09761867a469fae16c347d879dcb.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=satyanarayana.k.v.p@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox