Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Rodrigo Vivi <rodrigo.vivi@intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>,
	<intel-xe@lists.freedesktop.org>,
	Michal Wajdeczko <michal.wajdeczko@intel.com>,
	Matthew Auld <matthew.auld@intel.com>
Subject: Re: [PATCH v5 1/3] drm/xe/migrate: Atomicize CCS copy command setup
Date: Thu, 9 Oct 2025 09:00:43 -0400	[thread overview]
Message-ID: <aOex-1E8yBlGFbwI@intel.com> (raw)
In-Reply-To: <aObsmCaIbrSgHrvd@lstrano-desk.jf.intel.com>

On Wed, Oct 08, 2025 at 03:58:32PM -0700, Matthew Brost wrote:
> On Wed, Oct 08, 2025 at 03:41:47PM +0530, Satyanarayana K V P wrote:
> > The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > save/restore while this sequence is being programmed, partial writes may
> > trigger page faults when saving IGPU CCS metadata. Use the VMOVDQU
> > instruction to write the sequence atomically.
> > 
> > Since VMOVDQU operates on 256-bit chunks, update EMIT_COPY_CCS_DW to emit
> > 8 dwords instead of 5 dwords.
> > 
> > Update emit_flush_invalidate() to use VMOVDQU operating with 128-bit
> > chunks.
> > 
> > Signed-off-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com>
> > Cc: Michal Wajdeczko <michal.wajdeczko@intel.com>
> > Cc: Matthew Brost <matthew.brost@intel.com>
> > Cc: Matthew Auld <matthew.auld@intel.com>
> > 
> > ---
> > V4 -> V5:
> > - Fixed review comments. (Matt B)
> > 
> > V3 -> V4:
> > - Fixed review comments. (Wajdeczko)
> > - Fix issues reported by patchworks.
> > 
> > V2 -> V3:
> > - Added support for 128 bit and 256 bit instructions with memcpy_vmovdqu
> > - Updated emit_flush_invalidate() to use vmovdqu instruction.
> > 
> > V1 -> V2:
> > - Use memcpy_vmovdqu only for x86 arch and for VF. Else use memcpy
> >   (Auld, Matthew)
> > - Fix issues reported by patchworks.
> > ---
> >  drivers/gpu/drm/xe/xe_migrate.c | 93 +++++++++++++++++++++++++--------
> >  1 file changed, 72 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > index c39c3b423d05..b629072956ee 100644
> > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > @@ -5,7 +5,9 @@
> >  
> >  #include "xe_migrate.h"
> >  
> > +#include <asm/fpu/api.h>
> >  #include <linux/bitfield.h>
> > +#include <linux/cpufeature.h>
> >  #include <linux/sizes.h>
> >  
> >  #include <drm/drm_managed.h>
> > @@ -33,6 +35,7 @@
> >  #include "xe_res_cursor.h"
> >  #include "xe_sa.h"
> >  #include "xe_sched_job.h"
> > +#include "xe_sriov_vf_ccs.h"
> >  #include "xe_sync.h"
> >  #include "xe_trace_bo.h"
> >  #include "xe_validation.h"
> > @@ -644,18 +647,49 @@ static void emit_pte(struct xe_migrate *m,
> >  	}
> >  }
> >  
> > -#define EMIT_COPY_CCS_DW 5
> > +static void memcpy_vmovdqu(void *dst, const void *src, u32 size)
> > +{
> > +#ifdef CONFIG_X86
> > +	kernel_fpu_begin();
> > +	if (size == SZ_128) {
> > +		asm("vmovdqu (%0), %%xmm0\n"
> > +		    "vmovups %%xmm0,   (%1)\n"
> > +		    :: "r" (src), "r" (dst) : "memory");
> > +	} else if (size == SZ_256) {
> > +		asm("vmovdqu (%0), %%ymm0\n"
> > +		    "vmovups %%ymm0,   (%1)\n"
> > +		    :: "r" (src), "r" (dst) : "memory");
> > +	}
> > +	kernel_fpu_end();
> > +#endif
> 
> Everything in this patch LGTM but I think we maintainer input to ensure
> we are breaking some rules about inlined asm code in a driver (no idea
> if this exists) or if a better place would be somewhere common. Can you
> ping Lucas, Thomas, or Rodrigo and ask them about this?

Well, it is possible and we have asm code in i915 for instance (i915_memcpy.c)

But the rule does exist:
https://www.kernel.org/doc/html/latest/process/coding-style.html#inline-assembly

"don’t use inline assembly gratuitously when C can do the job. You can and should
poke hardware from C when possible"

In this case here, please explain why exactly memcpy with smp_wmb barriers and
or WRITE_ONCE code combined couldn't solve it.

Also, please explain how exactly vmovdqu guarantees the atomicity promised by
the commit message. On a quick search here my take is that for this 128 or 256
bits, atomicity is not guaranteed.

So, imho this patch is introducing a unmaintainable, complex, and fragile code
that is not even doing what it is claiming to do. But I will be glad if someone
can challenge this and prove me wrong.

Thanks,
Rodrigo.

> 
> Matt 
> 
> > +}
> > +
> > +static void emit_atomic(struct xe_gt *gt, void *dst, const void *src, u32 size)
> > +{
> > +	u32 instr_size = size * BITS_PER_BYTE;
> > +
> > +	xe_gt_assert(gt, instr_size == SZ_128 || instr_size == SZ_256);
> > +
> > +	if (IS_VF_CCS_READY(gt_to_xe(gt)) && static_cpu_has(X86_FEATURE_AVX))
> > +		memcpy_vmovdqu(dst, src, instr_size);
> > +	else
> > +		memcpy(dst, src, size);
> > +}
> > +
> > +#define EMIT_COPY_CCS_DW 8
> >  static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> >  			  u64 dst_ofs, bool dst_is_indirect,
> >  			  u64 src_ofs, bool src_is_indirect,
> >  			  u32 size)
> >  {
> > +	u32 dw[EMIT_COPY_CCS_DW] = {MI_NOOP};
> >  	struct xe_device *xe = gt_to_xe(gt);
> >  	u32 *cs = bb->cs + bb->len;
> >  	u32 num_ccs_blks;
> >  	u32 num_pages;
> >  	u32 ccs_copy_size;
> >  	u32 mocs;
> > +	u32 i = 0;
> >  
> >  	if (GRAPHICS_VERx100(xe) >= 2000) {
> >  		num_pages = DIV_ROUND_UP(size, XE_PAGE_SIZE);
> > @@ -673,15 +707,23 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> >  		mocs = FIELD_PREP(XY_CTRL_SURF_MOCS_MASK, gt->mocs.uc_index);
> >  	}
> >  
> > -	*cs++ = XY_CTRL_SURF_COPY_BLT |
> > -		(src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > -		(dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > -		ccs_copy_size;
> > -	*cs++ = lower_32_bits(src_ofs);
> > -	*cs++ = upper_32_bits(src_ofs) | mocs;
> > -	*cs++ = lower_32_bits(dst_ofs);
> > -	*cs++ = upper_32_bits(dst_ofs) | mocs;
> > +	dw[i++] = XY_CTRL_SURF_COPY_BLT |
> > +		  (src_is_indirect ? 0x0 : 0x1) << SRC_ACCESS_TYPE_SHIFT |
> > +		  (dst_is_indirect ? 0x0 : 0x1) << DST_ACCESS_TYPE_SHIFT |
> > +		  ccs_copy_size;
> > +	dw[i++] = lower_32_bits(src_ofs);
> > +	dw[i++] = upper_32_bits(src_ofs) | mocs;
> > +	dw[i++] = lower_32_bits(dst_ofs);
> > +	dw[i++] = upper_32_bits(dst_ofs) | mocs;
> >  
> > +	/*
> > +	 * The CCS copy command is a 5-dword sequence. If the vCPU halts during
> > +	 * save/restore while this sequence is being issued, partial writes may trigger
> > +	 * page faults when saving iGPU CCS metadata. Use the VMOVDQU instruction to
> > +	 * write the sequence atomically.
> > +	 */
> > +	emit_atomic(gt, cs, dw, sizeof(dw));
> > +	cs += EMIT_COPY_CCS_DW;
> >  	bb->len = cs - bb->cs;
> >  }
> >  
> > @@ -993,18 +1035,27 @@ static u64 migrate_vm_ppgtt_addr_tlb_inval(void)
> >  	return (NUM_KERNEL_PDE - 2) * XE_PAGE_SIZE;
> >  }
> >  
> > -static int emit_flush_invalidate(u32 *dw, int i, u32 flags)
> > +/*
> > + * The MI_FLUSH_DW command is a 4-dword sequence. If the vCPU halts during
> > + * save/restore while this sequence is being issued, partial writes may
> > + * trigger page faults when saving iGPU CCS metadata. Use
> > + * emit_atomic() to write the sequence atomically.
> > + */
> > +#define EMIT_FLUSH_INVALIDATE_DW 4
> > +static int emit_flush_invalidate(struct xe_exec_queue *q, u32 *cs, int i, u32 flags)
> >  {
> >  	u64 addr = migrate_vm_ppgtt_addr_tlb_inval();
> > +	u32 dw[EMIT_FLUSH_INVALIDATE_DW] = {MI_NOOP}, j = 0;
> > +
> > +	dw[j++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > +		      MI_FLUSH_IMM_DW | flags;
> > +	dw[j++] = lower_32_bits(addr);
> > +	dw[j++] = upper_32_bits(addr);
> > +	dw[j++] = MI_NOOP;
> >  
> > -	dw[i++] = MI_FLUSH_DW | MI_INVALIDATE_TLB | MI_FLUSH_DW_OP_STOREDW |
> > -		  MI_FLUSH_IMM_DW | flags;
> > -	dw[i++] = lower_32_bits(addr);
> > -	dw[i++] = upper_32_bits(addr);
> > -	dw[i++] = MI_NOOP;
> > -	dw[i++] = MI_NOOP;
> > +	emit_atomic(q->gt, &cs[i], dw, sizeof(dw));
> >  
> > -	return i;
> > +	return i + j;
> >  }
> >  
> >  /**
> > @@ -1049,7 +1100,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  	/* Calculate Batch buffer size */
> >  	batch_size = 0;
> >  	while (size) {
> > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> >  		u64 ccs_ofs, ccs_size;
> >  		u32 ccs_pt;
> >  
> > @@ -1090,7 +1141,7 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  	 * sizes here again before copy command is emitted.
> >  	 */
> >  	while (size) {
> > -		batch_size += 10; /* Flush + ggtt addr + 2 NOP */
> > +		batch_size += EMIT_FLUSH_INVALIDATE_DW * 2; /* Flush + ggtt addr + 1 NOP */
> >  		u32 flush_flags = 0;
> >  		u64 ccs_ofs, ccs_size;
> >  		u32 ccs_pt;
> > @@ -1113,11 +1164,11 @@ int xe_migrate_ccs_rw_copy(struct xe_tile *tile, struct xe_exec_queue *q,
> >  
> >  		emit_pte(m, bb, ccs_pt, false, false, &ccs_it, ccs_size, src);
> >  
> > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> >  		flush_flags = xe_migrate_ccs_copy(m, bb, src_L0_ofs, src_is_pltt,
> >  						  src_L0_ofs, dst_is_pltt,
> >  						  src_L0, ccs_ofs, true);
> > -		bb->len = emit_flush_invalidate(bb->cs, bb->len, flush_flags);
> > +		bb->len = emit_flush_invalidate(q, bb->cs, bb->len, flush_flags);
> >  
> >  		size -= src_L0;
> >  	}
> > -- 
> > 2.51.0
> >

next prev parent reply	other threads:[~2025-10-09 13:00 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-08 10:11 [PATCH v5 0/3] drm/xe/migrate: Atomicize CCS copy command setup Satyanarayana K V P
2025-10-08 10:11 ` [PATCH v5 1/3] " Satyanarayana K V P
2025-10-08 22:58   ` Matthew Brost
2025-10-09 13:00     ` Rodrigo Vivi [this message]
2025-10-09 16:11       ` Matthew Brost
2025-10-09 18:35         ` Rodrigo Vivi
2025-10-09 18:49           ` Matthew Brost
2025-10-09 19:49             ` Rodrigo Vivi
2025-10-09 20:34               ` Michal Wajdeczko
2025-10-09 23:06             ` Matt Roper
2025-10-10  8:41               ` K V P, Satyanarayana
2025-10-10 19:13                 ` Rodrigo Vivi
2025-10-13  4:42                   ` K V P, Satyanarayana
2025-10-08 10:11 ` [PATCH v5 2/3] drm/xe/migrate: Make emit_pte() header write atomic Satyanarayana K V P
2025-10-08 23:27   ` Matthew Brost
2025-10-08 10:11 ` [PATCH v5 3/3] drm/xe/vf: Clear CCS read/write buffers in atomic way Satyanarayana K V P

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aOex-1E8yBlGFbwI@intel.com \
    --to=rodrigo.vivi@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    --cc=michal.wajdeczko@intel.com \
    --cc=satyanarayana.k.v.p@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.