Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Matthew Brost <matthew.brost@intel.com>
To: Matthew Auld <matthew.auld@intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [PATCH 5/6] drm/xe/migrate: support MEM_COPY instruction
Date: Thu, 16 Oct 2025 14:26:18 -0700	[thread overview]
Message-ID: <aPFi+mThPIgY27Us@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <aPE9jWAlrUfmb4ve@lstrano-desk.jf.intel.com>

On Thu, Oct 16, 2025 at 11:46:37AM -0700, Matthew Brost wrote:
> On Thu, Oct 16, 2025 at 10:41:33AM +0100, Matthew Auld wrote:
> > On 16/10/2025 01:58, Matthew Brost wrote:
> > > On Wed, Oct 15, 2025 at 03:19:35PM +0100, Matthew Auld wrote:
> > > > Make this the default on xe2+ when doing a copy. This has a few
> > > > advantages over the exiting copy instruction:
> > > > 
> > > > 1) It has a special PAGE_COPY mode that claims to be optimised for
> > > >     page-in/page-out, which is the vast majority of current users.
> > > > 
> > > > 2) It also has a simple BYTE_COPY mode that supports byte granularity
> > > >     copying without any restrictions.
> > > > 
> > > > With 2) we can now easily skip the bounce buffer flow when copying
> > > > buffers with strange sizing/alignment, like for memory_access. But that
> > > > is left for the next patch.
> > > > 
> > > 
> > > How you tested if this series has an affect on bandwidth of copies?
> > 
> > I only tested it from functionaly pov. Main interest for this series was
> > with 2) atm.
> > 
> > > 
> > > We have some SVM tests which can measure this bandwidth rather
> > > effectively. I can give these tests a try a but it may take a few days.
> > > 
> > > With that, feel free to breakout the first 4 patches into an individual
> > > series while we explore the affects on bandwidth for th last two
> > > patches.
> > 
> > Sounds good. Can you point me to those SVM tests? I see some fault and
> > pre-fetch benchmarks in IGT, is it those? I can try them.
> > 
> 
> Yes, the prefetch benchmark test is a good one but it is software
> limited atm so might not give the best view.
> 
> Running 'xe_exec_system_allocator --r many-large-malloc' and then
> looking at the GT stats the copy bandwidth can be derived. I have
> scripts that do this, I believe Francios uploaded these somewhere
> internally but here is a public link to a script which parses these [1].
> 
> I can try to find time to see the bandwidth before / after this series
> today and report back.
> 

I didn’t observe a noticeable performance drop when using MEM_COPY_CMD
in the SVM tests. However, for various reasons, this path is still
software-limited in the KMD. Once we land additional software
optimizations to accelerate the copies, switching between commands will
be straightforward. So, there’s no performance concern with these
changes.

> Matt
> 
> [1] https://pastebin.com/rZZN5sgh
> 
> > > 
> > > Matt
> > > 
> > > > BSpec: 57561
> > > > Signed-off-by: Matthew Auld <matthew.auld@intel.com>
> > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > ---
> > > >   .../gpu/drm/xe/instructions/xe_gpu_commands.h |  6 ++
> > > >   drivers/gpu/drm/xe/xe_migrate.c               | 64 ++++++++++++++++---
> > > >   2 files changed, 61 insertions(+), 9 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > > > index 8cfcd3360896..5d41ca297447 100644
> > > > --- a/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > > > +++ b/drivers/gpu/drm/xe/instructions/xe_gpu_commands.h
> > > > @@ -31,6 +31,12 @@
> > > >   #define   XY_FAST_COPY_BLT_D1_DST_TILE4	REG_BIT(30)
> > > >   #define   XE2_XY_FAST_COPY_BLT_MOCS_INDEX_MASK	GENMASK(23, 20)
> > > > +#define MEM_COPY_CMD (2 << 29 | 0x5a << 22 | 0x8)
> > > > +#define   MEM_COPY_PAGE_COPY_MODE REG_BIT(19)
> > > > +#define   MEM_COPY_MATRIX_COPY REG_BIT(17)
> > > > +#define   MEM_COPY_SRC_MOCS_INDEX_MASK	GENMASK(31, 28)
> > > > +#define   MEM_COPY_DST_MOCS_INDEX_MASK	GENMASK(6, 3)
> > > > +
> > > >   #define	PVC_MEM_SET_CMD		(2 << 29 | 0x5b << 22)
> > > >   #define   PVC_MEM_SET_CMD_LEN_DW	7
> > > >   #define   PVC_MEM_SET_MATRIX		REG_BIT(17)
> > > > diff --git a/drivers/gpu/drm/xe/xe_migrate.c b/drivers/gpu/drm/xe/xe_migrate.c
> > > > index 3801152b7f8f..da1fefb96070 100644
> > > > --- a/drivers/gpu/drm/xe/xe_migrate.c
> > > > +++ b/drivers/gpu/drm/xe/xe_migrate.c
> > > > @@ -699,37 +699,83 @@ static void emit_copy_ccs(struct xe_gt *gt, struct xe_bb *bb,
> > > >   }
> > > >   #define EMIT_COPY_DW 10
> > > > -static void emit_copy(struct xe_gt *gt, struct xe_bb *bb,
> > > > -		      u64 src_ofs, u64 dst_ofs, unsigned int size,
> > > > -		      unsigned int pitch)
> > > > +static void emit_xy_fast_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
> > > > +			      u64 dst_ofs, unsigned int size,
> > > > +			      unsigned int pitch)
> > > >   {
> > > >   	struct xe_device *xe = gt_to_xe(gt);
> > > > -	u32 mocs = 0;
> > > >   	u32 tile_y = 0;
> > > > +	xe_gt_assert(gt, GRAPHICS_VER(xe) < 20);
> > > >   	xe_gt_assert(gt, !(pitch & 3));
> > > >   	xe_gt_assert(gt, size / pitch <= S16_MAX);
> > > >   	xe_gt_assert(gt, pitch / 4 <= S16_MAX);
> > > >   	xe_gt_assert(gt, pitch <= U16_MAX);
> > > > -	if (GRAPHICS_VER(xe) >= 20)
> > > > -		mocs = FIELD_PREP(XE2_XY_FAST_COPY_BLT_MOCS_INDEX_MASK, gt->mocs.uc_index);
> > > > -

Can we keep this part in case we want to experiment with switching
between commands on Xe2+? It isn't a huge amount of code to carry in
emit_xy_fast_copy to support Xe2+.

> > > >   	if (GRAPHICS_VERx100(xe) >= 1250)
> > > >   		tile_y = XY_FAST_COPY_BLT_D1_SRC_TILE4 | XY_FAST_COPY_BLT_D1_DST_TILE4;
> > > >   	bb->cs[bb->len++] = XY_FAST_COPY_BLT_CMD | (10 - 2);
> > > > -	bb->cs[bb->len++] = XY_FAST_COPY_BLT_DEPTH_32 | pitch | tile_y | mocs;
> > > > +	bb->cs[bb->len++] = XY_FAST_COPY_BLT_DEPTH_32 | pitch | tile_y;
> > > >   	bb->cs[bb->len++] = 0;
> > > >   	bb->cs[bb->len++] = (size / pitch) << 16 | pitch / 4;
> > > >   	bb->cs[bb->len++] = lower_32_bits(dst_ofs);
> > > >   	bb->cs[bb->len++] = upper_32_bits(dst_ofs);
> > > >   	bb->cs[bb->len++] = 0;
> > > > -	bb->cs[bb->len++] = pitch | mocs;
> > > > +	bb->cs[bb->len++] = pitch;
> > > >   	bb->cs[bb->len++] = lower_32_bits(src_ofs);
> > > >   	bb->cs[bb->len++] = upper_32_bits(src_ofs);
> > > >   }
> > > > +static void emit_mem_copy(struct xe_gt *gt, struct xe_bb *bb, u64 src_ofs,
> > > > +			  u64 dst_ofs, unsigned int size, unsigned int pitch)
> > > > +{
> > > > +	u32 mode, copy_type, width;
> > > > +
> > > > +	xe_gt_assert(gt, IS_ALIGNED(size, pitch));
> > > > +	xe_gt_assert(gt, pitch <= U16_MAX);
> > > > +	xe_gt_assert(gt, size);
> > > > +
> > > > +	if (IS_ALIGNED(size, 256) &&
> > > > +	    IS_ALIGNED(lower_32_bits(src_ofs), 256) &&
> > > > +	    IS_ALIGNED(lower_32_bits(dst_ofs), 256)) {

s/256/SZ_256 or perhaps a define for page copy mode alignment
requirements?

Nits aside, everything LGTM.
Matt

> > > > +		mode = MEM_COPY_PAGE_COPY_MODE;
> > > > +		copy_type = 0; /* linear copy */
> > > > +		width = size / 256;
> > > > +	} else {
> > > > +		xe_gt_assert(gt, size / pitch <= U16_MAX);
> > > > +		mode = 0; /* BYTE_COPY */
> > > > +		copy_type = MEM_COPY_MATRIX_COPY;
> > > > +		width = pitch;
> > > > +	}
> > > > +
> > > > +	xe_gt_assert(gt, width <= U16_MAX);
> > > > +
> > > > +	bb->cs[bb->len++] = MEM_COPY_CMD | mode | copy_type;
> > > > +	bb->cs[bb->len++] = width - 1;
> > > > +	bb->cs[bb->len++] = size / pitch - 1; /* ignored by hw for page copy above */
> > > > +	bb->cs[bb->len++] = pitch - 1;
> > > > +	bb->cs[bb->len++] = pitch - 1;
> > > > +	bb->cs[bb->len++] = lower_32_bits(src_ofs);
> > > > +	bb->cs[bb->len++] = upper_32_bits(src_ofs);
> > > > +	bb->cs[bb->len++] = lower_32_bits(dst_ofs);
> > > > +	bb->cs[bb->len++] = upper_32_bits(dst_ofs);
> > > > +	bb->cs[bb->len++] = FIELD_PREP(MEM_COPY_SRC_MOCS_INDEX_MASK, gt->mocs.uc_index) |
> > > > +			    FIELD_PREP(MEM_COPY_DST_MOCS_INDEX_MASK, gt->mocs.uc_index);
> > > > +}
> > > > +
> > > > +static void emit_copy(struct xe_gt *gt, struct xe_bb *bb,
> > > > +		      u64 src_ofs, u64 dst_ofs, unsigned int size,
> > > > +		      unsigned int pitch)
> > > > +{
> > > > +	struct xe_device *xe = gt_to_xe(gt);
> > > > +
> > > > +	if (GRAPHICS_VER(xe) >= 20)

Would it be better to stick this in xe_pci.c / xe_device.info rather
than inline IP version check?

Nits aside, patch looks correct.

Matt

> > > > +		emit_mem_copy(gt, bb, src_ofs, dst_ofs, size, pitch);
> > > > +	else
> > > > +		emit_xy_fast_copy(gt, bb, src_ofs, dst_ofs, size, pitch);
> > > > +}
> > > > +
> > > >   static u64 xe_migrate_batch_base(struct xe_migrate *m, bool usm)
> > > >   {
> > > >   	return usm ? m->usm_batch_base_ofs : m->batch_base_ofs;
> > > > -- 
> > > > 2.51.0
> > > > 
> > 

  reply	other threads:[~2025-10-16 21:26 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-15 14:19 [PATCH 0/6] Some migration fixes/improvements Matthew Auld
2025-10-15 14:19 ` [PATCH 1/6] drm/xe/migrate: rework size restrictions for sram pte emit Matthew Auld
2025-10-16  0:36   ` Matthew Brost
2025-10-15 14:19 ` [PATCH 2/6] drm/xe/migrate: fix chunk handling for 2M page emit Matthew Auld
2025-10-16  0:34   ` Matthew Brost
2025-10-15 14:19 ` [PATCH 3/6] drm/xe/migrate: fix batch buffer sizing Matthew Auld
2025-10-16  0:36   ` Matthew Brost
2025-10-15 14:19 ` [PATCH 4/6] drm/xe/migrate: trim " Matthew Auld
2025-10-16  0:38   ` Matthew Brost
2025-10-15 14:19 ` [PATCH 5/6] drm/xe/migrate: support MEM_COPY instruction Matthew Auld
2025-10-16  0:58   ` Matthew Brost
2025-10-16  9:41     ` Matthew Auld
2025-10-16 18:46       ` Matthew Brost
2025-10-16 21:26         ` Matthew Brost [this message]
2025-10-17 11:23           ` Matthew Auld
2025-10-15 14:19 ` [PATCH 6/6] drm/xe/migrate: skip bounce buffer path on xe2 Matthew Auld
2025-10-16 21:28   ` Matthew Brost
2025-10-15 22:58 ` ✓ CI.KUnit: success for Some migration fixes/improvements Patchwork
2025-10-15 23:52 ` ✓ Xe.CI.BAT: " Patchwork
2025-10-16 16:08 ` ✓ Xe.CI.Full: " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aPFi+mThPIgY27Us@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox