Re: [PATCH v3 3/3] drm/xe: Move LRC seqno to system memory to avoid slow dGPU reads

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <stuart.summers@intel.com>,
	<francois.dugast@intel.com>, <daniele.ceraolospurio@intel.com>,
	<michal.wajdeczko@intel.com>
Subject: Re: [PATCH v3 3/3] drm/xe: Move LRC seqno to system memory to avoid slow dGPU reads
Date: Thu, 26 Feb 2026 09:11:20 -0800	[thread overview]
Message-ID: <aaB+uBE28cSkeuBv@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <95ad0e5b8ed775c2a8948a09ebab75c7acfd1a74.camel@linux.intel.com>

On Thu, Feb 26, 2026 at 01:25:19PM +0100, Thomas Hellström wrote:
> On Tue, 2026-02-17 at 20:33 -0800, Matthew Brost wrote:
> > The LRC seqno is read by the CPU in the fence signaling path. On dGPU
> > that read can turn into a PCIe transaction when the seqno lives in
> > the
> > main LRC BO, making the hot-path poll/peek much more expensive.
> > 
> > Allocate a small dedicated seqno BO in system memory and map the
> > seqno
> > and start_seqno fields from there instead. The GPU still updates the
> > values, but CPU reads stay in cached system memory and avoid PCIe
> > read
> > latency.
> > 
> > Update the LRC map/address helpers to accept a BO expression and use
> > the
> > new lrc->seqno_bo for seqno mappings. Unpin/unmap seqno_bo during
> > teardown.
> 
> I remember this was discussed also when enabling discrete for the i915
> driver but we didn't have any timing information at that time.
> 
> Whether this is a good thing depends on the amount of cpu polling per
> seqno bump, but I figure that's typically always at least one CPU read,
> right?

We store every pending fence in each engine class in a linked list, and
on every hardware IRQ we walk all pending fences to check whether they
have signaled (one hardware seqno read).

We do this because with virtual engines we don’t know which hardware
engine instance each submission will run on, and the GuC is free to
reorder anything. Thus, we can’t rely on a HOQ check to short-circuit
this or look at individual hardware engine instance. Perhaps we could
make the logic slightly more advanced to short-circuit redundant checks
on the same context, but let’s put that aside for now.

In bad cases, there may be quite a few fences we need to iterate over on
every IRQ. If hardware fences are in VRAM, this can add up quickly.

I arrived at the 1.5µs improvement in fence signaling by running
xe_exec_system_allocator plus a script that parses GT stats and outputs
the average time spent in xe_svm_copy—comparing the patched and
unpatched versions across 10 runs. Each run does roughly 2k copies and
provides very reliable data; I’ve been using this method for a number of
SVM optimizations I’ve been working on. I believe I also had
clear-on-free enabled in this branch, so in this case exactly one
hardware fence should have been in the linked list per IRQ.

I would think this is always a good thing given the disparity in read
speeds between SRAM and VRAM.

Matt

> 
> > 
> > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > ---
> >  drivers/gpu/drm/xe/xe_lrc.c       | 57 +++++++++++++++++++----------
> > --
> >  drivers/gpu/drm/xe/xe_lrc_types.h |  6 ++++
> >  2 files changed, 42 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/xe/xe_lrc.c
> > b/drivers/gpu/drm/xe/xe_lrc.c
> > index 38f648b98868..d72146313424 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc.c
> > +++ b/drivers/gpu/drm/xe/xe_lrc.c
> > @@ -715,12 +715,13 @@ u32 xe_lrc_pphwsp_offset(struct xe_lrc *lrc)
> >  #define __xe_lrc_pphwsp_offset xe_lrc_pphwsp_offset
> >  #define __xe_lrc_regs_offset xe_lrc_regs_offset
> >  
> > -#define LRC_SEQNO_PPHWSP_OFFSET 512
> > -#define LRC_START_SEQNO_PPHWSP_OFFSET (LRC_SEQNO_PPHWSP_OFFSET + 8)
> > -#define LRC_CTX_JOB_TIMESTAMP_OFFSET (LRC_START_SEQNO_PPHWSP_OFFSET
> > + 8)
> > +#define LRC_CTX_JOB_TIMESTAMP_OFFSET 512
> >  #define LRC_ENGINE_ID_PPHWSP_OFFSET 1024
> >  #define LRC_PARALLEL_PPHWSP_OFFSET 2048
> >  
> > +#define LRC_SEQNO_OFFSET 0
> > +#define LRC_START_SEQNO_OFFSET (LRC_SEQNO_OFFSET + 8)
> > +
> >  u32 xe_lrc_regs_offset(struct xe_lrc *lrc)
> >  {
> >  	return xe_lrc_pphwsp_offset(lrc) + LRC_PPHWSP_SIZE;
> > @@ -747,14 +748,12 @@ size_t xe_lrc_skip_size(struct xe_device *xe)
> >  
> >  static inline u32 __xe_lrc_seqno_offset(struct xe_lrc *lrc)
> >  {
> > -	/* The seqno is stored in the driver-defined portion of
> > PPHWSP */
> > -	return xe_lrc_pphwsp_offset(lrc) + LRC_SEQNO_PPHWSP_OFFSET;
> > +	return LRC_SEQNO_OFFSET;
> >  }
> >  
> >  static inline u32 __xe_lrc_start_seqno_offset(struct xe_lrc *lrc)
> >  {
> > -	/* The start seqno is stored in the driver-defined portion
> > of PPHWSP */
> > -	return xe_lrc_pphwsp_offset(lrc) +
> > LRC_START_SEQNO_PPHWSP_OFFSET;
> > +	return LRC_START_SEQNO_OFFSET;
> >  }
> >  
> >  static u32 __xe_lrc_ctx_job_timestamp_offset(struct xe_lrc *lrc)
> > @@ -805,10 +804,11 @@ static inline u32 __xe_lrc_wa_bb_offset(struct
> > xe_lrc *lrc)
> >  	return xe_bo_size(lrc->bo) - LRC_WA_BB_SIZE;
> >  }
> >  
> > -#define DECL_MAP_ADDR_HELPERS(elem) \
> > +#define DECL_MAP_ADDR_HELPERS(elem, bo_expr) \
> >  static inline struct iosys_map __xe_lrc_##elem##_map(struct xe_lrc
> > *lrc) \
> >  { \
> > -	struct iosys_map map = lrc->bo->vmap; \
> > +	struct xe_bo *bo = (bo_expr); \
> > +	struct iosys_map map = bo->vmap; \
> >  \
> >  	xe_assert(lrc_to_xe(lrc), !iosys_map_is_null(&map));  \
> >  	iosys_map_incr(&map, __xe_lrc_##elem##_offset(lrc)); \
> > @@ -816,20 +816,22 @@ static inline struct iosys_map
> > __xe_lrc_##elem##_map(struct xe_lrc *lrc) \
> >  } \
> >  static inline u32 __maybe_unused __xe_lrc_##elem##_ggtt_addr(struct
> > xe_lrc *lrc) \
> >  { \
> > -	return xe_bo_ggtt_addr(lrc->bo) +
> > __xe_lrc_##elem##_offset(lrc); \
> > +	struct xe_bo *bo = (bo_expr); \
> > +\
> > +	return xe_bo_ggtt_addr(bo) + __xe_lrc_##elem##_offset(lrc);
> > \
> >  } \
> >  
> > -DECL_MAP_ADDR_HELPERS(ring)
> > -DECL_MAP_ADDR_HELPERS(pphwsp)
> > -DECL_MAP_ADDR_HELPERS(seqno)
> > -DECL_MAP_ADDR_HELPERS(regs)
> > -DECL_MAP_ADDR_HELPERS(start_seqno)
> > -DECL_MAP_ADDR_HELPERS(ctx_job_timestamp)
> > -DECL_MAP_ADDR_HELPERS(ctx_timestamp)
> > -DECL_MAP_ADDR_HELPERS(ctx_timestamp_udw)
> > -DECL_MAP_ADDR_HELPERS(parallel)
> > -DECL_MAP_ADDR_HELPERS(indirect_ring)
> > -DECL_MAP_ADDR_HELPERS(engine_id)
> > +DECL_MAP_ADDR_HELPERS(ring, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(pphwsp, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(seqno, lrc->seqno_bo)
> > +DECL_MAP_ADDR_HELPERS(regs, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(start_seqno, lrc->seqno_bo)
> > +DECL_MAP_ADDR_HELPERS(ctx_job_timestamp, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(ctx_timestamp, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(ctx_timestamp_udw, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(parallel, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(indirect_ring, lrc->bo)
> > +DECL_MAP_ADDR_HELPERS(engine_id, lrc->bo)
> >  
> >  #undef DECL_MAP_ADDR_HELPERS
> >  
> > @@ -1036,6 +1038,7 @@ static void xe_lrc_finish(struct xe_lrc *lrc)
> >  {
> >  	xe_hw_fence_ctx_finish(&lrc->fence_ctx);
> >  	xe_bo_unpin_map_no_vm(lrc->bo);
> > +	xe_bo_unpin_map_no_vm(lrc->seqno_bo);
> >  }
> >  
> >  /*
> > @@ -1445,6 +1448,7 @@ static int xe_lrc_init(struct xe_lrc *lrc,
> > struct xe_hw_engine *hwe,
> >  	u32 bo_size = ring_size + lrc_size + LRC_WA_BB_SIZE;
> >  	struct xe_tile *tile = gt_to_tile(gt);
> >  	struct xe_device *xe = gt_to_xe(gt);
> > +	struct xe_bo *seqno_bo;
> >  	struct iosys_map map;
> >  	u32 arb_enable;
> >  	u32 bo_flags;
> > @@ -1479,6 +1483,17 @@ static int xe_lrc_init(struct xe_lrc *lrc,
> > struct xe_hw_engine *hwe,
> >  	if (IS_ERR(lrc->bo))
> >  		return PTR_ERR(lrc->bo);
> >  
> > +	seqno_bo = xe_bo_create_pin_map_novm(xe, tile, PAGE_SIZE,
> > +					     ttm_bo_type_kernel,
> > +					     XE_BO_FLAG_GGTT |
> > +					    
> > XE_BO_FLAG_GGTT_INVALIDATE |
> > +					     XE_BO_FLAG_SYSTEM,
> > false);
> 
> XE_BO_FLAG_PINNED_NORESTORE?
> 
> Thanks,
> Thomas
> 
> 
> > +	if (IS_ERR(seqno_bo)) {
> > +		err = PTR_ERR(lrc->bo);
> > +		goto err_lrc_finish;
> > +	}
> > +	lrc->seqno_bo = seqno_bo;
> > +
> >  	xe_hw_fence_ctx_init(&lrc->fence_ctx, hwe->gt,
> >  			     hwe->fence_irq, hwe->name);
> >  
> > diff --git a/drivers/gpu/drm/xe/xe_lrc_types.h
> > b/drivers/gpu/drm/xe/xe_lrc_types.h
> > index a4373d280c39..5a718f759ed6 100644
> > --- a/drivers/gpu/drm/xe/xe_lrc_types.h
> > +++ b/drivers/gpu/drm/xe/xe_lrc_types.h
> > @@ -22,6 +22,12 @@ struct xe_lrc {
> >  	 */
> >  	struct xe_bo *bo;
> >  
> > +	/**
> > +	 * @seqno_bo: Buffer object (memory) for seqno numbers.
> > Always in system
> > +	 * memory as this a CPU read, GPU write path object.
> > +	 */
> > +	struct xe_bo *seqno_bo;
> > +
> >  	/** @size: size of the lrc and optional indirect ring state
> > */
> >  	u32 size;
> >

next prev parent reply	other threads:[~2026-02-26 17:11 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18  4:33 [PATCH v3 0/3] dGPU memory optimizations Matthew Brost
2026-02-18  4:33 ` [PATCH v3 1/3] drm/xe: Split H2G and G2H into separate buffer objects Matthew Brost
2026-02-18 23:12   ` Summers, Stuart
2026-02-19  3:46     ` Matthew Brost
2026-02-24 15:58   ` Thomas Hellström
2026-02-24 16:12     ` Matthew Brost
2026-02-25 10:55       ` Thomas Hellström
2026-02-25 18:08         ` Matthew Brost
2026-02-26 12:08       ` Thomas Hellström
2026-02-18  4:33 ` [PATCH v3 2/3] drm/xe: Avoid unconditional VRAM reads in H2G path Matthew Brost
2026-02-18 23:20   ` Summers, Stuart
2026-02-26 12:47   ` Thomas Hellström
2026-02-18  4:33 ` [PATCH v3 3/3] drm/xe: Move LRC seqno to system memory to avoid slow dGPU reads Matthew Brost
2026-02-24  2:40   ` Matthew Brost
2026-02-26 12:25   ` Thomas Hellström
2026-02-26 17:11     ` Matthew Brost [this message]
2026-02-26 17:26       ` Matthew Brost
2026-02-26 17:56         ` Thomas Hellström
2026-02-26 12:43   ` Thomas Hellström
2026-02-26 16:55     ` Matthew Brost
2026-02-18  4:40 ` ✓ CI.KUnit: success for dGPU memory optimizations Patchwork
2026-02-18  5:23 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-18  6:15 ` ✗ Xe.CI.FULL: " Patchwork
2026-02-18  7:07 ` ✓ CI.KUnit: success for dGPU memory optimizations (rev2) Patchwork
2026-02-18  7:36 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-18  7:53 ` ✓ Xe.CI.FULL: " Patchwork
2026-02-18 12:29 ` ✓ CI.KUnit: success for dGPU memory optimizations (rev3) Patchwork
2026-02-18 13:09 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-18 14:08 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aaB+uBE28cSkeuBv@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    --cc=stuart.summers@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox