Re: [PATCH v3 1/3] drm/xe: Split H2G and G2H into separate buffer objects

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>, <stuart.summers@intel.com>,
	<francois.dugast@intel.com>, <daniele.ceraolospurio@intel.com>,
	<michal.wajdeczko@intel.com>
Subject: Re: [PATCH v3 1/3] drm/xe: Split H2G and G2H into separate buffer objects
Date: Wed, 25 Feb 2026 10:08:11 -0800	[thread overview]
Message-ID: <aZ86i8DWtNikdN6k@lstrano-desk.jf.intel.com> (raw)
In-Reply-To: <b052bef6590a73a3ea726b81b1af55dea36db272.camel@linux.intel.com>

On Wed, Feb 25, 2026 at 11:55:28AM +0100, Thomas Hellström wrote:
> On Tue, 2026-02-24 at 08:12 -0800, Matthew Brost wrote:
> > On Tue, Feb 24, 2026 at 04:58:35PM +0100, Thomas Hellström wrote:
> > > On Tue, 2026-02-17 at 20:33 -0800, Matthew Brost wrote:
> > > > H2G and G2H buffers have different access patterns (H2G is CPU-
> > > > write,
> > > > GuC-read, while G2H is GPU-write, CPU-read). On dGPU, these
> > > > patterns
> > > > benefit from different memory placements: H2G in VRAM and G2H in
> > > > system
> > > > memory. Split the CT buffer into two separate buffers—one for H2G
> > > > and
> > > > one for G2H—and select the optimal placement for each.
> > > > 
> > > > This provides a significant performance improvement on the G2H
> > > > read
> > > > path, reducing a single read from ~20 µs to under 1 µs on BMG.
> > > > 
> > > > Signed-off-by: Matthew Brost <matthew.brost@intel.com>
> > > 
> > > Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
> > > 
> > > Perhaps one could experiment with reading the data from the g2h bo
> > > using MOVNTDQA, like the write-combining memcopy. That would avoid
> > > caching the data and the GuC having to invalidate the cache line
> > > while
> > > snooping on the next write.
> > 
> > We can try that, but G2H messages are variable-sized, so I believe it
> > will get a little tricky. Once these are system-memory reads, I
> > recall
> > G2H handling being something like 15 per µs of page faults (maybe
> > that
> > isn’t correct — I’ll double-check), and that included my not-yet-
> > posted
> > caching implementation, which also takes a spinlock, examines the
> > page-fault cache, and chains the fault onto a list. So I don’t think
> > this will end up in the critical path.
> > 
> > > 
> > > But that should probably have a less impact, but perhaps speeding
> > > up
> > > GuC writes.
> > 
> > We can play around with this and bounce ideas around. On the H2G
> > side,
> > xe_device_wmb() before the tail update is actually fairly expensive,
> > so
> > if we can use some asm instructions to avoid that, it might be
> > worthwhile.
> 
> That is probably the latency of flushing out buffered writes. I don't
> think there is much to be done on improving that.
> 

I came up with periodically advancing the tail and issuing
xe_device_wmb() when acking a storm of faults. For example, if we need
to ack 40 faults at once, we send out the first few acks immediately,
then rate-control the rest. This gives roughly a 5× speedup (e.g.,
increasing from ~3 H2Gs per µs to ~15 H2Gs per µs).

I'm asking the GuC to apply the same strategy when generating G2Hs
during fault storms as well.

My KMD portion of this should be posted shortly as part of fault storm
caching.

Matt

> /Thomas
> 
> > 
> > Matt
> > 
> > > 
> > > /Thomas
> > > 
> > > 
> > > > 
> > > > ---
> > > > v3:
> > > >  - Move BO to ctbs h2g or g2h structure (Michal)
> > > > ---
> > > >  drivers/gpu/drm/xe/xe_guc_ct.c       | 67 +++++++++++++++++++---
> > > > ----
> > > > --
> > > >  drivers/gpu/drm/xe/xe_guc_ct_types.h |  4 +-
> > > >  2 files changed, 47 insertions(+), 24 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > index 8a45573f8812..ea07a27757d5 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct.c
> > > > @@ -255,6 +255,7 @@ static bool g2h_fence_needs_alloc(struct
> > > > g2h_fence *g2h_fence)
> > > >  
> > > >  #define CTB_DESC_SIZE		ALIGN(sizeof(struct
> > > > guc_ct_buffer_desc), SZ_2K)
> > > >  #define CTB_H2G_BUFFER_OFFSET	(CTB_DESC_SIZE * 2)
> > > > +#define CTB_G2H_BUFFER_OFFSET	(CTB_DESC_SIZE * 2)
> > > >  #define CTB_H2G_BUFFER_SIZE	(SZ_4K)
> > > >  #define CTB_H2G_BUFFER_DWORDS	(CTB_H2G_BUFFER_SIZE /
> > > > sizeof(u32))
> > > >  #define CTB_G2H_BUFFER_SIZE	(SZ_128K)
> > > > @@ -279,10 +280,14 @@ long
> > > > xe_guc_ct_queue_proc_time_jiffies(struct
> > > > xe_guc_ct *ct)
> > > >  	return (CTB_H2G_BUFFER_SIZE / SZ_4K) * HZ;
> > > >  }
> > > >  
> > > > -static size_t guc_ct_size(void)
> > > > +static size_t guc_h2g_size(void)
> > > >  {
> > > > -	return CTB_H2G_BUFFER_OFFSET + CTB_H2G_BUFFER_SIZE +
> > > > -		CTB_G2H_BUFFER_SIZE;
> > > > +	return CTB_H2G_BUFFER_OFFSET + CTB_H2G_BUFFER_SIZE;
> > > > +}
> > > > +
> > > > +static size_t guc_g2h_size(void)
> > > > +{
> > > > +	return CTB_G2H_BUFFER_OFFSET + CTB_G2H_BUFFER_SIZE;
> > > >  }
> > > >  
> > > >  static void guc_ct_fini(struct drm_device *drm, void *arg)
> > > > @@ -311,7 +316,8 @@ int xe_guc_ct_init_noalloc(struct xe_guc_ct
> > > > *ct)
> > > >  	struct xe_gt *gt = ct_to_gt(ct);
> > > >  	int err;
> > > >  
> > > > -	xe_gt_assert(gt, !(guc_ct_size() % PAGE_SIZE));
> > > > +	xe_gt_assert(gt, !(guc_h2g_size() % PAGE_SIZE));
> > > > +	xe_gt_assert(gt, !(guc_g2h_size() % PAGE_SIZE));
> > > >  
> > > >  	err = drmm_mutex_init(&xe->drm, &ct->lock);
> > > >  	if (err)
> > > > @@ -356,7 +362,17 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
> > > >  	struct xe_tile *tile = gt_to_tile(gt);
> > > >  	struct xe_bo *bo;
> > > >  
> > > > -	bo = xe_managed_bo_create_pin_map(xe, tile,
> > > > guc_ct_size(),
> > > > +	bo = xe_managed_bo_create_pin_map(xe, tile,
> > > > guc_h2g_size(),
> > > > +					  XE_BO_FLAG_SYSTEM |
> > > > +					  XE_BO_FLAG_GGTT |
> > > > +					 
> > > > XE_BO_FLAG_GGTT_INVALIDATE
> > > > > 
> > > > +					 
> > > > XE_BO_FLAG_PINNED_NORESTORE);
> > > > +	if (IS_ERR(bo))
> > > > +		return PTR_ERR(bo);
> > > > +
> > > > +	ct->ctbs.h2g.bo = bo;
> > > > +
> > > > +	bo = xe_managed_bo_create_pin_map(xe, tile,
> > > > guc_g2h_size(),
> > > >  					  XE_BO_FLAG_SYSTEM |
> > > >  					  XE_BO_FLAG_GGTT |
> > > >  					 
> > > > XE_BO_FLAG_GGTT_INVALIDATE
> > > > > 
> > > > @@ -364,7 +380,7 @@ int xe_guc_ct_init(struct xe_guc_ct *ct)
> > > >  	if (IS_ERR(bo))
> > > >  		return PTR_ERR(bo);
> > > >  
> > > > -	ct->bo = bo;
> > > > +	ct->ctbs.g2h.bo = bo;
> > > >  
> > > >  	return devm_add_action_or_reset(xe->drm.dev,
> > > > guc_action_disable_ct, ct);
> > > >  }
> > > > @@ -389,7 +405,7 @@ int xe_guc_ct_init_post_hwconfig(struct
> > > > xe_guc_ct
> > > > *ct)
> > > >  	xe_assert(xe, !xe_guc_ct_enabled(ct));
> > > >  
> > > >  	if (IS_DGFX(xe)) {
> > > > -		ret = xe_managed_bo_reinit_in_vram(xe, tile,
> > > > &ct-
> > > > > bo);
> > > > +		ret = xe_managed_bo_reinit_in_vram(xe, tile,
> > > > &ct-
> > > > > ctbs.h2g.bo);
> > > >  		if (ret)
> > > >  			return ret;
> > > >  	}
> > > > @@ -439,8 +455,7 @@ static void guc_ct_ctb_g2h_init(struct
> > > > xe_device
> > > > *xe, struct guc_ctb *g2h,
> > > >  	g2h->desc = IOSYS_MAP_INIT_OFFSET(map, CTB_DESC_SIZE);
> > > >  	xe_map_memset(xe, &g2h->desc, 0, 0, sizeof(struct
> > > > guc_ct_buffer_desc));
> > > >  
> > > > -	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map,
> > > > CTB_H2G_BUFFER_OFFSET
> > > > +
> > > > -					   
> > > > CTB_H2G_BUFFER_SIZE);
> > > > +	g2h->cmds = IOSYS_MAP_INIT_OFFSET(map,
> > > > CTB_G2H_BUFFER_OFFSET);
> > > >  }
> > > >  
> > > >  static int guc_ct_ctb_h2g_register(struct xe_guc_ct *ct)
> > > > @@ -449,8 +464,8 @@ static int guc_ct_ctb_h2g_register(struct
> > > > xe_guc_ct *ct)
> > > >  	u32 desc_addr, ctb_addr, size;
> > > >  	int err;
> > > >  
> > > > -	desc_addr = xe_bo_ggtt_addr(ct->bo);
> > > > -	ctb_addr = xe_bo_ggtt_addr(ct->bo) +
> > > > CTB_H2G_BUFFER_OFFSET;
> > > > +	desc_addr = xe_bo_ggtt_addr(ct->ctbs.h2g.bo);
> > > > +	ctb_addr = xe_bo_ggtt_addr(ct->ctbs.h2g.bo) +
> > > > CTB_H2G_BUFFER_OFFSET;
> > > >  	size = ct->ctbs.h2g.info.size * sizeof(u32);
> > > >  
> > > >  	err = xe_guc_self_cfg64(guc,
> > > > @@ -476,9 +491,8 @@ static int guc_ct_ctb_g2h_register(struct
> > > > xe_guc_ct *ct)
> > > >  	u32 desc_addr, ctb_addr, size;
> > > >  	int err;
> > > >  
> > > > -	desc_addr = xe_bo_ggtt_addr(ct->bo) + CTB_DESC_SIZE;
> > > > -	ctb_addr = xe_bo_ggtt_addr(ct->bo) +
> > > > CTB_H2G_BUFFER_OFFSET +
> > > > -		CTB_H2G_BUFFER_SIZE;
> > > > +	desc_addr = xe_bo_ggtt_addr(ct->ctbs.g2h.bo) +
> > > > CTB_DESC_SIZE;
> > > > +	ctb_addr = xe_bo_ggtt_addr(ct->ctbs.g2h.bo) +
> > > > CTB_G2H_BUFFER_OFFSET;
> > > >  	size = ct->ctbs.g2h.info.size * sizeof(u32);
> > > >  
> > > >  	err = xe_guc_self_cfg64(guc,
> > > > @@ -605,9 +619,12 @@ static int __xe_guc_ct_start(struct
> > > > xe_guc_ct
> > > > *ct, bool needs_register)
> > > >  	xe_gt_assert(gt, !xe_guc_ct_enabled(ct));
> > > >  
> > > >  	if (needs_register) {
> > > > -		xe_map_memset(xe, &ct->bo->vmap, 0, 0,
> > > > xe_bo_size(ct->bo));
> > > > -		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct->bo-
> > > > > vmap);
> > > > -		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct->bo-
> > > > > vmap);
> > > > +		xe_map_memset(xe, &ct->ctbs.h2g.bo->vmap, 0, 0,
> > > > +			      xe_bo_size(ct->ctbs.h2g.bo));
> > > > +		xe_map_memset(xe, &ct->ctbs.g2h.bo->vmap, 0, 0,
> > > > +			      xe_bo_size(ct->ctbs.g2h.bo));
> > > > +		guc_ct_ctb_h2g_init(xe, &ct->ctbs.h2g, &ct-
> > > > > ctbs.h2g.bo->vmap);
> > > > +		guc_ct_ctb_g2h_init(xe, &ct->ctbs.g2h, &ct-
> > > > > ctbs.g2h.bo->vmap);
> > > >  
> > > >  		err = guc_ct_ctb_h2g_register(ct);
> > > >  		if (err)
> > > > @@ -624,7 +641,7 @@ static int __xe_guc_ct_start(struct xe_guc_ct
> > > > *ct, bool needs_register)
> > > >  		ct->ctbs.h2g.info.broken = false;
> > > >  		ct->ctbs.g2h.info.broken = false;
> > > >  		/* Skip everything in H2G buffer */
> > > > -		xe_map_memset(xe, &ct->bo->vmap,
> > > > CTB_H2G_BUFFER_OFFSET, 0,
> > > > +		xe_map_memset(xe, &ct->ctbs.h2g.bo->vmap,
> > > > CTB_H2G_BUFFER_OFFSET, 0,
> > > >  			      CTB_H2G_BUFFER_SIZE);
> > > >  	}
> > > >  
> > > > @@ -1963,8 +1980,9 @@ static struct xe_guc_ct_snapshot
> > > > *guc_ct_snapshot_alloc(struct xe_guc_ct *ct, bo
> > > >  	if (!snapshot)
> > > >  		return NULL;
> > > >  
> > > > -	if (ct->bo && want_ctb) {
> > > > -		snapshot->ctb_size = xe_bo_size(ct->bo);
> > > > +	if (ct->ctbs.h2g.bo && ct->ctbs.g2h.bo && want_ctb) {
> > > > +		snapshot->ctb_size = xe_bo_size(ct->ctbs.h2g.bo)
> > > > +
> > > > +			xe_bo_size(ct->ctbs.g2h.bo);
> > > >  		snapshot->ctb = kmalloc(snapshot->ctb_size,
> > > > atomic ?
> > > > GFP_ATOMIC : GFP_KERNEL);
> > > >  	}
> > > >  
> > > > @@ -2012,8 +2030,13 @@ static struct xe_guc_ct_snapshot
> > > > *guc_ct_snapshot_capture(struct xe_guc_ct *ct,
> > > >  		guc_ctb_snapshot_capture(xe, &ct->ctbs.g2h,
> > > > &snapshot->g2h);
> > > >  	}
> > > >  
> > > > -	if (ct->bo && snapshot->ctb)
> > > > -		xe_map_memcpy_from(xe, snapshot->ctb, &ct->bo-
> > > > >vmap,
> > > > 0, snapshot->ctb_size);
> > > > +	if (ct->ctbs.h2g.bo && ct->ctbs.g2h.bo && snapshot->ctb)
> > > > {
> > > > +		xe_map_memcpy_from(xe, snapshot->ctb, &ct-
> > > > > ctbs.h2g.bo->vmap, 0,
> > > > +				   xe_bo_size(ct->ctbs.h2g.bo));
> > > > +		xe_map_memcpy_from(xe, snapshot->ctb +
> > > > xe_bo_size(ct->ctbs.h2g.bo),
> > > > +				   &ct->ctbs.g2h.bo->vmap, 0,
> > > > +				   xe_bo_size(ct->ctbs.g2h.bo));
> > > > +	}
> > > >  
> > > >  	return snapshot;
> > > >  }
> > > > diff --git a/drivers/gpu/drm/xe/xe_guc_ct_types.h
> > > > b/drivers/gpu/drm/xe/xe_guc_ct_types.h
> > > > index 09d7ff1ef42a..46ad1402347d 100644
> > > > --- a/drivers/gpu/drm/xe/xe_guc_ct_types.h
> > > > +++ b/drivers/gpu/drm/xe/xe_guc_ct_types.h
> > > > @@ -39,6 +39,8 @@ struct guc_ctb_info {
> > > >   * struct guc_ctb - GuC command transport buffer (CTB)
> > > >   */
> > > >  struct guc_ctb {
> > > > +	/** @bo: Xe BO for CTB */
> > > > +	struct xe_bo *bo;
> > > >  	/** @desc: dma buffer map for CTB descriptor */
> > > >  	struct iosys_map desc;
> > > >  	/** @cmds: dma buffer map for CTB commands */
> > > > @@ -126,8 +128,6 @@ struct xe_fast_req_fence {
> > > >   * for the H2G and G2H requests sent and received through the
> > > > buffers.
> > > >   */
> > > >  struct xe_guc_ct {
> > > > -	/** @bo: Xe BO for CT */
> > > > -	struct xe_bo *bo;
> > > >  	/** @lock: protects everything in CT layer */
> > > >  	struct mutex lock;
> > > >  	/** @fast_lock: protects G2H channel and credits */

next prev parent reply	other threads:[~2026-02-25 18:08 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-18  4:33 [PATCH v3 0/3] dGPU memory optimizations Matthew Brost
2026-02-18  4:33 ` [PATCH v3 1/3] drm/xe: Split H2G and G2H into separate buffer objects Matthew Brost
2026-02-18 23:12   ` Summers, Stuart
2026-02-19  3:46     ` Matthew Brost
2026-02-24 15:58   ` Thomas Hellström
2026-02-24 16:12     ` Matthew Brost
2026-02-25 10:55       ` Thomas Hellström
2026-02-25 18:08         ` Matthew Brost [this message]
2026-02-26 12:08       ` Thomas Hellström
2026-02-18  4:33 ` [PATCH v3 2/3] drm/xe: Avoid unconditional VRAM reads in H2G path Matthew Brost
2026-02-18 23:20   ` Summers, Stuart
2026-02-26 12:47   ` Thomas Hellström
2026-02-18  4:33 ` [PATCH v3 3/3] drm/xe: Move LRC seqno to system memory to avoid slow dGPU reads Matthew Brost
2026-02-24  2:40   ` Matthew Brost
2026-02-26 12:25   ` Thomas Hellström
2026-02-26 17:11     ` Matthew Brost
2026-02-26 17:26       ` Matthew Brost
2026-02-26 17:56         ` Thomas Hellström
2026-02-26 12:43   ` Thomas Hellström
2026-02-26 16:55     ` Matthew Brost
2026-02-18  4:40 ` ✓ CI.KUnit: success for dGPU memory optimizations Patchwork
2026-02-18  5:23 ` ✗ Xe.CI.BAT: failure " Patchwork
2026-02-18  6:15 ` ✗ Xe.CI.FULL: " Patchwork
2026-02-18  7:07 ` ✓ CI.KUnit: success for dGPU memory optimizations (rev2) Patchwork
2026-02-18  7:36 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-18  7:53 ` ✓ Xe.CI.FULL: " Patchwork
2026-02-18 12:29 ` ✓ CI.KUnit: success for dGPU memory optimizations (rev3) Patchwork
2026-02-18 13:09 ` ✓ Xe.CI.BAT: " Patchwork
2026-02-18 14:08 ` ✗ Xe.CI.FULL: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aZ86i8DWtNikdN6k@lstrano-desk.jf.intel.com \
    --to=matthew.brost@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=francois.dugast@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=michal.wajdeczko@intel.com \
    --cc=stuart.summers@intel.com \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox