Re: [RFC PATCH 0/8] ULLS for kernel submission of migration jobs

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [RFC PATCH 0/8] ULLS for kernel submission of migration jobs
Date: Tue, 17 Sep 2024 14:39:01 +0000	[thread overview]
Message-ID: <ZumUhcUYGBLUdfrW@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <5ee3eec0b0c8585b0b278d9f2f779d299aa51e3a.camel@linux.intel.com>

On Tue, Sep 17, 2024 at 08:59:10AM +0200, Thomas Hellström wrote:
> Hi, Matt
> 
> On Mon, 2024-09-16 at 13:55 +0000, Matthew Brost wrote:
> > On Mon, Aug 12, 2024 at 05:26:26PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 12, 2024 at 10:53:01AM +0200, Thomas Hellström wrote:
> > > > Hi, Matt,
> > > > 
> > > > On Sun, 2024-08-11 at 19:47 -0700, Matthew Brost wrote:
> > > > > Ultra low latency for kernel submission of migration jobs.
> > > > > 
> > > > > The basic idea is that faults (CPU or GPU) typically depend on
> > > > > migration
> > > > > jobs. Faults should be addressed as quickly as possible, but
> > > > > context
> > > > > switches via GuC on hardware are slow. To avoid context
> > > > > switches,
> > > > > perform ULLS in the kernel for migration jobs on discrete
> > > > > faulting
> > > > > devices with an LR VM open.
> > > > > 
> > > > > This is implemented by switching the migration layer to ULLS
> > > > > mode
> > > > > upon
> > > > > opening an LR VM. In ULLS mode, migration jobs have a preamble
> > > > > and
> > > > > postamble: the preamble clears the current semaphore value, and
> > > > > the
> > > > > postamble waits for the next semaphore value. Each job
> > > > > submission
> > > > > sets
> > > > > the current semaphore in memory, bypassing the GuC. The net
> > > > > effect is
> > > > > that the migration execution queue never gets switched off the
> > > > > hardware
> > > > > while an LR VM is open.
> > > > > 
> > > > > There may be concerns regarding power management, as the ring
> > > > > program
> > > > > continuously runs on a copy engine, and a force wake reference
> > > > > to a
> > > > > copy
> > > > > engine is held with an LR VM open.
> > > > > 
> > > > > The implementation has been lightly tested but seems to be
> > > > > working.
> > > > > 
> > > > > This approach will likely be put on hold until SVM is
> > > > > operational
> > > > > with
> > > > > benchmarks, but it is being posted early for feedback and as a
> > > > > public
> > > > > checkpoint.
> > > > > 
> > > > > Matt
> > > > 
> > > > The main concern I have with this is that, at least according to
> > > > upstream discussions, pagefaults are so slow anyway, a performant
> > > > stack
> > > > needs to try extremely hard to avoid them using manual prefaults,
> > > > and
> > > > if we hit a gpu pagefault, we've lost anyway and any migration
> > > > latency
> > > > optimization won't matter much.
> > > > 
> > > 
> > > I agree that if pagefaults are getting hit all the time we are in
> > > trouble wrt to performance but that doesn't mean when they do occur
> > > we
> > > shouldn't try to make servicing them as fast as possible.
> > > 
> > 
> > Okay, there is definitely something to this. I have an SVM test that
> > serially bounces (i.e., each bounce results in a GuC context switch)
> > a
> > 2M allocation via fault about 1,000 times between the CPU and GPU. I
> > applied this series and added a modparam to enable/disable ULLS to
> > the
> > tip of my latest working branch [3].
> > 
> > Without ULLS enabled, the test on average took about 3 seconds on
> > BMG.
> > With ULLS enabled, the test on average took 2 seconds. I was
> > expecting a
> > small gain, but this is significant. Other similar sections I have
> > seen
> > a speed nearing twice as fast. It seems to indicate that this series
> > is
> > definitely worth pursuing.
> > 
> > Also note that any operation using the migrate engine will see gains
> > (e.g., clearing BOs, prefetches, eviction) in terms of latency.
> 
> I'm still concerned about adding this, and if we do we should only use
> it as a last resource if we see significant performance improvements in
> real-world applications. Historically all the latency optimizations was
> what screwed up the maintainability of the i915 driver and I think we

I agree real-world application performance is the acceptance criteria
and maintainability is always a concern. FWIW I thinks this series is a
clean implementation so maintainbility from PoV it not a huge concern.
If it created all sorts of new concepts I'd be concerned but it doesn't
- it largely fits in the already defined layers rather nicely.

> should be extremely careful so we don't end up in the same situation.
> Concerns also around power management.
> 

Agree we'd have to look that implictions of PM too. I added a forcewake
but really unsure if that required aside from keeping our asserts happy.
Obviously a spinning batch will use some amount of power but the UMDs
are already doing this too and that was deemed acceptable.

> Regardless, ULLS only really should improve things if we fail to
> pipeline migrations on the HW in the non-ULLS case, assuming that we're
> using a single exec-queue in both cases. Is that because we wait for

Single VM per device actually due to locking. Yes, the single VM tests
show a much larger perf improvement compared to multi-VM / process tests
due to multiple fault workers in parallel feeding the copy engine avoid
a context switch on each copy. There still is an improvement in the
latter case though albiet not as dramatic (~80% vs. ~21%).

> each migration to complete before allowing the next one? If so, is that
> something we could look at?

For fault handling within a VM, that is not we can easily remove due the
serial nature of the migration API as copy is expected complete between the
migrate_vma_setup call and migrate_vma_finalize call. Then *after* this
page collection and binding still needs to done in the GPU fault case.

For prefetches, I think we basically are going to have pipeline things
for performance but I think this a bit easier as we have a large working
set upfront compared to 1 fault address. This will likely get quite
complicated though, likely much so compared to what is in this series
(e.g. software pipeline with fence generation and async waits / CBs,
possibly multiple workers, etc...).

Matt

> 
> Thanks,
> Thomas
> 
> 
> 
> > 
> > Matt
> > 
> > > > Also, for power management, LR VM open is a very simple strategy,
> > > > which
> > > > is good, but shouldn't it be possible to hook that up to LR job
> > > > running, similar to vm->preempt.rebind_deactivated?
> > > > 
> > > 
> > > That seems possible. Then in is scenario we'd hook the
> > > xe_migrate_lr_vm_get / put calls [1] [2] and runtime PM calls into
> > > the
> > > LR VM activate / deactivate calls rather LR VM open / close calls.
> > > 
> > > Matt
> > > 
> > > [1]
> > > https://patchwork.freedesktop.org/patch/607842/?series=137128&rev=1
> > > [2]
> > > https://patchwork.freedesktop.org/patch/607841/?series=137128&rev=1
> > > 
> > > > /Thomas
> > > > 
> > > > 
> > > > > 
> > > > > Matthew Brost (8):
> > > > >   drm/xe: Add xe_hw_engine_write_ring_tail
> > > > >   drm/xe: Add ULLS support to LRC
> > > > >   drm/xe: Add ULLS flags for jobs
> > > > >   drm/xe: Add ULLS migration job support to migration layer
> > > > >   drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> > > > >   drm/xe: Add ULLS migration job support to ring ops
> > > > >   drm/xe: Add ULLS migration job support to GuC submission
> > > > >   drm/xe: Enable ULLS migration jobs when opening LR VM
> > > > > 
> > > > >  .../gpu/drm/xe/instructions/xe_mi_commands.h  |   6 +
> > > > >  drivers/gpu/drm/xe/xe_guc_submit.c            |  26 +++-
> > > > >  drivers/gpu/drm/xe/xe_hw_engine.c             |  10 ++
> > > > >  drivers/gpu/drm/xe/xe_hw_engine.h             |   1 +
> > > > >  drivers/gpu/drm/xe/xe_lrc.c                   |  49 +++++++
> > > > >  drivers/gpu/drm/xe/xe_lrc.h                   |   3 +
> > > > >  drivers/gpu/drm/xe/xe_lrc_types.h             |   2 +
> > > > >  drivers/gpu/drm/xe/xe_migrate.c               | 130
> > > > > +++++++++++++++++-
> > > > >  drivers/gpu/drm/xe/xe_migrate.h               |   4 +
> > > > >  drivers/gpu/drm/xe/xe_ring_ops.c              |  32 +++++
> > > > >  drivers/gpu/drm/xe/xe_sched_job_types.h       |   3 +
> > > > >  drivers/gpu/drm/xe/xe_vm.c                    |  10 ++
> > > > >  12 files changed, 268 insertions(+), 8 deletions(-)
> > > > > 
> > > > 
>

     prev parent reply	other threads:[~2024-09-17 14:40 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-08-12  2:47 [RFC PATCH 0/8] ULLS for kernel submission of migration jobs Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 1/8] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 2/8] drm/xe: Add ULLS support to LRC Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 3/8] drm/xe: Add ULLS flags for jobs Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 4/8] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 5/8] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 6/8] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 7/8] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2024-08-12  2:47 ` [RFC PATCH 8/8] drm/xe: Enable ULLS migration jobs when opening LR VM Matthew Brost
2024-08-12  2:51 ` ✓ CI.Patch_applied: success for ULLS for kernel submission of migration jobs Patchwork
2024-08-12  2:52 ` ✗ CI.checkpatch: warning " Patchwork
2024-08-12  2:53 ` ✓ CI.KUnit: success " Patchwork
2024-08-12  3:05 ` ✓ CI.Build: " Patchwork
2024-08-12  3:07 ` ✓ CI.Hooks: " Patchwork
2024-08-12  3:08 ` ✓ CI.checksparse: " Patchwork
2024-08-12  3:28 ` ✓ CI.BAT: " Patchwork
2024-08-12  4:31 ` ✗ CI.FULL: failure " Patchwork
2024-08-12  8:53 ` [RFC PATCH 0/8] " Thomas Hellström
2024-08-12 17:26   ` Matthew Brost
2024-09-16 13:55     ` Matthew Brost
2024-09-17  6:59       ` Thomas Hellström
2024-09-17 14:39         ` Matthew Brost [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZumUhcUYGBLUdfrW@DUT025-TGLU.fm.intel.com \
    --to=matthew.brost@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=thomas.hellstrom@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox