From: Matthew Brost <matthew.brost@intel.com>
To: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
Cc: <intel-xe@lists.freedesktop.org>
Subject: Re: [RFC PATCH 0/8] ULLS for kernel submission of migration jobs
Date: Tue, 17 Sep 2024 14:39:01 +0000 [thread overview]
Message-ID: <ZumUhcUYGBLUdfrW@DUT025-TGLU.fm.intel.com> (raw)
In-Reply-To: <5ee3eec0b0c8585b0b278d9f2f779d299aa51e3a.camel@linux.intel.com>
On Tue, Sep 17, 2024 at 08:59:10AM +0200, Thomas Hellström wrote:
> Hi, Matt
>
> On Mon, 2024-09-16 at 13:55 +0000, Matthew Brost wrote:
> > On Mon, Aug 12, 2024 at 05:26:26PM +0000, Matthew Brost wrote:
> > > On Mon, Aug 12, 2024 at 10:53:01AM +0200, Thomas Hellström wrote:
> > > > Hi, Matt,
> > > >
> > > > On Sun, 2024-08-11 at 19:47 -0700, Matthew Brost wrote:
> > > > > Ultra low latency for kernel submission of migration jobs.
> > > > >
> > > > > The basic idea is that faults (CPU or GPU) typically depend on
> > > > > migration
> > > > > jobs. Faults should be addressed as quickly as possible, but
> > > > > context
> > > > > switches via GuC on hardware are slow. To avoid context
> > > > > switches,
> > > > > perform ULLS in the kernel for migration jobs on discrete
> > > > > faulting
> > > > > devices with an LR VM open.
> > > > >
> > > > > This is implemented by switching the migration layer to ULLS
> > > > > mode
> > > > > upon
> > > > > opening an LR VM. In ULLS mode, migration jobs have a preamble
> > > > > and
> > > > > postamble: the preamble clears the current semaphore value, and
> > > > > the
> > > > > postamble waits for the next semaphore value. Each job
> > > > > submission
> > > > > sets
> > > > > the current semaphore in memory, bypassing the GuC. The net
> > > > > effect is
> > > > > that the migration execution queue never gets switched off the
> > > > > hardware
> > > > > while an LR VM is open.
> > > > >
> > > > > There may be concerns regarding power management, as the ring
> > > > > program
> > > > > continuously runs on a copy engine, and a force wake reference
> > > > > to a
> > > > > copy
> > > > > engine is held with an LR VM open.
> > > > >
> > > > > The implementation has been lightly tested but seems to be
> > > > > working.
> > > > >
> > > > > This approach will likely be put on hold until SVM is
> > > > > operational
> > > > > with
> > > > > benchmarks, but it is being posted early for feedback and as a
> > > > > public
> > > > > checkpoint.
> > > > >
> > > > > Matt
> > > >
> > > > The main concern I have with this is that, at least according to
> > > > upstream discussions, pagefaults are so slow anyway, a performant
> > > > stack
> > > > needs to try extremely hard to avoid them using manual prefaults,
> > > > and
> > > > if we hit a gpu pagefault, we've lost anyway and any migration
> > > > latency
> > > > optimization won't matter much.
> > > >
> > >
> > > I agree that if pagefaults are getting hit all the time we are in
> > > trouble wrt to performance but that doesn't mean when they do occur
> > > we
> > > shouldn't try to make servicing them as fast as possible.
> > >
> >
> > Okay, there is definitely something to this. I have an SVM test that
> > serially bounces (i.e., each bounce results in a GuC context switch)
> > a
> > 2M allocation via fault about 1,000 times between the CPU and GPU. I
> > applied this series and added a modparam to enable/disable ULLS to
> > the
> > tip of my latest working branch [3].
> >
> > Without ULLS enabled, the test on average took about 3 seconds on
> > BMG.
> > With ULLS enabled, the test on average took 2 seconds. I was
> > expecting a
> > small gain, but this is significant. Other similar sections I have
> > seen
> > a speed nearing twice as fast. It seems to indicate that this series
> > is
> > definitely worth pursuing.
> >
> > Also note that any operation using the migrate engine will see gains
> > (e.g., clearing BOs, prefetches, eviction) in terms of latency.
>
> I'm still concerned about adding this, and if we do we should only use
> it as a last resource if we see significant performance improvements in
> real-world applications. Historically all the latency optimizations was
> what screwed up the maintainability of the i915 driver and I think we
I agree real-world application performance is the acceptance criteria
and maintainability is always a concern. FWIW I thinks this series is a
clean implementation so maintainbility from PoV it not a huge concern.
If it created all sorts of new concepts I'd be concerned but it doesn't
- it largely fits in the already defined layers rather nicely.
> should be extremely careful so we don't end up in the same situation.
> Concerns also around power management.
>
Agree we'd have to look that implictions of PM too. I added a forcewake
but really unsure if that required aside from keeping our asserts happy.
Obviously a spinning batch will use some amount of power but the UMDs
are already doing this too and that was deemed acceptable.
> Regardless, ULLS only really should improve things if we fail to
> pipeline migrations on the HW in the non-ULLS case, assuming that we're
> using a single exec-queue in both cases. Is that because we wait for
Single VM per device actually due to locking. Yes, the single VM tests
show a much larger perf improvement compared to multi-VM / process tests
due to multiple fault workers in parallel feeding the copy engine avoid
a context switch on each copy. There still is an improvement in the
latter case though albiet not as dramatic (~80% vs. ~21%).
> each migration to complete before allowing the next one? If so, is that
> something we could look at?
For fault handling within a VM, that is not we can easily remove due the
serial nature of the migration API as copy is expected complete between the
migrate_vma_setup call and migrate_vma_finalize call. Then *after* this
page collection and binding still needs to done in the GPU fault case.
For prefetches, I think we basically are going to have pipeline things
for performance but I think this a bit easier as we have a large working
set upfront compared to 1 fault address. This will likely get quite
complicated though, likely much so compared to what is in this series
(e.g. software pipeline with fence generation and async waits / CBs,
possibly multiple workers, etc...).
Matt
>
> Thanks,
> Thomas
>
>
>
> >
> > Matt
> >
> > > > Also, for power management, LR VM open is a very simple strategy,
> > > > which
> > > > is good, but shouldn't it be possible to hook that up to LR job
> > > > running, similar to vm->preempt.rebind_deactivated?
> > > >
> > >
> > > That seems possible. Then in is scenario we'd hook the
> > > xe_migrate_lr_vm_get / put calls [1] [2] and runtime PM calls into
> > > the
> > > LR VM activate / deactivate calls rather LR VM open / close calls.
> > >
> > > Matt
> > >
> > > [1]
> > > https://patchwork.freedesktop.org/patch/607842/?series=137128&rev=1
> > > [2]
> > > https://patchwork.freedesktop.org/patch/607841/?series=137128&rev=1
> > >
> > > > /Thomas
> > > >
> > > >
> > > > >
> > > > > Matthew Brost (8):
> > > > > drm/xe: Add xe_hw_engine_write_ring_tail
> > > > > drm/xe: Add ULLS support to LRC
> > > > > drm/xe: Add ULLS flags for jobs
> > > > > drm/xe: Add ULLS migration job support to migration layer
> > > > > drm/xe: Add MI_SEMAPHORE_WAIT instruction defs
> > > > > drm/xe: Add ULLS migration job support to ring ops
> > > > > drm/xe: Add ULLS migration job support to GuC submission
> > > > > drm/xe: Enable ULLS migration jobs when opening LR VM
> > > > >
> > > > > .../gpu/drm/xe/instructions/xe_mi_commands.h | 6 +
> > > > > drivers/gpu/drm/xe/xe_guc_submit.c | 26 +++-
> > > > > drivers/gpu/drm/xe/xe_hw_engine.c | 10 ++
> > > > > drivers/gpu/drm/xe/xe_hw_engine.h | 1 +
> > > > > drivers/gpu/drm/xe/xe_lrc.c | 49 +++++++
> > > > > drivers/gpu/drm/xe/xe_lrc.h | 3 +
> > > > > drivers/gpu/drm/xe/xe_lrc_types.h | 2 +
> > > > > drivers/gpu/drm/xe/xe_migrate.c | 130
> > > > > +++++++++++++++++-
> > > > > drivers/gpu/drm/xe/xe_migrate.h | 4 +
> > > > > drivers/gpu/drm/xe/xe_ring_ops.c | 32 +++++
> > > > > drivers/gpu/drm/xe/xe_sched_job_types.h | 3 +
> > > > > drivers/gpu/drm/xe/xe_vm.c | 10 ++
> > > > > 12 files changed, 268 insertions(+), 8 deletions(-)
> > > > >
> > > >
>
prev parent reply other threads:[~2024-09-17 14:40 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-08-12 2:47 [RFC PATCH 0/8] ULLS for kernel submission of migration jobs Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 1/8] drm/xe: Add xe_hw_engine_write_ring_tail Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 2/8] drm/xe: Add ULLS support to LRC Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 3/8] drm/xe: Add ULLS flags for jobs Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 4/8] drm/xe: Add ULLS migration job support to migration layer Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 5/8] drm/xe: Add MI_SEMAPHORE_WAIT instruction defs Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 6/8] drm/xe: Add ULLS migration job support to ring ops Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 7/8] drm/xe: Add ULLS migration job support to GuC submission Matthew Brost
2024-08-12 2:47 ` [RFC PATCH 8/8] drm/xe: Enable ULLS migration jobs when opening LR VM Matthew Brost
2024-08-12 2:51 ` ✓ CI.Patch_applied: success for ULLS for kernel submission of migration jobs Patchwork
2024-08-12 2:52 ` ✗ CI.checkpatch: warning " Patchwork
2024-08-12 2:53 ` ✓ CI.KUnit: success " Patchwork
2024-08-12 3:05 ` ✓ CI.Build: " Patchwork
2024-08-12 3:07 ` ✓ CI.Hooks: " Patchwork
2024-08-12 3:08 ` ✓ CI.checksparse: " Patchwork
2024-08-12 3:28 ` ✓ CI.BAT: " Patchwork
2024-08-12 4:31 ` ✗ CI.FULL: failure " Patchwork
2024-08-12 8:53 ` [RFC PATCH 0/8] " Thomas Hellström
2024-08-12 17:26 ` Matthew Brost
2024-09-16 13:55 ` Matthew Brost
2024-09-17 6:59 ` Thomas Hellström
2024-09-17 14:39 ` Matthew Brost [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZumUhcUYGBLUdfrW@DUT025-TGLU.fm.intel.com \
--to=matthew.brost@intel.com \
--cc=intel-xe@lists.freedesktop.org \
--cc=thomas.hellstrom@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox