Re: [RFC PATCH] drm/xe/bo: Honor madvise(2) advices

Intel-XE Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>, intel-xe@lists.freedesktop.org
Subject: Re: [RFC PATCH] drm/xe/bo: Honor madvise(2) advices
Date: Sat, 29 Nov 2025 17:18:02 +0100	[thread overview]
Message-ID: <bd74224684f03a0d1fea8354f406a9ea03e0d288.camel@linux.intel.com> (raw)
In-Reply-To: <aSsXWDSOVnI4V2jG@lstrano-desk.jf.intel.com>

On Sat, 2025-11-29 at 07:55 -0800, Matthew Brost wrote:
> On Sat, Nov 29, 2025 at 01:51:38PM +0100, Thomas Hellström wrote:
> > On Fri, 2025-11-28 at 13:01 -0800, Matthew Brost wrote:
> > > On Fri, Nov 28, 2025 at 12:57:15PM +0000, Matthew Auld wrote:
> > > > On 28/11/2025 10:46, Thomas Hellström wrote:
> > > > > The user can give advices as to how the CPU will access an
> > > > > address range. Use those advices to determine the number of
> > > > > bo pages to prefault on a page-fault.
> > > > > 
> > > > > Do this regardless of whether we can find a way to avoid the
> > > > > fairly slow vm_insert_pfn_prot() to populate buffer
> > > > > object maps.
> > > > > 
> > > > > Initially, fault up to 512 pages on sequential access and
> > > > > a single page on random access.
> > > > > 
> > > > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > > > Signed-off-by: Thomas Hellström
> > > > > <thomas.hellstrom@linux.intel.com>
> > > > > ---
> > > > >   drivers/gpu/drm/xe/xe_bo.c | 18 +++++++++++++++++-
> > > > >   1 file changed, 17 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/xe/xe_bo.c
> > > > > b/drivers/gpu/drm/xe/xe_bo.c
> > > > > index 6fd6ce6c6586..07d0d954f826 100644
> > > > > --- a/drivers/gpu/drm/xe/xe_bo.c
> > > > > +++ b/drivers/gpu/drm/xe/xe_bo.c
> > > > > @@ -1821,15 +1821,31 @@ static int xe_bo_fault_migrate(struct
> > > > > xe_bo *bo, struct ttm_operation_ctx *ctx,
> > > > >   	return err;
> > > > >   }
> > > > > +/*
> > > > > + * Number of prefaulted pages for the MADV_SEQUENTIAL and
> > > > > + * MADV_RANDOM madvise() advices.
> > > > > + */
> > > > > +#define XE_BO_VM_NUM_PREFAULT_SEQ  512
> > > > > +#define XE_BO_VM_NUM_PREFAULT_RAND 1
> > > > > +
> > > > >   /* Call into TTM to populate PTEs, and register bo for PTE
> > > > > removal on runtime suspend. */
> > > > >   static vm_fault_t __xe_bo_cpu_fault(struct vm_fault *vmf,
> > > > > struct xe_device *xe, struct xe_bo *bo)
> > > > >   {
> > > > > +	const struct vm_area_struct *vma = vmf->vma;
> > > > > +	pgoff_t num_prefault;
> > > > >   	vm_fault_t ret;
> > > > >   	trace_xe_bo_cpu_fault(bo);
> > > > > +	if (vma->vm_flags & VM_SEQ_READ)
> > > > > +		num_prefault = XE_BO_VM_NUM_PREFAULT_SEQ;
> > > > > +	else if (vma->vm_flags & VM_RAND_READ)
> > > > > +		num_prefault = XE_BO_VM_NUM_PREFAULT_RAND;
> > > > > +	else
> > > > > +		num_prefault = TTM_BO_VM_NUM_PREFAULT;
> > > > 
> > > > Ah, interesting. Do we know if any UMD is making use of these
> > > > special flags
> > > > today? Just wondering if this might be a visible change or not?
> > > > Also would
> > > > it make sense to document/advertise this somewhere for UMD
> > > > folks,
> > > > in case
> > > > this has an immediate benefit for them?
> > > > 
> > > 
> > > I also have a question here - does Xe / TTM support faulting in
> > > THP
> > > on
> > > the CPU side? Is that something we should also look at doing
> > > based on
> > > madvise / global THP settings? Would that help mitigate the slow
> > > vm_insert_pfn_prot too?
> > 
> > It would probably help a lot, as long as we actually get 2MiB pages
> > from TTM. 
> > 
> 
> Hmm, yes this seems like a pretty big win too considering Mesa now
> always allocates 2M BOs and then suballocates smaller allocations in
> user space. So we should pretty much always should be getting 2M
> pages /
> faults.
> 
> > I had that implemented in TTM once with vmwgfx the only user, and
> > it
> > was working fine except one very important detail: I had
> > implemented it
> > based on vma information rather than PTE-based information, so
> > get_user_pages_fast() didn't recognize these pages and was terribly
> > confused. So it had to be ripped out.
> > 
> > If we're going to try that again, we need to talk to x86 arch to
> > get a
> > PMD_PUD_SPECIAL pmd/pud flag that behaves just like PTE_SPECIAL, so
> > that things like get_user_pages_fast() ignore these huge PTEs.
> > Auditing
> > all page-walks in core-mm for this is non-trivial.
> > 
> 
> Agree, core-mm page walks are non-trivial to audit. Recently looked
> at
> 2M device pages series and it really wasn't all that bad though.
> 
> Out of technical depth on the PTE_SPECIAL comment, but can dig in a
> bit
> here. We do have a maintainer of x86 at Intel (Dave Hansen) which we
> can
> float any ideas wrt to this topic though.

Yeah, I was talking to those people back then and they suggested using
a flag at least initially separate from PTE_SPECIAL. (There are some
available flags in the huge PMDs). But at the time x86_32 was a thing
and using one of the suggested flags would make updating the 64-bit
PTEs with x86_32 the way it was implemented become racy, on top of all
other complications. So while I made significant progress back then, I
decided there were more important things to do, and stopped.

> 
> > But if that is done, we could bring in that stuff again, although
> > Christian wasn't very fond of having it in TTM.
> > 
> 
> We can perhaps bring this up as an option to Christian - from my
> limited
> knowledge on this topic, this seems like something worth while to do
> regardless of the PAT issue as just seems like a pretty big win. This
> however is very unlikely to make it into customer kernels which are
> complaining about this perf issue, so I think ww need to explore
> other
> options too.

Maybe for the issue at hand (I completely understand the NAK of using
apply_to_page_range() from within drivers) maybe we could get a go-
ahead for exporting some useful interfaces from core mm: Like perhaps
vmf_insert_pfns_prot() that takes an array of pfns and inserts them,
and perhaps vmf_insert_pages_as_pfns_prot() which would do the same for
struct page pointer arrays These could do the slow PAT adjustments once
and then insert all pages/pfns using a single page-walk. Either using
apply_to_page_range() or the more generic pagewalk. Could perhaps look
at the i915 implementation, particularly for the iomap stuff and see if
there's something reusable.

/Thomas


> 
> Matt
> 
> > But I think it would also be very beneficial for things like
> > ioremap()
> > and friends.
> > 
> > /Thomas
> > 
> > 
> > > 
> > > Matt
> > > 
> > > > I guess would be good to add an IGT which uses both flags, if
> > > > we
> > > > don't
> > > > already?
> > > > 
> > > > Anyway, I think change makes sense,
> > > > Reviewed-by: Matthew Auld <matthew.auld@intel.com>
> > > > 
> > > > > +
> > > > >   	ret = ttm_bo_vm_fault_reserved(vmf, vmf->vma-
> > > > > > vm_page_prot,
> > > > > -				      
> > > > > TTM_BO_VM_NUM_PREFAULT);
> > > > > +				       num_prefault);
> > > > >   	/*
> > > > >   	 * When TTM is actually called to insert PTEs,
> > > > > ensure no
> > > > > blocking conditions
> > > > >   	 * remain, in which case TTM may drop locks and
> > > > > return
> > > > > VM_FAULT_RETRY.
> > > > 
> >

next prev parent reply	other threads:[~2025-11-29 16:18 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-28 10:46 [RFC PATCH] drm/xe/bo: Honor madvise(2) advices Thomas Hellström
2025-11-28 10:53 ` ✓ CI.KUnit: success for " Patchwork
2025-11-28 12:57 ` [RFC PATCH] " Matthew Auld
2025-11-28 21:01   ` Matthew Brost
2025-11-29 12:51     ` Thomas Hellström
2025-11-29 15:55       ` Matthew Brost
2025-11-29 16:18         ` Thomas Hellström [this message]
2025-11-29 12:40   ` Thomas Hellström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bd74224684f03a0d1fea8354f406a9ea03e0d288.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=matthew.auld@intel.com \
    --cc=matthew.brost@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox